# 🤗 Database Setup

This notebook is used to upload the Huggingface Climate Policy Radar dataset to a Postgres table.

Refer to README.md for instructions on setting up the Postgres database.

In [1]:
import os
import regex as re
from tqdm.notebook import tqdm
from dotenv import load_dotenv
import pandas as pd
import numpy as np
import pgai
import torch
import glob
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM
from datasets import load_dataset, Features, Value
from functions import generate_embeddings_for_text, store_database_batched
from sqlalchemy import create_engine, text

## 1. Load the Huggingface dataset.

In [4]:
# Login using e.g. `huggingface-cli login` in the command line to access this dataset

ds = load_dataset("ClimatePolicyRadar/all-document-text-data")
ds = ds["train"]
flat_ds = ds.flatten()

Resolving data files:   0%|          | 0/23 [00:00<?, ?it/s]

Downloading data: 100%|██████████| 98.2M/98.2M [00:01<00:00, 86.9MB/s]
Downloading data: 100%|██████████| 96.8M/96.8M [00:01<00:00, 94.3MB/s]
Downloading data: 100%|██████████| 57.4M/57.4M [00:00<00:00, 137MB/s] 
Downloading data: 100%|██████████| 61.0M/61.0M [00:00<00:00, 123MB/s] 
Downloading data: 100%|██████████| 71.4M/71.4M [00:00<00:00, 166MB/s] 
Downloading data: 100%|██████████| 79.5M/79.5M [00:00<00:00, 159MB/s] 
Downloading data: 100%|██████████| 62.6M/62.6M [00:00<00:00, 144MB/s] 
Downloading data: 100%|██████████| 86.1M/86.1M [00:00<00:00, 126MB/s] 
Downloading data: 100%|██████████| 70.5M/70.5M [00:00<00:00, 87.0MB/s]
Downloading data: 100%|██████████| 68.5M/68.5M [00:00<00:00, 82.5MB/s]
Downloading data: 100%|██████████| 50.8M/50.8M [00:00<00:00, 127MB/s] 
Downloading data: 100%|██████████| 67.9M/67.9M [00:00<00:00, 146MB/s] 
Downloading data: 100%|██████████| 99.5M/99.5M [00:00<00:00, 107MB/s] 
Downloading data: 100%|██████████| 63.5M/63.5M [00:00<00:00, 98.8MB/s]
Downlo

Generating train split: 0 examples [00:00, ? examples/s]

Loading dataset shards:   0%|          | 0/42 [00:00<?, ?it/s]

## 2. Save all chunks into the "climate_policy_radar" table.

The table contains all documents from the Climate Policy Radar dataset, with each row representing a chunk and all associated metadata stored in individual columns.

In [None]:
store_database_batched(flat_ds, num_chunks=len(flat_ds))