# Generating Embeddings

In this notebook we will generate embeddings using both the ClimateBERT model and Word2Vec using the following structure:

1. Reading the data and documents from the database
2. Download and store the embeddings models
3. Create a new table for storing the embeddings and some original data we want. Generate embeddings using both models and store them into the database.

In [5]:
# Necessary imports
import os
import regex as re
from tqdm.notebook import tqdm
from dotenv import load_dotenv
import pandas as pd
import numpy as np
import pgai
import torch
import glob
from datasets import load_dataset, Features, Value

from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM


tqdm.pandas() #check if i should put it here

from sqlalchemy import create_engine, text
from sqlalchemy.orm import sessionmaker

#connecting to the database
load_dotenv() #loads the .env file into os.environ
engine = create_engine(os.getenv("DB_URL"))

#create session
Session = sessionmaker(bind=engine)
session = Session()

#word2vec model imports
import psycopg2
from gensim.models import KeyedVectors
from gensim.downloader import load
from gensim.utils import simple_preprocess
from collections import Counter
from gensim.models import Word2Vec


#importing functions
from functions import generate_embeddings_for_text, embed_and_store_all_embeddings, train_custom_word2vec_from_texts

## 1. Read the data from the database

So it's easier to access the data in case the kernel crashes and had to re-run the codes again

In [6]:
# Read the table
df = pd.read_sql('SELECT * FROM climate_policy_radar WHERE "document_metadata.geographies" ~ \'ALB\';', engine)
df.head()

Unnamed: 0,document_id,document_metadata.collection_summary,document_metadata.collection_title,document_metadata.corpus_type_name,document_metadata.corpus_import_id,document_metadata.category,document_metadata.description,document_metadata.document_title,document_metadata.family_import_id,document_metadata.family_slug,...,pipeline_metadata.parser_metadata.azure_model_id,pipeline_metadata.parser_metadata.parsing_date,text_block.text_block_id,text_block.language,text_block.type,text_block.type_confidence,text_block.coords,text_block.page_number,text_block.text,text_block.index
0,CCLW.document.i00000964.n0000,,,Laws and Policies,CCLW.corpus.i00000001.n0000,Executive,<p>The NAP is the adaptation component of the ...,Albania’s National Adaptation Plan First - pro...,CCLW.family.10661.0,national-adaptation-planning-nap-to-climate-ch...,...,prebuilt-document,2024-05-07T08:46:18.237130,1924,en,TableCell,1.0,"{{444.83040000000005,404.84159999999997},{523....",59.0,Ministry of Tourism and Environment,1924
1,CCLW.document.i00000964.n0000,,,Laws and Policies,CCLW.corpus.i00000001.n0000,Executive,<p>The NAP is the adaptation component of the ...,Albania’s National Adaptation Plan First - pro...,CCLW.family.10661.0,national-adaptation-planning-nap-to-climate-ch...,...,prebuilt-document,2024-05-07T08:46:18.237130,1925,en,TableCell,1.0,"{{523.6128,404.84159999999997},{584.6472000000...",59.0,Completed in 2020,1925
2,CCLW.document.i00000964.n0000,,,Laws and Policies,CCLW.corpus.i00000001.n0000,Executive,<p>The NAP is the adaptation component of the ...,Albania’s National Adaptation Plan First - pro...,CCLW.family.10661.0,national-adaptation-planning-nap-to-climate-ch...,...,prebuilt-document,2024-05-07T08:46:18.237130,1926,en,TableCell,1.0,"{{584.6472000000001,404.84159999999997},{760.6...",59.0,https://www.osce.org/secretariat/4 84148,1926
3,CCLW.document.i00000964.n0000,,,Laws and Policies,CCLW.corpus.i00000001.n0000,Executive,<p>The NAP is the adaptation component of the ...,Albania’s National Adaptation Plan First - pro...,CCLW.family.10661.0,national-adaptation-planning-nap-to-climate-ch...,...,prebuilt-document,2024-05-07T08:46:18.237130,1927,en,TableCell,1.0,"{{67.2624,463.7952},{98.4888,463.7952},{98.488...",59.0,5,1927
4,CCLW.document.i00000964.n0000,,,Laws and Policies,CCLW.corpus.i00000001.n0000,Executive,<p>The NAP is the adaptation component of the ...,Albania’s National Adaptation Plan First - pro...,CCLW.family.10661.0,national-adaptation-planning-nap-to-climate-ch...,...,prebuilt-document,2024-05-07T08:46:18.237130,1928,en,TableCell,1.0,"{{98.4888,463.7952},{323.4744,463.7952},{323.4...",59.0,BRIdging the GAp for Innovations in Disaster r...,1928


## 2. Embeddings generation

### 2.1 Download and load ClimateBERT

The code below will download and load the ClimateBERT model.

In [7]:
EMBEDDING_MODEL_LOCAL_DIR = os.getenv('EMBEDDING_MODEL_LOCAL_DIR')
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL")

In [9]:
# Download
tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL, use_auth_token=False)
model = AutoModelForMaskedLM.from_pretrained(EMBEDDING_MODEL, use_auth_token=False)

# Save it to a  local_models folder
tokenizer.save_pretrained(EMBEDDING_MODEL_LOCAL_DIR)
model.save_pretrained(EMBEDDING_MODEL_LOCAL_DIR)



RuntimeError: Failed to import transformers.models.roberta.modeling_roberta because of the following error (look up to see its traceback):
partially initialized module 'torch._dynamo' has no attribute 'decorators' (most likely due to a circular import)

In [None]:
# Load the embedding model
climatebert_tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL_LOCAL_DIR)
climatebert_model = AutoModel.from_pretrained(EMBEDDING_MODEL_LOCAL_DIR)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Some weights of RobertaModel were not initialized from the model checkpoint at local_model/climatebert/distilroberta-base-climate-f and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 2.2 Download and load Word2Vec

This Word2Vec model is untrained. We will check if training is necessary and use the trained model if needed.

In [None]:
# Choose a pretrained Word2Vec model
model_name = "word2vec-google-news-300"

# Download and load the model
print(f"🔄 Loading pretrained Word2Vec model: {model_name}")
word2vec_model = load(model_name)
print("✅ Model loaded!")

# Example: check similarity
print(word2vec_model.most_similar("climate"))


🔄 Loading pretrained Word2Vec model: word2vec-google-news-300
✅ Model loaded!
[('climate_change', 0.6569507122039795), ('Climate', 0.6230838298797607), ('climates', 0.6195024251937866), ('global_warming', 0.6047458648681641), ('environment', 0.6009922027587891), ('climatic', 0.5555011630058289), ('climatic_conditions', 0.5207005143165588), ('ambassador_Brice_Lalonde', 0.5172268152236938), ('Global_warming', 0.5048916339874268), ('Climate_Change', 0.4955976605415344)]


Now check if the Word2Vec model is able to cover climate-specific words in the climate policy radar. If they cannot be covered we would have to train the Word2Vec model.

In [None]:
query = """
SELECT "text_block.text"
FROM climate_policy_radar
WHERE "text_block.text" IS NOT NULL
LIMIT 10000;
"""

df = pd.read_sql_query(query, engine)

# 3. Tokenize and gather all unique words
all_tokens = []
for text in df['text_block.text']:
    tokens = simple_preprocess(text)
    all_tokens.extend(tokens)

# 4. Compare to Word2Vec vocabulary
vocab = set(word2vec_model.key_to_index)
oov_words = [token for token in all_tokens if token not in vocab]

# 5. Count top missing words
oov_counter = Counter(oov_words)
most_common_oov = oov_counter.most_common(50)

# 6. Display
print("❌ Top OOV words not in Word2Vec:")
for word, count in most_common_oov:
    print(f"{word}: {count}")


❌ Top OOV words not in Word2Vec:
of: 3362
and: 2566
to: 1565
albania: 310
albanian: 159
dcm: 144
wem: 136
necp: 79
modelling: 72
ktoe: 69
implem: 57
meur: 43
tirana: 41
ghgs: 33
gwp: 33
adriatic: 32
montenegro: 28
oshee: 23
vlora: 23
albgaz: 23
pams: 18
neeap: 17
mva: 17
lulucf: 17
unfccc: 16
aee: 15
ionian: 15
ippu: 15
programme: 15
smes: 14
entso: 13
instat: 13
tpes: 13
wbif: 13
alkogap: 12
gwh: 12
elbasan: 12
hfc: 12
ktco: 11
labelling: 10
dumrea: 10
hpp: 9
iap: 9
mmr: 9
nzeb: 9
balkans: 9
escos: 9
ebrd: 8
kfw: 8
mte: 8


We can see there are some terms i.e. abbreviations and locations that are absent in word2vec. We'll train the model and also make sure the embeddings are in 768 dimension.

In [None]:
#use the function to train the model so the absent words can be added

texts = df['text_block.text'].dropna().tolist()

important_terms = [
    "albania", "albanian", "unfccc", "gwp", "ghgs", "necp", "modelling", "ktoe", 
    "tirana", "vlora", "adriatic", "ionian", "montenegro", "albgaz", "oshee",
    "lulucf", "neeap", "wbif", "instat", "tpes", "gwh", "nzeb", "entso", "smes"
]

model = train_custom_word2vec_from_texts(
    texts=texts,
    force_include_words=important_terms
)


In [None]:
# DOUBLE CHECK IF THE MODEL IS LOADED CORRECTLY
# Load model if needed

model = Word2Vec.load("./local_model/custom_word2vec_768.model")


# List of words you want to check
words_to_check = [
    "albania", "unfccc", "gwp", "oshee", "tirana", "ktoe", "neeap", "smes"
]

# Check dimensionality and coverage
for word in words_to_check:
    if word in model.wv:
        vec = model.wv[word]
        print(f"✅ '{word}' in vocab | Dim: {len(vec)}")
    else:
        print(f"❌ '{word}' NOT in vocabulary")

✅ 'albania' in vocab | Dim: 768
✅ 'unfccc' in vocab | Dim: 768
✅ 'gwp' in vocab | Dim: 768
✅ 'oshee' in vocab | Dim: 768
✅ 'tirana' in vocab | Dim: 768
✅ 'ktoe' in vocab | Dim: 768
✅ 'neeap' in vocab | Dim: 768
✅ 'smes' in vocab | Dim: 768


Check exisiting documents' country so when they are embedded and they are grouped together and uploaded to the table.

In [None]:
query = """
SELECT DISTINCT "document_metadata.geographies"
FROM climate_policy_radar
WHERE "document_metadata.geographies" IS NOT NULL;
"""

geos = pd.read_sql(query, engine)
print(geos)


   document_metadata.geographies
0                          {SRB}
1                          {MKD}
2                          {GBR}
3                          {TUV}
4                          {FRA}
5                          {ALB}
6                          {EUR}
7                          {MNE}
8                          {AZE}
9                          {CAN}
10                         {JPN}
11                         {BRA}
12                         {DEU}
13                         {XKX}
14                         {CHN}
15                         {ZAF}
16                         {BIH}
17                         {IRL}


## 3. Embedding all documents for all countries

Generate embeddings for all documents and upload them into the database.

**IMPORTANT THING TO DO BEFORE RUNNING THE CODE BELOW:**

A new table is needed, this will be created through the create_table.sql file. Steps to run it:

1. Go to create_table.sql and run the query to create the table
2. Remember to select the Postgres Server at the bottom, and highlight the code and right click to run query


This will create a new table in the database. The file also includes a *"DROP TABLE IF EXISTS document_embeddings;"* line if the table does not appear. Try not to use it after the data are uploaded because it will drop all exisisting data. Use with cautious. After creating the table, then run the code below to generate embeddings and store them into the database. This will take around 2 hours to finish running.


In [None]:
#Embedding and storing all embeddings in the database

embed_and_store_all_embeddings(df, engine)
