# Generating Embeddings

In this notebook we will generate embeddings using both the ClimateBERT model and Word2Vec using the following structure:

1. Reading the data and documents from the database
2. Download and store the embeddings models
3. Create a new table for storing the embeddings and some original data we want. Generate embeddings using both models and store them into the database.

In [1]:
# Necessary imports
import os
import regex as re
from tqdm.notebook import tqdm
from dotenv import load_dotenv
import pandas as pd
import numpy as np
import pgai
import torch
import glob
from datasets import load_dataset, Features, Value

from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM


tqdm.pandas() #check if i should put it here

from sqlalchemy import create_engine, text
from sqlalchemy.orm import sessionmaker

#connecting to the database
load_dotenv() #loads the .env file into os.environ
engine = create_engine(os.getenv("DB_URL"))

#create session
Session = sessionmaker(bind=engine)
session = Session()

#word2vec model imports
import psycopg2
from gensim.models import KeyedVectors
from gensim.downloader import load
from gensim.utils import simple_preprocess
from collections import Counter
from gensim.models import Word2Vec


#importing functions
from functions import generate_embeddings_for_text, embed_and_store_all_embeddings, train_custom_word2vec_from_texts

## 1. Read the data from the database

So it's easier to access the data in case the kernel crashes and had to re-run the codes again

In [16]:
# Read the table
df = pd.read_sql('SELECT * FROM climate_policy_radar WHERE "document_metadata.geographies" ~ \'ALB\';', engine)
df.head()

Unnamed: 0,document_id,document_metadata.collection_summary,document_metadata.collection_title,document_metadata.corpus_type_name,document_metadata.corpus_import_id,document_metadata.category,document_metadata.description,document_metadata.document_title,document_metadata.family_import_id,document_metadata.family_slug,...,pipeline_metadata.parser_metadata.azure_model_id,pipeline_metadata.parser_metadata.parsing_date,text_block.text_block_id,text_block.language,text_block.type,text_block.type_confidence,text_block.coords,text_block.page_number,text_block.text,text_block.index
0,CCLW.document.i00001343.n0000,,,Laws and Policies,CCLW.corpus.i00000001.n0000,Executive,<p>The national vision on combatting climate c...,National Strategy on Climate Change and Action...,CCLW.family.i00001342.n0000,national-strategy-on-climate-change-and-action...,...,,,,,,,,,,0
1,CCLW.document.i00001343.n0000,,,Laws and Policies,CCLW.corpus.i00000001.n0000,Executive,<p>The national vision on combatting climate c...,National Strategy on Climate Change and Action...,CCLW.family.i00001342.n0000,national-strategy-on-climate-change-and-action...,...,,,,,,,,,,0
2,CCLW.document.i00000002.n0000,,,Laws and Policies,CCLW.corpus.i00000001.n0000,Executive,"<p><span style=""font-size: 10pt;font-family: A...",National Energy and Climate Plan 2019 Draft,CCLW.family.i00000001.n0000,national-energy-and-climate-plan_8a4f,...,prebuilt-document,2023-12-11T11:43:23.509480,2731.0,en,TableCell,1.0,"{{70.6392,596.3976},{244.548,596.3976},{244.54...",83.0,Modelling Scenario Considered Type of Instrument,2731
3,CCLW.document.i00000002.n0000,,,Laws and Policies,CCLW.corpus.i00000001.n0000,Executive,"<p><span style=""font-size: 10pt;font-family: A...",National Energy and Climate Plan 2019 Draft,CCLW.family.i00000001.n0000,national-energy-and-climate-plan_8a4f,...,prebuilt-document,2023-12-11T11:43:23.509480,1706.0,en,Text,1.0,"{{69.3576,551.1744},{473.8104,551.1744},{473.8...",58.0,EE targets based on Article 3 of Directive 201...,1706
4,CCLW.document.i00000002.n0000,,,Laws and Policies,CCLW.corpus.i00000001.n0000,Executive,"<p><span style=""font-size: 10pt;font-family: A...",National Energy and Climate Plan 2019 Draft,CCLW.family.i00000001.n0000,national-energy-and-climate-plan_8a4f,...,prebuilt-document,2023-12-11T11:43:23.509480,1707.0,en,Text,1.0,"{{69.71039999999999,570.1536},{524.5488,569.79...",58.0,· Energy savings goal referring to final energ...,1707


## 2. Embeddings generation

### 2.1 Download and load ClimateBERT

The code below will download and load the ClimateBERT model.

In [3]:
EMBEDDING_MODEL_LOCAL_DIR = os.getenv('EMBEDDING_MODEL_LOCAL_DIR')
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL")

In [4]:
# Download
tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL, use_auth_token=False)
model = AutoModelForMaskedLM.from_pretrained(EMBEDDING_MODEL, use_auth_token=False)

# Save it to a  local_models folder
tokenizer.save_pretrained(EMBEDDING_MODEL_LOCAL_DIR)
model.save_pretrained(EMBEDDING_MODEL_LOCAL_DIR)



In [5]:
# Load the embedding model
climatebert_tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL_LOCAL_DIR)
climatebert_model = AutoModel.from_pretrained(EMBEDDING_MODEL_LOCAL_DIR)

Some weights of RobertaModel were not initialized from the model checkpoint at local_model/climatebert/distilroberta-base-climate-f and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 2.2 Download and load Word2Vec

This Word2Vec model is untrained. We will check if training is necessary and use the trained model if needed.

In [7]:
# Choose a pretrained Word2Vec model
model_name = "word2vec-google-news-300"

# Download and load the model
print(f"🔄 Loading pretrained Word2Vec model: {model_name}")
word2vec_model = load(model_name)
print("✅ Model loaded!")

# Example: check similarity
print(word2vec_model.most_similar("climate"))


🔄 Loading pretrained Word2Vec model: word2vec-google-news-300
✅ Model loaded!
[('climate_change', 0.6569506525993347), ('Climate', 0.6230838298797607), ('climates', 0.6195024847984314), ('global_warming', 0.6047458648681641), ('environment', 0.6009921431541443), ('climatic', 0.5555011630058289), ('climatic_conditions', 0.5207005143165588), ('ambassador_Brice_Lalonde', 0.5172268152236938), ('Global_warming', 0.5048916339874268), ('Climate_Change', 0.4955976903438568)]


Now check if the Word2Vec model is able to cover climate-specific words in the climate policy radar. If they cannot be covered we would have to train the Word2Vec model.

In [8]:
query = """
SELECT "text_block.text"
FROM climate_policy_radar
WHERE "text_block.text" IS NOT NULL
LIMIT 10000;
"""

df = pd.read_sql_query(query, engine)

# 3. Tokenize and gather all unique words
all_tokens = []
for text in df['text_block.text']:
    tokens = simple_preprocess(text)
    all_tokens.extend(tokens)

# 4. Compare to Word2Vec vocabulary
vocab = set(word2vec_model.key_to_index)
oov_words = [token for token in all_tokens if token not in vocab]

# 5. Count top missing words
oov_counter = Counter(oov_words)
most_common_oov = oov_counter.most_common(50)

# 6. Display
print("❌ Top OOV words not in Word2Vec:")
for word, count in most_common_oov:
    print(f"{word}: {count}")


❌ Top OOV words not in Word2Vec:
of: 5249
and: 3486
to: 1875
nº: 310
ambiental: 251
artigo: 245
anp: 185
meio: 184
parágrafo: 157
emissões: 152
albania: 142
desenvolvimento: 142
resolução: 136
ações: 130
áreas: 124
anexo: 119
desta: 118
informações: 113
às: 110
convenção: 110
conama: 110
quilombola: 108
decreto: 102
suas: 101
redução: 98
proteção: 96
trata: 96
quilombolas: 92
órgãos: 91
são: 90
órgão: 89
devem: 88
aviação: 88
seguintes: 87
gestão: 86
capítulo: 85
produção: 84
ibama: 84
educação: 83
atividades: 83
mudança: 80
espécies: 80
serão: 78
inciso: 76
promover: 74
disposto: 74
execução: 72
direito: 72
conferência: 71
efeito: 70


We can see there are some terms i.e. abbreviations and locations that are absent in word2vec. We'll train the model and also make sure the embeddings are in 768 dimension.

In [9]:
#use the function to train the model so the absent words can be added

texts = df['text_block.text'].dropna().tolist()

important_terms = [
    "albania", "albanian", "unfccc", "gwp", "ghgs", "necp", "modelling", "ktoe", 
    "tirana", "vlora", "adriatic", "ionian", "montenegro", "albgaz", "oshee",
    "lulucf", "neeap", "wbif", "instat", "tpes", "gwh", "nzeb", "entso", "smes"
]

model = train_custom_word2vec_from_texts(
    texts=texts,
    force_include_words=important_terms
)


In [10]:
# DOUBLE CHECK IF THE MODEL IS LOADED CORRECTLY
# Load model if needed

model = Word2Vec.load("./local_model/custom_word2vec_768.model")


# List of words you want to check
words_to_check = [
    "albania", "unfccc", "gwp", "oshee", "tirana", "ktoe", "neeap", "smes"
]

# Check dimensionality and coverage
for word in words_to_check:
    if word in model.wv:
        vec = model.wv[word]
        print(f"✅ '{word}' in vocab | Dim: {len(vec)}")
    else:
        print(f"❌ '{word}' NOT in vocabulary")

✅ 'albania' in vocab | Dim: 768
✅ 'unfccc' in vocab | Dim: 768
✅ 'gwp' in vocab | Dim: 768
✅ 'oshee' in vocab | Dim: 768
✅ 'tirana' in vocab | Dim: 768
✅ 'ktoe' in vocab | Dim: 768
✅ 'neeap' in vocab | Dim: 768
✅ 'smes' in vocab | Dim: 768


Check exisiting documents' country so when they are embedded and they are grouped together and uploaded to the table.

In [11]:
query = """
SELECT DISTINCT "document_metadata.geographies"
FROM climate_policy_radar
WHERE "document_metadata.geographies" IS NOT NULL;
"""

geos = pd.read_sql(query, engine)
print(geos)


    document_metadata.geographies
0                           {ALB}
1                           {AND}
2                           {ARE}
3                           {ARG}
4                           {AUS}
..                            ...
108                         {VNM}
109                         {XKX}
110                         {ZAF}
111                         {ZMB}
112                         {ZWE}

[113 rows x 1 columns]


## 3. Embedding all documents for all countries

Generate embeddings for all documents and upload them into the database.

**IMPORTANT THING TO DO BEFORE RUNNING THE CODE BELOW:**

A new table is needed, this will be created through the create_table.sql file. Steps to run it:

1. Go to create_table.sql and run the query to create the table
2. Remember to select the Postgres Server at the bottom, and highlight the code and right click to run query


This will create a new table in the database. The file also includes a *"DROP TABLE IF EXISTS document_embeddings;"* line if the table does not appear. Try not to use it after the data are uploaded because it will drop all exisisting data. Use with cautious. After creating the table, then run the code below to generate embeddings and store them into the database. This will take around 2 hours to finish running.


In [17]:
#Embedding and storing all embeddings in the database

embed_and_store_all_embeddings(df, engine)


Some weights of RobertaModel were not initialized from the model checkpoint at local_model/climatebert/distilroberta-base-climate-f and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Python(74729) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


Filtering by country:   0%|          | 0/1 [00:00<?, ?it/s]

Processing all countries:   0%|          | 0/1 [00:00<?, ?it/s]

Embedding ALB:   0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/9998 [00:00<?, ?it/s]

  0%|          | 0/9998 [00:00<?, ?it/s]

  0%|          | 0/6038 [00:00<?, ?it/s]

  0%|          | 0/6038 [00:00<?, ?it/s]

Uploading ALB:   0%|          | 0/16036 [00:00<?, ?it/s]


✅ All ClimateBERT and Word2Vec embeddings uploaded directly.
