# Comparing ClimateBERT vs DistilRoBERTa Embeddings

This notebook compares how well domain-specific ClimateBERT captures climate-related language compared to the general-purpose DistilRoBERTa model.

**Investigation questions:**
- Do domain-specific models actually capture specialized language better?
- How do the similarity patterns differ between models?
- When should we use domain-specific vs general-purpose models?

In [31]:
import os
import sys
import numpy as np
import pandas as pd
from IPython.display import display

from lets_plot import *
LetsPlot.setup_html()

# Import the comparison script
from compare_embeddings import load_models, compare_climate_terms, visualize_comparison, run_comparison

## Running the full comparison

Let's run the complete comparison between ClimateBERT and DistilRoBERTa to see which model better captures climate-specific language:

In [32]:
run_comparison()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading models...


Some weights of RobertaModel were not initialized from the model checkpoint at ./local_models/climatebert/distilroberta-base-climate-f and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



--- Comparison Results ---
Average within-category similarity for climate terms:


Unnamed: 0,term,climate_within_category_avg,roberta_within_category_avg
2,greenhouse gas reduction,0.9977,0.9993
6,carbon neutrality,0.9976,0.9994
5,climate resilience,0.9976,0.9994
1,climate change adaptation,0.9975,0.9994
0,carbon emissions,0.9974,0.9995
3,renewable energy transition,0.9972,0.9994
9,net zero emissions,0.997,0.9993
7,climate finance,0.9969,0.9994
8,loss and damage,0.9964,0.9989
4,nationally determined contributions,0.9956,0.9991



Average within-category similarity for general terms:


Unnamed: 0,term,climate_within_category_avg,roberta_within_category_avg
11,policy implementation,0.9982,0.9995
15,governance structure,0.9974,0.9995
12,sustainable development,0.9981,0.9995
19,global partnership,0.9982,0.9995
10,international cooperation,0.9973,0.9995
14,resource allocation,0.9976,0.9995
16,technical assistance,0.9977,0.9995
18,monitoring framework,0.9979,0.9994
13,economic growth,0.9972,0.9993
17,capacity building,0.9973,0.9993



--- Summary Statistics ---


Unnamed: 0,Model,Climate Terms Coherence,General Terms Coherence,Climate Specialization Ratio
0,ClimateBERT,0.9971,0.9977,0.9994
1,DistilRoBERTa,0.9993,0.9995,0.9998



--- Conclusions ---
ClimateBERT has a -0.0022 advantage in capturing similarity between climate terms.
Surprisingly, DistilRoBERTa better captures the semantic relationships between climate-specific terms.

This suggests that:
❓ The domain-specific training may not have significantly improved climate language understanding.
❓ General-purpose models may be sufficient for basic climate policy document analysis.


## Custom Analysis with UK NDC Document

Now let's run a comparison using text samples from the UK NDC document we analyzed previously:

In [33]:
from utils import load_ndc_doc_strings, get_embeddings
from sklearn.metrics.pairwise import cosine_similarity

# Load the UK document
corpus_dir = "data/ndc-docs-robust"
all_docs = load_ndc_doc_strings(corpus_dir, filter_english=True)
uk_doc = all_docs[all_docs['file'].str.contains("United Kingdom")]

# Extract sample sentences
sample_sentences = uk_doc['doc'].iloc[0].split(".")[:15]  # Get first 15 sentences
sample_sentences = [s.strip() + "." for s in sample_sentences if len(s.strip()) > 30]  # Clean them up

# Display samples
print("Sample sentences from UK NDC document:")
for i, s in enumerate(sample_sentences):
    print(f"{i+1}. {s}")

Loading NDC documents from data/ndc-docs-robust
Sample sentences from UK NDC document:
1. United Kingdom of Great Britain and Northern Ireland’s Nationally Determined Contribution
Presented to Parliament by the Secretary of State for Business, Energy, and Industrial Strategy by Command of His Majesty
Updated: September 2022
CP 744
United Kingdom of Great Britain and Northern Ireland’s Nationally Determined Contribution
Presented to Parliament by the Secretary of State for Business, Energy, and Industrial Strategy by Command of His Majesty
Updated: September 2022
© Crown copyright 2022
This publication is licensed under the terms of the Open Government Licence v3.
2. 0 except where otherwise stated.
3. To view this licence, visit nationalarchives.
4. uk/doc/open-government-licence/version/3.
5. Where we have identified any third party copyright information you will need to obtain permission from the copyright holders concerned.
6. This publication is available at www.
7. ISBN 978-1-5286

In [34]:
# Load models
(climate_tokenizer, climate_model), (roberta_tokenizer, roberta_model) = load_models()

# Calculate embeddings
climate_embeddings = get_embeddings(sample_sentences, climate_model, climate_tokenizer)
roberta_embeddings = get_embeddings(sample_sentences, roberta_model, roberta_tokenizer)

# Calculate similarity matrices
climate_sim = cosine_similarity(climate_embeddings)
roberta_sim = cosine_similarity(roberta_embeddings)

# Create DataFrames for easier visualisation
climate_sim_df = pd.DataFrame(climate_sim, 
                              index=[f"S{i+1}" for i in range(len(sample_sentences))],
                              columns=[f"S{i+1}" for i in range(len(sample_sentences))])

roberta_sim_df = pd.DataFrame(roberta_sim, 
                             index=[f"S{i+1}" for i in range(len(sample_sentences))],
                             columns=[f"S{i+1}" for i in range(len(sample_sentences))])

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading models...


Some weights of RobertaModel were not initialized from the model checkpoint at ./local_models/climatebert/distilroberta-base-climate-f and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Visualising the similarity matrix differences

In [35]:
# Calculate difference in similarity matrices
diff_sim = climate_sim - roberta_sim

# Convert to long format for heatmap
climate_long = pd.melt(climate_sim_df.reset_index(), id_vars='index', var_name='column', value_name='similarity')
climate_long['model'] = 'ClimateBERT'

roberta_long = pd.melt(roberta_sim_df.reset_index(), id_vars='index', var_name='column', value_name='similarity')
roberta_long['model'] = 'DistilRoBERTa'

combined = pd.concat([climate_long, roberta_long])

# Create heatmaps using lets_plot
plot_climate = ggplot(climate_long, aes(x='index', y='column', fill='similarity')) + \
    geom_tile() + \
    scale_fill_gradient(low="white", high="#0C79A2") + \
    ggtitle("ClimateBERT Similarity Matrix") + \
    theme(axis_title_x=element_blank(), axis_title_y=element_blank())
display(plot_climate)

plot_roberta = ggplot(roberta_long, aes(x='index', y='column', fill='similarity')) + \
    geom_tile() + \
    scale_fill_gradient(low="white", high="#A20C73") + \
    ggtitle("DistilRoBERTa Similarity Matrix") + \
    theme(axis_title_x=element_blank(), axis_title_y=element_blank())
display(plot_roberta)

### Analyzing the differences between models

In [36]:
# Calculate average similarities for each sentence pair
diff_sim_df = pd.DataFrame(diff_sim, 
                          index=[f"S{i+1}" for i in range(len(sample_sentences))],
                          columns=[f"S{i+1}" for i in range(len(sample_sentences))])

# Convert difference matrix to long format
diff_long = pd.melt(diff_sim_df.reset_index(), id_vars='index', var_name='column', value_name='diff')

# Create heatmap for differences
plot_diff = ggplot(diff_long, aes(x='index', y='column', fill='diff')) + \
    geom_tile() + \
    scale_fill_gradient2(low="#A20C73", mid="white", high="#0C79A2", midpoint=0) + \
    ggtitle("Difference in Similarity: ClimateBERT - DistilRoBERTa\n(Blue = ClimateBERT higher)") + \
    theme(axis_title_x=element_blank(), axis_title_y=element_blank())
display(plot_diff)

# Find the top differences
diff_long['abs_diff'] = abs(diff_long['diff'])
top_diffs = diff_long[diff_long['index'] < diff_long['column']]  # Look at unique pairs
top_diffs = top_diffs.nlargest(5, 'abs_diff')

# Display the sentence pairs with biggest differences
print("\nSentence pairs with biggest model differences:")
for _, row in top_diffs.iterrows():
    i = int(row['index'][1:]) - 1  # Get original index
    j = int(row['column'][1:]) - 1
    
    print(f"\nPair {row['index']} & {row['column']} (diff: {row['diff']:.4f})")
    print(f"S{i+1}: {sample_sentences[i]}")
    print(f"S{j+1}: {sample_sentences[j]}")
    print(f"ClimateBERT similarity: {climate_sim[i,j]:.4f}")
    print(f"RoBERTa similarity: {roberta_sim[i,j]:.4f}")


Sentence pairs with biggest model differences:

Pair S11 & S7 (diff: -0.0108)
S11: At COP26 in November 2021, which the UK hosted in Glasgow, Parties resolved to pursue efforts to limit global temperature increase to 1.
S7: ISBN 978-1-5286-3666-7 E02785110 09/22
Printed on paper containing 40% recycled fibre content minimum
Printed in the UK by HH Associates Ltd.
ClimateBERT similarity: 0.9852
RoBERTa similarity: 0.9961

Pair S10 & S7 (diff: -0.0107)
S10: In its NDC, the UK commits to reducing economy-wide greenhouse gas emissions by at least 68% by 2030, compared to 1990 levels.
S7: ISBN 978-1-5286-3666-7 E02785110 09/22
Printed on paper containing 40% recycled fibre content minimum
Printed in the UK by HH Associates Ltd.
ClimateBERT similarity: 0.9856
RoBERTa similarity: 0.9963

Pair S7 & S9 (diff: -0.0102)
S7: ISBN 978-1-5286-3666-7 E02785110 09/22
Printed on paper containing 40% recycled fibre content minimum
Printed in the UK by HH Associates Ltd.
S9: In December 2020, the United

### Direct Embedding Space Comparison

Let's perform a more direct "apples to apples" comparison of how the same exact text is represented in both embedding spaces. This will help us see how the models fundamentally differ in their vector representations.

In [37]:
# Define comparison text samples - specific climate terms, general terms, and mixed terms
comparison_terms = [
    # Climate policy specific
    "carbon emissions",
    "greenhouse gas reduction",
    "climate finance",
    "adaptation and mitigation",
    "nationally determined contributions",
    # General policy terms
    "international cooperation",
    "resource allocation",
    "governance framework",
    "policy implementation", 
    "my doggie is so cute",
    # Some exact phrases from the UK document
    "reducing economy-wide greenhouse gas emissions",
    "limiting global warming to 1.5 degrees",
    "transition to a low-carbon economy",
    "United Kingdom's commitment to climate action",
    "implementation of the Paris Agreement"
]

# Type labels to help with visualisation
term_types = [
    "climate", "climate", "climate", "climate", "climate",
    "general", "general", "general", "general", "general",
    "document", "document", "document", "document", "document"
]

In [38]:
# Get embeddings from both models
# We'll use mean pooling as before
climate_term_embeddings = get_embeddings(comparison_terms, climate_model, climate_tokenizer)
roberta_term_embeddings = get_embeddings(comparison_terms, roberta_model, roberta_tokenizer)

#### Vector Distance Analysis

Let's compare how these terms are positioned in each embedding space using cosine similarity matrices.

In [39]:
# Calculate similarity matrices
climate_sim_matrix = cosine_similarity(climate_term_embeddings)
roberta_sim_matrix = cosine_similarity(roberta_term_embeddings)

# Create DataFrames for better visualization
climate_sim_df = pd.DataFrame(climate_sim_matrix, 
                             index=comparison_terms,
                             columns=comparison_terms)

roberta_sim_df = pd.DataFrame(roberta_sim_matrix, 
                             index=comparison_terms,
                             columns=comparison_terms)

In [47]:
# Calculate the difference between models
diff_matrix = climate_sim_matrix - roberta_sim_matrix
diff_df = pd.DataFrame(diff_matrix,
                      index=comparison_terms, 
                      columns=comparison_terms)

# Find the term pairs with the largest differences
# We'll flatten the upper triangle of the matrix to avoid duplicates
diff_flat = []
for i in range(len(comparison_terms)):
    for j in range(i+1, len(comparison_terms)):
        diff_flat.append({
            'term1': comparison_terms[i],
            'term2': comparison_terms[j],
            'type1': term_types[i],
            'type2': term_types[j],
            'climate_sim': climate_sim_matrix[i,j],
            'roberta_sim': roberta_sim_matrix[i,j],
            'diff': climate_sim_matrix[i,j] - roberta_sim_matrix[i,j],
            'abs_diff': abs(climate_sim_matrix[i,j] - roberta_sim_matrix[i,j])
        })

diff_pairs_df = pd.DataFrame(diff_flat)

# Display the term pairs with the largest differences
print("Term pairs with largest embedding differences between models:")
display(diff_pairs_df.nlargest(10, 'abs_diff')
       .sort_values('diff', ascending=False)
       [['term1', 'type1', 'term2', 'type2', 'climate_sim', 'roberta_sim', 'diff']]
       .style.format({'climate_sim': '{:.4f}', 'roberta_sim': '{:.4f}', 'diff': '{:.4f}'})
       .set_caption('Largest differences in semantic similarity between models'))


Term pairs with largest embedding differences between models:


Unnamed: 0,term1,type1,term2,type2,climate_sim,roberta_sim,diff
63,international cooperation,general,my doggie is so cute,general,0.9928,0.9986,-0.0058
33,climate finance,climate,my doggie is so cute,general,0.9928,0.9986,-0.0058
93,my doggie is so cute,general,United Kingdom's commitment to climate action,document,0.9919,0.9978,-0.0059
44,adaptation and mitigation,climate,my doggie is so cute,general,0.9925,0.9986,-0.0061
92,my doggie is so cute,general,transition to a low-carbon economy,document,0.992,0.9983,-0.0063
98,reducing economy-wide greenhouse gas emissions,document,implementation of the Paris Agreement,document,0.9924,0.9987,-0.0064
90,my doggie is so cute,general,reducing economy-wide greenhouse gas emissions,document,0.9913,0.998,-0.0067
91,my doggie is so cute,general,limiting global warming to 1.5 degrees,document,0.9907,0.9979,-0.0072
54,nationally determined contributions,climate,my doggie is so cute,general,0.9905,0.9982,-0.0077
94,my doggie is so cute,general,implementation of the Paris Agreement,document,0.9902,0.998,-0.0078


#### Visualise Embeddings with Dimensionality Reduction

In [41]:
from sklearn.manifold import TSNE
import numpy as np

# Use t-SNE to reduce dimensionality to 2D for visualization
tsne = TSNE(n_components=2, random_state=42, perplexity=5)

# Transform both embedding sets
climate_coords = tsne.fit_transform(climate_term_embeddings)
roberta_coords = tsne.fit_transform(roberta_term_embeddings)

# Create DataFrame for plotting with lets_plot
climate_df = pd.DataFrame({
    'x': climate_coords[:, 0],
    'y': climate_coords[:, 1],
    'term': comparison_terms,
    'type': term_types,
    'model': ['ClimateBERT'] * len(comparison_terms)
})

roberta_df = pd.DataFrame({
    'x': roberta_coords[:, 0],
    'y': roberta_coords[:, 1],
    'term': comparison_terms,
    'type': term_types,
    'model': ['DistilRoBERTa'] * len(comparison_terms)
})

plot_df = pd.concat([climate_df, roberta_df])

In [49]:
# Create separate plots for each model for clarity
climate_plot = ggplot(climate_df, aes(x='x', y='y', color='type', label='term')) + \
    geom_point(size=5) + \
    geom_text(nudge_x=0.5, nudge_y=0.5) + \
    ggtitle('ClimateBERT Embeddings Projected to 2D') + \
    theme_minimal()

roberta_plot = ggplot(roberta_df, aes(x='x', y='y', color='type', label='term')) + \
    geom_point(size=5) + \
    geom_text(nudge_x=0.5, nudge_y=0.5) + \
    ggtitle('DistilRoBERTa Embeddings Projected to 2D') + \
    theme_minimal()

display(climate_plot)
display(roberta_plot)

#### Compare Word Distances Between Types

Let's examine how the models structure the relationships between different types of terms:

In [43]:
# Function to calculate average similarity between term types
def avg_similarity_between_types(sim_matrix, terms, types, type1, type2):
    type1_indices = [i for i, t in enumerate(types) if t == type1]
    type2_indices = [i for i, t in enumerate(types) if t == type2]
    
    similarities = []
    for i in type1_indices:
        for j in type2_indices:
            if i != j:  # Avoid comparing term with itself
                similarities.append(sim_matrix[i, j])
    
    return np.mean(similarities)

# Calculate average similarities between and within types
term_types_unique = list(set(term_types))
comparison_results = []

for t1 in term_types_unique:
    for t2 in term_types_unique:
        climate_sim = avg_similarity_between_types(climate_sim_matrix, comparison_terms, term_types, t1, t2)
        roberta_sim = avg_similarity_between_types(roberta_sim_matrix, comparison_terms, term_types, t1, t2)
        comparison_results.append({
            'type1': t1,
            'type2': t2,
            'climate_sim': climate_sim,
            'roberta_sim': roberta_sim,
            'diff': climate_sim - roberta_sim
        })

type_comparison_df = pd.DataFrame(comparison_results)

In [44]:
# Display results
display(type_comparison_df.pivot(index='type1', columns='type2', values=['climate_sim', 'roberta_sim', 'diff'])
       .style.format('{:.4f}')
       .set_caption("Average similarity between term types"))

# Create a visualization
type_plot_df = pd.melt(
    type_comparison_df, 
    id_vars=['type1', 'type2'],
    value_vars=['climate_sim', 'roberta_sim'],
    var_name='model',
    value_name='similarity'
)

# Clean up model names
type_plot_df['model'] = type_plot_df['model'].map({
    'climate_sim': 'ClimateBERT',
    'roberta_sim': 'DistilRoBERTa'
})

# Filter for same type comparisons
same_type_df = type_plot_df[type_plot_df['type1'] == type_plot_df['type2']]

# Create a bar chart of within-category similarity
within_category_plot = ggplot(same_type_df, aes(x='type1', y='similarity', fill='model')) + \
    geom_bar(stat='identity', position='dodge') + \
    ggtitle('Within-Category Term Similarity by Model') + \
    theme_minimal()

display(within_category_plot)

Unnamed: 0_level_0,climate_sim,climate_sim,climate_sim,roberta_sim,roberta_sim,roberta_sim,diff,diff,diff
type2,climate,document,general,climate,document,general,climate,document,general
type1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
climate,0.9968,0.9955,0.9959,0.9993,0.999,0.9992,-0.0026,-0.0036,-0.0033
document,0.9955,0.9945,0.9945,0.999,0.9989,0.9988,-0.0036,-0.0045,-0.0042
general,0.9959,0.9945,0.996,0.9992,0.9988,0.9992,-0.0033,-0.0042,-0.0033
