## Classification of Climate | Non-Climate posts using Word Embeddings + Cosine Similarity:

In [1]:
pip install sentence-transformers scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Collecting sentence-transformers
  Downloading sentence_transformers-4.0.1-py3-none-any.whl (340 kB)
[K     |████████████████████████████████| 340 kB 4.1 MB/s eta 0:00:01
[?25hCollecting scikit-learn
  Downloading scikit_learn-1.6.1-cp39-cp39-macosx_12_0_arm64.whl (11.1 MB)
[K     |████████████████████████████████| 11.1 MB 10.4 MB/s eta 0:00:01
[?25hCollecting tqdm
  Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 13.8 MB/s eta 0:00:01
[?25hCollecting torch>=1.11.0
  Downloading torch-2.6.0-cp39-none-macosx_11_0_arm64.whl (66.5 MB)
[K     |████████████████████████████████| 66.5 MB 107 kB/s  eta 0:00:01
Collecting huggingface-hub>=0.20.0
  Downloading huggingface_hub-0.30.1-py3-none-any.whl (481 kB)
[K     |████████████████████████████████| 481 kB 19.0 MB/s eta 0:00:01
[?25hCollecting transformers<5.0.0,>=4.41.0
  Downloading transformers-4.50.3-py3-non

In [5]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import json

In [6]:
with open("/Users/tobiasmichelsen/Bachelor_Project/DS_BachelorProject_PH/data/filtered/english_posts.json") as f:
    english_posts = json.load(f)

In [8]:
#Extend these prompts to capture the "true" meaning of climate from multiple perspectives. 

#small issue with a cosine similarity model for classifying whether or not a given text belongs to [Climate] or [Non-Climate] is that it uses reference / anchor points for what climate is.

#Currently, the prompts are NOT based on actual data, but simply ChatGPT generated prompts. Change accordingly and discuss with Luca  
climate_prompts = ["The world is heating up", "Global warming is imminent", 
                   "Marine life is dwindling" , "The seas are rising", 
                   "Carbon emissions are reaching a record high", "The Climate crisis is a hoax" 
                    ]

In [3]:
#https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

model = SentenceTransformer("all-MiniLM-L6-v2")

In [10]:
anchor_embeddings = model.encode(climate_prompts)
print(anchor_embeddings)

[[ 0.02835135 -0.00131723  0.03569233 ... -0.10806449 -0.06337425
   0.03882696]
 [-0.07531293  0.02271377  0.08144233 ... -0.06254279 -0.01569482
  -0.00166639]
 [-0.029637    0.03345019  0.0721257  ... -0.08058383  0.00560287
   0.08174925]
 [-0.05719218  0.0022417   0.0710148  ... -0.05447808 -0.02128065
   0.10292619]
 [ 0.02706154  0.00828526  0.03643303 ... -0.12502515 -0.02559752
   0.0579204 ]
 [-0.01609303  0.02864322  0.07125508 ... -0.03571245 -0.04944063
   0.02691268]]


In [12]:
# Subset of the 150,000 english posts

subset_english_posts = english_posts[:15000]

In [13]:
#Takes approx 8.5 mins for 15000 posts


threshold = 0.6 #Test this out: the lower --> 

climate_related_posts = []

for post in subset_english_posts:
    text = post.get("text", "")
    if not text.strip():
        continue

    post_embedding = model.encode([text])[0]
    similarities = cosine_similarity([post_embedding], anchor_embeddings)[0]

    if np.max(similarities) > threshold:
        climate_related_posts.append(post)

print(f"Identified {len(climate_related_posts)} climate-related posts out of {len(subset_english_posts)}")

Identified 2 climate-related posts out of 15000


In [14]:
output_path = "/Users/tobiasmichelsen/Bachelor_Project/DS_BachelorProject_PH/data/filtered/climate_posts.json"
with open(output_path, "w") as f:
    json.dump(climate_related_posts, f, indent=2)

Scrape english posts over a week --> "Randomly" sample climate-related posts and add them to the anchor points --> Find climate_related_posts and further tune the threshold  

In [None]:
#See code from huggingface: If we want more transparency 
"""from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

print("Sentence embeddings:")
print(sentence_embeddings)"""
