<h1 align="center">LAB1S1_p4. Contextual word-embeddings for text representation</h1>

<h3 style="display:block; margin-top:5px;" align="center">Natural Language and Information Retrieval</h3>
<h3 style="display:block; margin-top:5px;" align="center">Degree in Data Science</h3>
<h3 style="display:block; margin-top:5px;" align="center">2024-2025</h3>    
<h3 style="display:block; margin-top:5px;" align="center">ETSInf. Universitat Politècnica de València</h3>
<br>

## Authors:
- Marcos Ranchal
- Marc Siquier

In [11]:
!pip install -U transformers
!pip install -U emoji
!pip install -U ipywidgets



## Some libraries

In [12]:
import pandas as pd
import torch
from transformers import AutoModel, AutoTokenizer
from transformers import BertTokenizer, BertModel, RobertaTokenizer, RobertaModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

## Read the corpora

In [13]:
filepath = {
    "english": "EXIST2024_EN_examples_mini.csv",
    "spanish": "EXIST2024_ES_examples_mini.csv"
}
df = {k: pd.read_csv(v, sep="\t") for k, v in filepath.items()}
for k, v in df.items():
    print(f"DataFrame for {k}:")
    print(v.head())
# OR USING Google colab
#from google.colab import drive
#drive.mount('/content/drive')
#df = {
#    "english": pd.read_csv("/content/drive/MyDrive/LNR/LNR2025/Lab1/EXIST2024_EN_examples.csv", sep="\t"),
#    "spanish": pd.read_csv("/content/drive/MyDrive/LNR/LNR2025/Lab1/EXIST2024_ES_examples.csv", sep="\t")
#}

DataFrame for english:
       id                                               text label  size
0  200002  Writing a uni essay in my local pub with a cof...   YES   255
1  200003  @UniversalORL it is 2021 not 1921. I dont appr...   YES   191
2  200006  According to a customer I have plenty of time ...   YES   183
3  200007  So only 'blokes' drink beer? Sorry, but if you...   YES   197
4  200008  New to the shelves this week - looking forward...    NO   172
DataFrame for spanish:
       id                                               text label  size
0  100001  @TheChiflis Ignora al otro, es un capullo.El p...   YES   281
1  100002  @ultimonomada_ Si comicsgate se parece en algo...    NO   226
2  100003  @Steven2897 Lee sobre Gamergate, y como eso ha...    NO   233
3  100005  @novadragon21 @icep4ck @TvDannyZ Entonces como...   YES   305
4  100006  @yonkykong Aaah sí. Andrew Dobson. El que se d...    NO   149


## Model names

In [14]:
modelnames = {
    "english": ["bert-base-uncased", "roberta-base"],
    "spanish": ["dccuchile/bert-base-spanish-wwm-uncased", "PlanTL-GOB-ES/roberta-base-bne"]
}

## Which device to use?

In [15]:
if torch.backends.mps.is_available():  # Mac M? GPU
    device = torch.device("mps")
elif torch.cuda.is_available():  # Nvidia GPU
    device = torch.device("cuda")
else:  # CPU
    device = torch.device("cpu")
print(device)

cpu


## Load the tokenizers and the models

In [16]:
# COMPLETE


tokenizer_bert_en = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer_roberta_en = RobertaTokenizer.from_pretrained("roberta-base")
tokenizer_bert_es = BertTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")
tokenizer_roberta_es = RobertaTokenizer.from_pretrained("PlanTL-GOB-ES/roberta-base-bne")


model_bert_en = BertModel.from_pretrained("bert-base-uncased")
model_roberta_en = RobertaModel.from_pretrained("roberta-base")
model_bert_es = BertModel.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")
model_roberta_es = RobertaModel.from_pretrained("PlanTL-GOB-ES/roberta-base-bne")


model_bert_en.to(device)
model_roberta_en.to(device)
model_bert_es.to(device)
model_roberta_es.to(device)


model_bert_en.eval()
model_roberta_en.eval()
model_bert_es.eval()
model_roberta_es.eval()

print("Models and tokenizers loaded successfully.")


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertModel were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at PlanTL-GOB-ES/roberta-base-bne and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Models and tokenizers loaded successfully.


## Compute tweets representations

In [17]:
batch_size = 16 

def compute_tweet_representation_from_df(df, tokenizer, model, device, batch_size=16):
    model.to(device)
    model.eval()

    tensor_list = []  
    batch_texts = []  

    for _, row in df.iterrows():
        tweet = row["text"]
        batch_texts.append(tweet)

        if len(batch_texts) == batch_size:
            inputs = tokenizer(batch_texts, padding=True, truncation=True, return_tensors="pt")
            inputs = {key: value.to(device) for key, value in inputs.items()}

            with torch.no_grad():
                outputs = model(**inputs)
                cls_vector = outputs.last_hidden_state[:, 0, :]

            tensor_list.append(cls_vector)
            batch_texts = []

    if batch_texts:
        inputs = tokenizer(batch_texts, padding=True, truncation=True, return_tensors="pt")
        inputs = {key: value.to(device) for key, value in inputs.items()}

        with torch.no_grad():
            outputs = model(**inputs)
            cls_vector = outputs.last_hidden_state[:, 0, :]

        tensor_list.append(cls_vector)

    return torch.cat(tensor_list).cpu()

embeddings_english = compute_tweet_representation_from_df(df=df["english"],tokenizer=tokenizer_bert_en,model=model_bert_en,device=device,batch_size=batch_size)

embeddings_spanish = compute_tweet_representation_from_df(df=df["spanish"],tokenizer=tokenizer_bert_es,model=model_bert_es,device=device,batch_size=batch_size)

print("Embeddings en inglés:", embeddings_english.shape)
print("Embeddings en español:", embeddings_spanish.shape)

# COMPLETE

Embeddings en inglés: torch.Size([748, 768])
Embeddings en español: torch.Size([702, 768])


## Compute cosine similarities

In [18]:
def encontrar_mas_similar(vectores_texto, etiquetas, etiqueta_objetivo, modelo, idioma="english"):
    indices = df[idioma][etiquetas] == etiqueta_objetivo
    sub_vectores = vectores_texto[indices]
    similitud_coseno = cosine_similarity(sub_vectores)

    max_similitud, mejor_par = 0, (None, None)
    for i in range(sub_vectores.shape[0]): 
        for j in range(i + 1, sub_vectores.shape[0]):
            if similitud_coseno[i, j] > max_similitud:
                max_similitud = similitud_coseno[i, j]
                mejor_par = (df[idioma][indices].iloc[i]["id"], df[idioma][indices].iloc[j]["id"])

    return mejor_par, max_similitud


## Show results

In [19]:
# COMPLETE

models_en = {
    "bert-base-uncased": embeddings_english,
    "roberta-base": embeddings_english  # Assuming you have embeddings for RoBERTa as well
}

models_es = {
    "dccuchile/bert-base-spanish-wwm-uncased": embeddings_spanish,
    "PlanTL-GOB-ES/roberta-base-bne": embeddings_spanish  # Assuming you have embeddings for RoBERTa as well
}

for name, vectors in models_en.items():
    print(f"----------------------\n{name}\n" + "-" * 22)

    for label in ['NO', 'YES']:
        best_pair, similarity = encontrar_mas_similar(vectors, "label", label, name, idioma="english")
        print(f"Label: {label} \nTweets IDs: {best_pair} \nSimilarity: {similarity:.4f}")
        print(f"Tweets: \n \t1: {df['english'][df['english']['id'] == best_pair[0]]['text'].values[0]} \n \t2: {df['english'][df['english']['id'] == best_pair[1]]['text'].values[0]}")
        if label == "NO":
            print("-" * 20)

for name, vectors in models_es.items():
    print(f"----------------------\n{name}\n" + "-" * 22)
    for label in ['NO', 'YES']:
        best_pair, similarity = encontrar_mas_similar(vectors, "label", label, name, idioma="spanish")
        print(f"Label: {label} \nTweets IDs: {best_pair} \nSimilarity: {similarity:.4f}")
        print(f"Tweets: \n \t1: {df['spanish'][df['spanish']['id'] == best_pair[0]]['text'].values[0]} \n \t2: {df['spanish'][df['spanish']['id'] == best_pair[1]]['text'].values[0]}")
        if label == "NO":
            print("-" * 20)

----------------------
bert-base-uncased
----------------------


Label: NO 
Tweets IDs: (200469, 200637) 
Similarity: 0.9739
Tweets: 
 	1: I still wish they turned this into a boss fight. https://t.co/HyvPYJPHJc 
 	2: I don't particularly care or want to know about the cock carousel. Everyone has a past. https://t.co/73WMTyEKHt
--------------------
Label: YES 
Tweets IDs: (200538, 200642) 
Similarity: 0.9774
Tweets: 
 	1: The mighty ass. Call me sexist I do not care. https://t.co/LzXw4iRbLR 
 	2: @RP_JetBlack Not shaming you at all! I too am a massive slut and a total cock tease. https://t.co/HbZiZXRi0N
----------------------
roberta-base
----------------------
Label: NO 
Tweets IDs: (200469, 200637) 
Similarity: 0.9739
Tweets: 
 	1: I still wish they turned this into a boss fight. https://t.co/HyvPYJPHJc 
 	2: I don't particularly care or want to know about the cock carousel. Everyone has a past. https://t.co/73WMTyEKHt
--------------------
Label: YES 
Tweets IDs: (200538, 200642) 
Similarity: 0.9774
Tweets: 
 	1: The mighty ass. Call me sexist I d