# Contextual Word Embeddings for Text Representation

This notebook explores the use of contextual word embeddings for text representation in Natural Language Processing (NLP). It focuses on generating tweet embeddings using pre-trained transformer-based models such as BERT and RoBERTa for both English and Spanish corpora.

The notebook includes steps for:
* loading datasets
* tokenizing text
* computing embeddings using pre-trained models
* preparing representations for downstream tasks like classification or semantic similarity analysis

Implemented in Python with libraries such as Hugging Face's Transformers and Scikit-learn

In [1]:
!pip install -U transformers
!pip install -U emoji
!pip install -U ipywidgets

Collecting transformers
  Downloading transformers-4.50.0-py3-none-any.whl.metadata (39 kB)
Downloading transformers-4.50.0-py3-none-any.whl (10.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m44.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.49.0
    Uninstalling transformers-4.49.0:
      Successfully uninstalled transformers-4.49.0
Successfully installed transformers-4.50.0
Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.1
Collecting ipywidgets
  Downloading ipywidgets-8.1.5-py3-none-any.whl.metadata (2.3 kB)
Collecting comm>=0.1.3 (from ipywidgets)

## Some libraries

In [2]:
import pandas as pd
import torch
from transformers import AutoModel, AutoTokenizer
from transformers import BertTokenizer, BertModel, RobertaTokenizer, RobertaModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

## Read the corpora

In [3]:
# filepath = {
#     "english": "EXIST2024_EN_examples_mini.csv",
#     "spanish": "EXIST2024_ES_examples_mini.csv"
# }
# df = {k: pd.read_csv(v, sep="\t") for k, v in filepath.items()}

from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [4]:
df = {
   "english": pd.read_csv("/content/drive/MyDrive/EXIST2024_EN_examples_mini.csv", sep="\t"),
   "spanish": pd.read_csv("/content/drive/MyDrive/EXIST2024_ES_examples_mini.csv", sep="\t")
}

<class 'str'>


In [5]:
modelnames = {
    "english": ["bert-base-uncased", "roberta-base"],
    "spanish": ["dccuchile/bert-base-spanish-wwm-uncased", "PlanTL-GOB-ES/roberta-base-bne"]
}

In [6]:
if torch.backends.mps.is_available():  # Mac M? GPU
    device = torch.device("mps")
elif torch.cuda.is_available():  # Nvidia GPU
    device = torch.device("cuda")
else:  # CPU
    device = torch.device("cpu")
print(device)

cuda


## Compute tweets representations

In [13]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="huggingface_hub.utils._auth")

batch_size = 16
def get_embeds(tokenizer, model, model_name, text):
    tensor_list=[]
    for i in range(0, len(text), batch_size):
        batch = text[i:i+batch_size]

        input = tokenizer(batch, padding="max_length", max_length = 100, truncation=True, return_tensors="pt")
        model.eval()
        model.to(device)
        input = input.to(device)
        with torch.no_grad():
          outputs = model(**input)
          encoded_layers = outputs[0]
          cls_vector = encoded_layers[:,0,:]

        tensor_list.append(cls_vector)
    cls_vector = torch.cat(tensor_list).cpu()
    print(f"Model: {model_name}, {cls_vector.size()}")
    return cls_vector

transformers.logging.set_verbosity_error()
data = []
for lang in ["english","spanish"]:
    for model_name in modelnames[lang]:
      tokenizer = AutoTokenizer.from_pretrained(model_name)
      model = AutoModel.from_pretrained(model_name)
      text = [df[lang]['text'][i] for i in range(len(df[lang]))]


      embed = get_embeds(tokenizer, model, model_name, text)
      data.append((embed, model_name, lang))

Model: bert-base-uncased, torch.Size([748, 768])
Model: roberta-base, torch.Size([748, 768])
Model: dccuchile/bert-base-spanish-wwm-uncased, torch.Size([702, 768])
Model: PlanTL-GOB-ES/roberta-base-bne, torch.Size([702, 768])


## Compute cosine similarities

In [9]:
def find_closest_similarity(model_embed, tweets, is_sexist):
  similarity = np.round(cosine_similarity(model_embed, model_embed), 4)

  tri_upper_indices = np.triu_indices_from(similarity, k=1)
  max_index = np.argmax(similarity[tri_upper_indices])
  tweet_idx1, tweet_idx2 = tri_upper_indices[0][max_index], tri_upper_indices[1][max_index]

  label = "Yes" if is_sexist else "NO"
  print(f"label: {label}\n sentence1: {tweets.iloc[tweet_idx1]['text']} \n --------------------")
  print(f"sentence2: {tweets.iloc[tweet_idx2]['text']} \n distance: {similarity[tweet_idx1, tweet_idx2]:.4f}\n")


## Show results

In [12]:
def show_results(tweets, name, model_embed):
  tweets_nonsexist = tweets[tweets["label"] == "NO"].reset_index(drop=True)
  tweets_sexist = tweets[tweets["label"] == "YES"].reset_index(drop=True)

  embeds_sexist = np.array([model_embed[i] for i in tweets[tweets["label"] == "YES"].index.to_list()])
  embeds_nonsexist = np.array([model_embed[i] for i in tweets[tweets["label"] == "NO"].index.to_list()])

  print(f"{name}\n# =======================================n")
  for tweets, is_sexist, embeddings in [(tweets_nonsexist, False, embeds_nonsexist), (tweets_sexist, True, embeds_sexist)]:
      find_closest_similarity(embeddings, tweets, is_sexist)

spanish_data = df["spanish"]
english_data = df["english"]

for embed, model_name, lang in data:
      show_results(df[lang], model_name, embed)

bert-base-uncased
label: NO
 sentence1: I still wish they turned this into a boss fight. https://t.co/HyvPYJPHJc 
 --------------------
sentence2: I don't particularly care or want to know about the cock carousel. Everyone has a past. https://t.co/73WMTyEKHt 
 distance: 0.9739

label: Yes
 sentence1: The mighty ass. Call me sexist I do not care. https://t.co/LzXw4iRbLR 
 --------------------
sentence2: @RP_JetBlack Not shaming you at all! I too am a massive slut and a total cock tease. https://t.co/HbZiZXRi0N 
 distance: 0.9774

roberta-base
label: NO
 sentence1: Thank you beautiful friend 😊Sending love and 🕯️🚨 light your way 💓 https://t.co/EbPpAKWqjo https://t.co/n3MDADAH7N 
 --------------------
sentence2: Have a lovely day beautiful sunshine 🌞 ❤️♥️💜🔥🔥🔥🔥🔥🔥🐎 https://t.co/w4yoltPn6z https://t.co/qDf358MMsH 
 distance: 0.9992

label: Yes
 sentence1: @lkmeenha we can’t even have a day without women making it about themselves 🙄 
 --------------------
sentence2: @BigDILF01 Can’t go a day w