# **Rosetta Stone Group Project**
## Group members:
Brandetti Claudia 793871, Mammetti Francesco 805431, Sarcina Daniele 793031


---

# **Set Up**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip cache purge

[0mFiles removed: 0




---


# Dataset composition
The dataset contains 949080 lines with paired sentences of 11 different languages. Some sentence pairs share the same meaning, while others differ.

The 11 present languages are: English (en), Spanish (es), French (fr), Italian (it), Japanese (ja), Dutch (nl), Polish (pl), Portuguese (pt), Russian (ru), Chinese (zh), German (de).

The dataset is composed of 5 columns: "sentence1", "sentence2", "score", "lang1", "lang2".



---


# Libraries used
* Counter: used to count word frequencies - https://docs.python.org/3/library/collections.html#counter-objects
* pandas: for handling and processing tabular data - https://pandas.pydata.org/docs/
* numpy: for numerical operations and array handling - https://numpy.org/doc/
* torch: PyTorch, used for building and training neural networks - https://docs.pytorch.org/docs/stable/index.html
* tqdm: adds progress bars to loops (e.g., during training) - https://tqdm.github.io/
* warnings: to suppress or handle warning messages - https://docs.python.org/3/library/warnings.html
* fasttext: library for efficient word embeddings and language detection - https://fasttext.cc/docs/en/supervised-tutorial.html
* nltk: natural language processing tools (e.g., tokenization, stopwords) - https://www.nltk.org/
* sklearn: scikit-learn, used here for data splitting and evaluation metrics - https://scikit-learn.org/stable/
* random: for random number generation and reproducibility - https://docs.python.org/3/library/random.html

### Transformers
* BertTokenizer, BertModel: BERT tokenizer and model used for sentence encoding
* MarianMTModel, MarianTokenizer: machine translation model/tokenizer from HuggingFace for back-translation
* AutoTokenizer, AutoModelForCausalLM: generic tokenizer and causal language model interface from HuggingFace
* SentenceTransformer (SBert): to calculate the similarity score for Japanese

More in-depth explanation of how the transfomers have been used can be found later on.



---


# **1. DATA PROCESSING**
The dataset is cleaned of any symbols,invalid or duplicate rows, all the sentences are set in lowercase, and it was checked if there were NaN.



In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("/content/drive/MyDrive/rosetta stone.csv")
df.shape

(949080, 5)

In [None]:
df.head(100)

Unnamed: 0,sentence1,sentence2,score,lang1,lang2
0,Ein Flugzeug hebt gerade ab.,An air plane is taking off.,5.0,de,en
1,Ein Flugzeug hebt gerade ab.,Un avión está despegando.,5.0,de,es
2,Ein Flugzeug hebt gerade ab.,Un avion est en train de décoller.,5.0,de,fr
3,Ein Flugzeug hebt gerade ab.,Un aereo sta decollando.,5.0,de,it
4,Ein Flugzeug hebt gerade ab.,飛行機が離陸します。,5.0,de,ja
...,...,...,...,...,...
95,Самолет взлетает.,飛行機が離陸します。,5.0,ru,ja
96,Самолет взлетает.,Er gaat een vliegtuig opstijgen.,5.0,ru,nl
97,Самолет взлетает.,Samolot wystartował.,5.0,ru,pl
98,Самолет взлетает.,Um avião aéreo está a descolar.,5.0,ru,pt


# **Preprocessing**  

This function (`preprocess_multilingual_dataset`) standardizes and cleans a multilingual sentence-pair dataset for training/evaluating **semantic similarity models**. It ensures data quality by handling missing values, duplicates, text normalization, and score validation.

## **Key Steps**  

### **1. Input & Initial Setup**  
- **Input DataFrame**: Expects columns:  
  - `sentence1`, `sentence2`: Text pairs to compare.  
  - `lang1`, `lang2`: Language codes (e.g., `'en'`, `'fr'`).  
  - `score`: Numeric similarity score (e.g., 0–5).  
- **Parameters**:  
  - `language_pairs`: Optional filter for specific language combinations.  
  - `lowercase`: Set `False` for case-sensitive models (e.g., LaBSE).  
  - `score_range`: Valid score bounds (default: `(0, 5)`).  
  - `remove_commas`: Strips commas to reduce noise (default: `True`).  

### **2. Data Cleaning Pipeline**  

#### **Step 1: Remove Invalid Rows**  
- Drops rows where either `sentence1` or `sentence2` is:  
  - `NaN` (missing).  
  - Empty (after stripping whitespace).  

#### **Step 2: Deduplication**  
- Removes duplicate rows based on **exact matches** of:  
  - Sentence pairs (`sentence1`, `sentence2`).  
  - Language pairs (`lang1`, `lang2`).  

#### **Step 3: Text Normalization**  
1. **Whitespace Handling**:  
   - Trims leading/trailing spaces.  
   - Replaces multiple spaces/tabs/newlines with a single space (regex: `\s+`).  
2. **Optional Transformations**:  
   - **Lowercasing**: Applied if `lowercase=True`.  
   - **Comma Removal**: Controlled by `remove_commas`.  

#### **Step 4: Score Validation**  
- Converts the `score` column to numeric (invalid entries → `NaN`).  
- Filters scores to lie within `score_range` (e.g., 0–5).  


## **Output & Metrics**  
- Returns a cleaned `DataFrame` with:  
  - Consistent text formatting.  
  - Valid scores.  
  - No duplicates/missing values.  
- Prints preprocessing statistics:  
  - Rows removed at each step.  
  - Final dataset size and reduction percentage.  

In [None]:
def preprocess_multilingual_dataset(
    df,
    language_pairs=None,
    lowercase=False,
    score_range=(0, 5),
    remove_commas=True  # Nuovo parametro per rimuovere le virgole
):
    """
    Preprocess a multilingual sentence pair dataset for semantic similarity detection.

    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame containing the multilingual sentence pairs dataset.
    language_pairs : list of tuples, optional
        List of (lang1, lang2) tuples to filter by. If None, all language pairs are kept.
        Example: [('en', 'fr'), ('en', 'de')]
    lowercase : bool, default=False
        Whether to lowercase the text. Set to False for case-sensitive models like LaBSE.
    score_range : tuple, default=(0, 5)
        The valid range for similarity scores (min, max).
    remove_commas : bool, default=True
        Whether to remove commas from sentences.

    Returns:
    --------
    pandas.DataFrame
        Preprocessed DataFrame ready for a sentence embedding model.
    """
    # Make a copy to avoid modifying the original
    df = df.copy()
    initial_rows = len(df)
    print(f"Initial dataset: {initial_rows} rows")

    # Step 1: Remove rows with missing or empty sentences
    df = df.dropna(subset=['sentence1', 'sentence2'])
    df = df[(df['sentence1'].str.strip() != '') & (df['sentence2'].str.strip() != '')]
    print(f"After removing rows with missing sentences: {len(df)} rows")

    # Step 2: Drop duplicates
    df = df.drop_duplicates(subset=['sentence1', 'sentence2', 'lang1', 'lang2'])
    print(f"After removing duplicates: {len(df)} rows")

    # Step 3: Normalize text
    # Remove leading/trailing whitespace
    df['sentence1'] = df['sentence1'].str.strip()
    df['sentence2'] = df['sentence2'].str.strip()

    # Replace multiple whitespace characters (including newlines, tabs) with a single space
    df['sentence1'] = df['sentence1'].apply(lambda x: re.sub(r'\s+', ' ', x))
    df['sentence2'] = df['sentence2'].apply(lambda x: re.sub(r'\s+', ' ', x))

    # Remove commas from sentences
    if remove_commas:
        df['sentence1'] = df['sentence1'].apply(lambda x: x.replace(',', ''))
        df['sentence2'] = df['sentence2'].apply(lambda x: x.replace(',', ''))
        print("Removed commas from sentences")

    # Lowercase
    if lowercase:
        df['sentence1'] = df['sentence1'].str.lower()
        df['sentence2'] = df['sentence2'].str.lower()
        print("Applied lowercase transformation")

    # Step 4: Ensure score column is numeric, drop rows with invalid scores
    # Convert to numeric, with errors='coerce' to set invalid values to NaN
    df['score'] = pd.to_numeric(df['score'], errors='coerce')

    # Drop rows with missing scores
    df = df.dropna(subset=['score'])
    print(f"After cleaning score column: {len(df)} rows")

    # Filter by valid score range
    min_score, max_score = score_range
    df = df[(df['score'] >= min_score) & (df['score'] <= max_score)]
    print(f"After filtering by score range ({min_score}-{max_score}): {len(df)} rows")

    # Optional: Reset index
    df = df.reset_index(drop=True)

    print(f"Final dataset: {len(df)} rows")
    print(f"Rows removed: {initial_rows - len(df)} ({(initial_rows - len(df))/initial_rows*100:.2f}%)")

    return df

In [None]:
clean_df = preprocess_multilingual_dataset(df, lowercase=True)


Initial dataset: 949080 rows
After removing rows with missing sentences: 949080 rows
After removing duplicates: 940538 rows
Removed commas from sentences
Applied lowercase transformation
After cleaning score column: 940538 rows
After filtering by score range (0-5): 940538 rows
Final dataset: 940538 rows
Rows removed: 8542 (0.90%)


In [None]:
clean_df.head(100)

Unnamed: 0,sentence1,sentence2,score,lang1,lang2
0,ein flugzeug hebt gerade ab.,an air plane is taking off.,5.0,de,en
1,ein flugzeug hebt gerade ab.,un avión está despegando.,5.0,de,es
2,ein flugzeug hebt gerade ab.,un avion est en train de décoller.,5.0,de,fr
3,ein flugzeug hebt gerade ab.,un aereo sta decollando.,5.0,de,it
4,ein flugzeug hebt gerade ab.,飛行機が離陸します。,5.0,de,ja
...,...,...,...,...,...
95,самолет взлетает.,飛行機が離陸します。,5.0,ru,ja
96,самолет взлетает.,er gaat een vliegtuig opstijgen.,5.0,ru,nl
97,самолет взлетает.,samolot wystartował.,5.0,ru,pl
98,самолет взлетает.,um avião aéreo está a descolar.,5.0,ru,pt


In [None]:
clean_df.shape

(940538, 5)

In [None]:
import pandas as pd
clean_df = pd.read_csv("/content/drive/MyDrive/clean df.csv")



---


# **2. DATA AUGMENTATION**

To augment the dataset we followed three approaches for different group of sentences based by the language of "sentence1".

### Part One
For English (en), Spanish (es), French (fr), Italian (it), German (de), Chinese (zh), Dutch (nl), and Russian (ru), we applied backtranslation using English as the pivot language for all of them, and French for English.

### Part Two
For Portuguese (pt) and Polish (pl) we used a lightweight word-level substitution method using synonym or semantic replacement strategies. The reason for this choice is because these languages had problems with backtranslation since they were not supported by MarianMTModel.

### Part three
For  Japanese (ja) we used Rinna's Japanese GPT-2 Medium (AutoModelForCausalLM) with left-padded tokenization (AutoTokenizer), because tqdm didn't support the language.



---


## **PART ONE**

We created 8 subset from clean_df.
*   df_en where lang1 = en
*   df_es where lang1 = es
*   df_fr where lang1 = fr
*   df_it where lang1 = it
*   df_de where lang1 = de
*   df_ru where lang1 = ru
*   df_zh where lang1 = zh
*   df_nl where lang1 = nl

We will apply backtranslation to these languages using MarianMtModel, MarianTokenizer. Documentation available at https://huggingface.co/docs/transformers/model_doc/marian.

### **English**

In [None]:
df_en = clean_df[clean_df['lang1'] == 'en']
df_en.shape

(85591, 5)

# **Back-Translation for Data Augmentation**  

This code performs **back-translation** (translating text to a target language and back to the source language) to generate paraphrased versions of English sentences, enhancing dataset diversity while preserving semantic meaning. The same logic is applied to other languages (Spanish, French, etc.) using English as the pivot language.


## **Key Components**  

### **1. Model & Tokenizer Setup**  
- **Translation Models**: Uses Hugging Face's `MarianMT` models:  
  - **English → French**: `Helsinki-NLP/opus-mt-en-fr`  
  - **French → English**: `Helsinki-NLP/opus-mt-fr-en`  
- **Device Optimization**: Loads models on GPU (`cuda`) if available, otherwise CPU.  

### **2. Core Functions**  
1. **`translate_batch(sentences, tokenizer, model)`**:  
   - Tokenizes and translates a batch of sentences.  
   - Uses `padding` and `truncation` for uniform input length.  
   - Returns decoded translations (skipping special tokens like `[CLS]`).  

2. **`back_translate(sentences)`**:  
   - **Two-Step Process**:  
     - Translates English → French (`src_to_fr`).  
     - Translates French → English (`fr_to_src`).  
   - Outputs paraphrased English sentences (e.g., *"Hello"* → *"Bonjour"* → *"Hi"*).  

### **3. Batch Processing**  
- **Input**: English sentences from `df_en['sentence1']`.  
- **Batching**: Processes sentences in chunks (`batch_size=64`) for efficiency.  
- **Progress Tracking**: Uses `tqdm` to monitor back-translation progress.  

### **4. Output**  
- Creates `df_en_aug`, a copy of the original DataFrame with paraphrased `sentence1` values.  


## **Workflow Example**  
| Original Sentence (en) | Back-Translated (en→fr→en) |  
|------------------------|---------------------------|  
| "The cat sat on the mat." | "The cat was sitting on the rug." |  


## **Why Back-Translation?**  
- **Data Augmentation**: Generates semantically equivalent but syntactically varied sentences.  
- **Language Coverage**: Applied to multiple languages (es, fr, it, de, etc.) using English as a pivot.  
- **Model Robustness**: Helps train NLP models to handle paraphrases and stylistic variations.  


## **Technical Notes**  
- **Batch Processing**: Optimizes GPU utilization and speeds up translation.  
- **Error Handling**: Silent (`warnings.filterwarnings("ignore")`) for cleaner output.  
- **Scalability**: The same logic is replicated for other languages by swapping model paths (e.g., `opus-mt-en-es` for Spanish).  


In [None]:
from transformers import MarianMTModel, MarianTokenizer
import torch
from tqdm import tqdm
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# Load models and tokenizers
src_to_fr = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
src_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr")

fr_to_src = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-fr-en")
fr_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-fr-en")

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
src_to_fr.to(device)
fr_to_src.to(device)

# Define translation functions
def translate_batch(sentences, tokenizer, model):
    encoded = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        translated = model.generate(**encoded)
    return tokenizer.batch_decode(translated, skip_special_tokens=True)

def back_translate(sentences):
    fr = translate_batch(sentences, src_tokenizer, src_to_fr)
    return translate_batch(fr, fr_tokenizer, fr_to_src)

# Apply back-translation to all English sentences
batch_size = 64
sentence1_list = df_en['sentence1'].tolist()
bt_results = []

for i in tqdm(range(0, len(sentence1_list), batch_size), desc="Back-translating"):
    batch = sentence1_list[i:i+batch_size]
    bt = back_translate(batch)
    bt_results.extend(bt)

# Create augmented DataFrame with back-translated sentence1
df_en_aug = df_en.copy()
df_en_aug['sentence1'] = bt_results


Back-translating: 100%|██████████| 1338/1338 [19:49<00:00,  1.12it/s]


In [None]:
df_en.head(100)

Unnamed: 0,sentence1,sentence2,score,lang1,lang2
10,a plane is taking off.,ein flugzeug hebt gerade ab.,5.0,en,de
11,a plane is taking off.,un avión está despegando.,5.0,en,es
12,a plane is taking off.,un avion est en train de décoller.,5.0,en,fr
13,a plane is taking off.,un aereo sta decollando.,5.0,en,it
14,a plane is taking off.,飛行機が離陸します。,5.0,en,ja
...,...,...,...,...,...
1005,a person is throwing a cat on to the ceiling.,iemand gooit een kat op het plafond.,5.0,en,nl
1006,a person is throwing a cat on to the ceiling.,człowiek rzuca kota na sufit.,5.0,en,pl
1007,a person is throwing a cat on to the ceiling.,uma pessoa atira um gato para o tecto.,5.0,en,pt
1008,a person is throwing a cat on to the ceiling.,человек бросает кошку на потолок.,5.0,en,ru


In [None]:
df_en_aug['sentence1'] = bt_results
df_en_aug.head(100)

Unnamed: 0,sentence1,sentence2,score,lang1,lang2
10,A plane takes off.,ein flugzeug hebt gerade ab.,5.0,en,de
11,A plane takes off.,un avión está despegando.,5.0,en,es
12,A plane takes off.,un avion est en train de décoller.,5.0,en,fr
13,A plane takes off.,un aereo sta decollando.,5.0,en,it
14,A plane takes off.,飛行機が離陸します。,5.0,en,ja
...,...,...,...,...,...
1005,a person throws a cat on the ceiling.,iemand gooit een kat op het plafond.,5.0,en,nl
1006,a person throws a cat on the ceiling.,człowiek rzuca kota na sufit.,5.0,en,pl
1007,a person throws a cat on the ceiling.,uma pessoa atira um gato para o tecto.,5.0,en,pt
1008,a person throws a cat on the ceiling.,человек бросает кошку на потолок.,5.0,en,ru


In [None]:
df_en_aug.shape

(85591, 5)

### **Spanish**

In [None]:
df_es = clean_df[clean_df['lang1'] == 'es']
df_es.shape

(85548, 5)

In [None]:
from transformers import MarianMTModel, MarianTokenizer
import torch
from tqdm import tqdm
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# Load models and tokenizers for Spanish ↔ English
es_to_en_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-es-en")
es_to_en_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-es-en")

en_to_es_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-es")
en_to_es_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-es")

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
es_to_en_model.to(device)
en_to_es_model.to(device)

# Define translation functions
def translate_batch(sentences, tokenizer, model):
    encoded = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        translated = model.generate(**encoded)
    return tokenizer.batch_decode(translated, skip_special_tokens=True)

def back_translate_es(sentences):
    en_sentences = translate_batch(sentences, es_to_en_tokenizer, es_to_en_model)
    return translate_batch(en_sentences, en_to_es_tokenizer, en_to_es_model)

# Apply back-translation to all Spanish sentences
batch_size = 64
sentence1_list = df_es['sentence1'].tolist()
bt_results_es = []

for i in tqdm(range(0, len(sentence1_list), batch_size), desc="Back-translating"):
    batch = sentence1_list[i:i+batch_size]
    bt = back_translate_es(batch)
    bt_results_es.extend(bt)

# Create augmented DataFrame with back-translated sentence1
df_es_aug = df_es.copy()
df_es_aug['sentence1'] = bt_results_es


Back-translating: 100%|██████████| 1337/1337 [30:28<00:00,  1.37s/it]


In [None]:
df_es_aug['sentence1'] = bt_results_es
df_es_aug.head(100)

Unnamed: 0,sentence1,sentence2,score,lang1,lang2
20,Un avión despega.,ein flugzeug hebt gerade ab.,5.0,es,de
21,Un avión despega.,an air plane is taking off.,5.0,es,en
22,Un avión despega.,un avion est en train de décoller.,5.0,es,fr
23,Un avión despega.,un aereo sta decollando.,5.0,es,it
24,Un avión despega.,飛行機が離陸します。,5.0,es,ja
...,...,...,...,...,...
1015,Una persona está lanzando un gato en el techo.,iemand gooit een kat op het plafond.,5.0,es,nl
1016,Una persona está lanzando un gato en el techo.,człowiek rzuca kota na sufit.,5.0,es,pl
1017,Una persona está lanzando un gato en el techo.,uma pessoa atira um gato para o tecto.,5.0,es,pt
1018,Una persona está lanzando un gato en el techo.,человек бросает кошку на потолок.,5.0,es,ru


### **French**

In [None]:
df_fr = clean_df[clean_df['lang1'] == 'fr']
df_fr.shape

(85490, 5)

In [None]:
from transformers import MarianMTModel, MarianTokenizer
import torch
from tqdm import tqdm
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# Load models and tokenizers for French ↔ English
fr_to_en_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-fr-en")
fr_to_en_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-fr-en")

en_to_fr_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
en_to_fr_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr")

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
fr_to_en_model.to(device)
en_to_fr_model.to(device)

# Define translation functions
def translate_batch(sentences, tokenizer, model):
    encoded = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        translated = model.generate(**encoded)
    return tokenizer.batch_decode(translated, skip_special_tokens=True)

def back_translate_fr(sentences):
    en_sentences = translate_batch(sentences, fr_to_en_tokenizer, fr_to_en_model)
    return translate_batch(en_sentences, en_to_fr_tokenizer, en_to_fr_model)

# Apply back-translation to all French sentences
batch_size = 64
sentence1_list = df_fr['sentence1'].tolist()
bt_results_fr = []

for i in tqdm(range(0, len(sentence1_list), batch_size), desc="Back-translating French"):
    batch = sentence1_list[i:i+batch_size]
    bt = back_translate_fr(batch)
    bt_results_fr.extend(bt)

# Create augmented DataFrame with back-translated sentence1
df_fr_aug = df_fr.copy()
df_fr_aug['sentence1'] = bt_results_fr


Back-translating French: 100%|██████████| 1336/1336 [20:10<00:00,  1.10it/s]


In [None]:
df_fr_aug['sentence1'] = bt_results_fr
df_fr_aug.head(100)

Unnamed: 0,sentence1,sentence2,score,lang1,lang2
30,Un avion décolle.,ein flugzeug hebt gerade ab.,5.0,fr,de
31,Un avion décolle.,an air plane is taking off.,5.0,fr,en
32,Un avion décolle.,un avión está despegando.,5.0,fr,es
33,Un avion décolle.,un aereo sta decollando.,5.0,fr,it
34,Un avion décolle.,飛行機が離陸します。,5.0,fr,ja
...,...,...,...,...,...
1025,une personne jette un chat sur le plafond.,iemand gooit een kat op het plafond.,5.0,fr,nl
1026,une personne jette un chat sur le plafond.,człowiek rzuca kota na sufit.,5.0,fr,pl
1027,une personne jette un chat sur le plafond.,uma pessoa atira um gato para o tecto.,5.0,fr,pt
1028,une personne jette un chat sur le plafond.,человек бросает кошку на потолок.,5.0,fr,ru


### **Italian**

In [None]:
df_it = clean_df[clean_df['lang1'] == 'it']
df_it.shape

(85526, 5)

In [None]:
from transformers import MarianMTModel, MarianTokenizer
import torch
from tqdm import tqdm
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# Load models and tokenizers for Italian ↔ English
it_to_en_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-it-en")
it_to_en_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-it-en")

en_to_it_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-it")
en_to_it_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-it")

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
it_to_en_model.to(device)
en_to_it_model.to(device)

# Define translation functions
def translate_batch(sentences, tokenizer, model):
    encoded = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        translated = model.generate(**encoded)
    return tokenizer.batch_decode(translated, skip_special_tokens=True)

def back_translate_it(sentences):
    en_sentences = translate_batch(sentences, it_to_en_tokenizer, it_to_en_model)
    return translate_batch(en_sentences, en_to_it_tokenizer, en_to_it_model)

# Apply back-translation to all Italian sentences
batch_size = 64
sentence1_list = df_it['sentence1'].tolist()
bt_results_it = []

for i in tqdm(range(0, len(sentence1_list), batch_size), desc="Back-translating Italian"):
    batch = sentence1_list[i:i+batch_size]
    bt = back_translate_it(batch)
    bt_results_it.extend(bt)

# Create augmented DataFrame with back-translated sentence1
df_it_aug = df_it.copy()
df_it_aug['sentence1'] = bt_results_it

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/344M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/344M [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/814k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/790k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.37M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/343M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/343M [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/789k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/814k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.35M [00:00<?, ?B/s]

Back-translating Italian: 100%|██████████| 1337/1337 [26:02<00:00,  1.17s/it]


In [None]:
df_it_aug['sentence1'] = bt_results_it
df_it_aug.head(100)

Unnamed: 0,sentence1,sentence2,score,lang1,lang2
40,Un aereo sta decollando.,ein flugzeug hebt gerade ab.,5.0,it,de
41,Un aereo sta decollando.,an air plane is taking off.,5.0,it,en
42,Un aereo sta decollando.,un avión está despegando.,5.0,it,es
43,Un aereo sta decollando.,un avion est en train de décoller.,5.0,it,fr
44,Un aereo sta decollando.,飛行機が離陸します。,5.0,it,ja
...,...,...,...,...,...
1035,Una persona sta gettando un gatto sul soffitto.,iemand gooit een kat op het plafond.,5.0,it,nl
1036,Una persona sta gettando un gatto sul soffitto.,człowiek rzuca kota na sufit.,5.0,it,pl
1037,Una persona sta gettando un gatto sul soffitto.,uma pessoa atira um gato para o tecto.,5.0,it,pt
1038,Una persona sta gettando un gatto sul soffitto.,человек бросает кошку на потолок.,5.0,it,ru


### **German**

In [None]:
df_de = clean_df[clean_df['lang1'] == 'de']
df_de.shape

(85465, 5)

In [None]:
from transformers import MarianMTModel, MarianTokenizer
import torch
from tqdm import tqdm
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# Load models and tokenizers for German ↔ English
de_to_en_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-de-en")
de_to_en_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-de-en")

en_to_de_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-de")
en_to_de_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
de_to_en_model.to(device)
en_to_de_model.to(device)

# Define translation functions
def translate_batch(sentences, tokenizer, model):
    encoded = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        translated = model.generate(**encoded)
    return tokenizer.batch_decode(translated, skip_special_tokens=True)

def back_translate_de(sentences):
    en_sentences = translate_batch(sentences, de_to_en_tokenizer, de_to_en_model)
    return translate_batch(en_sentences, en_to_de_tokenizer, en_to_de_model)

# Apply back-translation to all German sentences
batch_size = 64
sentence1_list = df_de['sentence1'].tolist()
bt_results_de = []

for i in tqdm(range(0, len(sentence1_list), batch_size), desc="Back-translating German"):
    batch = sentence1_list[i:i+batch_size]
    bt = back_translate_de(batch)
    bt_results_de.extend(bt)

# Create augmented DataFrame with back-translated sentence1
df_de_aug = df_de.copy()
df_de_aug['sentence1'] = bt_results_de

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


source.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/298M [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/298M [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

Back-translating German: 100%|██████████| 1336/1336 [23:37<00:00,  1.06s/it]


In [None]:
df_de_aug['sentence1'] = bt_results_de
df_de_aug.head(100)

Unnamed: 0,sentence1,sentence2,score,lang1,lang2
0,Ein Flugzeug startet gerade.,an air plane is taking off.,5.0,de,en
1,Ein Flugzeug startet gerade.,un avión está despegando.,5.0,de,es
2,Ein Flugzeug startet gerade.,un avion est en train de décoller.,5.0,de,fr
3,Ein Flugzeug startet gerade.,un aereo sta decollando.,5.0,de,it
4,Ein Flugzeug startet gerade.,飛行機が離陸します。,5.0,de,ja
...,...,...,...,...,...
995,Eine Person wirft eine Katze auf die Decke.,iemand gooit een kat op het plafond.,5.0,de,nl
996,Eine Person wirft eine Katze auf die Decke.,człowiek rzuca kota na sufit.,5.0,de,pl
997,Eine Person wirft eine Katze auf die Decke.,uma pessoa atira um gato para o tecto.,5.0,de,pt
998,Eine Person wirft eine Katze auf die Decke.,человек бросает кошку на потолок.,5.0,de,ru


### **Russian**

In [None]:
df_ru = clean_df[clean_df['lang1'] == 'ru']
df_ru.shape

(85445, 5)

In [None]:
from transformers import MarianMTModel, MarianTokenizer
import torch
from tqdm import tqdm
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# Load models and tokenizers for Russian ↔ English
ru_to_en_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-ru-en")
ru_to_en_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ru-en")

en_to_ru_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-ru")
en_to_ru_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-ru")

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ru_to_en_model.to(device)
en_to_ru_model.to(device)

# Define translation functions
def translate_batch(sentences, tokenizer, model):
    encoded = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        translated = model.generate(**encoded)
    return tokenizer.batch_decode(translated, skip_special_tokens=True)

def back_translate_ru(sentences):
    en_sentences = translate_batch(sentences, ru_to_en_tokenizer, ru_to_en_model)
    return translate_batch(en_sentences, en_to_ru_tokenizer, en_to_ru_model)

# Apply back-translation to all Russian sentences
batch_size = 64
sentence1_list = df_ru['sentence1'].tolist()
bt_results_ru = []

for i in tqdm(range(0, len(sentence1_list), batch_size), desc="Back-translating Russian"):
    batch = sentence1_list[i:i+batch_size]
    bt = back_translate_ru(batch)
    bt_results_ru.extend(bt)

# Create augmented DataFrame with back-translated sentence1
df_ru_aug = df_ru.copy()
df_ru_aug['sentence1'] = bt_results_ru

Back-translating Russian: 100%|██████████| 1336/1336 [31:07<00:00,  1.40s/it]


In [None]:
df_ru_aug['sentence1'] = bt_results_ru
df_ru_aug.head(100)

Unnamed: 0,sentence1,sentence2,score,lang1,lang2
0,Самолёт взлетает.,ein flugzeug hebt gerade ab.,5.0,ru,de
1,Самолёт взлетает.,an air plane is taking off.,5.0,ru,en
2,Самолёт взлетает.,un avión está despegando.,5.0,ru,es
3,Самолёт взлетает.,un avion est en train de décoller.,5.0,ru,fr
4,Самолёт взлетает.,un aereo sta decollando.,5.0,ru,it
...,...,...,...,...,...
95,Человек кидает кошку на потолок.,人が天井に猫を投げつける。,5.0,ru,ja
96,Человек кидает кошку на потолок.,iemand gooit een kat op het plafond.,5.0,ru,nl
97,Человек кидает кошку на потолок.,człowiek rzuca kota na sufit.,5.0,ru,pl
98,Человек кидает кошку на потолок.,uma pessoa atira um gato para o tecto.,5.0,ru,pt


### **Chinese**

In [None]:
df_zh = clean_df[clean_df['lang1'] == 'zh']
df_zh.shape

(85456, 5)

In [None]:
from transformers import MarianMTModel, MarianTokenizer
import torch
from tqdm import tqdm
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# Load models and tokenizers for Chinese ↔ English
zh_to_en_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-zh-en")
zh_to_en_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-zh-en")

en_to_zh_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-zh")
en_to_zh_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-zh")

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
zh_to_en_model.to(device)
en_to_zh_model.to(device)

# Define translation functions
def translate_batch(sentences, tokenizer, model):
    encoded = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        translated = model.generate(**encoded)
    return tokenizer.batch_decode(translated, skip_special_tokens=True)

def back_translate_zh(sentences):
    en_sentences = translate_batch(sentences, zh_to_en_tokenizer, zh_to_en_model)
    return translate_batch(en_sentences, en_to_zh_tokenizer, en_to_zh_model)

# Apply back-translation to all Chinese sentences
batch_size = 64
sentence1_list = df_zh['sentence1'].tolist()
bt_results_zh = []

for i in tqdm(range(0, len(sentence1_list), batch_size), desc="Back-translating Chinese"):
    batch = sentence1_list[i:i+batch_size]
    bt = back_translate_zh(batch)
    bt_results_zh.extend(bt)

# Create augmented DataFrame with back-translated sentence1
df_zh_aug = df_zh.copy()
df_zh_aug['sentence1'] = bt_results_zh

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/312M [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/805k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/807k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.62M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/312M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/806k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/805k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.62M [00:00<?, ?B/s]

Back-translating Chinese: 100%|██████████| 1336/1336 [20:05<00:00,  1.11it/s]


In [None]:
df_zh_aug['sentence1'] = bt_results_zh
df_zh_aug.head(100)

Unnamed: 0,sentence1,sentence2,score,lang1,lang2
100,飞机起飞了,ein flugzeug hebt gerade ab.,5.0,zh,de
101,飞机起飞了,an air plane is taking off.,5.0,zh,en
102,飞机起飞了,un avión está despegando.,5.0,zh,es
103,飞机起飞了,un avion est en train de décoller.,5.0,zh,fr
104,飞机起飞了,un aereo sta decollando.,5.0,zh,it
...,...,...,...,...,...
1095,一个男人把一只猫扔到天花板上,人が天井に猫を投げつける。,5.0,zh,ja
1096,一个男人把一只猫扔到天花板上,iemand gooit een kat op het plafond.,5.0,zh,nl
1097,一个男人把一只猫扔到天花板上,człowiek rzuca kota na sufit.,5.0,zh,pl
1098,一个男人把一只猫扔到天花板上,uma pessoa atira um gato para o tecto.,5.0,zh,pt


### **Dutch**

In [None]:
df_nl = clean_df[clean_df['lang1'] == 'nl']
df_nl.shape

(85521, 5)

In [None]:
from transformers import MarianMTModel, MarianTokenizer
import torch
from tqdm import tqdm
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# Load models and tokenizers for Dutch ↔ English
nl_to_en_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-nl-en")
nl_to_en_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-nl-en")

en_to_nl_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-nl")
en_to_nl_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-nl")

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
nl_to_en_model.to(device)
en_to_nl_model.to(device)

# Define translation functions
def translate_batch(sentences, tokenizer, model):
    encoded = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        translated = model.generate(**encoded)
    return tokenizer.batch_decode(translated, skip_special_tokens=True)

def back_translate_nl(sentences):
    en_sentences = translate_batch(sentences, nl_to_en_tokenizer, nl_to_en_model)
    return translate_batch(en_sentences, en_to_nl_tokenizer, en_to_nl_model)

# Apply back-translation to all Dutch sentences
batch_size = 64
sentence1_list = df_nl['sentence1'].tolist()
bt_results_nl = []

for i in tqdm(range(0, len(sentence1_list), batch_size), desc="Back-translating Dutch", ncols=100):
    batch = sentence1_list[i:i+batch_size]
    bt = back_translate_nl(batch)
    bt_results_nl.extend(bt)

# Create augmented DataFrame with back-translated sentence1
df_nl_aug = df_nl.copy()
df_nl_aug['sentence1'] = bt_results_nl

Back-translating Dutch: 100%|███████████████████████████████████| 1337/1337 [33:16<00:00,  1.49s/it]


In [None]:
df_nl_aug['sentence1'] = bt_results_nl
df_nl_aug.head(100)

Unnamed: 0,sentence1,sentence2,score,lang1,lang2
60,Er gaat een vliegtuig opstijgen.,ein flugzeug hebt gerade ab.,5.0,nl,de
61,Er gaat een vliegtuig opstijgen.,an air plane is taking off.,5.0,nl,en
62,Er gaat een vliegtuig opstijgen.,un avión está despegando.,5.0,nl,es
63,Er gaat een vliegtuig opstijgen.,un avion est en train de décoller.,5.0,nl,fr
64,Er gaat een vliegtuig opstijgen.,un aereo sta decollando.,5.0,nl,it
...,...,...,...,...,...
1055,Iemand gooit een kat tegen het plafond.,人が天井に猫を投げつける。,5.0,nl,ja
1056,Iemand gooit een kat tegen het plafond.,człowiek rzuca kota na sufit.,5.0,nl,pl
1057,Iemand gooit een kat tegen het plafond.,uma pessoa atira um gato para o tecto.,5.0,nl,pt
1058,Iemand gooit een kat tegen het plafond.,человек бросает кошку на потолок.,5.0,nl,ru




---


## **PART TWO**

## **Polish and Portuguese**

We adopted a **lightweight word-level substitution method** using synonym or semantic replacement strategies.

We used pre-trained word vectors for the 2 languages taken from **FastText** available at https://fasttext.cc/docs/en/crawl-vectors.html

FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers.
These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.
The word vectors are available in both binary and text formats. We used the binary format because for the text format it was necessary to use the Gensim library which creates conflicts with NuMpy.


### **Polish and Portuguese**

In [None]:
!pip install fasttext



The following codes download and extract pre-trained FastText word vectors for Portuguese and Polish. The first command uses `wget` to download the `.bin.gz` compressed model file, and the second command extracts it using `gunzip`, resulting in a usable `.bin` file (`cc.pt.300.bin`), (`cc.pl.300.bin`).


In [None]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.pt.300.bin.gz
!gunzip cc.pt.300.bin.gz

--2025-05-09 12:22:34--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.pt.300.bin.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 13.227.219.70, 13.227.219.10, 13.227.219.59, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|13.227.219.70|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4505096760 (4.2G) [application/octet-stream]
Saving to: ‘cc.pt.300.bin.gz.1’


2025-05-09 12:22:55 (200 MB/s) - ‘cc.pt.300.bin.gz.1’ saved [4505096760/4505096760]

gzip: cc.pt.300.bin already exists; do you wish to overwrite (y or n)? n
	not overwritten


In [None]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.pl.300.bin.gz
!gunzip cc.pl.300.bin.gz

--2025-05-09 12:23:05--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.pl.300.bin.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 13.227.219.10, 13.227.219.70, 13.227.219.59, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|13.227.219.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4503081312 (4.2G) [application/octet-stream]
Saving to: ‘cc.pl.300.bin.gz.1’


2025-05-09 12:23:26 (209 MB/s) - ‘cc.pl.300.bin.gz.1’ saved [4503081312/4503081312]

gzip: cc.pl.300.bin already exists; do you wish to overwrite (y or n)? n
	not overwritten


In [None]:
import fasttext
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

This lines load the pre-trained Portuguese and Polish FastText word embedding models (`cc.pt.300.bin`), (`cc.pl.300.bin`) using the `fasttext` library.


In [None]:
ft_pt = fasttext.load_model('cc.pt.300.bin')

In [None]:
ft_pl = fasttext.load_model('cc.pl.300.bin')

df_pl_pt is a subset of clean_df that contains only the rows that have 'pl' or 'pt' in the column 'lang1'.

In [None]:
df_pl_pt = clean_df[(clean_df['lang1'] == 'pl') | (clean_df['lang1'] == 'pt')]
df_pl_pt.shape

(170981, 5)

# **Word-Level Paraphrasing with FastText Embeddings**  

This code performs **word substitution-based paraphrasing** for Portuguese (`pt`) and Polish (`pl`) text using **FastText word embeddings**. It replaces words in sentences with their semantically similar counterparts while preserving meaning.  

## **Key Steps**  

### **1. Setup & Model Loading**  
- **FastText Models**: Pre-trained FastText models (`cc.pt.300.bin` for Portuguese, `cc.pl.300.bin` for Polish) are loaded to compute word similarities.  
- **NLTK Tokenization**: The `punkt` tokenizer is used to split sentences into words.  
- **Warnings**: Suppressed for cleaner output.  

### **2. Building the Substitution Dictionary**  
For each language (`pt`/`pl`), the code:  
1. **Tokenizes and Preprocesses** sentences from `sentence1`.  
2. **Extracts Top-N Most Frequent Words** (default: 500) to focus on relevant vocabulary.  
3. **Finds Similar Words** using FastText’s nearest-neighbor search:  
   - Only keeps replacements with a **similarity score ≥ 0.75**.  
   - Avoids replacing a word with itself.  
   - If no suitable synonym is found, the original word is retained.  

### **3. Applying Substitutions**  
- Each sentence in `sentence1` is tokenized, lowercased, and processed.  
- Words are **replaced if they exist in the substitution dictionary**, otherwise kept as-is.  
- The modified sentences are stored in a new column, `sentence1_aug`.  

### **4. Output**  
- The paraphrased sentences for both languages are combined into a single DataFrame (`df_paraphrased`).

In [None]:
import fasttext
import pandas as pd
import nltk
import warnings
from collections import Counter
from nltk.tokenize import word_tokenize
from tqdm import tqdm

nltk.download('punkt')
warnings.filterwarnings("ignore")

# Caricamento modelli FastText
ft_pt = fasttext.load_model('cc.pt.300.bin')
ft_pl = fasttext.load_model('cc.pl.300.bin')

model_cache = {
    'pt': ft_pt,
    'pl': ft_pl
}

# tokenization function
def tokenize(text, lang):
    return word_tokenize(text)

# subtitution dictionary building
def build_substitution_dict(model, texts, lang, top_n_words=500, similarity_threshold=0.75):
    all_tokens = [tokenize(s.lower(), lang) for s in texts]
    flat_tokens = [token for sublist in all_tokens for token in sublist]
    freq_words = [w for w, _ in Counter(flat_tokens).most_common(top_n_words)]

    sub_dict = {}
    for word in tqdm(freq_words, desc=f"Building substitution dict for {lang}"):
        try:
            neighbors = model.get_nearest_neighbors(word)
            for score, neighbor in neighbors:
                if score >= similarity_threshold and neighbor != word:
                    sub_dict[word] = neighbor
                    break
        except Exception:
            continue
    return sub_dict

# word substitution dictionary
def substitute_words(text, sub_dict, lang):
    tokens = tokenize(text.lower(), lang)
    return ' '.join([sub_dict.get(t, t) for t in tokens])

# paraphrasing application on sentence1
augmented_rows = []

for lang in ['pt', 'pl']:
    subset = df_pl_pt[df_pl_pt["lang1"] == lang].copy()
    if subset.empty:
        continue

    print(f"Processing language: {lang}")

    model = model_cache[lang]
    sub_dict = build_substitution_dict(model, subset["sentence1"], lang)

    subset["sentence1_aug"] = subset["sentence1"].apply(lambda s: substitute_words(s, sub_dict, lang))

    augmented_rows.append(subset)

df_paraphrased = pd.concat(augmented_rows, ignore_index=True)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Processing language: pt


Building substitution dict for pt: 100%|██████████| 500/500 [05:12<00:00,  1.60it/s]


Processing language: pl


Building substitution dict for pl: 100%|██████████| 500/500 [05:14<00:00,  1.59it/s]


This code substitutes the values of the first column of df_paraphrased with the values of the new additional column that was created in the previous code, and this column was dropped.

In [None]:
df_paraphrased.iloc[:, 0] = df_paraphrased.iloc[:, 5]
df_paraphrased.drop(df_paraphrased.columns[5], axis=1, inplace=True)


In [None]:
df_paraphrased = pd.read_csv('/content/drive/MyDrive/paraphrased_data.csv')

In [None]:
df_paraphrased.head()

Unnamed: 0,sentence1,sentence2,score,lang1,lang2
0,Um helicóptero Está a descolar .,ein flugzeug hebt gerade ab.,5.0,pt,de
1,Um helicóptero Está a descolar .,an air plane is taking off.,5.0,pt,en
2,Um helicóptero Está a descolar .,un avión está despegando.,5.0,pt,es
3,Um helicóptero Está a descolar .,un avion est en train de décoller.,5.0,pt,fr
4,Um helicóptero Está a descolar .,un aereo sta decollando.,5.0,pt,it
5,Um helicóptero Está a descolar .,飛行機が離陸します。,5.0,pt,ja
6,Um helicóptero Está a descolar .,er gaat een vliegtuig opstijgen.,5.0,pt,nl
7,Um helicóptero Está a descolar .,samolot wystartował.,5.0,pt,pl
8,Um helicóptero Está a descolar .,взлетает самолет.,5.0,pt,ru
9,Um helicóptero Está a descolar .,一架飞机正在起飞。,5.0,pt,zh




---


## **PART THREE**
## **Japanese**
For Japanese text paraphrasing, we used Rinna's Japanese GPT-2 Medium (AutoModelForCausalLM), model trained to predict the next token in a sequence, with left-padded tokenization (AutoTokenizer) to generate fluent, context-aware sentence continuations.

The documentation is available at https://huggingface.co/rinna/japanese-gpt2-medium.
Left-padding tokenization adds padding tokens (typically zeros) to the beginning of shorter sequences in a batch, ensuring equal length. The model generates text left-to-right, so padding on the left prevents interference with attention mechanisms during generation.

In [None]:
df_ja = clean_df[clean_df['lang1'] == 'ja']
df_ja.shape

(85515, 5)

# **Japanese Text Paraphrasing with GPT-2**

This code uses a Japanese GPT-2 model to generate continuations of input sentences, appending new text to the original input. Unlike paraphrasing, this process extends the input rather than rewriting it.

## **Key Steps**
1. **Initial Setup**:
   - Imports necessary libraries
   - Configures to use GPU if available, otherwise falls back to CPU
   - Suppresses unnecessary warning messages

2. **Model Loading**:
   - Loads a pre-trained Japanese GPT-2 tokenizer and model from Rinna
   - Configures the tokenizer to use left-padding with EOS (end-of-sequence) token as padding

3. **Core Functions**:
   - `generate_batch()`: Takes a list of texts and generates new text continuations with:
     - Temperature sampling (0.9) for varied but coherent outputs
     - No repeat n-grams to avoid repetitive phrases
     - Left-padding for consistent processing
     - Maximum 50 new tokens generated per input

   - `paraphrase_dataframe()`: Processes a DataFrame in batches to:
     - Generate paraphrases for each text in the specified column
     - Clean the output by removing the original text
     - Show progress with a tqdm progress bar

4. **Output**:
   - Creates a new column `sentence1_aug` with the paraphrased versions
   - Makes a copy of the augmented DataFrame in `df_ja_aug`

The implementation is optimized for batch processing and handles Japanese text specifically using a model trained on Japanese language data.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from tqdm import tqdm
import pandas as pd
import torch
import warnings
from transformers import logging
logging.set_verbosity_error()

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load tokenizer with correct padding configuration
tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt2-medium")
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token

# Force left-padding in all operations
tokenizer.__dict__['_padding_side'] = "left"
tokenizer.__dict__['_pad_token'] = tokenizer.eos_token

# Load model
model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt2-medium").to(device)
model.eval()

# Optimized batch generation
def generate_batch(texts, max_new_tokens=50, temperature=0.9):
    # Explicitly enforce left padding
    inputs = tokenizer(
        texts,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=128,
        padding_side="left"  # Explicitly set here as well
    ).to(device)

    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            pad_token_id=tokenizer.pad_token_id,
            no_repeat_ngram_size=2
        )

    return tokenizer.batch_decode(outputs, skip_special_tokens=True)

# DataFrame processing
def paraphrase_dataframe(df, text_column="sentence1", batch_size=32):
    augmented_texts = []

    for i in tqdm(range(0, len(df), batch_size), desc="Paraphrasing"):
        batch = df[text_column].iloc[i:i+batch_size].tolist()
        augmented_batch = generate_batch(batch)
        # Clean output by removing original text
        augmented_texts.extend([x.replace(text, "").strip() for x, text in zip(augmented_batch, batch)])

    return augmented_texts

# Apply to DataFrame
df_ja["sentence1_aug"] = paraphrase_dataframe(df_ja)
df_ja_aug = df_ja.copy()

Using device: cuda


Paraphrasing: 100%|██████████| 2673/2673 [58:53<00:00,  1.32s/it]


In [None]:
df_ja_aug.iloc[:, 0] = df_ja_aug.iloc[:, 5]
df_ja_aug.drop(df_ja_aug.columns[5], axis=1, inplace=True)

In [None]:
df_ja_aug.head()

Unnamed: 0,sentence1,sentence2,score,lang1,lang2
50,空から見ると、大きな飛行機、とても巨大な飛行機です。 小さな子供がはしゃいでいました。 そし...,ein flugzeug hebt gerade ab.,5.0,ja,de
51,機体の重量、スピード、エンジン出力、機体が空を飛んでいる間、空から、飛行機は地面と平行に飛行...,an air plane is taking off.,5.0,ja,en
52,着陸する時に機体が風を受けて風抵抗が強くなります。 主翼の空気抵抗が機体に影響を与えるので、...,un avión está despegando.,5.0,ja,es
53,目的地、空港、または目的地の電話番号を入力すると、フライト情報のページに移動できます。次に空...,un avion est en train de décoller.,5.0,ja,fr
54,空港の端には、「出発」「着陸」といったアナウンスがあります。 それぞれ英語と中国語で表記され...,un aereo sta decollando.,5.0,ja,it


## **Recalculating the similarity score for Japanese**

This code computes **semantic similarity scores** between Japanese-{lang} pairs using a multilingual sentence embedding model (documentation available at https://www.sbert.net/) called **SentenceTransformer**; which is a library built on top of BERT models (and its variants) and is specifically designed to produce sentence embeddings, which are dense representations of entire sentences, not just individual words or tokens. The underlying model is called **SBERT** (Sentence-BERT)

The scores range from `0` (completely dissimilar) to `1` (identical meaning), with higher values indicating greater semantic similarity.

## **Key Steps**  

1. **Initial Setup**  
   - Imports libraries for sentence embeddings (`SentenceTransformer`), similarity calculation (`util`), and data handling (`pandas`, `torch`).  
   - Configures GPU usage if available for faster processing.  
   - Suppresses warnings to keep logs clean.  

2. **Model Loading**  
   - Loads the **`paraphrase-multilingual-mpnet-base-v2`** model, which is optimized for:  
     - Multilingual text (including Japanese and English).  
     - Semantic similarity tasks (e.g., paraphrase detection).  
   - Moves the model to GPU if available (`model.to(device)`).  

3. **Similarity Calculation**  
   - **`compute_similarity(row)`**:  
     1. Encodes two sentences (`sentence1` and `sentence2`) into dense vector embeddings.  
     2. Computes the **cosine similarity** between the embeddings (a value between `0` and `1`).  
     3. Returns the similarity score as a float.  

4. **DataFrame Processing**  
   - Applies the `compute_similarity` function to each row of the DataFrame (`df_ja_aug`).  
   - Uses `tqdm.progress_apply` to show a progress bar during computation.  
   - Stores results in a new column **`score`**.  



In [None]:
df_ja_aug = pd.read_csv('/content/drive/MyDrive/df_ja_aug.csv')

In [None]:
from sentence_transformers import SentenceTransformer, util
import torch
import pandas as pd
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# 1. Load multilingual similarity model (Japanese-English)
model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2').to(device)
model.eval()

# 2. Compute similarity function
def compute_similarity(row):
    embeddings = model.encode([row['sentence1'], row['sentence2']], convert_to_tensor=True)
    return util.pytorch_cos_sim(embeddings[0], embeddings[1]).item()

# 3. Apply to dataframe with progress bar
tqdm.pandas(desc="Computing similarity scores")
df_ja_aug['score'] = df_ja_aug.progress_apply(compute_similarity, axis=1)


Using device: cuda


Computing similarity scores: 100%|██████████| 85515/85515 [18:51<00:00, 75.60it/s]


In [None]:
df_ja_aug.head()

Unnamed: 0,sentence1,sentence2,score,lang1,lang2
0,空から見ると、大きな飛行機、とても巨大な飛行機です。 小さな子供がはしゃいでいました。 そし...,ein flugzeug hebt gerade ab.,0.715067,ja,de
1,機体の重量、スピード、エンジン出力、機体が空を飛んでいる間、空から、飛行機は地面と平行に飛行...,an air plane is taking off.,0.640819,ja,en
2,着陸する時に機体が風を受けて風抵抗が強くなります。 主翼の空気抵抗が機体に影響を与えるので、...,un avión está despegando.,0.685746,ja,es
3,目的地、空港、または目的地の電話番号を入力すると、フライト情報のページに移動できます。次に空...,un avion est en train de décoller.,0.234093,ja,fr
4,空港の端には、「出発」「着陸」といったアナウンスがあります。 それぞれ英語と中国語で表記され...,un aereo sta decollando.,0.429997,ja,it


In [None]:
df_ja_aug_score = df_ja_aug.copy()
df_ja_aug_score.to_csv('/content/drive/MyDrive/df_ja_aug_score.csv', index=False)

In [None]:
import pandas as pd
df_ja_aug_score = pd.read_csv('/content/drive/MyDrive/df_ja_aug_score.csv')

### **Rescaling the similarity score**
This code **rescales the `score` column** from `0-1` to `0-5` (multiplying by 5 and rounding to 2 decimals).  

In [None]:
df_ja_aug_final = df_ja_aug_score.copy()

df_ja_aug_final['score'] = (df_ja_aug_score['score'] * 5).round(2)

df_ja_aug_final.head()

Unnamed: 0,sentence1,sentence2,score,lang1,lang2
0,空から見ると、大きな飛行機、とても巨大な飛行機です。 小さな子供がはしゃいでいました。 そし...,ein flugzeug hebt gerade ab.,3.58,ja,de
1,機体の重量、スピード、エンジン出力、機体が空を飛んでいる間、空から、飛行機は地面と平行に飛行...,an air plane is taking off.,3.2,ja,en
2,着陸する時に機体が風を受けて風抵抗が強くなります。 主翼の空気抵抗が機体に影響を与えるので、...,un avión está despegando.,3.43,ja,es
3,目的地、空港、または目的地の電話番号を入力すると、フライト情報のページに移動できます。次に空...,un avion est en train de décoller.,1.17,ja,fr
4,空港の端には、「出発」「着陸」といったアナウンスがあります。 それぞれ英語と中国語で表記され...,un aereo sta decollando.,2.15,ja,it


In [None]:
df_ja_aug_final.to_csv('/content/drive/MyDrive/df_ja_aug_final.csv', index=False)

---
## **Final Dataset**

In [None]:
df_en_aug = pd.read_csv('/content/drive/MyDrive/Copy of df_en_aug.csv') # English
df_es_aug = pd.read_csv('/content/drive/MyDrive/df_es_aug.csv') # Spanish
df_fr_aug = pd.read_csv('/content/drive/MyDrive/df_fr_aug.csv') # French
df_it_aug = pd.read_csv('/content/drive/MyDrive/df_it_aug.csv') # Italian
df_de_aug = pd.read_csv('/content/drive/MyDrive/df_de_aug.csv') # German
df_ru_aug = pd.read_csv('/content/drive/MyDrive/df_ru_aug.csv') # Russian
df_zh_aug = pd.read_csv('/content/drive/MyDrive/df_zh_aug.csv') # Chinese
df_nl_aug = pd.read_csv('/content/drive/MyDrive/df_nl_aug.csv') # Dutch
df_paraphrased = pd.read_csv('/content/drive/MyDrive/paraphrased_data.csv') # Portuguese & Polish
df_ja_aug_final = pd.read_csv('/content/drive/MyDrive/df_ja_aug_final.csv') # Japanese
clean_df = pd.read_csv('/content/drive/MyDrive/clean df.csv')

In [None]:
augmented_df = pd.concat([clean_df, df_en_aug, df_es_aug, df_fr_aug, df_it_aug, df_de_aug, df_ru_aug, df_zh_aug, df_nl_aug, df_paraphrased, df_ja_aug_final], axis=0, ignore_index=True)
augmented_df.shape

(1881076, 5)

This code checks if any duplicates and NaN are present in `final_df` and drops them.



In [None]:
final_df = augmented_df.drop_duplicates(subset=['sentence1', 'sentence2'])
final_df = final_df.dropna(subset=["sentence1", "sentence2", "score"])
final_df["sentence1"] = final_df["sentence1"].astype(str)
final_df["sentence2"] = final_df["sentence2"].astype(str)
final_df.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df["sentence1"] = final_df["sentence1"].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df["sentence2"] = final_df["sentence2"].astype(str)


(1847162, 5)

In [None]:
final_df.to_csv('/content/drive/MyDrive/final_df.csv', index=False)

In [None]:
clean_df.shape

(940538, 5)

Here's the updated version with your fourth point about word continuation:

# Dataset Augmentation  

**From 940,538 to 1,847,162 rows** - We effectively doubled our dataset size through:  

1. **Back-Translation**  
   - Generated paraphrased versions of all sentences via English pivot translation (en→fr→en, en→es→en, etc.)  

2. **Word-Level Augmentation**  
   - Used FastText embeddings to create synonym-swapped variants while preserving meaning  

3. **Strategic Sampling**  
   - Applied augmentation separately per language pair to maintain linguistic integrity  

4. **Text Continuation**  
   - Leveraged Japanese GPT-2 to generate fluent, context-aware sentence continuations  

The augmented dataset now contains richer syntactic variations while preserving original semantic relationships, significantly improving our model's training potential.  

*(+907,624 rows | +96.5% expansion)*



---


# **3. MODEL DEVELOPMENT and EVALUATION**


To train our model we used BertTokenizer, BertModel; documentation available at: https://huggingface.co/docs/transformers/model_doc/bert

BERT is a bidirectional transformer pretrained on unlabeled text to predict masked tokens in a sentence and to predict whether one sentence follows another. The main idea is that by randomly masking some tokens, the model can train on text to the left and right, giving it a more thorough understanding.





In [None]:
import pandas as pd
final_df = pd.read_csv('/content/drive/MyDrive/final_df.csv')

In [None]:
final_df.head(100)

Unnamed: 0,sentence1,sentence2,score,lang1,lang2
0,ein flugzeug hebt gerade ab.,an air plane is taking off.,5.0,de,en
1,ein flugzeug hebt gerade ab.,un avión está despegando.,5.0,de,es
2,ein flugzeug hebt gerade ab.,un avion est en train de décoller.,5.0,de,fr
3,ein flugzeug hebt gerade ab.,un aereo sta decollando.,5.0,de,it
4,ein flugzeug hebt gerade ab.,飛行機が離陸します。,5.0,de,ja
...,...,...,...,...,...
95,самолет взлетает.,飛行機が離陸します。,5.0,ru,ja
96,самолет взлетает.,er gaat een vliegtuig opstijgen.,5.0,ru,nl
97,самолет взлетает.,samolot wystartował.,5.0,ru,pl
98,самолет взлетает.,um avião aéreo está a descolar.,5.0,ru,pt


# **Multilingual Sentence Similarity with BERT**

## Overview
This code implements a BERT-based model for predicting similarity scores between sentence pairs across multiple languages. We tested it on a **subset of 50,000** rows for efficient prototyping.

## Key Components

### 1. Setup & Data Preparation
- **Reproducibility**: Fixed random seeds (42) for all libraries
- **Dataset Sampling**:
  - 50,000 random samples (`small_df`)
- **Train-Test Split**: 90-10 ratio

### 2. Data Processing
- **Tokenizer**: `bert-base-multilingual-cased` (handles 104 languages)
- **Custom Dataset Class**:
  - Processes sentence pairs with dynamic padding/truncation (max_len=64)
  - Returns dictionary with:
    - `input_ids`, `attention_mask`, `token_type_ids`
    - Normalized similarity `score` (0-5 range)


In [None]:
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertModel
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
import random
import warnings
warnings.filterwarnings("ignore")
from tqdm import tqdm
import logging
logging.getLogger("transformers.tokenization_utils_base").setLevel(logging.ERROR)

# ------------------ SETUP ------------------

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

set_seed()

# Sottocampiona il dataset per prototipare velocemente
small_df = final_df.sample(n=50000, random_state=42)

# Split in train/test
train_df, test_df = train_test_split(small_df, test_size=0.1, random_state=42)

# Tokenizer multilingua
tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")

# ------------------ DATASET ------------------

class SentencePairDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len=64):
        self.df = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
      row = self.df.iloc[idx]
      inputs = self.tokenizer(
          row["sentence1"],
          row["sentence2"],
          padding='max_length',
          truncation='longest_first',
          max_length=self.max_len,
          return_overflowing_tokens=False,
          return_tensors="pt",
      )
      item = {key: val.squeeze(0) for key, val in inputs.items()}
      item["score"] = torch.tensor(row["score"], dtype=torch.float)
      return item

# ------------------ MODEL ------------------

class BERTSimilarityModel(nn.Module):
    def __init__(self):
        super(BERTSimilarityModel, self).__init__()
        self.bert = BertModel.from_pretrained("bert-base-multilingual-cased")
        self.dropout = nn.Dropout(0.3)
        self.regressor = nn.Linear(self.bert.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask, token_type_ids):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )
        cls_output = outputs.pooler_output
        x = self.dropout(cls_output)
        return self.regressor(x).squeeze(-1)

# ------------------ TRAINING ------------------

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BERTSimilarityModel().to(device)

train_dataset = SentencePairDataset(train_df, tokenizer, max_len=64)
test_dataset = SentencePairDataset(test_df, tokenizer, max_len=64)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16)

optimizer = optim.AdamW(model.parameters(), lr=2e-5)
loss_fn = nn.MSELoss()

def train_model(model, train_loader, loss_fn, optimizer, epochs=1):
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        for batch in train_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            token_type_ids = batch['token_type_ids'].to(device)
            scores = batch['score'].to(device)

            optimizer.zero_grad()
            outputs = model(input_ids, attention_mask, token_type_ids)
            loss = loss_fn(outputs, scores)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        print(f"Epoch {epoch+1}/{epochs} - Loss: {total_loss / len(train_loader):.4f}")

# ------------------ EVALUATION ------------------

def evaluate_model(model, test_loader):
    model.eval()
    all_preds = []
    all_targets = []

    with torch.no_grad():
        for batch in test_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            token_type_ids = batch['token_type_ids'].to(device)
            scores = batch['score'].to(device)

            outputs = model(input_ids, attention_mask, token_type_ids)
            all_preds.extend(outputs.cpu().numpy())
            all_targets.extend(scores.cpu().numpy())

    mse = mean_squared_error(all_targets, all_preds)
    mae = mean_absolute_error(all_targets, all_preds)
    r2 = r2_score(all_targets, all_preds)

    print("\n Evaluation:")
    print(f"→ MSE: {mse:.4f}")
    print(f"→ MAE: {mae:.4f}")
    print(f"→ R2 Score: {r2:.4f}")

# ------------------ RUN ------------------

train_model(model, train_loader, loss_fn, optimizer, epochs=1)
evaluate_model(model, test_loader)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Epoch 1/1 - Loss: 1.0843

 Evaluation:
→ MSE: 0.6595
→ MAE: 0.6221
→ R2 Score: 0.6966


# **Multilingual Sentence Similarity with BERT**

This code implements a BERT-based regression model for predicting semantic similarity scores between sentence pairs across multiple languages. The model is trained to predict similarity scores on a continuous scale (typically 0-5).

## Implementation Overview

### 1. Initial Setup

* **Imports**: Key libraries including:
  - PyTorch for model building and training
  - Transformers for BERT implementation
  - Pandas for data handling
  - Scikit-learn for metrics and utilities
  - Additional utilities (tqdm, logging, warnings)

* **Configuration**:
  - Sets up GPU if available, otherwise uses CPU
  - Configures logging to suppress tokenizer warnings
  - Initializes the multilingual BERT tokenizer

### 2. Dataset Preparation

* **Custom Dataset Classes**:
  - `TrainSentencePairDataset`: Processes training data with:
    - Tokenization of sentence pairs
    - Padding/truncation to fixed length (32 tokens)
    - Includes similarity scores as targets
    - Preserves language pair information
  
  - `TestSentencePairDataset`: Similar but without scores (for prediction)

* **Data Loading**:
  - Creates DataLoader instances for batch processing
  - Handles shuffling (training) and fixed ordering (testing)

### 3. Model Architecture

* **BERTSimilarityModel**:
  - Built on pretrained `bert-base-multilingual-cased`
  - Architecture components:
    1. BERT encoder (frozen during initial training)
    2. Dropout layer (p=0.3) for regularization
    3. Linear regression head for score prediction
  - Processes input through:
    - Token IDs
    - Attention masks
    - Token type IDs (for sentence pair differentiation)

### 4. Training Process

* **Optimization Setup**:
  - AdamW optimizer (learning rate 2e-5)
  - MSE loss function
  - Mixed precision training (autocast + GradScaler)

* **Training Loop**:
  - Batched processing with progress tracking
  - Gradient scaling for mixed precision
  - Loss reporting per epoch

### 5. Evaluation & Prediction

* **Validation** (Optional):
  - Can split training data for validation
  - Computes metrics:
    - Mean Squared Error (MSE)
    - Mean Absolute Error (MAE)
    - R² Score

* **Prediction**:
  - Generates scores for test data
  - Preserves original test data structure
  - Tracks language pairs with predictions

## Usage Flow

1. Initializes model and data loaders
2. Trains for specified epochs (default: 3)
3. Optionally evaluates on validation split
4. Generates predictions for test set
5. Returns results dataframe with predicted scores

In [None]:
test_df = final_df.sample(frac=0.2, random_state=42)
train_df = final_df.drop(test_df.index)

In [None]:
test_df = test_df.drop(columns = ['score'])

In [None]:
test_df.head()

Unnamed: 0,sentence1,sentence2,lang1,lang2
260396,мусульмане сказали бы то же самое о коране.,印度教徒对《薄伽梵歌》也会有同样的说法。,ru,zh
341732,según la oficina del censo la población hispan...,de spaanse bevolking is ten opzichte van de vo...,es,nl
747221,但最终，该蠕虫所做的只是访问了一个色情网站，赛门铁克安全响应公司在加州的安全总监vincen...,ma vincent weafer direttore della sicurezza di...,zh,it
481613,les stocks de la chine ont augmenté après la r...,w środę zapasy chińskie zamykają się wyżej,fr,pl
1398946,"Ожидается, что министры сельского хозяйства из...",die us-landwirtschaftsministerin ann veneman e...,ru,de


In [None]:
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertModel
import torch.optim as optim
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
import warnings
warnings.filterwarnings("ignore")
from tqdm import tqdm
import logging
logging.getLogger("transformers.tokenization_utils_base").setLevel(logging.ERROR)


# Setup
tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Dataset classes
class TrainSentencePairDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len=32):
        self.df = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        inputs = self.tokenizer(
            row["sentence1"],
            row["sentence2"],
            padding='max_length',
            truncation='longest_first',
            max_length=self.max_len,
            return_overflowing_tokens=False,
            return_tensors="pt",
        )
        item = {key: val.squeeze(0) for key, val in inputs.items()}
        item["score"] = torch.tensor(row["score"], dtype=torch.float)
        item["lang1"] = row["lang1"]
        item["lang2"] = row["lang2"]
        return item

class TestSentencePairDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len=32):
        self.df = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        inputs = self.tokenizer(
            row["sentence1"],
            row["sentence2"],
            padding='max_length',
            truncation='longest_first',
            max_length=self.max_len,
            return_overflowing_tokens=False,
            return_tensors="pt",
        )
        item = {key: val.squeeze(0) for key, val in inputs.items()}
        item["lang1"] = row["lang1"]
        item["lang2"] = row["lang2"]
        # No score for test set
        return item

# Model
class BERTSimilarityModel(nn.Module):
    def __init__(self):
        super(BERTSimilarityModel, self).__init__()
        self.bert = BertModel.from_pretrained("bert-base-multilingual-cased")
        self.dropout = nn.Dropout(0.3)
        self.regressor = nn.Linear(self.bert.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask, token_type_ids):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )
        x = self.dropout(outputs.pooler_output)
        return self.regressor(x).squeeze(-1)

# Dataloaders
train_loader = DataLoader(TrainSentencePairDataset(train_df, tokenizer), batch_size=32, shuffle=True)
test_loader = DataLoader(TestSentencePairDataset(test_df, tokenizer), batch_size=32)

# Model setup
model = BERTSimilarityModel().to(device)
optimizer = optim.AdamW(model.parameters(), lr=2e-5)
loss_fn = nn.MSELoss()
scaler = torch.cuda.amp.GradScaler()

# Training loop
def train_model(model, train_loader, loss_fn, optimizer, epochs=1):
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}"):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            token_type_ids = batch['token_type_ids'].to(device)
            scores = batch['score'].to(device)

            optimizer.zero_grad()
            with torch.cuda.amp.autocast():
                outputs = model(input_ids, attention_mask, token_type_ids)
                loss = loss_fn(outputs, scores)

            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
            total_loss += loss.item()

        print(f"\n Epoch {epoch+1}/{epochs} - Loss: {total_loss / len(train_loader):.4f}")

# Generate predictions for test set
def generate_predictions(model, test_loader):
    model.eval()
    all_preds = []
    all_lang_pairs = []

    with torch.no_grad():
        for batch in tqdm(test_loader, desc="Generating predictions"):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            token_type_ids = batch['token_type_ids'].to(device)

            lang1_list = batch['lang1']
            lang2_list = batch['lang2']

            with torch.cuda.amp.autocast():
                outputs = model(input_ids, attention_mask, token_type_ids)

            predictions = outputs.cpu().numpy()
            all_preds.extend(predictions)

            # Save language pairs for reference
            for i in range(len(predictions)):
                all_lang_pairs.append((lang1_list[i], lang2_list[i]))

    # Create a DataFrame with predictions
    results_df = test_df.copy()
    results_df['predicted_score'] = all_preds

    print(f"Generated {len(all_preds)} predictions")
    return results_df

# Function to evaluate model on validation data (if needed)
def evaluate_model(model, val_loader):
    model.eval()
    all_preds, all_targets = [], []

    with torch.no_grad():
        for batch in tqdm(val_loader, desc="Evaluating"):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            token_type_ids = batch['token_type_ids'].to(device)
            scores = batch['score'].to(device)

            with torch.cuda.amp.autocast():
                outputs = model(input_ids, attention_mask, token_type_ids)

            all_preds.extend(outputs.cpu().numpy())
            all_targets.extend(scores.cpu().numpy())

    mse = mean_squared_error(all_targets, all_preds)
    mae = mean_absolute_error(all_targets, all_preds)
    r2 = r2_score(all_targets, all_preds)
    print("\n Evaluation:")
    print(f"→ MSE: {mse:.4f}")
    print(f"→ MAE: {mae:.4f}")
    print(f"→ R2 Score: {r2:.4f}")

    return mse, mae, r2

# Training and prediction flow
print(f"Training on {len(train_df)} samples")
train_model(model, train_loader, loss_fn, optimizer, epochs=3)

# If you want to create a validation set from train_df to evaluate performance
from sklearn.model_selection import train_test_split

# Create a validation set from training data (optional)
train_subset, val_subset = train_test_split(train_df, test_size=0.2, random_state=42)
val_loader = DataLoader(TrainSentencePairDataset(val_subset, tokenizer), batch_size=32)
mse, mae, r2 = evaluate_model(model, val_loader)

# Generate predictions for the test set
print(f"Generating predictions for {len(test_df)} test samples")
results_df = generate_predictions(model, test_loader)

Training on 1477730 samples


Epoch 1: 100%|██████████| 46180/46180 [1:41:18<00:00,  7.60it/s]



 Epoch 1/3 - Loss: 0.3140


Epoch 2: 100%|██████████| 46180/46180 [1:41:23<00:00,  7.59it/s]



 Epoch 2/3 - Loss: 0.0826


Epoch 3: 100%|██████████| 46180/46180 [1:41:35<00:00,  7.58it/s]



 Epoch 3/3 - Loss: 0.0604


Evaluating: 100%|██████████| 9236/9236 [05:44<00:00, 26.83it/s]



 Evaluation:
→ MSE: 0.0339
→ MAE: 0.1062
→ R2 Score: 0.9844
Generating predictions for 369432 test samples


Generating predictions: 100%|██████████| 11545/11545 [06:51<00:00, 28.04it/s]


Generated 369432 predictions


In [None]:
results_df.head(100)

Unnamed: 0,sentence1,sentence2,lang1,lang2,predicted_score
260396,мусульмане сказали бы то же самое о коране.,印度教徒对《薄伽梵歌》也会有同样的说法。,ru,zh,0.656250
341732,según la oficina del censo la población hispan...,de spaanse bevolking is ten opzichte van de vo...,es,nl,3.205078
747221,但最终，该蠕虫所做的只是访问了一个色情网站，赛门铁克安全响应公司在加州的安全总监vincen...,ma vincent weafer direttore della sicurezza di...,zh,it,4.964844
481613,les stocks de la chine ont augmenté après la r...,w środę zapasy chińskie zamykają się wyżej,fr,pl,3.275391
1398946,"Ожидается, что министры сельского хозяйства из...",die us-landwirtschaftsministerin ann veneman e...,ru,de,3.771484
...,...,...,...,...,...
1639042,Um pequeno automóvel amarelo em cima de Um sem...,rode sportwagen bovenop witte semi-vrachtwagen.,pt,nl,4.066406
1110349,El funeral de la novia de Oscar Pistorius reev...,"oscar pistorius ""a tiré sur steenkamp dans la ...",es,fr,1.721680
399050,se a índia culpa os militantes estrangeiros fa...,abe a demandé une politique étrangère plus aff...,pt,fr,1.138672
1465384,黑白母牛停在门口,白い飛行機が空を飛んでいます。,zh,ja,0.041138




---

## Model weights

In [None]:
# Save the model weights
save_path = "/content/drive/MyDrive/bert_similarity_model_weights.pth"
torch.save(model.state_dict(), save_path)
print(f"Model weights saved to: {save_path}")


Model weights saved to: /content/drive/MyDrive/bert_similarity_model_weights.pth


In [42]:
# Initialize the model architecture
model = BERTSimilarityModel().to(device)

# Load the saved weights
model.load_state_dict(torch.load("/content/drive/MyDrive/bert_similarity_model_weights.pth", map_location=device))
model.eval()
print("Model weights loaded.")


Model weights loaded.




---


# Testing our model with unseen data

Through this code we test the model on new, unseen data.

## Function: `predict_similarity(model, sentence1, sentence2)`

### What it does:
1. **Prepares Input**:
   - Tokenizes the sentence pair with:
     - `max_length=32` (truncates longer sequences)
     - `padding='max_length'` (pads shorter sequences)
     - `truncation='longest_first'` (truncates from the longer sentence)
   - Converts tokens to PyTorch tensors
   - Moves inputs to the specified device (GPU/CPU)

2. **Makes Prediction**:
   - Runs in evaluation mode (`model.eval()`)
   - Disables gradient calculation (`torch.no_grad()`) for efficiency
   - Returns the similarity score as a numpy array


### Output Format:
For each pair, it prints:

Frase 1: '[sentence1]'

Frase 2: '[sentence2]'

→ Score predetto: [score]

In [43]:
def predict_similarity(model, sentence1, sentence2):
    model.eval()
    inputs = tokenizer(
        sentence1,
        sentence2,
        padding='max_length',
        truncation='longest_first',
        max_length=32,
        return_tensors="pt"
    ).to(device)

    with torch.no_grad():
        outputs = model(**inputs).cpu().numpy()[0]

    return outputs

# Frasi di esempio (modificale con casi rilevanti per il tuo task)
test_cases = [
    ("Il gatto si siede sul tappeto", "The cat sits on the mat"),
    ("Il cielo è blu", "数学は難しい"),
    ("La pizza est délicieuse", "La pizza è deliziosa"),
    ("Pythonはプログラミング言語です", "Python es un lenguaje de programación"),
    ("Der Himmel ist blau", "Python est un serpent?"),
    ("Banks are closed", "Los bancos son peligrosos"),
    ("De hond slaapt op de bank", "Der Hund schläft auf der Couch"),
    ("Het boek is interessant", "Książka jest interesująca"),
    ("De appel is rood", "Яблоко вкусное"),
    ("狗坐在垫子上", "Der Hund sitzt auf der Matte"),
    ("苹果好吃", "Apple ist ein Unternehmen")
]

for s1, s2 in test_cases:
    score = predict_similarity(model, s1, s2)
    print(f"Frase 1: '{s1}'\nFrase 2: '{s2}'\n→ Score predetto: {score:.2f}\n")

Frase 1: 'Il gatto si siede sul tappeto'
Frase 2: 'The cat sits on the mat'
→ Score predetto: 2.70

Frase 1: 'Il cielo è blu'
Frase 2: '数学は難しい'
→ Score predetto: 1.14

Frase 1: 'La pizza est délicieuse'
Frase 2: 'La pizza è deliziosa'
→ Score predetto: 4.03

Frase 1: 'Pythonはプログラミング言語です'
Frase 2: 'Python es un lenguaje de programación'
→ Score predetto: 3.47

Frase 1: 'Der Himmel ist blau'
Frase 2: 'Python est un serpent?'
→ Score predetto: 0.74

Frase 1: 'Banks are closed'
Frase 2: 'Los bancos son peligrosos'
→ Score predetto: 3.01

Frase 1: 'De hond slaapt op de bank'
Frase 2: 'Der Hund schläft auf der Couch'
→ Score predetto: 3.20

Frase 1: 'Het boek is interessant'
Frase 2: 'Książka jest interesująca'
→ Score predetto: 3.13

Frase 1: 'De appel is rood'
Frase 2: 'Яблоко вкусное'
→ Score predetto: 1.97

Frase 1: '狗坐在垫子上'
Frase 2: 'Der Hund sitzt auf der Matte'
→ Score predetto: 3.13

Frase 1: '苹果好吃'
Frase 2: 'Apple ist ein Unternehmen'
→ Score predetto: 2.62

