# Language classification model using the XLM-RoBERTa transformer.

Load and Filter Dataset:

Load the dataset and filter it to include only specific languages (English, Spanish, French, German, and Italian). Then take a sample of the filtered dataset for quick prototyping.

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv("sentences.csv", sep="\t", header=None, names=["id", "lang", "text"])

# Filter dataset for a few languages (e.g., English, Spanish, French, German, Italian)
languages = ["eng", "spa", "fra", "deu", "ita"]
df_filtered = df[df['lang'].isin(languages)]

# Select a sample for quick prototyping
df_sample = df_filtered.sample(1000)
print(df_sample.head())

                id lang                                               text
3713910    3955456  eng  The thick clouds which cover Venus cause a "gr...
5368446    5723029  ita                                          Le vedrò.
11393369  11860590  eng             Leonid’s eyes turned into a reptile’s.
10243780  10703006  eng  At that moment I was walking towards the station.
11251489  11717624  eng  Pietro came as fast as he could to the costume...


In [2]:
from transformers import pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

  from .autonotebook import tqdm as notebook_tqdm


Prepare Data for Training: 

Extract texts and labels, then split the data into training and test sets.

In [3]:
texts = df_sample['text'].tolist()
labels = df_sample['lang'].tolist()

# Split data into training and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)

Load Model:

Load the XLM-RoBERTa model for text classification.

In [4]:
classifier = pipeline("text-classification", model="xlm-roberta-base")

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Make Predictions:

Predict the language on the test set and evaluate results.

In [5]:
predictions = classifier(test_texts)

# Extract predicted labels
predicted_labels = [prediction['label'] for prediction in predictions]

In [6]:
print("Accuracy:", accuracy_score(test_labels, predicted_labels))
print("Classification Report:\n", classification_report(test_labels, predicted_labels))

Accuracy: 0.0
Classification Report:
               precision    recall  f1-score   support

     LABEL_0       0.00      0.00      0.00       0.0
         deu       0.00      0.00      0.00      37.0
         eng       0.00      0.00      0.00      89.0
         fra       0.00      0.00      0.00      19.0
         ita       0.00      0.00      0.00      41.0
         spa       0.00      0.00      0.00      14.0

    accuracy                           0.00     200.0
   macro avg       0.00      0.00      0.00     200.0
weighted avg       0.00      0.00      0.00     200.0



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [7]:
# Load the model
classifier = pipeline("text-classification", model="xlm-roberta-base")

# Define the mapping from label to actual language
label_to_language = {
    'LABEL_0': 'English',
    'LABEL_1': 'Spanish',
    'LABEL_2': 'French',
    'LABEL_3': 'German',
    'LABEL_4': 'Italian'
}

# Example sentences to test the classifier
sentences = [
    "This is a test sentence in English.",
    "Esta es una frase de prueba en español.",
    "C'est une phrase de test en français.",
    "Dies ist ein Testsatz auf Deutsch.",
    "Questa è una frase di prova in italiano."
]

# Predict the language and map the label to actual language name
for sentence in sentences:
    prediction = classifier(sentence)
    label = prediction[0]['label']
    language = label_to_language[label]
    
    print(f"Sentence: {sentence}")
    print(f"Predicted Language: {language}\n")

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentence: This is a test sentence in English.
Predicted Language: English

Sentence: Esta es una frase de prueba en español.
Predicted Language: English

Sentence: C'est une phrase de test en français.
Predicted Language: English

Sentence: Dies ist ein Testsatz auf Deutsch.
Predicted Language: English

Sentence: Questa è una frase di prova in italiano.
Predicted Language: English



Load and Tokenize Additional Dataset:

load a dataset (PAWS-X) for training and testing the model, and apply tokenization.

In [19]:
from transformers import XLMRobertaForSequenceClassification, XLMRobertaTokenizer, Trainer, TrainingArguments
from datasets import load_dataset, concatenate_datasets
from sklearn.model_selection import train_test_split

# Load PAWS-X for English, Spanish, German, French, and Italian
dataset_en = load_dataset('xtreme', 'PAWS-X.en')
dataset_es = load_dataset('xtreme', 'PAWS-X.es')
dataset_de = load_dataset('xtreme', 'PAWS-X.de')
dataset_fr = load_dataset('xtreme', 'PAWS-X.fr')
dataset_it = load_dataset('xtreme', 'PAN-X.it')

# Concatenate the datasets
train_dataset = concatenate_datasets([dataset_en['train'], dataset_es['train'], dataset_de['train'], dataset_fr['train'], dataset_it['train']])
test_dataset = concatenate_datasets([dataset_en['test'], dataset_es['test'], dataset_de['test'], dataset_fr['test'], dataset_it['test']])

Tokenize dataset

In [20]:
# Load the tokenizer
tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')

# Tokenize the dataset; adjust 'sentence1' and 'sentence2' if needed based on dataset structure
def tokenize(batch):
    # Concatenate sentence1 and sentence2, replacing None values with an empty string
    combined_sentences = [(s1 if s1 is not None else "") + " " + (s2 if s2 is not None else "") for s1, s2 in zip(batch['sentence1'], batch['sentence2'])]
    return tokenizer(combined_sentences, padding="max_length", truncation=True, max_length=128)

Apply tokenization to train and test datasets

In [21]:
train_data = train_dataset.map(tokenize, batched=True)
test_data = test_dataset.map(tokenize, batched=True)

Convert Labels to Integer Format

Filter out samples with None labels and convert labels to integers.

In [23]:
# Filter Out None Labels by removing samples with None as the label
train_data = train_data.filter(lambda x: x['label'] is not None)
test_data = test_data.filter(lambda x: x['label'] is not None)

# Convert remaining labels to integers
def convert_labels(batch):
    batch['label'] = int(batch['label'])
    return batch

train_data = train_data.map(convert_labels)
test_data = test_data.map(convert_labels)


Filter: 100%|██████████| 217355/217355 [00:23<00:00, 9383.42 examples/s] 
Filter: 100%|██████████| 18000/18000 [00:02<00:00, 8783.31 examples/s]
Map: 100%|██████████| 197355/197355 [00:25<00:00, 7652.31 examples/s]
Map: 100%|██████████| 8000/8000 [00:01<00:00, 7494.42 examples/s]


Set format for PyTorch:

Prepare the dataset for use with PyTorch.

In [24]:
train_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

Now train_data and test_data are ready to be used with Trainer.

First, 
Set up the model and specify training arguments.

In [25]:
from transformers import XLMRobertaForSequenceClassification

model = XLMRobertaForSequenceClassification.from_pretrained('xlm-roberta-base', num_labels=5)  # 5 for the five languages you're using


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Define Training Arguments

In [26]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)


Create the Trainer:

Initialize the Trainer object.

In [27]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data,
)

Train and Evaluate Model: 

Finally, Train the model and evaluate its performance.

In [None]:
trainer.train()

Evaluate the Model

In [None]:
trainer.evaluate()

Test with New Sentences:

Save the Model:

In [None]:
model.save_pretrained()