<a href="https://colab.research.google.com/github/HaywhyCoder/english-to-spanish-translation/blob/main/English_Spanish_Translator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### English to Spanish to translator

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("bryanpark/parallelsents")

print("Path to dataset files:", path)

#### Import Libraries

In [None]:
import pandas as pd
import warnings
import logging
from datasets import Dataset, DatasetDict
from transformers import MarianTokenizer, MarianMTModel, DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import torch
from sacrebleu import corpus_bleu

# Suppress specific warning related to 'Trainer.tokenizer'
logging.getLogger("transformers").setLevel(logging.ERROR)


#### Load Pre-trained Model

In [None]:
model_name = "Helsinki-NLP/opus-mt-en-es"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Define the English sentence(s) to be translated
english_sentences = [
    "Hello, how are you?",
    "What is your name?",
    "I love programming.",
    "Transformers are powerful models.",
    "Please translate this sentence."
]

# Translate each sentence
translated_sentences = []
for sentence in english_sentences:
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
    outputs = model.generate(inputs["input_ids"])
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    translated_sentences.append(translated_text)

# Print the translations
for original, translation in zip(english_sentences, translated_sentences):
    print(f"Original (English): {original}")
    print(f"Translation (Spanish): {translation}")
    print()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/826k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Original (English): Hello, how are you?
Translation (Spanish): Hola, ¿cómo estás?

Original (English): What is your name?
Translation (Spanish): ¿Cómo te llamas?

Original (English): I love programming.
Translation (Spanish): Me encanta la programación.

Original (English): Transformers are powerful models.
Translation (Spanish): Los transformadores son modelos poderosos.

Original (English): Please translate this sentence.
Translation (Spanish): Por favor, traduzca esta frase.



#### Load the Dataset

In [None]:
data = pd.read_csv("/root/.cache/kagglehub/datasets/bryanpark/parallelsents/versions/1/1000sents.csv", on_bad_lines='skip')
data.head()

Unnamed: 0,ID,HEADWORD,POS,ENGLISH,JAPANESE,SPANISH,INDONESIAN,EXAMPLE (KO),EXAMPLE (EN),EXAMPLE (JA),EXAMPLE (ES),EXAMPLE (ID)
0,1,가게,NOUN,"shop, store",店,"tienda, almacén","toko, kedai",그 가게는 열 시에 문을 연다.,The store opens at 10 A.M.,その店は十時に開く。,El almacén abre a las 10 de la mañana.,Toko itu buka jam 10 pagi
1,2,가격,NOUN,price,"価格, 値段",precio,harga,맘에 들어? 가격은 어때?,Do you like it? What's the price?,気に入った? 価格はどう?,¿Te gusta? ¿Cuál es el precio?,Berapa harganya?
2,3,가깝다,ADJ,"to be near, to be close",近い,cerca,dekat,우리 집은 학교에서 가깝다.,My house is near my school.,我家は学校から近い。,Mi casa está cerca de mi escuela.,Rumah saya dekat dengan sekolah saya
3,4,가끔,ADV,"sometimes, occasionally",たまに,"a veces, de vez en cuando","kadang-kadang, terkadang",나는 가끔 맥주를 마신다.,I sometimes drink beer.,私はたまにビールを飲む。,A veces bebo cerveza.,Kadang – kadang saya minum bir
4,5,가능하다,ADJ,to be possible,"可能だ, できる","ser posible, poder",memungkinkan,가능하다면 내일 오세요.,"Come tomorrow, if possible.",できれば明日来てください。,"Venga mañana, si es posible.","Jika memungkinkan, datanglah besok"


Testing the Model before Fine-tuning

In [None]:
# Extract English and Spanish sentences from the data
source_sentences = data['EXAMPLE (EN)'].tolist()
target_sentences = data['EXAMPLE (ES)'].tolist()

# === TESTING THE MODEL ===
translated_sentences = []

# Loop through each English sentence and translate it to Spanish
for sentence in source_sentences:
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
    outputs = model.generate(inputs["input_ids"])
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    translated_sentences.append(translated_text)

# Compute BLEU score
bleu_score = corpus_bleu(translated_sentences, [target_sentences]).score

# Print the results
for i in range(len(source_sentences)):
    print(f"Original (English): {source_sentences[i]}")
    print(f"Translation (Spanish): {translated_sentences[i]}")
    print(f"Expected (Spanish): {target_sentences[i]}\n")

print(f"BLEU score: {bleu_score}")

Original (English): The store opens at 10 A.M.
Translation (Spanish): La tienda abre a las 10 A.M.
Expected (Spanish): El almacén abre a las 10 de la mañana.

Original (English): Do you like it? What's the price?
Translation (Spanish): ¿Te gusta? ¿Cuál es el precio?
Expected (Spanish): ¿Te gusta? ¿Cuál es el precio?

Original (English): My house is near my school.
Translation (Spanish): Mi casa está cerca de mi escuela.
Expected (Spanish): Mi casa está cerca de mi escuela.

Original (English): I sometimes drink beer.
Translation (Spanish): A veces bebo cerveza.
Expected (Spanish): A veces bebo cerveza.

Original (English): Come tomorrow, if possible.
Translation (Spanish): Ven mañana, si es posible.
Expected (Spanish): Venga mañana, si es posible.

Original (English): My father went to Seoul early in the morning.
Translation (Spanish): Mi padre fue a Seúl temprano en la mañana.
Expected (Spanish): Mi padre fue a Seul temprano por la mañana.

Original (English): I taught her how to driv

#### Finetuning Model on Data

In [None]:
#use GPU
device = torch.device('cuda')
model.to(device)

MarianMTModel(
  (model): MarianModel(
    (shared): Embedding(65001, 512, padding_idx=65000)
    (encoder): MarianEncoder(
      (embed_tokens): Embedding(65001, 512, padding_idx=65000)
      (embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
      (layers): ModuleList(
        (0-5): 6 x MarianEncoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation_fn): SiLU()
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05

In [None]:
# Split into 80% training and 20% evaluation
train_size = int(0.8 * len(data))
train_data = data[:train_size]
eval_data = data[train_size:]

# Create Dataset objects for Hugging Face
train_dict = {"English": train_data["EXAMPLE (EN)"].tolist(), "Spanish": train_data['EXAMPLE (ES)'].tolist()}
eval_dict = {"English": eval_data["EXAMPLE (EN)"].tolist(), "Spanish": eval_data['EXAMPLE (ES)'].tolist()}
datasets = DatasetDict({
    "train": Dataset.from_dict(train_dict),
    "eval": Dataset.from_dict(eval_dict)
})

In [None]:
# Tokenize data
def tokenize_function(examples):
    return tokenizer(examples["English"], text_target=examples["Spanish"], padding=True, truncation=True)

# Map tokenization over dataset
tokenized_datasets = datasets.map(tokenize_function, batched=True)

Map:   0%|          | 0/798 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
# Data collator for dynamic padding
collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Set training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    num_train_epochs=2,
    save_total_limit=1,
    eval_strategy='epoch',
    predict_with_generate=True,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="no",
    report_to="none",
)

In [None]:
# Define compute_metrics function for evaluation
def compute_metrics(eval_preds):
    predictions, labels = eval_preds

    # Convert tensors to numpy arrays if necessary
    if isinstance(predictions, torch.Tensor):
        predictions = predictions.cpu().numpy()
    if isinstance(labels, torch.Tensor):
        labels = labels.cpu().numpy()

    # Decode predictions
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Decode labels, handling -100 masking for tokenizers
    labels = [[label for label in batch if label != -100] for batch in labels]
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Compute metrics
    precision, recall, f1, _ = precision_recall_fscore_support(
        decoded_labels, decoded_preds, average="weighted", zero_division=1
    )
    acc = accuracy_score(decoded_labels, decoded_preds)

    # Calculate BLEU score
    bleu = corpus_bleu(decoded_preds, [decoded_labels]).score

    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall, "bleu": bleu}


In [None]:
# Initialize Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["eval"],
    data_collator=collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

  trainer = Seq2SeqTrainer(


{'loss': 1.3713, 'grad_norm': 4.593786716461182, 'learning_rate': 4.775e-05, 'epoch': 0.1}
{'loss': 0.7524, 'grad_norm': 3.974125385284424, 'learning_rate': 4.525e-05, 'epoch': 0.2}
{'loss': 0.6495, 'grad_norm': 4.395814418792725, 'learning_rate': 4.275e-05, 'epoch': 0.3}
{'loss': 0.8092, 'grad_norm': 4.835759162902832, 'learning_rate': 4.025e-05, 'epoch': 0.4}
{'loss': 0.6389, 'grad_norm': 3.843903064727783, 'learning_rate': 3.775e-05, 'epoch': 0.5}
{'loss': 0.617, 'grad_norm': 2.682528495788574, 'learning_rate': 3.525e-05, 'epoch': 0.6}
{'loss': 0.6101, 'grad_norm': 4.20250940322876, 'learning_rate': 3.275e-05, 'epoch': 0.7}
{'loss': 0.7481, 'grad_norm': 4.187467575073242, 'learning_rate': 3.025e-05, 'epoch': 0.8}
{'loss': 0.7511, 'grad_norm': 5.35354471206665, 'learning_rate': 2.7750000000000004e-05, 'epoch': 0.9}
{'loss': 0.6944, 'grad_norm': 4.577513217926025, 'learning_rate': 2.525e-05, 'epoch': 1.0}
{'eval_loss': 0.4903583228588104, 'eval_accuracy': 0.16, 'eval_f1': 0.16, 'eval_

TrainOutput(global_step=200, training_loss=0.6148200166225434, metrics={'train_runtime': 27.1041, 'train_samples_per_second': 58.884, 'train_steps_per_second': 7.379, 'train_loss': 0.6148200166225434, 'epoch': 2.0})

In [None]:
# Evaluate the model and print the results
eval_results = trainer.evaluate()
print(f"Evaluation Results: {eval_results}")

{'eval_loss': 0.4921717047691345, 'eval_accuracy': 0.17, 'eval_f1': 0.17, 'eval_precision': 1.0, 'eval_recall': 0.17, 'eval_bleu': 47.205640759511105, 'eval_runtime': 5.6816, 'eval_samples_per_second': 35.201, 'eval_steps_per_second': 4.4, 'epoch': 2.0}
Evaluation Results: {'eval_loss': 0.4921717047691345, 'eval_accuracy': 0.17, 'eval_f1': 0.17, 'eval_precision': 1.0, 'eval_recall': 0.17, 'eval_bleu': 47.205640759511105, 'eval_runtime': 5.6816, 'eval_samples_per_second': 35.201, 'eval_steps_per_second': 4.4, 'epoch': 2.0}


#### Test Model

In [None]:
translated_sentences = []

# Loop through each English sentence and translate it to Spanish
for sentence in source_sentences[:10]:
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True).to(device)
    outputs = model.generate(inputs["input_ids"])
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    translated_sentences.append(translated_text)

for i in range(len(source_sentences[:10])):
    print(f"Original (English): {source_sentences[i]}")
    print(f"Translation (Spanish): {translated_sentences[i]}")
    print(f"Expected (Spanish): {target_sentences[i]}\n")


Original (English): The store opens at 10 A.M.
Translation (Spanish): El almacén abre a las 10 de la mañana.
Expected (Spanish): El almacén abre a las 10 de la mañana.

Original (English): Do you like it? What's the price?
Translation (Spanish): ¿Te gusta? ¿Cuál es el precio?
Expected (Spanish): ¿Te gusta? ¿Cuál es el precio?

Original (English): My house is near my school.
Translation (Spanish): Mi casa está cerca de mi escuela.
Expected (Spanish): Mi casa está cerca de mi escuela.

Original (English): I sometimes drink beer.
Translation (Spanish): A veces bebo cerveza.
Expected (Spanish): A veces bebo cerveza.

Original (English): Come tomorrow, if possible.
Translation (Spanish): Venga mañana, si es posible.
Expected (Spanish): Venga mañana, si es posible.

Original (English): My father went to Seoul early in the morning.
Translation (Spanish): Mi padre fue a Seul temprano por la mañana.
Expected (Spanish): Mi padre fue a Seul temprano por la mañana.

Original (English): I taught he

The model has a higher BLEU score after finetuning, indicating that there has been improvement in handling context and words. BLEU score increased from 36.15 to 47.21