<a href="https://colab.research.google.com/github/Merha23/AI_Tigrinya_Translation/blob/main/Final_Project_Source_Code_Program.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

      **Integration GitHub with Google Colab**

In [None]:
# Clone the GitHub Repository in Google Colab
# Authenticate Google Colab with GitHub
# Clone with Authentication
# Generate a GitHub Personal Access Token (PAT)
# How to Use Your GitHub Token in Google Colab Securely
# cloning your repository using the token:
# Use the stored token to clone your GitHub repository in Google Colab:
# Clone the Repository Securely in Colab
# Instead of using your username/password, use the token as follows:

# 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   Personal Access Token (PAT)

import os
from getpass import getpass

token = getpass('Enter your GitHub Personal Access Token: ')
os.environ["GITHUB_TOKEN"] = token

repo_url = f"https://{token}@github.com/Merha23/AI_Tigrinya_Translation.git"
!git clone {repo_url}
%cd AI_Tigrinya_Translation


In [None]:
!rm -rf .git
!git init


In [None]:
cd /content/AI_Tigrinya_Translation


In [None]:
# Run this python code and if cloning is successful, it should display the repository contents.

!ls -la


In [None]:
# Mount Google Drive in Colab
# Run this command in Colab to access your Drive:

from google.colab import drive
drive.mount('/content/drive')


In [None]:
# Pull the Latest Changes from GitHub
# Before pushing your changes, you need to sync your local copy of the repository with the remote one.
# Pull the latest changes:
# Run the following command to fetch and merge the latest changes from GitHub into your local branch:

!git pull https://{os.environ['GITHUB_TOKEN']}@github.com/Merha23/AI_Tigrinya_Translation.git main

In [None]:
# Push Your Changes to GitHub
# After successfully pulling the latest changes and resolving any conflicts (if needed), you can now push your local changes:

!git push https://{os.environ['GITHUB_TOKEN']}@github.com/Merha23/AI_Tigrinya_Translation.git main

In [None]:
# To reset to the remote state (undo local changes):

!git reset --hard origin/main


In [None]:
# Push Updates from Colab to GitHub
# After editing files, commit and push:

!git config --global user.email "merhagebrelibanos29@gmail.com"
!git config --global user.name "Merha Gebrelibanos"
!git add .
!git commit -m "Updated final project source code program"
!git push https://{os.environ['GITHUB_TOKEN']}@github.com/Merha23/AI_Tigrinya_Translation.git main

1)    **Load the CSV File into a Pandas DataFrame**

In [None]:
import pandas as pd

df = pd.read_csv("Medical Translation.csv", encoding='utf-8')
df.head()  # Display first few rows

In [None]:
!pip install datasets

2)     **Data Cleaning & Handling Missing Values**

     Before tokenization, we need to remove any unnecessary data such as missing values, duplicates, or improperly formatted sentences.

In [None]:
# Convert both 'english' and 'tigrinya' columns to strings (to handle float values as well)
df['english'] = df['english'].astype(str)
df['tigrinya'] = df['tigrinya'].astype(str)

In [None]:
# Fill missing values (NaN) in the columns with an empty string
df['english'] = df['english'].fillna("")
df['tigrinya'] = df['tigrinya'].fillna("")

In [None]:
# Check for non-string values in 'english' and 'tigrinya'
print(df['english'].apply(type).value_counts())
print(df['tigrinya'].apply(type).value_counts())

In [None]:
# Remove empty or missing values
df = df.dropna()

# Remove duplicates
df = df.drop_duplicates()

# Strip unwanted whitespace
df['english'] = df['english'].str.strip()
df['tigrinya'] = df['tigrinya'].str.strip()

print(f"Dataset size after cleaning: {df.shape}")

3) **Sentence Tokenization**:

  Since we are training a sequence-to-sequence model, we must tokenize each sentence.

In [None]:
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split

# Split dataset into train (80%), validation (10%), and test (10%)
train_texts, temp_texts, train_labels, temp_labels = train_test_split(
    df['english'].tolist(), df['tigrinya'].tolist(), test_size=0.2, random_state=42
)

val_texts, test_texts, val_labels, test_labels = train_test_split(
    temp_texts, temp_labels, test_size=0.5, random_state=42
)

# Convert to Hugging Face Dataset format
train_data = Dataset.from_dict({"english": train_texts, "tigrinya": train_labels})
val_data = Dataset.from_dict({"english": val_texts, "tigrinya": val_labels})
test_data = Dataset.from_dict({"english": test_texts, "tigrinya": test_labels})

# Prepare the DatasetDict for tokenization
datasets = DatasetDict({
    "train": train_data,
    "validation": val_data,
    "test": test_data
})


In [None]:
from transformers import AutoTokenizer

# Load the tokenizer
model_name = "facebook/nllb-200-distilled-600M"  # Replace with your model name if different
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define a tokenization function
def tokenize_function(examples):
    return tokenizer(examples['english'], examples['tigrinya'], padding="max_length", truncation=True)

# Apply the tokenization function to the datasets
tokenized_datasets = datasets.map(tokenize_function, batched=True)


In [None]:
training_args = TrainingArguments(
    output_dir='./nllb_trained',  # Output directory for saving model checkpoints
    eval_strategy="epoch",  # Evaluate after each epoch (use eval_strategy instead of evaluation_strategy)
    learning_rate=2e-5,  # Learning rate
    per_device_train_batch_size=8,  # Batch size for training
    per_device_eval_batch_size=8,  # Batch size for evaluation
    num_train_epochs=3,  # Number of epochs
    weight_decay=0.01,  # Weight decay for regularization
    save_steps=500,  # Save model every 500 steps
    logging_dir='./logs',  # Directory for logs
    logging_steps=100,  # Log every 100 steps
    report_to="none"  # Disable logging to WandB
)


In [None]:
!pip show wandb


In [None]:
!echo $WANDB_API_KEY


In [None]:
!pip uninstall wandb


In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments

# Load pre-trained model and tokenizer
model_name = "facebook/nllb-200-distilled-600M"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize the data
def tokenize_function(examples):
    model_inputs = tokenizer(examples["english"], max_length=128, truncation=True, padding="max_length")
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["tigrinya"], max_length=128, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Tokenizing the datasets
tokenized_datasets = datasets.map(tokenize_function, batched=True)

# Prepare training arguments
training_args = TrainingArguments(
    output_dir='./nllb_trained',
    evaluation_strategy="epoch",
    logging_dir='./logs',
    report_to="none"  # Disable W&B logging (optional)
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
)

# Start training
trainer.train()


In [None]:
import os
from transformers import TrainingArguments, Trainer, AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import load_dataset

# Disable WandB logging globally by setting this environment variable
os.environ["WANDB_DISABLED"] = "true"

# Load the pre-trained model
model_name = "facebook/nllb-200-distilled-600M"  # Replace with your model name if different
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load your dataset (replace this with your actual dataset)
# tokenized_datasets = load_dataset('your_dataset')

# Prepare your training arguments
training_args = TrainingArguments(
    output_dir='./nllb_trained',  # Output directory for saving model checkpoints
    eval_strategy="epoch",  # Use eval_strategy instead of the deprecated evaluation_strategy
    learning_rate=2e-5,  # Learning rate
    per_device_train_batch_size=8,  # Train batch size
    per_device_eval_batch_size=8,   # Evaluation batch size
    num_train_epochs=3,  # Number of training epochs
    save_steps=10_000,  # Save model every 10k steps
    save_total_limit=2,  # Limit the total saved models
    logging_dir='./logs',  # Directory for logs
    logging_strategy="steps",  # Log every set number of steps
    logging_steps=500,  # Log every 500 steps
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],  # Ensure the dataset is loaded correctly
    eval_dataset=tokenized_datasets["validation"],  # Use validation data
    tokenizer=tokenizer,  # Tokenizer for data processing
)

# Start the training process
trainer.train()


In [None]:
from transformers import TrainingArguments, Trainer, AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import load_dataset

# Load the pre-trained model
model_name = "facebook/nllb-200-distilled-600M"  # Replace with your model name if different
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load dataset (replace this with your dataset)
# tokenized_datasets = load_dataset('your_dataset')

# Prepare your training arguments
training_args = TrainingArguments(
    output_dir='./nllb_trained',  # Output directory for saving model checkpoints
    eval_strategy="epoch",  # Use eval_strategy instead of the deprecated evaluation_strategy
    learning_rate=2e-5,  # Learning rate
    per_device_train_batch_size=8,  # Train batch size
    per_device_eval_batch_size=8,   # Evaluation batch size
    num_train_epochs=3,  # Number of training epochs
    save_steps=10_000,  # Save model every 10k steps
    save_total_limit=2,  # Limit the total saved models
    report_to=None,  # Explicitly disable WandB logging
    logging_dir='./logs',  # Directory for logs
    logging_strategy="steps",  # Log every set number of steps
    logging_steps=500,  # Log every 500 steps
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],  # Ensure the dataset is loaded correctly
    eval_dataset=tokenized_datasets["validation"],  # Use validation data
    tokenizer=tokenizer,  # Tokenizer for data processing
)

# Start the training process
trainer.train()


In [None]:
from transformers import Trainer, TrainingArguments

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
    processing_class=tokenizer,  # Use processing_class instead of tokenizer
)


In [None]:
from transformers import TrainingArguments, Trainer, AutoModelForSeq2SeqLM, AutoTokenizer

# Define Training Arguments (without 'predict_with_generate')
training_args = TrainingArguments(
    output_dir='./nllb_trained',  # Output directory for saving model checkpoints
    eval_strategy="epoch",  # Evaluate after each epoch (use eval_strategy instead of evaluation_strategy)
    learning_rate=2e-5,  # Learning rate
    per_device_train_batch_size=8,  # Batch size for training
    per_device_eval_batch_size=8,  # Batch size for evaluation
    num_train_epochs=3,  # Number of epochs
    weight_decay=0.01,  # Weight decay for regularization
    save_steps=500,  # Save model every 500 steps
    logging_dir='./logs',  # Directory for logs
    logging_steps=100  # Log every 100 steps
)

# Load the pre-trained model
model_name = "facebook/nllb-200-distilled-600M"  # Replace with your model name if different
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Initialize the Trainer (no need to pass 'predict_with_generate')
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer  # No longer need to explicitly set `predict_with_generate`
)

# Start the training process
trainer.train()


In [None]:
from transformers import TrainingArguments, Trainer, AutoModelForSeq2SeqLM, AutoTokenizer

# Define Training Arguments without `predict_with_generate`
training_args = TrainingArguments(
    output_dir='./nllb_trained',  # Output directory for saving model checkpoints
    evaluation_strategy="epoch",  # Evaluate after each epoch
    learning_rate=2e-5,  # Learning rate
    per_device_train_batch_size=8,  # Batch size for training
    per_device_eval_batch_size=8,  # Batch size for evaluation
    num_train_epochs=3,  # Number of epochs
    weight_decay=0.01,  # Weight decay for regularization
    save_steps=500,  # Save model every 500 steps
    logging_dir='./logs',  # Directory for logs
    logging_steps=100  # Log every 100 steps
)

# Load the pre-trained model
model_name = "facebook/nllb-200-distilled-600M"  # Replace with your model name if different
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Initialize the Trainer with `predict_with_generate` set in the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    predict_with_generate=True  # Set this inside the Trainer
)

# Start the training process
trainer.train()


In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./nllb_trained',  # Output directory for saving model checkpoints
    evaluation_strategy="epoch",  # Evaluate after each epoch
    learning_rate=2e-5,  # Learning rate
    per_device_train_batch_size=8,  # Batch size for training
    per_device_eval_batch_size=8,  # Batch size for evaluation
    num_train_epochs=3,  # Number of epochs
    weight_decay=0.01,  # Weight decay for regularization
    save_steps=500,  # Save model every 500 steps
    logging_dir='./logs',  # Directory for logs
    logging_steps=100,  # Log every 100 steps
    predict_with_generate=True  # Use generate method for evaluation
)


In [None]:
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
import pandas as pd

# Load your dataset
df = pd.read_csv("Medical Translation.csv")  # Adjust the path if necessary

# Split dataset into train (80%), validation (10%), and test (10%)
train_texts, temp_texts, train_labels, temp_labels = train_test_split(
    df["english"].tolist(), df["tigrinya"].tolist(), test_size=0.2, random_state=42
)

val_texts, test_texts, val_labels, test_labels = train_test_split(
    temp_texts, temp_labels, test_size=0.5, random_state=42
)

# Convert to Hugging Face Dataset format
train_data = Dataset.from_dict({"english": train_texts, "tigrinya": train_labels})
val_data = Dataset.from_dict({"english": val_texts, "tigrinya": val_labels})
test_data = Dataset.from_dict({"english": test_texts, "tigrinya": test_labels})

# Prepare the DatasetDict for tokenization
datasets = DatasetDict({
    "train": train_data,
    "validation": val_data,
    "test": test_data
})


In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

# Ensure column names match exactly
df['tokenized_english'] = df['english'].apply(lambda x: str(x).split('. '))
df['tokenized_tigrinya'] = df['tigrinya'].apply(lambda x: str(x).split('። '))
df[['tokenized_english', 'tokenized_tigrinya']].head()

                **Check Sentence Length Distribution**

We need to analyze the length of sentences to ensure they fit within the model's input constraints.

In [None]:
# Check sentence length distribution
df['eng_length'] = df['english'].apply(lambda x: len(str(x).split()))
df['tir_length'] = df['tigrinya'].apply(lambda x: len(str(x).split()))

df[['eng_length', 'tir_length']].describe()

            **Filtering Extremely Long or Short Sentences**

Very short or long sentences might reduce model performance. We filter out sentences that are too short (<3 words) or too long (>128 words).

In [None]:
# Filter out sentences that are too short or too long
df = df[(df['eng_length'] >= 3) & (df['eng_length'] <= 128)]
df = df[(df['tir_length'] >= 3) &  (df['tir_length'] <= 128)]

print(f"Dataset size after filtering: {df.shape}")

In [None]:
print(df.columns)  # Verify column names

               **Save the Processed Dataset**

After preprocessing, we save the cleaned and tokenized dataset for use in model training.

In [None]:
df.to_csv("cleaned_Medical Translation.csv", index=False, encoding='utf-8')
print("Dataset saved successfully!")

In [None]:
import os
print(os.getcwd())  # Shows current directory
print(os.listdir())  # Lists all files in the directory


In [None]:
import os

file_path = "/content/AI_Tigrinya_Translation/cleaned_Medical Translation.csv"

if os.path.exists(file_path):
    print("File exists:", file_path)
else:
    print("File does NOT exist!")


In [None]:
import os

# Get the current working directory
current_dir = os.getcwd()

# List all files in the directory
files = os.listdir(current_dir)

# Print full paths
for file in files:
    full_path = os.path.join(current_dir, file)
    print(full_path)

In [None]:
df['tokenized_tigrinya'] = df['tokenized_tigrinya'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)

In [None]:
df.to_csv("cleaned_dataset.csv", index=False, encoding='utf-8-sig')

In [None]:
df[['tokenized_english', 'tokenized_tigrinya']].head(10)

In [None]:
# To download the cleaned dataset

from google.colab import files
files.download("cleaned_Medical Translation.csv")

      **Fine-Tuning NLLB-200 Model **

      Now that we have a cleaned and tokenized dataset,
      we can fine-tune the NLLB-200 model to improve translation
      performance for English ⇆ Tigrinya in medical and legal contexts.

      **Install and Load Hugging Face Transformers**

Before using the model, install and import the necessary libraries:

In [None]:
!pip install transformers datasets torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Trainer, TrainingArguments

**Load Pretrained NLLB-200 Model**

We will use Facebook's NLLB-200 model, specifically the distilled 600M version, which is optimized for translation tasks.

In [None]:
model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [None]:
import pandas as pd

df = pd.read_csv("cleaned_Medical Translation.csv")
print(df.head())  # Show first few rows
print(df.columns)  # Display column names


In [None]:
df.columns = df.columns.str.strip()  # Remove leading/trailing spaces
df.rename(columns={"english": "source", "tigrinya": "target"}, inplace=True)


In [None]:
train_texts, temp_texts, train_labels, temp_labels = train_test_split(
    df["source"].tolist(), df["target"].tolist(), test_size=0.2, random_state=42
)


In [None]:
from datasets import load_dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, TrainingArguments, Trainer
import torch
import numpy as np
from sklearn.model_selection import train_test_split


In [None]:
import pandas as pd
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split

# Load dataset
df = pd.read_csv("cleaned_Medical Translation.csv")

# Debug: Print column names
print("Dataset Columns:", df.columns)

# Rename columns to match expected format
df.rename(columns={"english": "source", "tigrinya": "target"}, inplace=True)

# Split dataset into train (80%), validation (10%), and test (10%)
train_texts, temp_texts, train_labels, temp_labels = train_test_split(
    df["source"], df["target"], test_size=0.2, random_state=42
)

val_texts, test_texts, val_labels, test_labels = train_test_split(
    temp_texts, temp_labels, test_size=0.5, random_state=42
)

# Convert to Hugging Face Dataset format
train_dataset = Dataset.from_dict({"source": train_texts.tolist(), "target": train_labels.tolist()})
val_dataset = Dataset.from_dict({"source": val_texts.tolist(), "target": val_labels.tolist()})
test_dataset = Dataset.from_dict({"source": test_texts.tolist(), "target": test_labels.tolist()})

dataset = DatasetDict({
    "train": train_dataset,
    "validation": val_dataset,
    "test": test_dataset,
})

print(dataset)


In [None]:
from transformers import AutoTokenizer

model_name = "facebook/nllb-200-distilled-600M"  # Choose appropriate NLLB-200 variant
tokenizer = AutoTokenizer.from_pretrained(model_name)

def preprocess_function(examples):
    model_inputs = tokenizer(examples["source"], max_length=128, truncation=True, padding="max_length")
    labels = tokenizer(examples["target"], max_length=128, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply tokenization
tokenized_datasets = dataset.map(preprocess_function, batched=True)

print("Tokenization complete!")


In [None]:
pip install --upgrade transformers


In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Define the model name (Use the actual model name you're working with)
model_name = "facebook/nllb-200-distilled-600M"  # Or another variant

# Load Pretrained Model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)


In [None]:
# Tokenize the dataset
def preprocess_function(examples):
    model_inputs = tokenizer(examples["english"], max_length=128, truncation=True, padding="max_length")
    labels = tokenizer(examples["tigrinya"], max_length=128, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Convert to Hugging Face Dataset format
train_data = Dataset.from_dict({"english": train_texts, "tigrinya": train_labels})
val_data = Dataset.from_dict({"english": val_texts, "tigrinya": val_labels})
test_data = Dataset.from_dict({"english": test_texts, "tigrinya": test_labels})

# Apply the tokenization function
tokenized_datasets = DatasetDict({
    "train": train_data.map(preprocess_function, batched=True),
    "validation": val_data.map(preprocess_function, batched=True),
    "test": test_data.map(preprocess_function, batched=True),
})

print("Tokenization complete!")

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./nllb_trained",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=3,
    fp16=True,  # Use FP16 if on GPU
    logging_dir="./logs",
    logging_steps=500,
)


In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Define the model name
model_name = "facebook/nllb-200-distilled-600M"  # Or the path to your fine-tuned model

# Load Pretrained Model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)


from transformers import AutoModelForSeq2SeqLM, Trainer

# Load Pretrained Model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
)

# Start training
trainer.train()


**Prepare Data for Fine-Tuning**

We need to format our cleaned dataset to be compatible with the model.

In [None]:
from datasets import Dataset

def preprocess_function(examples):
    inputs = tokenizer(examples["english"], max_length=128, truncation=True, padding="max_length")
    targets = tokenizer(examples["tigrinya"], max_length=128, truncation=True, padding="max_length")
    inputs["labels"] = targets["input_ids"]
    return inputs

# Load cleaned dataset
import pandas as pd
df = pd.read_csv("cleaned_Medical Translation.csv")

# Convert to Hugging Face dataset format
dataset = Dataset.from_pandas(df)
tokenized_datasets = dataset.map(preprocess_function, batched=True)

**Define Training Arguments**

We set up configurations for fine-tuning, such as batch size, learning rate, and evaluation strategy.

**Fine-tune the Model**

Now you can fine-tune the NLLB-200 model using the training arguments you defined. Make sure that you have your dataset loaded and preprocessed correctly.

In [None]:
!pip install --upgrade transformers


In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load the pre-trained NLLB-200 model and tokenizer
model_name = "facebook/nllb-200-3.3B"  # Replace with the correct NLLB model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
!pip install datasets

In [None]:
ls -la

In [None]:
!find . -name ".gitignore"

In [None]:
!find . -name "cleaned_Medical Translation.CSV"

In [None]:
ls | grep "cleaned_Medical Translation.csv"


In [None]:
!git ls-files | grep "cleaned_Medical Translation.CSV"

In [None]:
!git ls-files | grep ".env"

In [None]:
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from datasets import Dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Trainer, TrainingArguments

# Define dataset path
dataset_path = "/content/AI_Tigrinya_Translation/cleaned_Medical Translation.csv"

# Load dataset into Pandas DataFrame
df = pd.read_csv(dataset_path)

# Ensure dataset has correct columns
source_column = "english"
target_column = "tigrinya"

if source_column not in df.columns or target_column not in df.columns:
    raise ValueError(f"Expected columns '{source_column}' and '{target_column}' not found in dataset!")

# Drop any rows with missing values (avoiding NoneType errors)
df = df.dropna(subset=[source_column, target_column])

# Split dataset (90% training, 10% evaluation)
train_df, eval_df = train_test_split(df, test_size=0.1, random_state=42)

# Convert to Hugging Face Dataset
train_dataset = Dataset.from_pandas(train_df)
eval_dataset = Dataset.from_pandas(eval_df)

# Load NLLB-200 model and tokenizer
model_name = "facebook/nllb-200-3.3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# ✅ Tokenization function (updated)
def tokenize_function(examples):
    # Ensure no NoneType values exist
    inputs = [str(text) for text in examples[source_column]]
    targets = [str(text) for text in examples[target_column]]

    # Tokenization
    model_inputs = tokenizer(inputs, padding="max_length", truncation=True)
    labels = tokenizer(targets, padding="max_length", truncation=True, text_target=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# ✅ Apply tokenization
train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)

# ✅ Remove unnecessary columns
columns_to_keep = ["input_ids", "attention_mask", "labels"]
train_dataset.set_format(type="torch", columns=columns_to_keep)
eval_dataset.set_format(type="torch", columns=columns_to_keep)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=500,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    push_to_hub=False
)

# Initialize Trainer
trainer = Trainer(
    model=AutoModelForSeq2SeqLM.from_pretrained(model_name),
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer
)

# ✅ Train the model
trainer.train()

# ✅ Evaluate the model
trainer.evaluate()

In [None]:
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from datasets import Dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Trainer, TrainingArguments

# Define dataset path
dataset_path = "/content/AI_Tigrinya_Translation/cleaned_Medical Translation.csv"

# Load dataset into Pandas DataFrame
df = pd.read_csv(dataset_path)

# Ensure dataset has correct columns
source_column = "english"   # Update if different
target_column = "tigrinya"  # Update if different

if source_column not in df.columns or target_column not in df.columns:
    raise ValueError(f"Expected columns '{source_column}' and '{target_column}' not found in dataset!")

# Split dataset (90% training, 10% evaluation)
train_df, eval_df = train_test_split(df, test_size=0.1, random_state=42)

# Convert to Hugging Face Dataset
train_dataset = Dataset.from_pandas(train_df)
eval_dataset = Dataset.from_pandas(eval_df)

# Load NLLB-200 model and tokenizer
model_name = "facebook/nllb-200-3.3B"  # Update if needed
tokenizer = AutoTokenizer.from_pretrained(model_name)

# ✅ Tokenization function (sentence-level tokenization)
def tokenize_function(examples):
    model_inputs = tokenizer(examples[source_column], padding="max_length", truncation=True)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples[target_column], padding="max_length", truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# ✅ Apply tokenization (ensuring 'input_ids' and 'attention_mask' exist)
train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)

# ✅ Remove unnecessary columns (keep only required ones)
columns_to_keep = ["input_ids", "attention_mask", "labels"]
train_dataset.set_format(type="torch", columns=columns_to_keep)
eval_dataset.set_format(type="torch", columns=columns_to_keep)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=500,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    push_to_hub=False
)

# Initialize Trainer
trainer = Trainer(
    model=AutoModelForSeq2SeqLM.from_pretrained(model_name),
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer
)

# ✅ Train the model
trainer.train()

# ✅ Evaluate the model
trainer.evaluate()


In [None]:
import pandas as pd
from datasets import Dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split

# Load the cleaned dataset
dataset_path = "/content/AI_Tigrinya_Translation/cleaned_Medical Translation.csv"
df = pd.read_csv(dataset_path).dropna()

# Ensure necessary columns exist
required_columns = {"input_ids", "attention_mask", "tokenized_tigrinya"}
if not required_columns.issubset(df.columns):
    raise ValueError(f"Dataset is missing required columns: {required_columns - set(df.columns)}")

# Rename 'tokenized_tigrinya' to 'labels' (this is required for the Trainer)
df = df.rename(columns={"tokenized_tigrinya": "labels"})

# Drop unnecessary columns
df = df[["input_ids", "attention_mask", "labels"]]

# Convert string representations of lists into actual lists
df["input_ids"] = df["input_ids"].apply(eval)
df["attention_mask"] = df["attention_mask"].apply(eval)
df["labels"] = df["labels"].apply(eval)

# Split the dataset
train_df, eval_df = train_test_split(df, test_size=0.1, random_state=42)

# Convert to Hugging Face Dataset format
train_dataset = Dataset.from_pandas(train_df)
eval_dataset = Dataset.from_pandas(eval_df)

# Load pre-trained NLLB-200 model and tokenizer
model_name = "facebook/nllb-200-3.3B"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,  # Reduce batch size if OOM
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=500,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    push_to_hub=False
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer
)

# Fine-tune the model
trainer.train()

# Evaluate the model
trainer.evaluate()


In [None]:
import pandas as pd
from datasets import Dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split

# Load the cleaned, already tokenized dataset
dataset_path = "/content/AI_Tigrinya_Translation/cleaned_Medical Translation.csv"
df = pd.read_csv(dataset_path).dropna()

# Ensure all columns are present
if "english" not in df.columns or "tigrinya" not in df.columns:
    raise ValueError("Tokenized dataset must contain 'input_ids' and 'attention_mask' columns!")

# Split the dataset (90% training, 10% evaluation)
train_df, eval_df = train_test_split(df, test_size=0.1, random_state=42)

# Convert Pandas DataFrame to Hugging Face Dataset
train_dataset = Dataset.from_pandas(train_df)
eval_dataset = Dataset.from_pandas(eval_df)

# Load pre-trained NLLB-200 model and tokenizer
model_name = "facebook/nllb-200-3.3B"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,  # Reduce batch size if OOM
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=500,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    push_to_hub=False
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer
)

# Fine-tune the model
trainer.train()

# Evaluate the model
trainer.evaluate()

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Trainer, TrainingArguments

# Load dataset
dataset_path = "/content/AI_Tigrinya_Translation/cleaned_Medical Translation.csv"
df = pd.read_csv(dataset_path).dropna()  # Drop missing values

# Ensure dataset has correct columns
source_column = "english"
target_column = "tigrinya"

if source_column not in df.columns or target_column not in df.columns:
    raise ValueError(f"Expected columns '{source_column}' and '{target_column}' not found in dataset!")

# Convert everything to string
df[source_column] = df[source_column].astype(str)
df[target_column] = df[target_column].astype(str)

# Split the dataset (90% training, 10% evaluation)
train_df, eval_df = train_test_split(df, test_size=0.1, random_state=42)

# Convert to Hugging Face Dataset format
train_dataset = Dataset.from_pandas(train_df)
eval_dataset = Dataset.from_pandas(eval_df)

# Load the pre-trained NLLB-200 model and tokenizer
model_name = "facebook/nllb-200-3.3B"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

# Tokenization function
def tokenize_function(examples):
    return tokenizer(
        [str(text) for text in examples[source_column]],
        text_target=[str(text) for text in examples[target_column]],
        padding="max_length",
        truncation=True
    )

# Apply tokenization
train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,  # Reduce batch size if OOM error occurs
    per_device_eval_batch_size=4,   # Reduce batch size if OOM error occurs
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=500,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    push_to_hub=False
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer
)

# Fine-tune the model
trainer.train()

# Evaluate the model
trainer.evaluate()


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Trainer, TrainingArguments

# Load dataset
dataset_path = "/content/AI_Tigrinya_Translation/cleaned_Medical Translation.csv"
df = pd.read_csv(dataset_path).dropna()  # Remove missing values

# Ensure dataset has correct columns
source_column = "english"
target_column = "tigrinya"

if source_column not in df.columns or target_column not in df.columns:
    raise ValueError(f"Expected columns '{source_column}' and '{target_column}' not found in dataset!")

# Convert everything to string
df[source_column] = df[source_column].astype(str)
df[target_column] = df[target_column].astype(str)

# Split the dataset (90% training, 10% evaluation)
train_df, eval_df = train_test_split(df, test_size=0.1, random_state=42)

# Convert to Hugging Face Dataset format
train_dataset = Dataset.from_pandas(train_df)
eval_dataset = Dataset.from_pandas(eval_df)

# Load the pre-trained NLLB-200 model and tokenizer
model_name = "facebook/nllb-200-3.3B"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

# Sentence-based tokenization function
def tokenize_function(examples):
    # Tokenize sentences individually
    inputs = [sentence.strip() for sentence in examples[source_column]]
    targets = [sentence.strip() for sentence in examples[target_column]]

    return tokenizer(
        inputs,
        text_target=targets,
        padding="max_length",  # Ensures fixed input length
        truncation=True,  # Prevents exceeding max length
        max_length=128  # Adjust based on sentence length
    )

# Apply tokenization sentence-wise
train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,  # Adjust if OOM occurs
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=500,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    push_to_hub=False
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer
)

# Fine-tune the model
trainer.train()

# Evaluate the model
trainer.evaluate()


In [None]:
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from datasets import Dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Trainer, TrainingArguments

# Define paths
dataset_path = "/content/AI_Tigrinya_Translation/cleaned_Medical Translation.csv"

# Load dataset into a Pandas DataFrame
df = pd.read_csv(dataset_path)

# Ensure dataset has correct columns
source_column = "english"  # Update if different
target_column = "tigrinya"  # Update if different

if source_column not in df.columns or target_column not in df.columns:
    raise ValueError(f"Expected columns '{source_column}' and '{target_column}' not found in dataset!")

# Split the dataset (90% training, 10% evaluation)
train_df, eval_df = train_test_split(df, test_size=0.1, random_state=42)

# Convert to Hugging Face Dataset format
train_dataset = Dataset.from_pandas(train_df)
eval_dataset = Dataset.from_pandas(eval_df)

# Load the pre-trained NLLB-200 model and tokenizer
model_name = "facebook/nllb-200-3.3B"  # Update with the correct model if needed
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenization function
def tokenize_function(examples):
    return tokenizer(examples[source_column], text_target=examples[target_column],
                     padding="max_length", truncation=True)

# Apply tokenization
train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=500,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    push_to_hub=False
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer
)

# Fine-tune the model
trainer.train()

# Evaluate the model
trainer.evaluate()

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset

# Load the pre-trained NLLB-200 model and tokenizer
model_name = "facebook/nllb-200-3.3B"  # Replace with the correct NLLB model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load your custom dataset (ensure it's properly tokenized and preprocessed)
# Example: Load dataset for training and evaluation (you can update the path if needed)
train_dataset = load_dataset('/content/AI_Tigrinya_Translation/cleaned_Medical Translation.csv')  # Update with actual data path
eval_dataset = load_dataset('/content/AI_Tigrinya_Translation/cleaned_Medical Translation.csv')    # Update with actual data path

# Tokenize the datasets
def tokenize_function(examples):
    # Assuming the dataset has columns 'source_text' and 'target_text'
    return tokenizer(examples['source_text'], text_target=examples['target_text'], padding="max_length", truncation=True)

train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',              # output directory for model checkpoints
    evaluation_strategy="epoch",         # evaluate the model at the end of each epoch
    save_strategy="epoch",               # save model checkpoint at the end of each epoch (to match eval_strategy)
    learning_rate=2e-5,                  # learning rate for the optimizer
    per_device_train_batch_size=8,       # batch size for training
    per_device_eval_batch_size=8,        # batch size for evaluation
    num_train_epochs=3,                  # number of training epochs
    weight_decay=0.01,                   # weight decay to avoid overfitting
    logging_dir='./logs',                # directory for storing logs
    logging_steps=500,                   # log every 500 steps
    save_steps=500,                      # save model checkpoint every 500 steps
    save_total_limit=2,                  # maximum number of checkpoints to save
    load_best_model_at_end=True,         # load the best model when finished training
    metric_for_best_model="accuracy",   # use accuracy for model selection
    push_to_hub=False                    # set to True to push model to Hugging Face Hub
)

# Initialize the Trainer with the training arguments, model, and dataset
trainer = Trainer(
    model=model,                        # the model to be fine-tuned
    args=training_args,                 # the training arguments defined earlier
    train_dataset=train_dataset,        # the training dataset
    eval_dataset=eval_dataset,          # the evaluation dataset
    tokenizer=tokenizer                 # the tokenizer used to encode/decode text
)

# Fine-tune the model
trainer.train()

In [None]:
!pip install transformers datasets torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Trainer, TrainingArguments

Load Pretrained NLLB-200 Model

In [None]:
model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

Prepare Data for Fine-Tuning

In [None]:
from datasets import Dataset

def preprocess_function(examples):
    inputs = tokenizer(examples["english"], max_length=128, truncation=True)
    targets = tokenizer(examples["tigrinya"], max_length=128, truncation=True)
    inputs["labels"] = targets["input_ids"]
    return inputs

dataset =Dataset.from_pandas(df)
tokenized_datasets = dataset.map(preprocess_function, batched=True)

        ** Train the Model**

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
)


In [None]:
# Apply train_test_split to the "train" dataset
dataset_split = dataset["train"].train_test_split(test_size=0.1)

# Create a new DatasetDict that includes both train and eval sets
dataset = {
    "train": dataset_split["train"],
    "test": dataset_split["test"],  # This will be used as eval_dataset
}

# Assign train and eval sets
train_data = dataset["train"]
eval_data = dataset["test"]

print(train_data)
print(eval_data)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,  # Now eval_dataset is correctly provided
)
