Good day, my name is Mcpraise Leightong Okoi.

Today, we’ll be building a machine learning model that translates spoken English into written German. This will be a dynamic model, as we’ll train it specifically on workplace conversations, allowing it to provide tailored translations based on the context of each conversation.

Now, without further ado, let’s begin! 🥂

#### importing the libraries

In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer, Trainer, TrainingArguments
import pandas as pd
import torch

#### Optional Step: Verify Compatibility of These Two Versions
To ensure smooth performance, check that your versions of these two components are compatible. If they are not, proceed to upgrade both.

In [None]:
# import transformers
# import accelerate

# print("Transformers version:", transformers.__version__)
# print("Accelerate version:", accelerate.__version__)

#### We will use the 'google-t5-small' model from Hugging Face and fine-tune it to align with our end goal.

The code below downloads the model and saves it to your '.cache'. We’ll then use it to translate the dataset into its corresponding German text.

The logic behind this approach is that, to achieve our end goal (i.e., for the model to recognize patterns and provide tailored translations), we need the data in both English and its corresponding German form.

In [None]:
model_name = "t5-small"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)


df = pd.read_csv("./datasets/Conversation.csv")


def translate_text(text):
    input_text = f"translate English to German: {text}"
    inputs = tokenizer(input_text, return_tensors="pt")
    output = model.generate(**inputs)
    translated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return translated_text

# Translate the questions and answers
df['questions_german'] = df['question'].apply(translate_text)
df['answers_german'] = df['answer'].apply(translate_text)

# Save the translated dataset
df.to_csv("translated_dataset.csv", index=False)
print("Translation complete. Saved as 'translated_dataset.csv'")

#### Training

Now, we’ll train the model on the translated_dataset, save it, and create directories for storing the results and logs.

In [None]:
df = pd.read_csv("./datasets/translated_dataset.csv")

def prepare_data(df):
    input_texts = []
    target_texts = []
    
    for index, row in df.iterrows():
        
        input_text = f"translate English to German: {row['question']}"
        target_text = row['questions_german']
        
        input_texts.append(input_text)
        target_texts.append(target_text)
        
    return input_texts, target_texts

input_texts, target_texts = prepare_data(df)

class TranslationDataset(torch.utils.data.Dataset):
    def __init__(self, tokenizer, input_texts, target_texts, max_input_length=128, max_target_length=128):
        self.tokenizer = tokenizer
        self.input_texts = input_texts
        self.target_texts = target_texts
        self.max_input_length = max_input_length
        self.max_target_length = max_target_length
        
    def __len__(self):
        return len(self.input_texts)

    def __getitem__(self, idx):
        input_encoding = self.tokenizer(self.input_texts[idx], truncation=True, padding='max_length', max_length=self.max_input_length, return_tensors='pt')
        target_encoding = self.tokenizer(self.target_texts[idx], truncation=True, padding='max_length', max_length=self.max_target_length, return_tensors='pt')

        
        return {
            'input_ids': input_encoding['input_ids'].flatten(),
            'attention_mask': input_encoding['attention_mask'].flatten(),
            'labels': target_encoding['input_ids'].flatten()
        }


model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)


train_dataset = TranslationDataset(tokenizer, input_texts, target_texts)


training_args = TrainingArguments(
    output_dir='./results',            
    num_train_epochs=3,                
    per_device_train_batch_size=8,     
    save_steps=10_000,                 
    save_total_limit=2,                
    logging_dir='./logs',              
    logging_steps=200,                  
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)


trainer.train()


model.save_pretrained("fine_tuned_t5")
tokenizer.save_pretrained("fine_tuned_t5")

#### Model Evaluation

With the model already trained and saved, we can now load it to calculate metrics and assess its performance.

In [None]:
from transformers import Trainer

# Load the tokenizer and fine-tuned model if not already loaded
# (skip this step if already in memory)
tokenizer = T5Tokenizer.from_pretrained("fine_tuned_t5")
model = T5ForConditionalGeneration.from_pretrained("fine_tuned_t5")

# Define your compute_metrics function if not defined earlier
from datasets import load_metric
bleu_metric = load_metric("bleu")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    bleu = bleu_metric.compute(predictions=[pred.split() for pred in decoded_preds],
                               references=[[label.split()] for label in decoded_labels])
    return {"bleu": bleu["bleu"]}

# Initialize the Trainer for evaluation
trainer = Trainer(
    model=model,
    args=training_args,  # reuse previous training args if defined, or define new ones
    eval_dataset=val_dataset,  # use your validation dataset
    compute_metrics=compute_metrics
)

# Run evaluation only (this will not trigger training)
eval_results = trainer.evaluate()

# Print out evaluation results
print("Evaluation metrics:", eval_results)

#### Testing

Now we test the model to see how it performs.

In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load the fine-tuned model and tokenizer
model_name = "fine_tuned_t5"  # Directory where your fine-tuned model is saved
model = T5ForConditionalGeneration.from_pretrained("fine_tuned_t5")
tokenizer = T5Tokenizer.from_pretrained("fine_tuned_t5")

# Function to translate English to German
def translate(text):
    input_text = f"translate English to German: {text}"
    inputs = tokenizer(input_text, return_tensors="pt")
    outputs = model.generate(**inputs)
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return translated_text

# Test the model with workplace conversational phrases
test_phrases = [
    "Could you please send me the latest report?",
    "Let's schedule a meeting for tomorrow at 10 AM.",
    "Thank you for your assistance with this project.",
    "I'm looking forward to our collaboration on this task.",
]

# Translate and print the results
for phrase in test_phrases:
    translation = translate(phrase)
    print(f"English: {phrase}")
    print(f"German: {translation}\n")

## THE END

Thank you for taking the time to review my work! Due to hardware limitations, I was unable to extend it to produce state-of-the-art results, but I hope you appreciated the concept.

Stay safe and stay blessed!