# üöÄ Fine-Tuning T5 for Instruction-Based Text Generation üß†

Welcome to this notebook where we explore the exciting world of **instruction-based fine-tuning** using the powerful **T5 Transformer model** from Hugging Face! ü§ó‚ú®

---

## üìå Objective

The goal of this notebook is to fine-tune the lightweight `t5-small` model on a contextual dataset of instructions, inputs, and expected outputs. This process, known as **Supervised Fine-Tuning (SFT)**, teaches the model to generate accurate and contextually relevant outputs based on structured instructions.

---

## üìä Dataset Overview

We use a rich dataset containing:

- **Instruction** üìù ‚Äì The task description
- **Input** üî° ‚Äì Context or additional information
- **Output** üí¨ ‚Äì Desired response from the model
- **Domain** üåê ‚Äì Task domain (e.g., NLP, Coding, QA)
- **Source** üìÅ ‚Äì Origin of the instruction
- **Quality Score** ‚≠ê ‚Äì Human-annotated score for output quality

We visualize the dataset with insightful charts to better understand the data distribution before training.

---

## üß™ Workflow Summary

Here's what we cover in this notebook:

1. üîç **Exploratory Data Analysis (EDA)** ‚Äì Understanding domain distribution, output length, and instruction trends.
2. üßπ **Data Preprocessing** ‚Äì Tokenizing text with T5 tokenizer and preparing inputs/labels.
3. üèãÔ∏è **Model Fine-Tuning** ‚Äì Training the `t5-small` model using Hugging Face's `Trainer`.
4. üìâ **Loss Visualization** ‚Äì Plotting training loss over time.
5. üß™ **Testing** ‚Äì Running test prompts to see the model‚Äôs predictions!

---

## üõ†Ô∏è Tools & Libraries Used

- ü§ó **Transformers** (T5, Trainer)
- üìä **Matplotlib** & **Seaborn** (visualizations)
- üßº **Regex** & **WordCloud** (text cleaning + word cloud)
- üêç **PyTorch** (dataset creation)
- üìÅ **Pandas** (data handling)

---

Let's dive in and teach our model how to understand and complete instructions like a pro! üòéüî•


# üõ†Ô∏è **Installing Modules**

In [None]:
import torch
print(torch.cuda.is_available())  # Should return True

In [None]:
# !pip install -q transformers datasets
# !pip install sentencepiece
# !pip install hf_xet
# !pip install accelerate>=0.26.0
# !pip install wordcloud
# !pip install openpyxl

# üìö **Importing Libraries**

In [None]:
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import re
from wordcloud import WordCloud

from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments , AutoTokenizer
from sklearn.model_selection import train_test_split
import torch
warnings.filterwarnings('ignore')

# ‚öôÔ∏è **Basic Important Settings**

In [None]:
plt.style.use('dark_background')

# üìÇ **Loading Dataset**

In [None]:
df = pd.read_excel('/kaggle/input/contextual-input-sft-dataset/SFT_Contextual_10000.xlsx', sheet_name='Sheet1')

# üîç **Exploring Dataset**

In [None]:
df.info()

In [None]:
df.sample(10)

# üìä **Exploratory Data Analysis: Basic (EDA)**

## **üìä Domain-Level Analysis**

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(data=df, y='domain', order=df['domain'].value_counts().index)
plt.title('Instruction Counts by Domain')
plt.xlabel('Number of Instructions')
plt.ylabel('Domain')
plt.grid(True)
plt.show()

## **üß† Quality Score Distribution**

In [None]:
sns.countplot(data=df, x='quality_score')
plt.title('Distribution of Quality Scores')
plt.xlabel('Quality Score')
plt.ylabel('Count')
plt.grid(True)
plt.show()

## **üßÆ Compare Human vs Synthetic**

In [None]:
sns.countplot(data=df, x='source', hue='quality_score')
plt.title('Quality Scores by Source')
plt.xlabel('Source')
plt.ylabel('Count')
plt.grid(True)
plt.show()

## **üß¨ Word Count Analysis (Instruction/Input/Output)**

In [None]:
df['instruction_length'] = df['instruction'].apply(lambda x: len(str(x).split()))
df['input_length'] = df['input'].apply(lambda x: len(str(x).split()))
df['output_length'] = df['output'].apply(lambda x: len(str(x).split()))

df[['instruction_length', 'input_length', 'output_length']].describe()

In [None]:
sns.boxplot(data=df[['instruction_length', 'input_length', 'output_length']])
plt.title('Word Count Distribution')
plt.show()

## **üìà Quality Score vs Output Length**

In [None]:
sns.boxplot(data=df, x='quality_score', y='output_length')
plt.title('Output Length by Quality Score')
plt.grid(True)
plt.show()

In [None]:
def clean_text(text):
    return re.sub(r'[^\w\s]', '', text.lower())

all_text = " ".join(df['instruction'].apply(clean_text))

wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_text)

plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Most Common Words in Instructions")
plt.show()

In [None]:
df['quality_score'].value_counts()

In [None]:
df['source'].value_counts()

In [None]:
df['output'].value_counts()

In [None]:
df['domain'].value_counts()

# üéØ **Fine Tuning**

## **Prepare Data (10k ‚Üí 2k for speed)**

In [None]:
small_df = df.sample(2000, random_state=42)
train_df, val_df = train_test_split(small_df, test_size=0.1)

train_texts = ["instruction: " + i + " input: " + inp for i, inp in zip(train_df['instruction'], train_df['input'])]
val_texts = ["instruction: " + i + " input: " + inp for i, inp in zip(val_df['instruction'], val_df['input'])]

train_labels = train_df['output'].tolist()
val_labels = val_df['output'].tolist()

## **Tokenization**

In [None]:
tokenizer = AutoTokenizer.from_pretrained("t5-small")

def preprocess(texts, labels):
    model_inputs = tokenizer(texts, max_length=512, truncation=True, padding="max_length", return_tensors="pt")
    labels = tokenizer(labels, max_length=128, truncation=True, padding="max_length", return_tensors="pt")["input_ids"]
    labels[labels == tokenizer.pad_token_id] = -100  # mask loss
    model_inputs["labels"] = labels
    return model_inputs

train_encodings = preprocess(train_texts, train_labels)
val_encodings = preprocess(val_texts, val_labels)


## **Dataset Object**

In [None]:
class SFTDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    def __len__(self):
        return len(self.encodings["input_ids"])
    def __getitem__(self, idx):
        return {k: v[idx] for k, v in self.encodings.items()}

train_dataset = SFTDataset(train_encodings)
val_dataset = SFTDataset(val_encodings)

## **Define and Train Model**

In [None]:
model = T5ForConditionalGeneration.from_pretrained("t5-small")

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    logging_dir="./logs",
    logging_steps=10,
    save_total_limit=1,
    fp16=False  # turn off if no GPU
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()

# üìä **Visualizing Results**

In [None]:
loss_values = [log["loss"] for log in trainer.state.log_history if "loss" in log]
steps = list(range(1, len(loss_values) + 1))

plt.figure(figsize=(10, 5))
plt.plot(steps, loss_values, label="Training Loss")
plt.xlabel("Logging Step")
plt.ylabel("Loss")
plt.title("Training Loss Over Time")
plt.legend()
plt.grid()
plt.show()

# üß™ **Test Your Fine-Tuned Model**

In [None]:
def test_model(input_text):
    input_ids = tokenizer.encode(input_text, return_tensors="pt", truncation=True)
    output_ids = model.generate(input_ids, max_length=100)
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

In [None]:
test_model("Translate to German: I love data science.")

In [None]:
test_model("Translate to French: I love data science.")

In [None]:
test_model("Translate to Romanian: I love data science.")

# **ThankYou**