---
---

# **Niloufar Abbasi | 401209996**
# Deep Learning, Homework 4, Part3

---
---

Please note that all cells should be executed sequentially. Running certain cells, especially those involved in processing steps such as dropping data, more than once may yield different results.

---


In this script, I employ a combination of Markdown cells and in-cell comments placed throughout the code to provide explanations, making it easy to understand the functionality and purpose of each section.

---

### Table of Contents
- Required Libraries
- Dataset Loading and Preprocessing
- Tokenization and Data Preparation
- Model Training and Fine Tuning
- Evaluation & Results

---

**Required Libraries**

In [11]:
import pandas            as pd
import numpy             as np
import matplotlib.pyplot as plt
import torch
import torch.nn          as nn

from torch.utils.data        import Dataset, DataLoader
from transformers            import GPT2Tokenizer, GPT2LMHeadModel, AdamW
from torch.cuda.amp          import autocast, GradScaler
from sklearn.model_selection import train_test_split
from torch.nn                import CrossEntropyLoss
from transformers            import pipeline

---

# **Part (A)**

---

**Load the dataset**

In [12]:
#with open('/kaggle/input/ferdousi/ferdousi.txt', 'r') as file:
#    lines = file.readlines()

with open('ferdousi.txt', 'r') as file:
    lines = file.readlines()

**Process each line**

In [13]:
data = []
for i, line in enumerate(lines):
    line = line.strip()  # remove newline character
    if i % 2 == 0:
        data.append({"mesra1": line})
    else:
        data[-1]["mesra2"] = line

**Convert to DataFrame**

In [14]:
data = pd.DataFrame(data)
# Switch the columns to be in their right position ( like a real 'beyt' :) )
data = data[["mesra2", "mesra1"]]
data.head(10)

Unnamed: 0,mesra2,mesra1
0,number of beyts:\t49609,ferdousi.txt
1,کزین برتر اندیشه برنگذرد,به نام خداوند جان و خرد
2,خداوند روزی ده رهنمای,خداوند نام و خداوند جای
3,فروزنده ماه و ناهید و مهر,خداوند کیوان و گردان سپهر
4,نگارندهٔ بر شده پیکرست,ز نام و نشان و گمان برترست
5,نبینی مرنجان دو بیننده را,به بینندگان آفریننده را
6,که او برتر از نام و از جایگاه,نیابد بدو نیز اندیشه راه
7,نیابد بدو راه جان و خرد,سخن هر چه زین گوهران بگذرد
8,همان را گزیند که بیند همی,خرد گر سخن برگزیند همی
9,میان بندگی را ببایدت بست,ستودن نداند کس او را چو هست


**Processing the dataframe**

In [15]:
data.drop(0 , inplace=True ) # row 0 does not have any persian poem, it just has some information

In [16]:
# Add a new column for verse numbers
data['verses_number'] = range(0, len(data))
# Reorder the columns
data = data[['verses_number', 'mesra2', 'mesra1']]
data.head(5)
#we need this, because when we shuffle our dataframe ,  it shows us whether everything is correct or not!!

Unnamed: 0,verses_number,mesra2,mesra1
1,0,کزین برتر اندیشه برنگذرد,به نام خداوند جان و خرد
2,1,خداوند روزی ده رهنمای,خداوند نام و خداوند جای
3,2,فروزنده ماه و ناهید و مهر,خداوند کیوان و گردان سپهر
4,3,نگارندهٔ بر شده پیکرست,ز نام و نشان و گمان برترست
5,4,نبینی مرنجان دو بیننده را,به بینندگان آفریننده را


In [17]:
num_rows = data.shape[0]
print(f'The DataFrame has {num_rows} rows.')

The DataFrame has 49609 rows.


In [18]:
# Replace empty cells with a specific value
data = data.fillna("این قسمت فاقد محتوا می باشد:)")  # If there is any !! #(especially in the last line)

In [19]:
# Shuffle the DataFrame
shuffled_df = data.sample(frac=1, random_state=42) # it helps us to have unbiased dataframe

# Reset the index
shuffled_df.reset_index(drop=True, inplace=True)

# Calculate the size of the training set
train_size = int(0.8 * len(shuffled_df))

# Split the DataFrame
train_df = shuffled_df[:train_size]
test_df = shuffled_df[train_size:]

# Now I have train_df and test_df

print(f'The train DataFrame has {train_df.shape[0]} rows.')
print(f'The test DataFrame has {test_df.shape[0]} rows.')

The train DataFrame has 39687 rows.
The test DataFrame has 9922 rows.


---

# **Part (B)**

---

**Tokenization and Dataset Preparation for Seq2Seq Model**

in this partt, we are preparing data for training a seq-to-seq model using the GPT-2-based tokenizer. The goal is to concatenate two lines of Persian poetry (mesra1 and mesra2) and tokenize the resulting sequences for training and evaluation.

In [106]:
tokenizer = GPT2Tokenizer.from_pretrained('HooshvareLab/gpt2-fa') # initialize a GPT-2 tokenizer specifically designed for Persian language ('gpt2-fa').

# Set the padding token
tokenizer.pad_token = "هیچ" #We set the padding token for the tokenizer. In this case, "هیچ" (meaning 'none' in Persian) is used as the padding token.

# Concatenate mesra1 and mesra2 for both train and test data using .loc
train_df = train_df.copy()
test_df  = test_df.copy()
train_df.loc[:, 'concatenated'] = train_df['mesra1'] + '،' + train_df['mesra2']
test_df.loc[:, 'concatenated']  = test_df['mesra1']  + '،' + test_df['mesra2']
#We concatenate 'mesra1' and 'mesra2' for both training and test datasets, separated by the Persian comma ('،').

# Tokenize the concatenated data
train_encodings = tokenizer(train_df['concatenated'].apply(lambda x: ' ' + x).tolist(), truncation=True, padding=True)
test_encodings  = tokenizer(test_df['concatenated'].apply(lambda x: ' ' + x).tolist() , truncation=True, padding=True)
#We tokenize the concatenated sequences, adding a space before each sequence, and apply padding to ensure uniform length.

# Create the labels (shifted concatenated sequences by one position)
train_labels = tokenizer(train_df['concatenated'].apply(lambda x: x[1:]).tolist(), truncation=True, padding=True) #shifted!!! (the label for each x should be the next one.)
test_labels  = tokenizer(test_df['concatenated'].apply(lambda x: x[1:]).tolist() , truncation=True, padding=True) #shifted!!!!
#  We create labels by shifting the concatenated sequences by one position, and then tokenize them.

#-------------------------------------------------------
#ddefine a custom PyTorch dataset class (Seq2SeqDataset) to encapsulate the tokenized encodings and labels.
class Seq2SeqDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels['input_ids'][idx])
        return item

    def __len__(self):
        return len(self.encodings.input_ids)
#-------------------------------------------------------

# Create the data loaders
train_dataset = Seq2SeqDataset(train_encodings, train_labels)
test_dataset  = Seq2SeqDataset(test_encodings , test_labels )

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True )
test_loader  =  DataLoader(test_dataset, batch_size=16, shuffle=False)


 the code segment that you see above demonstrates the necessary steps to tokenize and prepare data for training a Seq2Seq model on concatenated Persian poetry lines using the GPT-2 tokenizer.

---

# **Part (C)**

---


what I've done in this part of my code :

1. **Load Pre-trained Model:**
   - Used 'HooshvareLab/gpt2-fa' for Persian language.

2. **Move to GPU:**
   - Shifted the model to GPU if available.

3. **Optimizer & Loss:**
   - Utilized AdamW optimizer and CrossEntropyLoss.

4. **Training Loop:**
   - Trained for 20 epochs on the poetry dataset.

5. **Results:**
   - Loss decreased, indicating model improvement.
   - Aiming for better generation of Persian poetry.


In [21]:
# Load the pre-trained model
model = GPT2LMHeadModel.from_pretrained('HooshvareLab/gpt2-fa')

# Move the model to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model  = model.to(device)

# Set up the optimizer and the loss function
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()

# Training loop
def train(model, dataloader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    for batch in dataloader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(dataloader)


# Train the model
num_epochs = 20
for epoch in range(num_epochs):
    loss = train(model, train_loader, optimizer, criterion, device)
    print(f'Epoch {epoch+1}/{num_epochs}: Loss = {loss}')


pytorch_model.bin:   0%|          | 0.00/485M [00:00<?, ?B/s]

Epoch 1/20: Loss = 3.12295079370604
Epoch 2/20: Loss = 2.4618343951864525
Epoch 3/20: Loss = 2.131697087410908
Epoch 4/20: Loss = 1.857341863344101
Epoch 5/20: Loss = 1.608992741310707
Epoch 6/20: Loss = 1.3777685616344082
Epoch 7/20: Loss = 1.1727102534850726
Epoch 8/20: Loss = 0.9949515821567614
Epoch 9/20: Loss = 0.8469533993562639
Epoch 10/20: Loss = 0.731245081626999
Epoch 11/20: Loss = 0.6487587131064345
Epoch 12/20: Loss = 0.587440934129801
Epoch 13/20: Loss = 0.5494004616041618
Epoch 14/20: Loss = 0.522333134282553
Epoch 15/20: Loss = 0.5041049857117484
Epoch 16/20: Loss = 0.4846757662267466
Epoch 17/20: Loss = 0.4753985822633215
Epoch 18/20: Loss = 0.4620984543816114
Epoch 19/20: Loss = 0.4536150928251881
Epoch 20/20: Loss = 0.4453157692823329


**explanation:**

*   The training loss decreases consistently over epochs.

*   A decreasing loss indicates that the model is improving its ability to predict the target sequences in the training data.

*   Lower loss values generally correspond to better model performance.


This code shows that the model is fine-tunining effectively from the training data, capturing complex patterns and relationships within the concatenated persian poetry lines. As the training progresses, the model refines its parameters, leading to a reduction in the training loss.

It's essential to monitor the loss progression to assess the model's convergence and overall training performance. Lower and stabilized loss values are indicative of a well-trained model.

---

# **Part (D)**

---

**Random Generation with some Arbitrary Sentences**

( you can repeat this cell as many times as you want and it generates new poems each time :) some of them are really fun)

In [105]:
# Create a text generation pipeline
generator = pipeline('text-generation', model=model, tokenizer=tokenizer, device=0)

def generate_and_print_results(input_sentence, generator, max_length=100, temperature=0.7):
    print("\nInput Sentence:", input_sentence)

    # Generate paired sentences given an input sentence
    generated_text = generator(input_sentence, max_length=max_length, do_sample=True, temperature=temperature, pad_token_id=generator.tokenizer.eos_token_id)
    generated_text = generated_text[0]['generated_text'].split(' ، ')

    print("\nGenerated Sentences:")
    for text in generated_text:
        lines = text.split('،')
        if len(lines) == 1:
            print(lines[0])
        else:
            print(lines[0])
            print(lines[1])

    print("===============================")


# Examples
print('\nExample #1:')
generate_and_print_results("تو نیکی میکن و در دجله", generator, temperature=0.8)
print('\nExample #2:')
generate_and_print_results("ای ایران"               , generator, temperature=0.8)




Example #1:

Input Sentence: تو نیکی میکن و در دجله

Generated Sentences:
تو نیکی میکن و در دجله شوی
چنان چون بباید گزندی بدوی

Example #2:

Input Sentence: ای ایران

Generated Sentences:
ای ایران ز هر سو بجنگ آمدی
ز هامون بگذاشتی و بجنگ آمدی


---

The generated sentences are displayed for each example, providing insights into the model's creative text generation capabilities.

---

**Generation On some samples from Test Data using first Part of the verses and Compare it with the Target verses**

In [51]:
def generate_and_print_results(input_sentence, target_sentence, model, tokenizer, device):

    # Tokenize the input sentence
    input_ids = tokenizer(input_sentence, return_tensors="pt", max_length=50, truncation=True, padding=True)['input_ids'].to(device)

    # Generate a complete sequence given an input sentence
    generated_ids = model.generate(input_ids, max_length=100, num_beams=5, temperature=0.7, pad_token_id=tokenizer.eos_token_id)
    generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)

    print("Input Mesra 1:   ", input_sentence)
    print("\n-------------------------------")

    # Print generated text in two lines
    print("Generated :\n   ")
    print(generated_text.replace('،', '،\n'))
    print("\n-------------------------------")

    print("Target (original):\n   ", target_sentence)

    print("\n================================================\n")


# Print results for a few examples on the test data
example_indices = [1,3,100]

for idx in example_indices:
    print(f'Test data, number {idx}\n')
    input_sentence = test_df['mesra1'].iloc[idx]
    target_sentence = '\n' + test_df['mesra1'].iloc[idx] + ' \n ' + test_df['mesra2'].iloc[idx]  # Original concatenated target
    generate_and_print_results(input_sentence, target_sentence, model, tokenizer, device)



Test data, number 1

Input Mesra 1:    بیامد پر اندیشه و روی زرد

-------------------------------
Generated :
   
بیامد پر اندیشه و روی زرد،
به پیش فریدون شد آن شوخ مرد

-------------------------------
Target (original):
    
بیامد پر اندیشه و روی زرد 
 بپرسید زان نامداران مرد


Test data, number 3

Input Mesra 1:    ولیکن بدین رای هشیار من

-------------------------------
Generated :
   
ولیکن بدین رای هشیار من،
همانا که بود از در کار من

-------------------------------
Target (original):
    
ولیکن بدین رای هشیار من 
 یکی بنگرد ژرف سالار من


Test data, number 100

Input Mesra 1:    کدامست مردی پژوهنده راز

-------------------------------
Generated :
   
کدامست مردی پژوهنده راز،
کز اندیشهٔ بد بپرداز نیاز

-------------------------------
Target (original):
    
کدامست مردی پژوهنده راز 
 که پیماید این ژرف راه دراز




---

**Perplexity**

Perplexity is a measure used to evaluate the performance of a language model in predicting a given sequence of words. It is commonly employed in natural language processing and machine learning tasks, especially in the context of probabilistic models.

The perplexity of a language model on a particular dataset is calculated based on the likelihood of the model predicting the sequence of words in that dataset. The lower the perplexity, the better the model is at predicting the given sequences. The formula for perplexity is often expressed as:

Perplexity=exp(Cross Entropy Loss)




In [107]:
def calculate_perplexity(model, dataloader, criterion, device):
    model.eval()
    total_loss = 0
    total_tokens = 0

    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_loss += loss.item()
            total_tokens += attention_mask.sum().item()

    perplexity = np.exp(total_loss / total_tokens)
    return perplexity

test_perplexity = calculate_perplexity(model, test_loader, criterion, device)
print(f"Perplexity on the test set: {test_perplexity}")

Perplexity on the test set: 1.0146381756956528
