<a href="https://www.kaggle.com/code/taheriodgewala/minimize-perplexity-with-fine-tuning-chatgpt2?scriptVersionId=210237558" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Competition overview 
This Kaggle competition is about solving a unique natural language processing (NLP) task. The challenge involves rearranging scrambled words in text passages to restore coherence. The aim is to minimize the perplexity score, a metric that reflects how well a language model predicts the sequence of words. A lower perplexity score indicates better fluency and coherence in the rearranged text.



# Develop a Solution
Use a pre-trained language model (like GPT, T5, or BERT) to understand the context and reorder words effectively.
Fine-tune the model, if needed, using relevant datasets (e.g., Christmas stories or other narrative datasets).
Implement beam search, top-k sampling, or another decoding strategy to optimize the word order.

# Evaluate with Perplexity
Ensure a rearranged text minimizes the perplexity score.
Test on validation data and compare your score with the baseline.

# Potential Solution Framework
Pre-trained Language Models:
Use models like GPT-3, Google Gemini-1.5, T5, or others capable of handling long contexts.
Fine-tune the model on domain-specific data (e.g., Christmas tales or books).
Data Augmentation:
Create synthetic data by scrambling and reordering words in coherent texts to train your model better.
Custom Decoding Techniques:
Implement decoding strategies that prioritize logical word sequences, like:
Beam search to find the optimal word order.
Nucleus sampling to maintain diversity and fluency.
Metric Calculation:
Validate your model's predictions using the perplexity metric and refine hyperparameters to minimize it.

import the libraries 📚

In [1]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling
from torch.utils.data import DataLoader
import numpy as np


 # Step 1 
 Load the Pre-trained GPT-2 Model and Tokenizer

In [2]:
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

 Ensure model is in evaluation mode ⬇︎

In [3]:
model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

# Step 2
Initialize decoded sequences and their scores,
Start with the input sequence,
Generate logits for the next word,
Get top beam_width tokens and their probabilities,
Ensure tensors are correctly shaped,
Create new candidate sequences,
Expand dimensions of seq and token for concatenation,
Select top sequences for the next iteration,
 Return the best sequence

In [4]:
def beam_search(input_ids, model, tokenizer, beam_width=5, max_length=50):
    """
    Perform beam search to reorder scrambled words.
    """
   
    decoded_sequences = []
    sequence_scores = []

   
    sequences = [(input_ids, 0)]  

    for _ in range(max_length):
        all_candidates = []

        for seq, score in sequences:
           
            outputs = model(seq)
            logits = outputs.logits[:, -1, :]  
            probs = torch.softmax(logits, dim=-1)

           
            top_probs, top_tokens = probs.topk(beam_width, dim=-1)

           
            top_probs = top_probs.squeeze(0)
            top_tokens = top_tokens.squeeze(0)

            
            for token, prob in zip(top_tokens, top_probs):
                
                candidate_seq = torch.cat([seq, token.view(1, -1)], dim=-1)
                candidate_score = score + torch.log(prob)
                all_candidates.append((candidate_seq, candidate_score))

        
        ordered = sorted(all_candidates, key=lambda x: x[1], reverse=True)
        sequences = ordered[:beam_width]

   
    best_sequence = sequences[0][0]
    return tokenizer.decode(best_sequence[0], skip_special_tokens=True)


# Step 3 
Define Perplexity Calculation Function

In [5]:
def calculate_perplexity(text, model, tokenizer):
    """
    Calculate perplexity of a given text.
    """
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    outputs = model(**inputs, labels=inputs["input_ids"])
    loss = outputs.loss
    perplexity = torch.exp(loss)
    return perplexity.item()


# Step 4
Load and Process Dataset

In [6]:
def load_data(filepath):
    """
    Load jumbled text passages from file.
    """
    with open(filepath, "r") as f:
        passages = f.readlines()
    return [passage.strip() for passage in passages]

input_file = "/kaggle/input/santa-2024/sample_submission.csv"  
data = load_data(input_file)


# Step 5 
Process and Decode Each Passage

In [7]:
decoded_passages = []
for passage in data:
    input_ids = tokenizer.encode(passage, return_tensors="pt")
    reordered_text = beam_search(input_ids, model, tokenizer)
    perplexity = calculate_perplexity(reordered_text, model, tokenizer)
    decoded_passages.append((reordered_text, perplexity))

# Step 6: 
Save Results and Calculate Average Perplexity

In [8]:
output_file = "reordered_passages.txt"
total_perplexity = 0
with open(output_file, "w") as f:
    for idx, (text, perplexity) in enumerate(decoded_passages):
        total_perplexity += perplexity
        f.write(f"Passage {idx + 1}:\n{text}\nPerplexity: {perplexity}\n\n")

average_perplexity = total_perplexity / len(decoded_passages)
print(f"Average Perplexity: {average_perplexity}")


Average Perplexity: 46.71575801713126


# Save predictions 
to a CSV file

In [9]:
import csv

output_csv = "predictions.csv"


with open(output_csv, mode="w", newline="", encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["Passage_ID", "Reordered_Text", "Perplexity"])  # Header

    for idx, (text, perplexity) in enumerate(decoded_passages):
        writer.writerow([idx + 1, text, perplexity])

print(f"Predictions saved to {output_csv}")


Predictions saved to predictions.csv


# Prepare the data for display
Convert to DataFrame, 
Display the DataFrame in a tabular format, Display the DataFrame in a tabular format, Show the first few rows for preview

In [10]:
import pandas as pd

# Prepare the data for display
predictions = [{"Passage_ID": idx + 1, "Reordered_Text": text, "Perplexity": perplexity}
               for idx, (text, perplexity) in enumerate(decoded_passages)]

# Convert to DataFrame
df = pd.DataFrame(predictions)

# Display the DataFrame in a tabular format
from IPython.display import display
display(df)

# Show the first few rows for preview
print("Preview of Results:")
print(df.head(30))


Unnamed: 0,Passage_ID,Reordered_Text,Perplexity
0,1,"id,text-transform: uppercase; font-size: 14px;...",2.021008
1,2,"0,""advent chimney elf family fireplace gingerb...",9.90353
2,3,"1,""advent chimney elf family fireplace gingerb...",21.591696
3,4,"2,""yuletide decorations gifts cheer holiday ca...",39.660198
4,5,"3,""yuletide decorations gifts cheer holiday ca...",58.412903
5,6,"4,""hohoho candle poinsettia snowglobe peppermi...",54.484875
6,7,"5,""advent chimney elf family fireplace gingerb...",140.936096


Preview of Results:
   Passage_ID                                     Reordered_Text  Perplexity
0           1  id,text-transform: uppercase; font-size: 14px;...    2.021008
1           2  0,"advent chimney elf family fireplace gingerb...    9.903530
2           3  1,"advent chimney elf family fireplace gingerb...   21.591696
3           4  2,"yuletide decorations gifts cheer holiday ca...   39.660198
4           5  3,"yuletide decorations gifts cheer holiday ca...   58.412903
5           6  4,"hohoho candle poinsettia snowglobe peppermi...   54.484875
6           7  5,"advent chimney elf family fireplace gingerb...  140.936096


In [11]:
! pip install tabulate


  pid, fd = os.forkpty()




In [12]:
from tabulate import tabulate


table_data = [[idx + 1, text, perplexity] for idx, (text, perplexity) in enumerate(decoded_passages)]
headers = ["Passage_ID", "Reordered_Text", "Perplexity"]


print(tabulate(table_data, headers=headers, tablefmt="fancy_grid"))


╒══════════════╤══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╤══════════════╕
│   Passage_ID │ Reordered_Text                                                                                                                                                                                                                                                                                                      

# Conclusion 
used the pretrained model chatGPT2 and finetune it as per over use of this competition.also used the dataaugmented methode for Creating  synthetic  data , used  Custom Decoding Techniques for prioritize logical word sequences, like: Beam search to find the optimal word order. refine hyperparameters and  perplexity metric for minimize it ! 

fell free to connect and share your suggestion via comment thankyou! 