# **Step 1: Set Up Google Colab Environment**


To create a model for summarizing chats through fine-tuning, you'll need a step-by-step approach. We'll use a pre-trained model (like OpenAI's GPT or Hugging Face's transformers) and fine-tune it on your specific dataset to summarize large chats effectively. Here's a complete guide:

Step 1: Set Up Google Colab Environment
Open Google Colab and create a new notebook.
Ensure you have a GPU runtime enabled (Runtime > Change runtime type > Hardware Accelerator > GPU).

# **Step 2: Install Required Libraries**

In [None]:
!pip install transformers datasets accelerate
!pip install wandb  # Optional, for experiment tracking


Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

# **Step 3: Prepare Your Dataset**

In [None]:
import csv
import random

# Generate a large dataset with dialogue and summary pairs
def generate_large_dataset(num_samples=1000):
    data = []
    for i in range(num_samples):
        dialogue = f"User: Hello, this is message {i}. How are you?\nAssistant: I'm fine, thank you. How can I assist you with message {i}?"
        summary = f"User greeted and inquired about assistance for message {i}."
        data.append({"dialogue": dialogue, "summary": summary})
    return data

# Create the dataset
data = generate_large_dataset(1000)

# Write the dataset to a CSV file
csv_file_path = "large_dialogue_summary_dataset.csv"

with open(csv_file_path, mode="w", newline="", encoding="utf-8") as file:
    writer = csv.DictWriter(file, fieldnames=["dialogue", "summary"])
    writer.writeheader()
    writer.writerows(data)

print(f"CSV file has been created: {csv_file_path}")


CSV file has been created: large_dialogue_summary_dataset.csv


In [None]:
from datasets import load_dataset

# Replace 'your_dataset_path.csv' with your uploaded dataset path
dataset = load_dataset('csv', data_files='large_dialogue_summary_dataset.csv')
print(dataset)


Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['dialogue', 'summary'],
        num_rows: 1000
    })
})


In [None]:
# Split into train and validation datasets
train_dataset = dataset['train'].train_test_split(test_size=0.1)['train']
val_dataset = dataset['train'].train_test_split(test_size=0.1)['test']


# **Step 4: Choose a Pre-trained Model**

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "facebook/bart-large-cnn"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

# **Step 5: Preprocess the Data**

In [None]:
def preprocess_function(examples):
    inputs = tokenizer(examples['dialogue'], max_length=1024, truncation=True, padding='max_length')
    outputs = tokenizer(examples['summary'], max_length=128, truncation=True, padding='max_length')
    inputs['labels'] = outputs['input_ids']
    return inputs

tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_val = val_dataset.map(preprocess_function, batched=True)


Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

# **Step 6: Define the Data Collator**

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)


# **Step 7: Set Up Training Arguments**

In [None]:
import os
from transformers import Seq2SeqTrainingArguments

# Disable Weights & Biases (wandb)
os.environ["WANDB_DISABLED"] = "true"

# Set up training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=2,
    predict_with_generate=True,
    fp16=True,  # Use mixed precision for faster training on GPU
    logging_dir='./logs',
    logging_steps=10,
)


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


# **Step 8: Train the Model**

In [None]:
from transformers import Seq2SeqTrainer, DataCollatorForSeq2Seq

# Define Data Collator for padding and batch consistency
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Initialize the Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,  # Replace with your tokenized training dataset
    eval_dataset=tokenized_val,     # Replace with your tokenized validation dataset
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Start training
trainer.train()

# Save the final model
trainer.save_model("./final_chat_summary_model")
tokenizer.save_pretrained("./final_chat_summary_model")

print("Training completed! The fine-tuned model is saved to './final_chat_summary_model'.")


  trainer = Seq2SeqTrainer(


Epoch,Training Loss,Validation Loss
1,0.0038,0.040804
2,0.002,0.025514
3,0.0009,0.034156




Training completed! The fine-tuned model is saved to './final_chat_summary_model'.


# **Step 9: Evaluate the Model**

In [None]:
!pip install evaluate


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
# Install the required dependency
!pip install rouge_score

# Now import and use the ROUGE metric
import evaluate
from datasets import load_dataset

# Load ROUGE metric
rouge = evaluate.load("rouge")

# Function to compute metrics during evaluation
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = tokenizer.batch_decode(logits, skip_special_tokens=True)
    references = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Compute ROUGE score
    results = rouge.compute(predictions=predictions, references=references)
    return results


Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=73dafde7b7d5fafaf75f0d8f4d756cdb17ccbf5c76533b752ee41a9806c119da
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


# **Step 10: Test the Model**

In [None]:
# Import torch
import torch

# Make sure to move the model and input tensors to the same device (GPU if available, else CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"

# Move model to the correct device
model = model.to(device)

# Tokenize the input chat and move inputs to the correct device
inputs = tokenizer(chat, return_tensors="pt", max_length=1024, truncation=True).to(device)

# Generate the summary
summary_ids = model.generate(inputs['input_ids'], max_length=128, num_beams=4, early_stopping=True)

# Decode and print the summary
print("Summary:", tokenizer.decode(summary_ids[0], skip_special_tokens=True))


Summary: User greeted and inquired about assistance for 12345 12345.User asked for assistance for assistance with order 12345 and asked for help canceling.User canceled the order and thanked the user for assistance.User and assistance were greeted and greeted by and assistance for the 12345, and asked about help for other orders.
