# Introduction 

This notebook presents a comprehensive guide to fine-tuning Facebook's BART (Bidirectional and Auto-Regressive Transformers) model for the task of summarizing chat conversations. 

The notebook is structured to provide a seamless and educative experience in applying advanced NLP techniques for practical applications.

The model fine-tuning utilizes three distinct datasets:

1. **DialogSum Dataset:** A specialized dataset for dialogue summarization, offering diverse conversational examples.

2. **SAMSUM Dataset by Samsung:** This dataset comprises scripted chat conversations with associated human-written summaries, providing a rich ground for training and validating summarization models.

3. **Custom Dataset:** A personally curated dataset, designed to include a variety of chat styles and topics, ensuring robustness and versatility in the model's performance.


In [1]:
# Importing necessary libraries
import json
import pandas as pd
import random
from transformers import BartTokenizer, BartForConditionalGeneration, Trainer, TrainingArguments
from torch.utils.data import Dataset
import torch

  from .autonotebook import tqdm as notebook_tqdm


We will install rogue score to evaluate the summaries generated by the model.

In [2]:
!pip install rouge_score




[notice] A new release of pip is available: 24.0 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


# 1. DialogSum dataset

### Defining functions to read and clean the dataset

**Note**

- To enhance the model's applicability to real-world scenarios, the original DialogSum dataset was modified by replacing generic placeholders 'Person1' and 'Person2' with a diverse list of names. 
- This adaptation ensures that the model is trained on data more representative of actual conversation patterns, thereby improving its practical utility and performance in real-world applications.

### Training set of DialogSum

In [4]:
# Path to JSONL file
train_file_path = '/kaggle/input/dialogue-chat/dialogsum.train.jsonl'

# Reading the JSONL file and creating a DataFrame
train_df1 = read_jsonl_to_dataframe(train_file_path)

# Replacing names in the DataFrame
train_df1 = replace_names(train_df1, names_list)

# Displaying the first few rows of the DataFrame with replaced names
train_df1

Unnamed: 0,dialogue,summary
0,"Ursula: Hi, Mr. Smith. I'm Doctor Hawkins. Why...","Mr. Smith's getting a check-up, and Doctor Haw..."
1,"Queenie: Hello Mrs. Parker, how have you been?...",Mrs Parker takes Ricky for his vaccines. Dr. P...
2,"Bob: Excuse me, did you see a set of keys?\nSe...",Bob's looking for a set of keys and asks for S...
3,Zelda: Why didn't you tell me you had a girlfr...,Zelda's angry because Victor didn't tell Zelda...
4,"Samuel: Watsup, ladies! Y'll looking'fine toni...",Malik invites Nikki to dance. Nikki agrees if ...
...,...,...
12455,Diana: Excuse me. You are Mr. Green from Manch...,Tan Ling picks Mr. Green up who is easily reco...
12456,Samuel: Mister Ewing said we should show up at...,Samuel and Bob plan to take the underground to...
12457,Edward: How can I help you today?\nAaron: I wo...,Aaron rents a small car for 5 days with the he...
12458,Kylie: You look a bit unhappy today. What's up...,Zach's mom lost her job. Zach hopes mom won't ...


### Validation set of DialogSum

In [3]:
from datasets import load_dataset

billsum = load_dataset("Kyudan/GTNT_8.25M")

In [4]:
split_dataset = billsum['train'].train_test_split(test_size=0.2)

# train과 test 데이터셋 확인
train_dataset = split_dataset['train']
test_dataset = split_dataset['test']

In [5]:
from transformers import BartTokenizer, BartForConditionalGeneration

# Initialize the tokenizer for BART
# 'facebook/bart-base' is a pretrained model identifier
# The tokenizer is responsible for converting text input into tokens that the model can understand
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')

# Initialize the BART model for conditional generation
# This model is used for tasks like summarization where the output is conditional on the input text
# The model is loaded with pretrained weights from 'facebook/bart-base'
model = BartForConditionalGeneration.from_pretrained('facebook/bart-base')

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [6]:
def preprocess_function(examples):
    # Prepends the string "summarize: " to each document in the 'text' field of the input examples.
    # This is done to instruct the T5 model on the task it needs to perform, which in this case is summarization.
    inputs = ["translate: " + doc for doc in examples["NT"]]

    # Tokenizes the prepended input texts to convert them into a format that can be fed into the T5 model.
    # Sets a maximum token length of 1024, and truncates any text longer than this limit.
    model_inputs = tokenizer(inputs, max_length=256, truncation=True)

    # Tokenizes the 'summary' field of the input examples to prepare the target labels for the summarization task.
    # Sets a maximum token length of 128, and truncates any text longer than this limit.
    labels = tokenizer(text_target=examples["GT"], max_length=256, truncation=True)

    # Assigns the tokenized labels to the 'labels' field of model_inputs.
    # The 'labels' field is used during training to calculate the loss and guide model learning.
    model_inputs["labels"] = labels["input_ids"]

    # Returns the prepared inputs and labels as a single dictionary, ready for training.
    return model_inputs

In [7]:
tokenized_billsum = split_dataset.map(preprocess_function, batched=True)

Map:  29%|██▉       | 1913000/6602527 [07:41<18:51, 4144.51 examples/s]


KeyboardInterrupt: 

# Fine-tuning the model

In [17]:
from transformers import TrainingArguments

# Define training arguments for the model
training_args = TrainingArguments(
    output_dir='./results',          # Directory to save model output and checkpoints
    num_train_epochs=2,              # Number of epochs to train the model
    per_device_train_batch_size=8,   # Batch size per device during training
    per_device_eval_batch_size=8,    # Batch size for evaluation
    warmup_steps=500,                # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # Weight decay for regularization
    logging_dir='./logs',            # Directory to save logs
    logging_steps=10,                # Log metrics every specified number of steps
    evaluation_strategy="epoch",     # Evaluation is done at the end of each epoch
    report_to='none'                 # Disables reporting to any online services (e.g., TensorBoard, WandB)
)

In [18]:
# Initializing the Trainer object
trainer = Trainer(
    model=model,             # The model to be trained (e.g., our BART model)
    args=training_args,      # Training arguments specifying training parameters like learning rate, batch size, etc.
    train_dataset=train_dataset,  # The dataset to be used for training the model
    eval_dataset=valid_dataset    # The dataset to be used for evaluating the model during training
)

# Starting the training process
trainer.train()



Epoch,Training Loss,Validation Loss
1,0.0995,0.086062
2,0.0834,0.081724




TrainOutput(global_step=3678, training_loss=0.4837404892229399, metrics={'train_runtime': 5644.8729, 'train_samples_per_second': 10.423, 'train_steps_per_second': 0.652, 'total_flos': 1.793783686496256e+16, 'train_loss': 0.4837404892229399, 'epoch': 2.0})

# Model Evaluation using Rogue Score

In [19]:
from datasets import load_metric
from torch.utils.data import DataLoader

# Load the ROUGE metric for evaluation
rouge = load_metric('rouge')

def generate_summaries(model, tokenizer, dataset, batch_size=8):
    """
    Generate summaries using the provided model and tokenizer on the given dataset.

    Args:
        model: The trained summarization model.
        tokenizer: Tokenizer associated with the model.
        dataset: Dataset for which summaries need to be generated.
        batch_size: Number of data samples to process in each batch.

    Returns:
        summaries: Generated summaries by the model.
        references: Actual summaries from the dataset for comparison.
    """
    # Set model to evaluation mode
    model.eval()
    summaries = []    # List to store generated summaries
    references = []   # List to store actual summaries

    # Create a DataLoader for batch processing
    dataloader = DataLoader(dataset, batch_size=batch_size)

    # Disabled gradient calculations for efficiency
    with torch.no_grad():
        for batch in dataloader:
            # Move input data to the same device as the model
            input_ids = batch['input_ids'].to(model.device)
            attention_mask = batch['attention_mask'].to(model.device)

            # Generate summaries with the model
            outputs = model.generate(input_ids, attention_mask=attention_mask, max_length=2048, num_beams=2)
            batch_summaries = [tokenizer.decode(ids, skip_special_tokens=True) for ids in outputs]

            # Append generated and actual summaries to the respective lists
            summaries.extend(batch_summaries)
            references.extend(batch['summary'])

    return summaries, references

# Generate summaries for the validation dataset
generated_summaries, actual_summaries = generate_summaries(model, tokenizer, valid_dataset, batch_size=8)

# Compute and print the ROUGE score for evaluation
rouge_score = rouge.compute(predictions=generated_summaries, references=actual_summaries)
print(rouge_score)

Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

{'rouge1': AggregateScore(low=Score(precision=0.5203167404161397, recall=0.454729290212834, fmeasure=0.4632587632485201), mid=Score(precision=0.5354479435805708, recall=0.46890642945251565, fmeasure=0.47534285466245074), high=Score(precision=0.5502350562403443, recall=0.4824408056039646, fmeasure=0.48742275414871145)), 'rouge2': AggregateScore(low=Score(precision=0.25078879710302293, recall=0.21609817018001448, fmeasure=0.22050781070275272), mid=Score(precision=0.26568774609184886, recall=0.2292050456292191, fmeasure=0.2331994917519589), high=Score(precision=0.2808212845633841, recall=0.24289826444917545, fmeasure=0.24596704167812236)), 'rougeL': AggregateScore(low=Score(precision=0.43189569859182625, recall=0.3784438094396151, fmeasure=0.38432840225294457), mid=Score(precision=0.4465577989373004, recall=0.39079956181403797, fmeasure=0.39648919210982875), high=Score(precision=0.46139814838987636, recall=0.4039602038430952, fmeasure=0.4090468688096221)), 'rougeLsum': AggregateScore(low=

### Displaying the Rouge scores to better understand the results

In [21]:
rouge_scores = {
    'rouge1': {
        'low': {'precision': 0.5203, 'recall': 0.4547, 'fmeasure': 0.4632},
        'mid': {'precision': 0.5354, 'recall': 0.4689, 'fmeasure': 0.4753},
        'high': {'precision': 0.5502, 'recall': 0.4824, 'fmeasure': 0.4874}
    },
    'rouge2': {
        'low': {'precision': 0.2507, 'recall': 0.2160, 'fmeasure': 0.2205},
        'mid': {'precision': 0.2656, 'recall': 0.2292, 'fmeasure': 0.2331},
        'high': {'precision': 0.2808, 'recall': 0.2428, 'fmeasure': 0.2459}
    },
    'rougeL': {
        'low': {'precision': 0.4318, 'recall': 0.3784, 'fmeasure': 0.3843},
        'mid': {'precision': 0.4465, 'recall': 0.3907, 'fmeasure': 0.3964},
        'high': {'precision': 0.4613, 'recall': 0.4039, 'fmeasure': 0.4090}
    },
    'rougeLsum': {
        'low': {'precision': 0.4324, 'recall': 0.3770, 'fmeasure': 0.3830},
        'mid': {'precision': 0.4463, 'recall': 0.3903, 'fmeasure': 0.3960},
        'high': {'precision': 0.4616, 'recall': 0.4031, 'fmeasure': 0.4075}
    }
}

# Convert the nested dictionary into a Pandas DataFrame
scores = pd.DataFrame.from_dict({(i, j): rouge_scores[i][j] 
                            for i in rouge_scores.keys() 
                            for j in rouge_scores[i].keys()},
                            orient='index')

# Set column names for readability
scores.columns = ['Precision', 'Recall', 'F-Measure']

# Display the DataFrame
scores

Unnamed: 0,Unnamed: 1,Precision,Recall,F-Measure
rouge1,low,0.5203,0.4547,0.4632
rouge1,mid,0.5354,0.4689,0.4753
rouge1,high,0.5502,0.4824,0.4874
rouge2,low,0.2507,0.216,0.2205
rouge2,mid,0.2656,0.2292,0.2331
rouge2,high,0.2808,0.2428,0.2459
rougeL,low,0.4318,0.3784,0.3843
rougeL,mid,0.4465,0.3907,0.3964
rougeL,high,0.4613,0.4039,0.409
rougeLsum,low,0.4324,0.377,0.383


### Let's test the model on a conversation using input 

In [22]:
# Check if CUDA (GPU support) is available and choose the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the chosen device
model = model.to(device)

In [23]:
def summarize_text(text, max_length=5000):
    """
    Generates a summary for the given text using a pre-trained model.

    Args:
        text (str): The text to be summarized.
        max_length (int): The maximum length of the input text for the model.

    Returns:
        str: The generated summary of the input text.
    """
    # Encode the input text using the tokenizer. The 'pt' indicates PyTorch tensors.
    inputs = tokenizer.encode(text, return_tensors="pt", max_length=max_length, truncation=False)
    
    # Move the encoded text to the same device as the model (e.g., GPU or CPU)
    inputs = inputs.to(device)

    # Generate summary IDs with the model. num_beams controls the beam search width.
    # early_stopping is set to False for a thorough search, though it can be set to True for faster results.
    summary_ids = model.generate(inputs, max_length=2000, num_beams=30, early_stopping=False)

    # Decode the generated IDs back to text, skipping special tokens like padding or EOS.
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Return the generated summary
    return summary

In [31]:
# Prompt the user to enter text for summarization
text = input('Enter the text: ')
print()

# Call the summarize_text function to generate a summary of the input text
summary = summarize_text(text)

# Print the generated summary
print(summary)

Enter the text:  Web Developer (You): Hey, I just launched a new website with some exciting features. Would you like to check it out? Machine Learning Enthusiast: That sounds interesting! I'd love to see how you've integrated machine learning into it. Computer Science Student: Speaking of machine learning, have you heard about the latest breakthroughs in natural language processing? Science Enthusiast: Yes, I've been following those developments closely. It's amazing how AI is transforming language understanding. Mathematics Enthusiast: Absolutely! The mathematical foundations of deep learning play a crucial role in these advancements. News Enthusiast: By the way, did you catch the latest headlines? There's a lot happening in the world right now. Web Developer (You): I did! In fact, my website can recommend personalized news articles based on user preferences. Clinical Medical Assistant: That's impressive! Speaking of recommendations, have you worked on any projects related to healthca

Enter the text: Web Developer (You): Hey, I just launched a new website with some exciting features. Would you like to check it out? Machine Learning Enthusiast: That sounds interesting! I'd love to see how you've integrated machine learning into it. Computer Science Student: Speaking of machine learning, have you heard about the latest breakthroughs in natural language processing? Science Enthusiast: Yes, I've been following those developments closely. It's amazing how AI is transforming language understanding. Mathematics Enthusiast: Absolutely! The mathematical foundations of deep learning play a crucial role in these advancements. News Enthusiast: By the way, did you catch the latest headlines? There's a lot happening in the world right now. Web Developer (You): I did! In fact, my website can recommend personalized news articles based on user preferences. Clinical Medical Assistant: That's impressive! Speaking of recommendations, have you worked on any projects related to healthcar

## Conclusion

In this notebook, I fine-tuned the BART Base model for summarizing chat conversations. The model showed promising performance, especially in capturing the essence of dialogues. The findings revealed the model's strengths and areas of improvement as follows:

1. **Consistency in Capturing Key Points:** The ROUGE-1 scores, with a high precision of 0.5502 and recall of 0.4824, indicate that the model is consistently capturing key points from the conversations. This suggests that for most of the chat content, the generated summaries were aligned well with the essential topics.


2. **Complex Relationships and Nuances:** The ROUGE-2 scores, particularly the high precision of 0.2808 and recall of 0.2428, reflect the model's ability to grasp more complex relationships and nuances in the conversations. While lower than ROUGE-1, these scores are indicative of the model's potential in understanding subtleties in dialogues.


3. **Summary Length and Relevance:** The ROUGE-L and ROUGE-Lsum scores, with a high precision of around 0.4613 and a recall of approximately 0.4039, demonstrate the model's capability in maintaining the length and relevance of the original dialogues in the summaries.



- While the model shows effectiveness in summarizing chat conversations, there is room for improvement, particularly in capturing more intricate details and subtleties, as suggested by the ROUGE-2 scores.

I welcome any feedback or questions in the comments and am open to collaborations on similar projects. For further discussions or networking opportunities, feel free to connect with me on [LinkedIn](https://www.linkedin.com/in/farneet-singh-6b155b208/).