In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
nileshmalode1_samsum_dataset_text_summarization_path = kagglehub.dataset_download('nileshmalode1/samsum-dataset-text-summarization')

print('Data source import complete.')


<div style="font-family: Calibri, serif; text-align: center;">
    <hr style="border: none;
               border-top: 15px solid #1d3580;
               width: 100%;
               margin-bottom: 20px;
               margin-left: 45;
               height: 20%"><br>
    <img src = "https://i.imgur.com/r3fweff.png" width = 256, height= 256>
    <br><br><br><br><br>
    <div style="font-size: 62px; color: #02011a"><b>Text Summarization with<br>Large Language Models</b></div><br>
        <hr style="border: none;
               border-top: 15px solid #1d3580;
               width: 100%;
               margin-bottom: 20px;
               margin-left: 45;
               height: 20%"> <br>
    <div style="font-weight: bold;
                text-transform: uppercase;
                margin-top: 20px;
                letter-spacing: 2.5px;
                color: #02011a;
                ">2023 | <a href ="https://www.kaggle.com/lusfernandotorres/">© Luis Fernando Torres</a></div>
</div>

<div style="font-family: Calibri, serif; text-align: left;">
    <hr style="border: none;
               border-top: 2px solid #041445;
               width: 100%;
               margin-top: 30px;
               margin-bottom: 20px;
               margin-left: 0;">
    <div style="font-size: 16px; letter-spacing: 1.5px; color: #02011a"><b>Table of Contents</b></div>
</div>

- [Introduction](#intro)<br><br>
    - [The Transformer Architecture](#transformers)<br><br>
- [This Notebook](#this_notebook)<br><br>
    - [The Task](#task)<br><br>
    - [The Dataset](#data)<br><br>
    - [The Model](#model)<br><br>
    - [Evaluation Metrics](#eval)<br><br>    
- [Exploring the Dataset](#eda)<br><br>
    - [Train Dataset](#train)<br><br>
    - [Test Dataset](#test)<br><br>
    - [Validation Dataset](#val)<br><br>
- [Preprocessing Data](#preprocess)<br><br>
- [Modeling](#modeling)<br><br>
- [Evaluating and Saving Model](#evaluating)<br><br>
- [Conclusion and Deployment](#conclusion)<br><br>

<div id = 'intro'
     style="font-family: Calibri, serif; text-align: left;">
    <hr style="border: none;
               border-top: 2.85px solid #041445;
               width: 100%;
               margin-top: 62px;
               margin-bottom: auto;
               margin-left: 0;">
    <div style="font-size: 56px; letter-spacing: 2.25px;color: #02011a;"><b>Introduction</b></div>
</div>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">November 30<sup>th</sup>, 2022, marks a significant chapter in the History of <b>machine learning</b>. It was the day OpenAI released ChatGPT, setting a new benchmark for chatbots powered by <b>Large Language Models</b> and offering the public an unparalleled conversational experience.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Ever since then, large language models — also referred to as <b>LLMs</b> —, have been in the public eye due to the extensive number of tasks they are able to perform. Examples include:</p>

<div style = "margin-left: 25px;">
    
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a"><b>• Text Summarization</b>: These models are able to perform a summarization of large texts, including legal texts, reviews, dialogues, among many others.</p>
    
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a"><b>• Sentiment Analysis</b>: They can read through reviews of products and services and classify them as positive, negative, or neutral. These can also be used in Finance to see if the general public feels <i>Bullish</i> or <i>Bearish</i> on certain securities.</p>
    
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a"><b>• Language Translation</b>: They can provide real-time translations from one language to another.</p>
    
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a"><b>• Text-based Recommender Systems</b>: They can  also recommend new products for a client based on their reviews on previously bought products.</p>
</div>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">But how do these models actually work? 🤔</p>

<div id = 'transformers'
     style="font-family: Calibri, serif; text-align: left;">
    <hr style="border: none;
               width: 100%;
               margin-top: 62px;
               margin-bottom: auto;
               margin-left: 0;">
    <div style="font-size: 32px; letter-spacing: 2.25px;color: #02011a;"><b>The Transformer Architecture</b></div>
</div>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">To understand the current state of LLMs, we must go back to Google's 2017 <b>Attention is All You Need</b>. In this paper, the <b>Transformer</b> architecture was introduced to the world, and it changed the industry forever.</p>

<center>
    <img src = "https://d2mk45aasx86xg.cloudfront.net/Transformer_architecture_a1d5ffc1e9.webp" width = 512, height = 950>
<p style = "font-size: 16px;
            font-family: 'Georgia', serif;
            text-align: center;
            margin-top: 10px;">Transformer architecture. <br>Source: <a href = "https://www.turing.com/kb/brief-introduction-to-transformers-and-their-power">Turing</a></p>
</center>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">While recurrent neural networks could be used to enable computers to comprehend text, these models were extremely limited due to the fact that they only allowed the machine to process one word at a time, which would result in the model not being able to acquire the full context of a text.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">The <b>transformer architecture</b>, however, is based on the attention mechanism, which allows the model to process an entire sentence or paragraph at once, rather than each word at a time. This is the main secret behind the possibility of full context comprehension, which gives much more power to all these language processing models.</p>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">The processing of text input with the transformer architecture is based on <mark style="background-color:#0d0259;
          color:white;
          border-radius:4px;
          opacity:1.0"><b>tokenization</b></mark>, which is the process of transforming texts into smaller components called tokens. These can be words, subwords, characters, or many others.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">The tokens are then mapped to numerical IDs, which are unique for each word or subword. Each ID is then transformed into an <b>embedding</b>: a dense, high-dimensional vector that contains numerical values. These values are designed to capture the original meaning of the tokens and serve as input for the transformer model.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">It is important to note that these embeddings are high-dimensional, with each dimension capturing certain aspects of a token’s meaning. Due to their high-dimensional nature, embeddings are not easily interpreted by humans, but transformer models readily use them to identify and group together tokens with similar meanings in the vector space.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Take the following example: </p>

<table style="font-family: Calibri, serif; font-size: 18px; letter-spacing: .85px;color: #02011a">
  <tr>
    <th><b>Original Text</b></th>
    <th><b>Tokenized Text</b></th>
    <th><b>Numerical IDs</b></th>
    <th><b>Embeddings (First 3 Dimensions)</b></th>
  </tr>
  <tr>
    <td>As she said this, she looked down at her hands, and was surprised to find<br> that she had put on one of the rabbit's little gloves while she was talking.</td>
    <td>['As', ' she', ' said', ' this', ',', ' she', ' looked', ' down', ' at', ' her', ' hands', ',', ' and', ' was', ' surprised', <br>' to', ' find',  ' that', ' she', ' had', ' put', ' on', ' one', ' of', ' the', ' rabbit', "'s", ' little', ' gloves', <br>' while', ' she', ' was', ' talking', '.']</td>
    <td>['7', ' 22', ' 258', ' 430', $\dots$, '589', ' 22', ' 78', ' 98', ' 5890']</td>   
      <td>['As': [1.12, -0.56, 0.07], <br> ['she': [0.88, 0.45, -2.03], <br> $\dots$ <br>]</td>
  </tr>
</table>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">By using this vector as input, the transformer model learns how to generate outputs based on the <b>probabilities of subsequent words that may naturally follow an input word</b>. This process gets repeated until the model creates an entire paragraph starting from an initial statement.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">There is a very intriguing post on Andrej Karpathy's blog, <a href = "http://karpathy.github.io/2015/05/21/rnn-effectiveness/"><i><b>The Unreasonable Effectiveness of Recurrent Neural Networks</b></i></a>, that explains why neural networks-based models are effective in predicting the next word of a text. One factor contributing to their effectiveness is the inherent <i>rules</i> in human languages, such as grammar, which constrain word usage in sentences.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">When you feed your model with examples of written language — news articles, Twitter/X posts, product reviews, messages, dialogues, etc. — it implicitly acquires the rules of language through these examples, which helps it to predict sequences of words and generate human-like texts.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">A large language model — such as <i>GPT</i>, <i>BERT</i>, <i>RoBERTa</i>, etc. — is a transformer model on a much larger scale. These models are built on an enormous amount of texts, so they learn and become experts in patterns and structures of language. The GPT-4, which is the model behind the premium version of ChatGPT, was trained on massive amounts of text data from the internet, such as books, articles, websites, etc.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">It is also relevant to note that different languages exhibit different patterns and structures. While Western European languages like English, French, German, Spanish, Portuguese, and Italian may share many structural similarities, other languages, such as Arabic and Japanese, are very distinct, posing unique challenges to modeling.</p>

<div id = 'this_notebook'
     style="font-family: Calibri, serif; text-align: left;">
    <hr style="border: none;
               border-top: 2.85px solid #041445;
               width: 100%;
               margin-top: 62px;
               margin-bottom: auto;
               margin-left: 0;">
    <div style="font-size: 56px; letter-spacing: 2.25px;color: #02011a;"><b>This Notebook</b></div>
</div>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">The goal of this notebook is to demonstrate how Large Language Models can be used for several tasks related to language processing. In this case, I am going to leverage the power of <b>transfer learning</b> to build a model capable of summarizing dialogues.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">For those of you who may not be aware, transfer learning is a machine learning technique in which we use a <i>pre-trained model</i>—that is already knowledgeable in a wide domain—and tailor its expertise for a specific task by training it in a specific dataset we might have. This process may also be referred to as <b>fine-tuning</b>.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">The <a href = "https://huggingface.co/docs/transformers/index"><b>🤗 Transformers</b></a> library—which is one of the most popular libraries for working with deep learning tasks—offers the possibility of working with the following architectures:</p>

<table style="font-family: Calibri, serif; font-size: 20px; letter-spacing: .85px;color: #02011a; float: left;">
  <tr>
    <th><b>Model Architectures</b></th>
  </tr>
  <tr>
    <td>BART, BigBird-Pegasus, Blenderbot, BlenderbotSmall, Encoder decoder, <br> FairSeq Machine-Translation, GPTSAN-japanese, LED, LongT5, M2M100, Marian, <br>mBART, MT5, MVP, NLLB, NLLB-MOE, Pegasus, <br>PEGASUS-X, PLBart, ProphetNet, SwitchTransformers, T5, UMT5, <br>XLM-ProphetNet</td>
  </tr>
</table>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">The <b>🤗 Transformers</b> library allows us to easily download and fine-tune state-of-the-art pre-trained models, and also allows us to easily work with both <b>TensorFlow</b> and <b>PyTorch</b> for several tasks related to Natural Language Preprocessing, Computer Vision, Audio, etc.</p>

<div id = 'task'
     style="font-family: Calibri, serif; text-align: left;">
    <hr style="border: none;
               width: 100%;
               margin-top: 62px;
               margin-bottom: auto;
               margin-left: 0;">
    <div style="font-size: 32px; letter-spacing: 2.25px;color: #02011a;"><b>The Task</b></div>
</div>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">As previously mentioned, the task at hand is <b>Text Summarization</b>. From the documentation of the 🤗 Transformers library, summarization can be described as the creation of <i>a shorter version of a document or an article that captures all the important information</i>.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">In this case, we are going to summarize dialogues by using a dataset containing chat texts.</p>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">For this task, we are going to use the <a href = "https://www.kaggle.com/datasets/nileshmalode1/samsum-dataset-text-summarization/versions/1"><b>SamSum Dataset</b></a>, which contains three <i>csv</i> files for training, testing, and validation. All these files are structured into a specific <code>id</code>, a <code>dialogue</code>, and a <code>summary</code>. The SamSum dataset consists of chat texts, which is ideal for the summarization of dialogues.</p>
     

<div id = 'model'
     style="font-family: Calibri, serif; text-align: left;">
    <hr style="border: none;
               width: 100%;
               margin-top: 62px;
               margin-bottom: auto;
               margin-left: 0;">
    <div style="font-size: 32px; letter-spacing: 2.25px;color: #02011a;"><b>The Model</b></div>
</div>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">As previously mentioned, we are going to harness the power of a pre-trained model for this task. In this case, I have decided to use the <b>BART</b> architecture, proposed in the  2019 paper <a href = "https://arxiv.org/abs/1910.13461">BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension</a>. More specifically, I am going to fine-tune a version of BART that has been already trained to perform text summarization of news articles, which is the <a href ="https://huggingface.co/facebook/bart-large-xsum"><b>facebook/bart-large-xsum</b></a> version.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Briefly explaining, BART is a denoising autoencoder that employs the strategy of distorting the input text in many ways, such as blanking out some words and flipping them around, and then learning to reconstruct it. BART has outperformed established models like RoBERTa and BERT on multiple NLP benchmarks, and it is especially efficient in summarization tasks, due to its ability to generate text and learn the context of the input text.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">For a deeper comprehension of BART, I highly suggest you read the research paper linked above, where this architecture was first introduced.</p>

<div id = 'eval'
     style="font-family: Calibri, serif; text-align: left;">
    <hr style="border: none;
               width: 100%;
               margin-top: 62px;
               margin-bottom: auto;
               margin-left: 0;">
    <div style="font-size: 32px; letter-spacing: 2.25px;color: #02011a;"><b>Evaluation Metrics</b></div>
</div>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Evaluating performance for language models can be quite tricky, especially when it comes to text summarization. The goal of our model is to produce a short sentence describing the content of a dialogue, while maintaining all the important information within that dialogue.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">One of the quantitative metrics we can employ to evaluate performance is the <b>ROUGE Score</b>. It is considered one of the best metrics for text summarization and it evaluates performance by comparing the quality of a machine-generated summary to a human-generated summary used for reference.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">The similarities between both summaries are measured by analyzing the overlapping <i>n-grams</i>, either single words or sequences of words that are present in both summaries. These can be unigrams (ROUGE-1), where only the overlap of sole words is measured; bigrams (ROUGE-2), where we measure the overlap of two-word sequences; trigrams (ROUGE-3), where we measure the overlap of three-word sequences; etc. Besides that, we also have:</p>

<div style = "margin-left: 25px;">
    
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a"><b>• ROUGE-L</b>: It measures the <i>Longest Common Subsequence (LCS)</i> between the two summaries, which helps to capture content coverage of the machine-generated text. If both summaries have the sequence <i>"the apple is green"</i>, we have a match regardless of where they appear in both texts.</p>
    
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a"><b>• ROUGE-S</b>: It evaluates the overlap of skip-bigrams, which are bigrams that permit gaps between words. This helps to measure the coherence of a machine-generated summary. For example, in the phrase <i>"this apple is absolutely green"</i>, we find a match for the terms such as <i>"apple"</i> and <i>"green"</i>, if that is what we are looking for.</p>
</div>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">These scores might typically range from 0 to 100, where 0 indicates no match and 100 indicates a perfect match between both summaries.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Besides quantitative metrics, it is useful to use <b>human evaluation</b> to analyze the output of language models, since we are able to comprehend text in a way that a machine does not. So we might read the dialogue and then read the summary to check if it is an accurate summarization.</p>

In [None]:
!nvidia-smi # Checking GPU

In [None]:
!pip install transformers # Installing the transformers library (https://huggingface.co/docs/transformers/index)

In [None]:
!pip install datasets # Installing the datasets library (https://huggingface.co/docs/datasets/index)

In [None]:
!pip install evaluate # Installing the evaluate library (https://huggingface.co/docs/evaluate/main/en/index)

In [None]:
!pip install rouge-score # Installing rouge-score library (https://pypi.org/project/rouge-score/)

In [None]:
!pip install py7zr # Installing library to save zip archives (https://pypi.org/project/py7zr/)

In [None]:
# Importing Libraries

# Data Handling
import pandas as pd
import numpy as np
from datasets import Dataset, load_metric
import shutil

# Data Visualization
import plotly.express as px
import plotly.graph_objs as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.io as pio
from IPython.display import display
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

# Statistics & Mathematics
import scipy.stats as stats
import statsmodels.api as sm
from scipy.stats import shapiro, skew, anderson, kstest, gaussian_kde,spearmanr
import math

# Hiding warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Transformers
from transformers import BartTokenizer, BartForConditionalGeneration      # BERT Tokenizer and architecture
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments         # These will help us to fine-tune our model
from transformers import pipeline                                         # Pipeline
from transformers import DataCollatorForSeq2Seq                           # DataCollator to batch the data
import torch                                                              # PyTorch
import evaluate                                                           # Hugging Face's library for model evaluation


# Other NLP libraries
from textblob import TextBlob                                             # This is going to help us fix spelling mistakes in texts
from sklearn.feature_extraction.text import TfidfVectorizer               # This is going to helps identify the most common terms in the corpus
import re                                                                 # This library allows us to clean text data
import nltk                                                               # Natural Language Toolkit
nltk.download('punkt')                                                    # This divides a text into a list of sentences

> <p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">By observing the imports above, you can clearly note that I have choosen to work with <b>PyTorch</b> for this notebook.</p>

In [None]:
# Configuring Pandas to exhibit larger columns
'''
This is going to allow us to fully read the dialogues and their summary
'''
pd.set_option('display.max_colwidth', 1000)

In [None]:
# Configuring notebook
seed = 42
#paper_color =
#bg_color =
colormap = 'cividis'
template = 'plotly_dark'

In [None]:
# Checking if GPU is available
if torch.cuda.is_available():
    print("GPU is available. \nUsing GPU")
    device = torch.device('cuda')
else:
    print("GPU is not available. \nUsing CPU")
    device = torch.device('cpu')

In [None]:
def display_feature_list(features, feature_type):

    '''
    This function displays the features within each list for each type of data
    '''

    print(f"\n{feature_type} Features: ")
    print(', '.join(features) if features else 'None')

def describe_df(df):
    """
    This function prints some basic info on the dataset and
    sets global variables for feature lists.
    """

    global categorical_features, continuous_features, binary_features
    categorical_features = [col for col in df.columns if df[col].dtype == 'object']
    binary_features = [col for col in df.columns if df[col].nunique() <= 2 and df[col].dtype != 'object']
    continuous_features = [col for col in df.columns if df[col].dtype != 'object' and col not in binary_features]

    print(f"\n{type(df).__name__} shape: {df.shape}")
    print(f"\n{df.shape[0]:,.0f} samples")
    print(f"\n{df.shape[1]:,.0f} attributes")
    print(f'\nMissing Data: \n{df.isnull().sum()}')
    print(f'\nDuplicates: {df.duplicated().sum()}')
    print(f'\nData Types: \n{df.dtypes}')

    #negative_valued_features = [col for col in df.columns if (df[col] < 0).any()]
    #print(f'\nFeatures with Negative Values: {", ".join(negative_valued_features) if negative_valued_features else "None"}')

    display_feature_list(categorical_features, 'Categorical')
    display_feature_list(continuous_features, 'Continuous')
    display_feature_list(binary_features, 'Binary')

    print(f'\n{type(df).__name__} Head: \n')
    display(df.head(5))
    print(f'\n{type(df).__name__} Tail: \n')
    display(df.tail(5))

In [None]:
def histogram_boxplot(df,hist_color, box_color, height, width, legend, name):
    '''
    This function plots a Histogram and a Box Plot side by side

    Parameters:
    hist_color = The color of the histogram
    box_color = The color of the boxplots
    heigh and width = Image size
    legend = Either to display legend or not
    '''

    features = df.select_dtypes(include = [np.number]).columns.tolist()

    for feat in features:
        try:
            fig = make_subplots(
                rows=1,
                cols=2,
                subplot_titles=["Box Plot", "Histogram"],
                horizontal_spacing=0.2
            )

            density = gaussian_kde(df[feat])
            x_vals = np.linspace(min(df[feat]), max(df[feat]), 200)
            density_vals = density(x_vals)

            fig.add_trace(go.Scatter(x=x_vals, y = density_vals, mode = 'lines',
                                     fill = 'tozeroy', name="Density", line_color=hist_color), row=1, col=2)
            fig.add_trace(go.Box(y=df[feat], name="Box Plot", boxmean=True, line_color=box_color), row=1, col=1)

            fig.update_layout(title={'text': f'<b>{name} Word Count<br><sup><i>&nbsp;&nbsp;&nbsp;&nbsp;{feat}</i></sup></b>',
                                     'x': .025, 'xanchor': 'left'},
                             margin=dict(t=100),
                             showlegend=legend,
                             template = template,
                             #plot_bgcolor=bg_color,paper_bgcolor=paper_color,
                             height=height, width=width
                            )

            fig.update_yaxes(title_text=f"<b>Words</b>", row=1, col=1, showgrid=False)
            fig.update_xaxes(title_text="", row=1, col=1, showgrid=False)

            fig.update_yaxes(title_text="<b>Frequency</b>", row=1, col=2,showgrid=False)
            fig.update_xaxes(title_text=f"<b>Words</b>", row=1, col=2, showgrid=False)

            fig.show()
            print('\n')
        except Exception as e:
            print(f"An error occurred: {e}")

In [None]:
def plot_correlation(df, title, subtitle, height, width, font_size):
    '''
    This function is resposible to plot a correlation map among features in the dataset.

    Parameters:
    height = Define height
    width = Define width
    font_size = Define the font size for the annotations
    '''
    corr = np.round(df.corr(numeric_only = True), 2)
    mask = np.triu(np.ones_like(corr, dtype = bool))
    c_mask = np.where(~mask, corr, 100)

    c = []
    for i in c_mask.tolist()[1:]:
        c.append([x for x in i if x != 100])



    fig = ff.create_annotated_heatmap(z=c[::-1],
                                      x=corr.index.tolist()[:-1],
                                      y=corr.columns.tolist()[1:][::-1],
                                      colorscale = colormap)

    fig.update_layout(title = {'text': f"<b>{title} Heatmap<br><sup>&nbsp;&nbsp;&nbsp;&nbsp;<i>{subtitle}</i></sup></b>",
                                'x': .025, 'xanchor': 'left', 'y': .95},
                    margin = dict(t=210, l = 110),
                    yaxis = dict(autorange = 'reversed', showgrid = False),
                    xaxis = dict(showgrid = False),
                    template = template,
                    #plot_bgcolor=bg_color,paper_bgcolor=paper_color,
                    height = height, width = width)


    fig.add_trace(go.Heatmap(z = c[::-1],
                             colorscale = colormap,
                             showscale = True,
                             visible = False))
    fig.data[1].visible = True

    for i in range(len(fig.layout.annotations)):
        fig.layout.annotations[i].font.size = font_size

    fig.show()

In [None]:
def compute_tfidf(df_column, ngram_range=(1,1), max_features=15):
    vectorizer = TfidfVectorizer(max_features=max_features, stop_words='english', ngram_range=ngram_range)
    x = vectorizer.fit_transform(df_column.fillna(''))
    df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
    return df_tfidfvect

<div id = 'eda'
     style="font-family: Calibri, serif; text-align: left;">
    <hr style="border: none;
               border-top: 2.85px solid #041445;
               width: 100%;
               margin-top: 62px;
               margin-bottom: auto;
               margin-left: 0;">
    <div style="font-size: 56px; letter-spacing: 2.25px;color: #02011a;"><b>Exploring the Dataset</b></div>
</div>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">We can start our analysis of the dataset by loading all the three sets available, <code>train</code>, <code>test</code>, and <code>val</code>.</p>

In [None]:
# Loading data
train = pd.read_csv('/kaggle/input/samsum-dataset-text-summarization/samsum-train.csv')
test = pd.read_csv('/kaggle/input/samsum-dataset-text-summarization/samsum-test.csv')
val = pd.read_csv('/kaggle/input/samsum-dataset-text-summarization/samsum-validation.csv')

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">I am now going to analyze each dataset separately.</p>

<div id = 'train'
     style="font-family: Calibri, serif; text-align: left;">
    <hr style="border: none;
               width: 100%;
               margin-top: 62px;
               margin-bottom: auto;
               margin-left: 0;">
    <div style="font-size: 32px; letter-spacing: 2.25px;color: #02011a;"><b>Train Dataset</b></div>
</div>

In [None]:
# Extracting info on the training Dataframe
describe_df(train)

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">We have 14,732 pairs of dialogues and summaries. It also seems like one of the dialogues is empty, let's investigate it further.</p>

In [None]:
mask = train['dialogue'].isnull() # Creating mask with null dialogues
filtered_train = train[mask] # filtering dataframe
filtered_train # Visualizing

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">It seems that sample <b>6054</b> does not really add anything to the dataset. We have a Null dialogue and the summary does not give us a clue on what this dialogue was supposed to be. We will remove this entry.</p>

In [None]:
train = train.dropna() # removing null values

In [None]:
# Removing 'Id' from categorical features list
categorical_features.remove('id')

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">We can now analyze the length of both dialogues and summaries by counting the words in them. This might give us a clue about how these texts are structured.</p>

In [None]:
df_text_lenght = pd.DataFrame() # Creating an empty dataframe
for feat in categorical_features: # Iterating through features --> Dialogue & Summary
    df_text_lenght[feat] = train[feat].apply(lambda x: len(str(x).split())) #  Counting words for each feature

# Plotting histogram-boxplot
histogram_boxplot(df_text_lenght,'#89c2e0', '#d500ff', 600, 1000, True, 'Train Dataset')

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">On average, dialogues consist of about 94 words. We do have some outliers with very extensive texts, going way over 300 words per dialogue.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Summaries are naturally shorter texts, consisting of about 20 words on average, although we also have some outliers with extensive summaries.</p>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">We can also use scikit-learn's <code>TfidfVectorizer</code> to extract more info on the dialogues and summaries available. This function will give us a dataframe with the top $n$ most frequent terms in the corpus, which we select by using the <code>max_features</code> parameter.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">In this dataframe, each column represents the $n$ most frequent terms in the overall corpus, while each row represents one entry in the original dataframe, such as <code>train</code>. For each term in each entry, we will see the TF-IDF score associated with it, which quantifies the relevance of a term in a given dialogue — or summary — relative to its frequency across all other dialogues — or summaries.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">We will also use the <code>ngram_range</code> parameter to select the most frequent words (unigrams), the most frequent sequence of two words (bigrams), and the most frequent sequence of three words (trigrams). The <code>stop_words = 'english'</code> parameter will help us filter out common stop-words of the English language, which are words that do not add up much to the overall context, such as <i>"and"</i>, <i>"of"</i>, etc.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">After measuring the most frequent terms, I will plot a heatmap displaying the correlations between these terms. This may help us understand how frequently they are used together in dialogues. For instance, how frequent is the occurrence of the word <i>"will"</i> when the word <i>"we"</i> is present?</p>

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english') # Top 15 terms
x = vectorizer.fit_transform(train['dialogue'])
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Unigrams', 'Train - Dialogue', 800, 800, 12)

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">You can see that the correlations between these terms are neither strongly positive nor strongly negative. The most positively correlated terms are <i>"don"</i> and <i>"know"</i>, at 0.12. It is relevant to observe that the <code>TfidfVectorizer</code> function performs some changes to the text, such as removing contractions, which explains why the word <i>don't</i> appears without its apostrophe <i>'t</i>.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">It is also interesting to notice a negative correlation — although still not extremely significant — between the terms <i>"yes"</i> and <i>"yeah"</i>. Maybe this happens because it would be redundant to include both in the same dialogue, or perhaps the data captures a tendency of individuals to use <i>"yeah"</i> instead of <i>"yes"</i> during conversations. These are some hypotheses we can consider when analyzing this type of heatmaps.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Let's perform the same analysis to the summaries.</p>

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english') # Top 15 terms
x = vectorizer.fit_transform(train['summary'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Unigrams', 'Train - Summary', 800, 800, 12)

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">The correlations of terms in summaries seem to be more pronounced than those in dialogues, even though these correlations are still not strong. This suggests that summaries may convey relevant information more succinctly than full dialogues, which is exactly the idea behind a summary.</p>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">We have positively correlated pairs such as <i>"going"</i> and <i>"meet"</i>, <i>"come"</i> and <i>"party"</i>, as well as <i>"buy"</i> and <i>"wants"</i>. It makes perfect sense to see these unigrams appearing together across texts.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Conversely, it's reasonable for negatively correlated pairs <b>not</b> to co-occur frequently in texts, such as <i>"going"</i> and <i>"wants"</i>, and <i>"going"</i> and <i>"got"</i>.</p>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Let's now analyze bigrams across dialogues and summaries.</p>

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (2,2)) # Top 15 terms
x = vectorizer.fit_transform(train['dialogue'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Bigrams', 'Train - Dialogue', 800, 800, 12)

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Once more, the correlations are not extremely strong. Still, we can see some pairs that seem reasonable to be together, such as <i>"good idea"</i> and <i>"sounds like"</i>.</p>

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (2,2)) # Top 15 terms
x = vectorizer.fit_transform(train['summary'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Bigrams', 'Train - Summary', 800, 800, 12)

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">We have only one correlation between the pairs <i>"wants buy"</i> and <i>"buy new"</i>. The other terms do not appear to have any kind of correlation at all.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">It is interesting to see the tendency of the summaries to contain information on minutes, which does not seem to be present in the dialogues. We can even investigate further this relationship by querying some summaries where the bigram <i>15 minutes</i> appears in the summary.</p>

In [None]:
# Filtering dataset to see those containing the term '15 minutes' in the summary
filtered_train = train[train['summary'].str.contains('15 minutes', case=False, na=False)]
filtered_train.head()

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">The last row gives us an idea of why we see so many terms related to minutes in summaries, but not in dialogues. In dialogues, people may write "15min" together or even other forms of it, such as "15m", whereas the summaries give us a patternized description, making it natural to be more prominent than other forms to describe time.</p>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Let's now visualize the trigrams.</p>

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (3,3)) # Top 15 terms
x = vectorizer.fit_transform(train['dialogue'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Trigrams', 'Train - Dialogue', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (3,3)) # Top 15 terms
x = vectorizer.fit_transform(train['summary'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Trigrams', 'Train - Summary', 800, 800, 12)

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Once more, we can see that the terms are not strongly correlated. But still, it is possible to see pairs that seem logical to appear together in the corpus.</p>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">I will now perform the exact same analysis on the <code>test</code> and <code>val</code> datasets. We can expect the same behavior as the ones seen during the analysis of the training set, which is why I will refrain from commenting on the following plots to avoid redundancy. However, if something different appears, we will surely investigate further.</p>

<div id = 'test'
     style="font-family: Calibri, serif; text-align: left;">
    <hr style="border: none;
               width: 100%;
               margin-top: 62px;
               margin-bottom: auto;
               margin-left: 0;">
    <div style="font-size: 32px; letter-spacing: 2.25px;color: #02011a;"><b>Test Dataset</b></div>
</div>

In [None]:
# Extracting info on the test dataset
describe_df(test)

In [None]:
# Removing 'Id' from categorical features list
categorical_features.remove('id')

In [None]:
df_text_lenght = pd.DataFrame()
for feat in categorical_features:
    df_text_lenght[feat] = test[feat].apply(lambda x: len(str(x).split()))

histogram_boxplot(df_text_lenght,'#89c2e0', '#d500ff', 600, 1000, True, 'Test Dataset')

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english') # Top 15 terms
x = vectorizer.fit_transform(test['dialogue'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Unigrams', 'Test - Dialogue', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english') # Top 15 terms
x = vectorizer.fit_transform(test['summary'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Unigrams', 'Test - Summary', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (2,2)) # Top 15 terms
x = vectorizer.fit_transform(test['dialogue'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Bigrams', 'Test - Dialogue', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (2,2)) # Top 15 terms
x = vectorizer.fit_transform(test['summary'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Bigrams', 'Test - Summary', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (3,3)) # Top 15 terms
x = vectorizer.fit_transform(test['dialogue'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Trigrams', 'Test - Dialogue', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (3,3)) # Top 15 terms
x = vectorizer.fit_transform(test['summary'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Trigrams', 'Test - Summary', 800, 800, 12)

<div id = 'val'
     style="font-family: Calibri, serif; text-align: left;">
    <hr style="border: none;
               width: 100%;
               margin-top: 62px;
               margin-bottom: auto;
               margin-left: 0;">
    <div style="font-size: 32px; letter-spacing: 2.25px;color: #02011a;"><b>Validation Dataset</b></div>
</div>

In [None]:
# Extracting info on the val dataset
describe_df(val)

In [None]:
# Removing 'Id' from categorical features list
categorical_features.remove('id')

In [None]:
df_text_lenght = pd.DataFrame()
for feat in categorical_features:
    df_text_lenght[feat] = val[feat].apply(lambda x: len(str(x).split()))

histogram_boxplot(df_text_lenght,'#89c2e0', '#d500ff', 600, 1000, True, 'Validation Dataset')

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english') # Top 15 terms
x = vectorizer.fit_transform(val['dialogue'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Unigrams', 'Validation - Dialogue', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english') # Top 15 terms
x = vectorizer.fit_transform(val['summary'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Unigrams', 'Validation - Summary', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (2,2)) # Top 15 terms
x = vectorizer.fit_transform(val['dialogue'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Bigrams', 'Validation - Dialogue', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (2,2)) # Top 15 terms
x = vectorizer.fit_transform(val['summary'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Bigrams', 'Validation - Summary', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (3,3)) # Top 15 terms
x = vectorizer.fit_transform(val['dialogue'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Trigrams', 'Validation - Dialogue', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (3,3)) # Top 15 terms
x = vectorizer.fit_transform(val['summary'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Trigrams', 'Validation - Summary', 800, 800, 12)

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Overall, we have similar patterns across all the three datasets. Summaries are shorter in length than dialogues—as expected—and lots of terms that seem reasonable to be together have a higher degree of correlation.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">By analyzing the <i>n-grams</i> heatmaps, it is also clear that this data consists of chat/dialogue texts, since we can see a lot of terms that would usuallly appear in conversations.</p>

<div id = 'preprocess'
     style="font-family: Calibri, serif; text-align: left;">
    <hr style="border: none;
               border-top: 2.85px solid #041445;
               width: 100%;
               margin-top: 62px;
               margin-bottom: auto;
               margin-left: 0;">
    <div style="font-size: 56px; letter-spacing: 2.25px;color: #02011a;"><b>Preprocessing Data</b></div>
</div>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">One of the main advantages of working with pre-trained models, such as BART, is that these models are usually extremely robust and require very little data preprocessing.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">While performing the EDA, I noticed that we have some tags in a few texts, such as <code>file_photo</code>. Let's take a look at a few examples.</p>

In [None]:
print(train['dialogue'].iloc[14727])

In [None]:
print(test['dialogue'].iloc[0])

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">I am going to use the <code>clean_tags</code> function defined below to remove these tags from the texts, so we can make them cleaner.</p>

In [None]:
def clean_tags(text):
    clean = re.compile('<.*?>') # Compiling tags
    clean = re.sub(clean, '', text) # Replacing tags text by an empty string

    # Removing empty dialogues
    clean = '\n'.join([line for line in clean.split('\n') if not re.match('.*:\s*$', line)])

    return clean

In [None]:
test1 = clean_tags(train['dialogue'].iloc[14727]) # Applying function to example text
test2 = clean_tags(test['dialogue'].iloc[0]) # Applying function to example text

# Printing results
print(test1)
print('\n' *3)
print(test2)

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">You can see that we have successfully removed the tags from the texts. I am now going to define the <code>clean_df</code> function, in which we will apply the <code>clean_tags</code> to the entire datasets.</p>

In [None]:
# Defining function to clean every text in the dataset.
def clean_df(df, cols):
    for col in cols:
        df[col] = df[col].fillna('').apply(clean_tags)
    return df

In [None]:
# Cleaning texts in all datasets
train = clean_df(train,['dialogue', 'summary'])
test = clean_df(test,['dialogue', 'summary'])
val = clean_df(val,['dialogue', 'summary'])

In [None]:
train.tail(3) # Visualizing results

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">The tags have been removed from the texts. It's beneficial to conduct such data cleansing to eliminate noise—information that might not significantly contribute to the overall context and could potentially impair performance.</p>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">I am now going to perform some preprocessing that is necessary to prepare our data to serve as input to the pre-trained model and for fine-tuning. Most of what I'm doing here is a part of the tutorial on Text Summarization described in the 🤗 Transformers documentation, which you can see <a href ="https://huggingface.co/docs/transformers/tasks/summarization">here</a>.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">First, I am going to use the 🤗 Datasets library to convert our Pandas Dataframes to Datasets. This is going to make our data ready to be processed across the whole Hugging Face ecosystem.</p>

In [None]:
# Transforming dataframes into datasets
train_ds = Dataset.from_pandas(train)
test_ds = Dataset.from_pandas(test)
val_ds = Dataset.from_pandas(val)

# Visualizing results
print(train_ds)
print('\n' * 2)
print(test_ds)
print('\n' * 2)
print(val_ds)

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">To see the cotent inside a 🤗 Dataset, we can select a specific row, as below.</p>

In [None]:
train_ds[0] # Visualizing the first row

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">This way, we can see the original ID, the dialogue, as well as the reference summary. <code>__index_level_0__</code> does not add anything to the data and will be removed further.</p>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">After successfully converting the pandas dataframes to 🤗Datasets, we can move on to the modeling process.</p>

<div id = 'modeling'
     style="font-family: Calibri, serif; text-align: left;">
    <hr style="border: none;
               border-top: 2.85px solid #041445;
               width: 100%;
               margin-top: 62px;
               margin-bottom: auto;
               margin-left: 0;">
    <div style="font-size: 56px; letter-spacing: 2.25px;color: #02011a;"><b>Modeling</b></div>
</div>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">As I have previously mentioned, we are going to fine-tune a version of BART that has been trained on several news articles for text summarization, <a href ="https://huggingface.co/facebook/bart-large-xsum"><b>facebook/bart-large-xsum</b></a>.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">I will briefly demonstrate this model by loading a summarization pipeline with it to show you how it works on news data.</p>

In [None]:
# Loading summarization pipeline with the bart-large-cnn model
summarizer = pipeline('summarization', model = 'facebook/bart-large-xsum')

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">As an example, I am going to use the following news article, published on CNN on October 24<sup>th</sup>, 2023, <i><a href ="https://edition.cnn.com/2023/10/24/europe/bobi-oldest-ever-dog-dies-intl-scli/index.html">Bobi, the world’s oldest dog ever, dies aged 31</a></i>. Notice that this is a totally unseen news article that I'm passing to the model, so we can see how it performs.</p>

In [None]:
news = '''Bobi, the world’s oldest dog ever, has died after reaching the almost inconceivable age of 31 years and 165 days, said Guinness World Records (GWR) on Monday.
His death at an animal hospital on Friday was initially announced by veterinarian Dr. Karen Becker.
She wrote on Facebook that “despite outliving every dog in history, his 11,478 days on earth would never be enough, for those who loved him.”
There were many secrets to Bobi’s extraordinary old age, his owner Leonel Costa told GWR in February. He always roamed freely, without a leash or chain, lived in a “calm, peaceful” environment and ate human food soaked in water to remove seasonings, Costa said.
He spent his whole life in Conqueiros, a small Portuguese village about 150 kilometers (93 miles) north of the capital Lisbon, often wandering around with cats.
Bobi was a purebred Rafeiro do Alentejo – a breed of livestock guardian dog – according to his owner. Rafeiro do Alentejos have a life expectancy of about 12-14 years, according to the American Kennel Club.
But Bobi lived more than twice as long as that life expectancy, surpassing an almost century-old record to become the oldest living dog and the oldest dog ever – a title which had previously been held by Australian cattle-dog Bluey, who was born in 1910 and lived to be 29 years and five months old.
However, Bobi’s story almost had a different ending.
When he and his three siblings were born in the family’s woodshed, Costa’s father decided they already had too many animals at home.
Costa and his brothers thought their parents had taken all the puppies away to be destroyed. However, a few sad days later, they found Bobi alive, safely hidden in a pile of logs.
The children hid the puppy from their parents and, by the time Bobi’s existence became known, he was too old to be put down and went on to live his record-breaking life.
His 31st birthday party in May was attended by more than 100 people and a performing dance troupe, GWR said.
His eyesight deteriorated and walking became harder as Bobi grew older but he still spent time in the backyard with the cats, rested more and napped by the fire.
“Bobi is special because looking at him is like remembering the people who were part of our family and unfortunately are no longer here, like my father, my brother, or my grandparents who have already left this world,” Costa told GWR in May. “Bobi represents those generations.”
'''
summarizer(news) # Using the pipeline to generate a summary of the text above

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">You can observe that the model is able to accurately produce a much shorter text consisting of the most relevant information present in the input text. This is a successful summarization.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">However, this model has been trained mainly on datasets consisting of several news articles from CNN and the Daily Mail, not on much dialogue data. This is why I'm going to fine-tune it with the SamSum dataset.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Let's go ahead and load BartTokenizer and BartForConditionalGeneration using the <i><b>facebook/bart-large-xsum</b></i> checkpoint.</p>

In [None]:
checkpoint = 'facebook/bart-large-xsum' # Model
tokenizer = BartTokenizer.from_pretrained(checkpoint) # Loading Tokenizer

In [None]:
model = BartForConditionalGeneration.from_pretrained(checkpoint) # Loading Model

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">We can also print below the architecture of the model.</p>

In [None]:
print(model) # Visualizing model's architecture

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">It is possible to see that the models consist of an encoder and a decoder, we can see the Linear Layers, as well as the activation functions, which use $GeLU$, instead of the more typical $ReLU$.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">It is also interesting to observe the output layer, <b>lm_head</b>, which shows us that this model is ideal for generating outputs with a vocabulary size—<code>out_features=50264</code>—this shows us that this architecture is adequate for summarization tasks, as well as other tasks, such as translation for example.</p>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Now we must preprocess our datasets and use BartTokenizer so that our data is legible for the BART model.</p>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">The following <code>preprocess_function</code> can be directly copied from the 🤗 Transformers documentation, and it serves well to preprocess data for several NLP tasks. I am going to delve a bit deeper into how it preprocesses the data by explaining the steps it takes.</p>

<div style = "margin-left: 25px;">
    
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a"><b>• <code>inputs = [doc for doc in examples["dialogue"]]:</code></b> In this line, we are iterating over every <code>dialogue</code> in the dataset and saving them as input to the model.</p>
    
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a"><b>• <code>model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
:</code></b> Here, we are using the <code>tokenizer</code> to convert the input dialogues into tokens that can be easily understood by the BART model. The <code>truncation=True</code> parameter ensures that all dialogues have a maximum number of 1024 tokens, as defined by the <code>max_length</code> parameter.</p>
    
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a"><b>• <code>labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True):</code></b> This line performs a very similar tokenization process as the one above. This time, however, it tokenizes the target variable, which is our summaries. Also, note that the max_length here is significantly lower, at 128. This implies that we expect summaries to be a much shorter text than that of dialogues.</p>
    
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a"><b>• <code>model_inputs["labels"] = labels["input_ids"]:</code></b> This line is essentially adding the tokenized labels to the preprocessed dataset, alongside the tokenized inputs.</p>
  
</div>

In [None]:
def preprocess_function(examples):
    inputs = [doc for doc in examples["dialogue"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
# Applying preprocess_function to the datasets
tokenized_train = train_ds.map(preprocess_function, batched=True,
                               remove_columns=['id', 'dialogue', 'summary', '__index_level_0__']) # Removing features

tokenized_test = test_ds.map(preprocess_function, batched=True,
                               remove_columns=['id', 'dialogue', 'summary']) # Removing features

tokenized_val = val_ds.map(preprocess_function, batched=True,
                               remove_columns=['id', 'dialogue', 'summary']) # Removing features

# Printing results
print('\n' * 3)
print('Preprocessed Training Dataset:\n')
print(tokenized_train)
print('\n' * 2)
print('Preprocessed Test Dataset:\n')
print(tokenized_test)
print('\n' * 2)
print('Preprocessed Validation Dataset:\n')
print(tokenized_val)

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Our tokenized datasets consist now of only three features, <code>input_ids</code>, <code>attention_mask</code>, and <code>labels</code>. Let's print a sample from our tokenized train dataset to investigate further how the preprocess function altered the data.</p>

In [None]:
# Selecting a sample from the dataset
sample = tokenized_train[0]

# Printing its features
print("input_ids:")
print(sample['input_ids'])
print("\n")
print("attention_mask:")
print(sample['attention_mask'])
print("\n")
print("sample:")
print(sample['labels'])
print("\n")

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Let's dive a deep further into what each feature means.</p>

<div style = "margin-left: 25px;">
    
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a"><b>• input_ids</b>: These are the token IDs mapped to the dialogues. Each token represents a word or subword that can be perfectly understood by the BART model. For instance, the number <i><b>5219</b></i> could be a map to a word like <i>"hello"</i> in BART's vocabulary. Each word has its unique token in this context.</p>
    
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a"><b>• attention_mask</b>: This mask indicates which tokens the model should pay attention to and which tokens should be ignored. This is often used in the context of padding—when some tokens are used to equalize the lengths of sentences—but most of these padding tokens do not hold any meaningful information, so the attention mask ensures the model does not focus on them. In the case of this specific sample, all tokens are masked as '1', meaning they are all relevant and none of them are used for padding.</p>
    
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a"><b>• labels</b>: Similarly to the first feature, these are token IDs obtained from the words and subwords in the summaries. These are the tokens that the model will be trained on to give as output.</p>
  
</div>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">We must now use <code>DataCollatorForSeq2Seq</code> to batch the data. These data collators may also automatically apply some processing techniques, such as padding. They are important for the task of fine-tuning models and are also present in the 🤗 Transformers documentation for text summarization.</p>

In [None]:
# Instantiating Data Collator
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Next, I am going to load the ROUGE metrics and define a new function to evaluate the model.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">The <code>compute_metrics</code> function is also available in the documentation. In this function, we are basically extracting the model-generated summaries, as well as the human-generated summaries, and decoding them. We then use rouge to compare how similar they are to evaluate performance. </p>

In [None]:
metric = load_metric('rouge') # Loading ROUGE Score

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred# Obtaining predictions and true labels

    # Decoding predictions
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Obtaining the true labels tokens, while eliminating any possible masked token (i.e., label = -100)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]


    # Computing rouge score
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()} # Extracting some results

    # Add mean-generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">We now use the <code>Seq2SeqTrainingArguments</code> class to set some relevant settings for fine-tuning. I will first define a directory to serve as output, and then define the evaluation strategy, learning rate, etc.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">This class can be quite extensive, with several different parameters. I highly suggest you take your time with <a href = "https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments">the documentation</a> to get familiar with them.</p>

In [None]:
# Defining parameters for training
'''
Please don't forget to check the documentation.
Both the Seq2SeqTrainingArguments and Seq2SeqTrainer classes have quite an extensive list of parameters.

doc: https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/trainer

'''
training_args = Seq2SeqTrainingArguments(
    output_dir = 'bart_samsum',
    evaluation_strategy = "epoch",
    save_strategy = 'epoch',
    load_best_model_at_end = True,
    metric_for_best_model = 'eval_loss',
    seed = seed,
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True,
    report_to="none"
)

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Finally, the <code>Seq2SeqTrainer</code> class allows us to use <b>PyTorch</b> to fine-tune the model. In this class, we are basically defining the model, the training arguments, the datasets used for training and evaluation, the tokenizer, the data_collator, and the metrics.</p>

In [None]:
# Defining Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train() # Training model

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">We finally finished fine-tuning after 4 epochs. Since we had <code>load_best_model_at_end = True</code> in the training arguments, the Trainer automatically saves the model with the best performance, which in this case is the one with the lowest <code>Validation Loss</code>.</p>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">The second epoch was the one with the lowest validation loss, at <code><b>1.443861</b></code>. It also achieved the highest <code>Rouge1</code> and <code>Rouge2</code> scores, as well as the highest <code>Rougelsum</code> score.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">I have not presented the <code><b>Rougelsum</b></code> score previously. According to <a href = "https://pypi.org/project/rouge-score/">the documentation</a> of the rouge-score library, we can conclude that this is similar to the RougeL score, but it measures content coverage at a sentence-by-sentence level, instead of the entire summary.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">The <code>Gen Len</code> column gives us the average length of the model-generated summaries. It is relevant to remember that we want short, yet informative, texts. In this case, the second epoch also yielded the shortest summaries on average.</p>

<div id = 'evaluating'
     style="font-family: Calibri, serif; text-align: left;">
    <hr style="border: none;
               border-top: 2.85px solid #041445;
               width: 100%;
               margin-top: 62px;
               margin-bottom: auto;
               margin-left: 0;">
    <div style="font-size: 56px; letter-spacing: 2.25px;color: #02011a;"><b>Evaluating and Saving Model</b></div>
</div>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">After training and testing the model, we can evaluate its performance on the <code>validation</code> dataset. We can use the <code>evaluate</code> method for that.</p>

In [None]:
# Evaluating model performance on the tokenized validation dataset
validation = trainer.evaluate(eval_dataset = tokenized_val)
print(validation) # Printing results

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">This outputs the same scores we have previously seen during training and testing. Here, we can notice that we have even <b>higher</b> performance in every metric compared to the performance in the testing set. When it comes to <code>Gen Len</code>, we also have more concise summaries in the validation set.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Considering that our results seem to be satisfactory at this point, we can go ahead and use the <code>save_model</code> method to save our fine-tuned model in the <code>bart_finetuned_samsum</code> directory. We can also use the <code>shutil</code> package to save the model in a <i>zip</i> file.</p>

In [None]:
# Saving model to a custom directory
directory = "bart_finetuned_samsum"
trainer.save_model(directory)

# Saving model tokenizer
tokenizer.save_pretrained(directory)

In [None]:
# Saving model in .zip format
shutil.make_archive('bart_finetuned_samsum', 'zip', '/kaggle/working/bart_finetuned_samsum')
shutil.move('bart_finetuned_samsum.zip', '/kaggle/working/bart_finetuned_samsum.zip')

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">After saving your model, you can easily <a href = "https://huggingface.co/docs/hub/models-uploading">upload it to Hugging Face Models</a> and use it on new datasets and texts.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">The fine-tuned model we trained here is now available for everyone on Hugging Face, and you can have access to it by clicking on <a href = "https://huggingface.co/luisotorres/bart-finetuned-samsum">luisotorres/bart-finetuned-samsum</a>.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Let's load the model, using the summarization pipeline, and generate some summaries for human evaluation, where we evaluate if the model-generated summaries are accurate or not.</p>

In [None]:
# Loading summarization pipeline and model
summarizer = pipeline('summarization', model = 'luisotorres/bart-finetuned-samsum')

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">After loading the pipeline, we can now produce some summaries. I'll first start by using examples from the validation dataset, so we can compare our model-generated summaries to the reference summaries.</p>

In [None]:
# Obtaining a random example from the validation dataset
val_ds[35]

In [None]:
text = "John: doing anything special?\r\nAlex: watching 'Millionaires' on tvn\r\nSam: me too! He has a chance to win a million!\r\nJohn: ok, fingers crossed then! :)"
summary = "Alex and Sam are watching Millionaires."
generated_summary = summarizer(text)

In [None]:
print('Original Dialogue:\n')
print(text)
print('\n' * 2)
print('Reference Summary:\n')
print(summary)
print('\n' * 2)
print('Model-generated Summary:\n')
print(generated_summary)

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">The model-generated summary is just a bit longer than the reference summary, but it still captures quite well the content of the dialogue.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Let's see another example.</p>

In [None]:
val_ds[22]

In [None]:
text = "Madison: Hello Lawrence are you through with the article?\r\nLawrence: Not yet sir. \r\nLawrence: But i will be in a few.\r\nMadison: Okay. But make it quick.\r\nMadison: The piece is needed by today\r\nLawrence: Sure thing\r\nLawrence: I will get back to you once i am through."
summary = "Lawrence will finish writing the article soon."
generated_summary = summarizer(text)

print('Original Dialogue:\n')
print(text)
print('\n' * 2)
print('Reference Summary:\n')
print(summary)
print('\n' * 2)
print('Model-generated Summary:\n')
print(generated_summary)

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Once again, the model-generated summary is longer than the reference summary. However, I would definitely say that the model-generated summary is more informative than the reference one because it lets us know that there's a sense of urgency for Lawrence to finish the article since Madison needs it by today.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Let's see another example.</p>

In [None]:
val_ds[4]

In [None]:
text = "Robert: Hey give me the address of this music shop you mentioned before\r\nRobert: I have to buy guitar cable\r\nFred: Catch it on google maps\r\nRobert: thx m8\r\nFred: ur welcome"
summary = "Robert wants Fred to send him the address of the music shop as he needs to buy guitar cable."
generated_summary = summarizer(text)

In [None]:
print('Original Dialogue:\n')
print(text)
print('\n' * 2)
print('Reference Summary:\n')
print(summary)
print('\n' * 2)
print('Model-generated Summary:\n')
print(generated_summary)

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">In this case, while the generated text captures the essence of the dialogue, it suffers from a lack of clarity due to ambiguity. Specifically, the pronoun <i>he</i> creates uncertainty about whether Fred or Robert intends to buy the guitar cable. In the original dialogue, it is clearly specified that it is Robert the one who has to buy the cable.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Now that we have been able to compare summaries, we can create some dialogues and input them into the model to check how it performs on them.</p>

In [None]:
# Creating new dialogues for evaluation
text = "John: Hey! I've been thinking about getting a PlayStation 5. Do you think it is worth it? \r\nDan: Idk man. R u sure ur going to have enough free time to play it? \r\nJohn: Yeah, that's why I'm not sure if I should buy one or not. I've been working so much lately idk if I'm gonna be able to play it as much as I'd like."
generated_summary = summarizer(text)

In [None]:
print('Original Dialogue:\n')
print(text)
print('\n' * 2)
print('Model-generated Summary:\n')
print(generated_summary)

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">For this dialogue, I have decided to include some abbreviations such as <i>idk</i>—for <i>I don't know</i>—and <i>r u</i>—for <i>are you</i>— to observe how the model would interpret them.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">We can see that the model has been able to successfully capture the essence of the dialogue and identify the main subject, which is John's uncertainty to buy a PlayStation 5 given the fact that he has so little time to play it.</p>

In [None]:
text = "Camilla: Who do you think is going to win the competition?\r\nMichelle: I believe Jonathan should win but I'm sure Mike is cheating!\r\nCamilla: Why do you say that? Can you prove Mike is really cheating?\r\nMichelle: I can't! But I just know!\r\nCamilla: You shouldn't accuse him of cheating if you don't have any evidence to support it."
generated_summary = summarizer(text)

In [None]:
print('Original Dialogue:\n')
print(text)
print('\n' * 2)
print('Model-generated Summary:\n')
print(generated_summary)

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Once more the model captures the main theme of the conversation, which is Michelle's belief that Jonathan should win the competition, but that Mike may be cheating. Some further improvements could be made, though, such as including the information that Michelle cannot really show any evidence to support her belief that Mike is cheating.</p>

<div id = 'conclusion'
     style="font-family: Calibri, serif; text-align: left;">
    <hr style="border: none;
               border-top: 2.85px solid #041445;
               width: 100%;
               margin-top: 62px;
               margin-bottom: auto;
               margin-left: 0;">
    <div style="font-size: 56px; letter-spacing: 2.25px;color: #02011a;"><b>Conclusion and Deployment </b></div>
</div>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">In this notebook, we have explored how we can use <b><i>Large Language Models</i></b> for several tasks involving Natural Language Processing, more specifically, Text Summarization tasks.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">We delved into how Hugging Face's Transformers, Evaluate, and Datasets can be used to leverage frameworks such as PyTorch to fine-tune pre-trained models with a large number of parameters. This type of technique is usually referred to as <b>transfer learning</b>, which allows Data Scientists and Machine Learning Engineers to exploit the knowledge gained from previous tasks to improve generalization on a new task.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">We used a BART model that has been already trained to perform summarization on news articles and fine-tuned it to perform summarizations of dialogues with the <b>SamSum</b> dataset.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Thanks to Hugging Face's Models and Spaces, I have uploaded this model online, and it is free for anyone to use on their own summarization tasks or further fine-tune it on other tasks. I highly suggest you visit the <a href = "https://huggingface.co/luisotorres/bart-finetuned-samsum">luisotorres/bart-finetuned-samsum</a> for more information on how to use this model.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">I have also built a <b>web app</b> where you can use the model for the summarization of dialogues and news articles. Below, you can see some images of the web app, which is also available for free on <a href = "https://huggingface.co/spaces/luisotorres/bart-text-summarization">Bart Text Summarization</a>.</p>

<center>
    <img src = "https://i.imgur.com/ZMIsCHL.png">
<p style = "font-size: 16px;
            font-family: 'Calibri', serif;
            text-align: center;
            margin-top: 10px;">Example of Summarization of News Article</p>
</center>

<center>
    <img src = "https://i.imgur.com/fc48B0l.png">
<p style = "font-size: 16px;
            font-family: 'Calibri', serif;
            text-align: center;
            margin-top: 10px;">Example of Summarization of Dialogue</p>
</center>

<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">I hope that this notebook serves as a good introduction for those interested in the use of LLMs for Natural Language Processing tasks, as well as for those who already work with them and are in search of refining their knowledge on the subject.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">This notebook took quite a while to be made and I highly appreciate your feedback on this work. Feel free to leave your comments, suggestions, and upvotes if you liked the content presented here.</p>
          
<p style="font-family: Calibri, serif; text-align: left;
          font-size: 24px; letter-spacing: .85px;color: #02011a">Thank you very much!</p>

<hr style="border: 0;
           height: 1px;
           border-top: 0.85px;
           solid #b2b2b2">
           
<div style="text-align: left;
            color: #8d8d8d;
            padding-left: 15px;
            font-size: 14.25px;">
    Luis Fernando Torres, 2023 <br><br>
    Let's connect!🔗<br>
    <a href="https://www.linkedin.com/in/luuisotorres/">LinkedIn</a> • <a href="https://medium.com/@luuisotorres">Medium</a> • <a href = "https://huggingface.co/luisotorres">Hugging Face</a><br><br>
</div>