# Overview

In this notebook, we will do the data visualization and statistic on SamSum dataset. The tools we use below

![](https://github.com/Aisuko/notebooks/raw/main/images/Data%20Science.svg)

In [1]:
import warnings

warnings.filterwarnings('ignore')

In [2]:
# configuring Pandas to exhibit larger columns

import pandas as pd

pd.set_option('display.max_colwidth', 1000)

# Exploring the dataset

We are going to analyze each dataset separately.

In [3]:
train=pd.read_csv('/kaggle/input/samsum-dataset-text-summarization/samsum-train.csv')
test=pd.read_csv('/kaggle/input/samsum-dataset-text-summarization/samsum-test.csv')
val=pd.read_csv('/kaggle/input/samsum-dataset-text-summarization/samsum-validation.csv')
type(train)

pandas.core.frame.DataFrame

# Data Visualization

In [4]:
from IPython.display import display

def display_feature_list(features, feature_type):
    '''
    This function displays the features within each list for each type of data
    '''
    print(f'\n{feature_type} Features:')
    print(', '.join(features) if features else 'None')


def describe_dataframe(dataframe):
    global categorical_features, continuous_features, binary_features
    categorical_features=[col for col in dataframe.columns if dataframe[col].dtype=='object']
    binary_features=[col for col in dataframe.columns if dataframe[col].nunique() <=2 and dataframe[col].dtype!='object']
    continuous_features=[col for col in dataframe.columns if dataframe[col].dtype!='object' and col not in binary_features]
    
    print(f'\n{type(dataframe).__name__} shape: {dataframe.shape}')
    print(f'\n{dataframe.shape[0]:,.0f} samples')
    print(f'\n{dataframe.shape[1]:,.0f} attributes')
    print(f'\nMissing Data: \n{dataframe.isnull().sum()}')
    print(f'\nDuplicates:{dataframe.duplicated().sum()}')
    print(f'\nData types:\n{dataframe.dtypes}')
    
    display_feature_list(categorical_features, 'Categorical')
    display_feature_list(continuous_features, 'Continuous')
    display_feature_list(binary_features, 'Binary')
    
    print(f'\n{type(dataframe).__name__} Head: \n')
    display(dataframe.head(5))
    print(f'\n{type(dataframe).__name__} Tail: \n')
    display(dataframe.tail(5))


describe_dataframe(train)


DataFrame shape: (14732, 3)

14,732 samples

3 attributes

Missing Data: 
id          0
dialogue    1
summary     0
dtype: int64

Duplicates:0

Data types:
id          object
dialogue    object
summary     object
dtype: object

Categorical Features:
id, dialogue, summary

Continuous Features:
None

Binary Features:
None

DataFrame Head: 



Unnamed: 0,id,dialogue,summary
0,13818513,Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-),Amanda baked cookies and will bring Jerry some tomorrow.
1,13728867,Olivia: Who are you voting for in this election? \r\nOliver: Liberals as always.\r\nOlivia: Me too!!\r\nOliver: Great,Olivia and Olivier are voting for liberals in this election.
2,13681000,"Tim: Hi, what's up?\r\nKim: Bad mood tbh, I was going to do lots of stuff but ended up procrastinating\r\nTim: What did you plan on doing?\r\nKim: Oh you know, uni stuff and unfucking my room\r\nKim: Maybe tomorrow I'll move my ass and do everything\r\nKim: We were going to defrost a fridge so instead of shopping I'll eat some defrosted veggies\r\nTim: For doing stuff I recommend Pomodoro technique where u use breaks for doing chores\r\nTim: It really helps\r\nKim: thanks, maybe I'll do that\r\nTim: I also like using post-its in kaban style",Kim may try the pomodoro technique recommended by Tim to get more stuff done.
3,13730747,"Edward: Rachel, I think I'm in ove with Bella..\r\nrachel: Dont say anything else..\r\nEdward: What do you mean??\r\nrachel: Open your fu**ing door.. I'm outside",Edward thinks he is in love with Bella. Rachel wants Edward to open his door. Rachel is outside.
4,13728094,"Sam: hey overheard rick say something\r\nSam: i don't know what to do :-/\r\nNaomi: what did he say??\r\nSam: he was talking on the phone with someone\r\nSam: i don't know who\r\nSam: and he was telling them that he wasn't very happy here\r\nNaomi: damn!!!\r\nSam: he was saying he doesn't like being my roommate\r\nNaomi: wow, how do you feel about it?\r\nSam: i thought i was a good rommate\r\nSam: and that we have a nice place\r\nNaomi: that's true man!!!\r\nNaomi: i used to love living with you before i moved in with me boyfriend\r\nNaomi: i don't know why he's saying that\r\nSam: what should i do???\r\nNaomi: honestly if it's bothering you that much you should talk to him\r\nNaomi: see what's going on\r\nSam: i don't want to get in any kind of confrontation though\r\nSam: maybe i'll just let it go\r\nSam: and see how it goes in the future\r\nNaomi: it's your choice sam\r\nNaomi: if i were you i would just talk to him and clear the air","Sam is confused, because he overheard Rick complaining about him as a roommate. Naomi thinks Sam should talk to Rick. Sam is not sure what to do."



DataFrame Tail: 



Unnamed: 0,id,dialogue,summary
14727,13863028,"Romeo: You are on my ‘People you may know’ list.\nGreta: Ah, maybe it is because of the changed number of somebody’s?\nGreta: I don’t know you?\nRomeo: This might be the beginning of a beautiful relationship\nRomeo: How about adding me on your friend list and talk a bit?\nGreta: No.\nRomeo: Okay I see.",Romeo is trying to get Greta to add him to her friend list but she refuses.
14728,13828570,"Theresa: <file_photo>\r\nTheresa: <file_photo>\r\nTheresa: Hey Louise, how are u?\r\nTheresa: This is my workplace, they always give us so much food here 😊\r\nTheresa: Luckily they also offer us yoga classes, so all the food isn't much of a problem 😂\r\nLouise: Hey!! 🙂 \r\nLouise: Wow, that's awesome, seems great 😎 Haha\r\nLouise: I'm good! Are you coming to visit Stockholm this summer? 🙂\r\nTheresa: I don't think so :/ I need to prepare for Uni.. I will probably attend a few lessons this winter\r\nLouise: Nice! Do you already know which classes you will attend?\r\nTheresa: Yes, it will be psychology :) I want to complete a few modules that I missed :)\r\nLouise: Very good! Is it at the Uni in Prague?\r\nTheresa: No, it will be in my home town :)\r\nLouise: I have so much work right now, but I will continue to work until the end of summer, then I'm also back to Uni, on the 26th September!\r\nTheresa: You must send me some pictures, so I can see where you live :) \r\nLouise: I will,...","Theresa is at work. She gets free food and free yoga classes. Theresa won't go to visit Louise in Stockholm, because she will prepare for university psychology lessons. She'll be back at uni on 26th September."
14729,13819050,"John: Every day some bad news. Japan will hunt whales again\r\nErica: Yes, I've read this. It's very upsetting\r\nJohn: Cruel Japanese\r\nFaith: I think this is a racist remark. Because Island and Norways has never joined this international whaling agreement\r\nErica: really? I haven't known, everybody is so outraged by Japan\r\nFaith: sure, European hypocrisy \r\nJohn: not entirely. Scandinavians don't use the nets that Japanese use, so Norway and Island kill much less specimens that Japan will\r\nFaith: oh, it's much more complex than one may expect\r\nJohn: True, but the truth is, that all of them should stop\r\nJohn: and this decision is a step back\r\nFaith: yes, this is worrying\r\nErica: And it seems that the most important whaling countries are out of the agreement right now\r\nFaith: yes, seems so\r\nJohn: Just like USA leaving the Paris Agreement",Japan is going to hunt whales again. Island and Norway never stopped hunting them. The Scandinavians kill fewer whales than the Japanese.
14730,13828395,"Jennifer: Dear Celia! How are you doing?\r\nJennifer: The afternoon with the Collins was very pleasant, nice folks, but we missed you.\r\nJennifer: But I appreciate your consideration for Peter.\r\nCelia: My dear Jenny! It turns out that my decision not to come, though I wanted so much to see you again and Peter and the Collins, was right. Yesterday it all developed into a full bore cold. Sh.....\r\nCelia: All symptoms like in a text book.\r\nCelia: Luckily it's contagious only on the first 2, 3 days, so when we meet next week it should be alright.\r\nCelia: Thanks for asking! Somehow for all of us Peter comes first now.\r\nJennifer: That's too bad. Poor you...\r\nJennifer: I'll be driving to FR, do you want me to bring you sth? It's on my way.\r\nCelia: Thank you dear! I was at the pharmacy yesterday and had done my shopping the day before.\r\nCelia: You'd better still stay away from me in case I'm still contagious\r\nJennifer: Right. So I'll only leave a basket on your terrace. A...","Celia couldn't make it to the afternoon with the Collins and Jennifer as she is ill. She's working, but doesn't want to meet with Jennifer as it might be contagious. Jennifer will leave a basket with cookies on Celia's terrace."
14731,13729017,"Georgia: are you ready for hotel hunting? We need to book something finally for Lisbon\r\nJuliette: sure we can go on, show me what you found\r\nGeorgia: <file_photo>\r\nJuliette: nah... it looks like an old lady's room lol\r\nGeorgia: <file_photo>\r\nJuliette: that's better... but the bed doesn't look very comfortable\r\nGeorgia: i kind of like it and it's really close to the city center\r\nJuliette: show me the others please\r\nGeorgia: <file_photo>\r\nJuliette: nah... this one sucks too, look at those horrible curtains \r\nGeorgia: aff Julie you are such a princess\r\nJuliette: i just want to be comfortable\r\nGeorgia: come on, stop whining you know we are on a budget\r\nJuliette: well hopefully we can find something that's decent right?\r\nGeorgia: i did show you decent but you want a Marriott or something :/\r\nJuliette: ok ok don't get angry\r\nGeorgia: we need to decide today, the longer we wait the higher the prices get \r\nJuliette: ok how about we get the second one then?...","Georgia and Juliette are looking for a hotel in Lisbon. Juliette dislikes Georgia's choices. Juliette and Georgia decide on the second option presented by Georgia, but it has already been booked. Finally Georgia books the third hotel."


In [5]:
# Checking the empty values of specific data and remove it

mask=train['dialogue'].isnull() # crating mask with null dialogues
filtered_train=train[mask] # filtering dataframe
filtered_train

Unnamed: 0,id,dialogue,summary
6054,13828807,,problem with visualization of the content


In [6]:
# removing it
train=train.dropna()

# removing Id from categorical features list
categorical_features.remove('id')

In [7]:
# Data Visualization
import numpy as np
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

# Statistics & Mathematics
from scipy.stats import gaussian_kde

# paper_color=
# bg_color =

template='plotly_dark'

def histogram_boxplot(df, hist_color, box_color, height, width, legend, name):
    '''This function plots a Histogram and a Box plot side by side
    
    Parameters:
    * hist_color: the color of the histogram
    * box_color: the color of the boxplots
    * heigh and width: Image size
    * legend = Either to display legend or not
    '''
    
    features=df.select_dtypes(include=[np.number]).columns.tolist()
    
    for feat in features:
        try:
            fig=make_subplots(rows=1, cols=2, subplot_titles=['Box Plot', 'Histogram'], horizontal_spacing=0.2)
            density=gaussian_kde(df[feat])
            x_vals=np.linspace(min(df[feat]), max(df[feat]), 200)
            density_vals=density(x_vals)
            
            fig.add_trace(go.Scatter(x=x_vals, y=density_vals, mode='lines', fill='tozeroy', name='Density', line_color=hist_color), row=1, col=2)
            fig.add_trace(go.Box(y=df[feat], name='Box Plot', boxmean=True, line_color=box_color), row=1, col=1)
            fig.update_layout(title={'text':f'<b>{name} Word Count<br><sup><i>&nbsp;&nbsp;&nbsp;&nbsp;{feat}</i></sup></b>', 'x':.025, 'xanchor':'left'}, 
                              margin=dict(t=100),
                              showlegend=legend, 
                              template=template,
                              #plot_bgcolor=bg_color,
                              #paper_bgcolor=paper_color,
                              height=height,
                              width=width,
                             )
            fig.update_yaxes(title_text=f'<b>Word</b>', row=1, col=1, showgrid=False)
            fig.update_xaxes(title_text='', row=1, col=1, showgrid=False)
            fig.update_yaxes(title_text='<b>Frequency</b>', row=1, col=2, showgrid=False)
            fig.update_xaxes(title_text=f'<b>Words</b>', row=1, col=2, showgrid=False)
            
            fig.show()
            print('\n')
        except Exception as e:
            print(f'An error occurred: {e}')
            
    
    
df_text_length=pd.DataFrame() # crating an empty dataframe
for feat in categorical_features: # iterating through features --> dialogue & summary
    df_text_length[feat]=train[feat].apply(lambda x: len(str(x).split())) # counting words for each feature

# ploting histogram-boxplot
histogram_boxplot(df_text_length, '#89c2e0', '#d500ff', 600, 1000, True, 'Train Dataset')









From the result above, dialogues consist of about 94 words. we do have some outliers with very extensive texts, going way over 300 words per dialogue. Summaries are naturally shorters texts, consisting of about 20 words on average, although we also have some outliers with extensive summaries.


# Statistic on data

We can also use scikit-learn's `TfidfVectorizer` to extarct more infor on the dialogues and summaries available. This function will give us a dataframe with the top n most frequent terms in the corpus, which we select by using the `max_features` parameter.

In this dataframe, each column represents the $n$ most frequent terms in the overall corpus, while each row represents one entry in the original dataframe, such as `train`. For each term in each entry we will see the **TF-IDF score** associated with it, which quantifies the relevance of a term in a given dialogue - or summary - relative to its frequency across all other dialogues - or summaries. We will also use the `bgram_range` parameter to select the most frequent words(`unigrams`), the most frequent sequence of two words(`bigrams`), and the most frequent sequence of three words(`trigrams`). The `stop_words='english'` parameter will help us filter out common stop-words of the English language, which are words that do not add up much to the overall context, such as "and", "of", etc.

And we will plot a `heatmap` displaying the correlations between these terms. This may help us to understand how frequently they are used together in dialogues.

In [8]:
# NLP libraries
from sklearn.feature_extraction.text import TfidfVectorizer

# data visualization
import plotly.figure_factory as ff

colormap='cividis'

def plot_correlation(df, title, subtitle, height, width, font_size):
    '''
    This function is resposible to plot a correlation map among features in the dataset.
    
    Parameters:
    * height=Define height
    * width=Define width
    * font_size=Define the font size for the annotations
    '''
    corr=np.round(df.corr(numeric_only=True), 2)
    mask=np.triu(np.ones_like(corr, dtype=bool))
    c_mask=np.where(~mask, corr, 100)
    
    c=[]
    for i in c_mask.tolist()[1:]:
        c.append([x for x in i if x!=100])
    
    fig=ff.create_annotated_heatmap(z=c[::-1], x=corr.index.tolist()[:-1], y=corr.columns.tolist()[1:][::-1], colorscale=colormap)
    fig.update_layout(title={'text': f'<b>{title} Heatmap<br><sup>&nbsp;&nbsp;&nbsp;&nbsp;<i>{subtitle}</i></sup></b>', 'x':.025,'xanchor':'left','y':.95},
                      margin=dict(t=210, l=110),
                      yaxis=dict(autorange='reversed', showgrid=False),
                      xaxis=dict(showgrid=False),
                      template=template,
                      height=height,
                      width=width)
    fig.add_trace(go.Heatmap(z=c[::-1], colorscale=colormap, showscale=True, visible=False))
    fig.data[1].visible=True
    
    for i in range(len(fig.layout.annotations)):
        fig.layout.annotations[i].font.size=font_size
        
    fig.show()
    

vectorizer=TfidfVectorizer(max_features=15, stop_words='english') # top 15 terms
x=vectorizer.fit_transform(train['dialogue'])
df_tfidfvect=pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Unigrams','Train - Dialogue', 800, 800, 12)

We can see the correlations between these terms are neither strongly positive nor strongly negative. The most positively correlated terms are `don` and `know`, at 0.12. It is relevant to observe that the `TridVectorizer` function performs some changes to the text, such as removing contractions, which explains why the word don't appears without its apostrophe 't.

It is also intersting to notice a negative correlation - although still not extremely significant - between the terms `yes` and `yeah`. Maybe this happens because it would be redundant to include both in the same dialogue, or perhaps the data captures a tendency of individuals to use `yeah` instead of `yes` during conversations. These are some hypotheses we can consider when analyzing this type of heatmaps.

Let's perform the same abalysis to the summaries.

In [9]:
vectorizer=TfidfVectorizer(max_features=15, stop_words='english')
x=vectorizer.fit_transform(train['summary'].fillna(''))
df_tfidvect=pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidvect, 'Unigrams', 'Train - Summary', 800, 800, 12)

The correlations of terms in summaries seem to be more pronounced than those in dialogues, even though these correlaions are still not strong. This suggests that summaries may convey relevant information more succinctly than full dialogues, which is exactly the idea behind a summary.

We have positively correlated pairs such as `going` and `meet`, `come` and `party`, as well as `buy` and `wants`. It makes perfect sense to see these unigrams appearing together across texts. Conversely, it's reasonable for negatively correlated pairs **not** to co-occur frequently in texts, such as `going` and `wants`, and `going` and `got`.


# Statistic bigrams across dialogues and summaries

In [10]:
vectorizer=TfidfVectorizer(max_features=15, stop_words='english', ngram_range=(2,2))
x=vectorizer.fit_transform(train['dialogue'].fillna(''))
df_tfidfvect=pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Bigrams', 'Train - Dialogue', 800, 800, 12)

The correlations are not extremely strong. Still, we can see some pairs that seem reasonable to be together, such as `good idea` and `sounds like`.

In [11]:
vectorizer=TfidfVectorizer(max_features=15, stop_words='english', ngram_range=(2,2))
x=vectorizer.fit_transform(train['summary'].fillna(''))
df_tfidvect=pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidvect, 'Bigrams', 'Train - Summary', 800, 800, 12)

We have only one correlation between the pairs `wants buy` and `buy new`. The other terms do not appear to have any kind of correlation at all.

It is interesting to see the tendency of the summaries to contain information on minutes, which does not seem to be present in the dialogues. We can even investigate further this relationship by querying some summaries where the bigram 15 minutes appearts in the summary.

In [12]:
# filtering dataset to see those containing the term `15 minutes` in the summary.
filtered_train=train[train['summary'].str.contains('15 minutes', case=False, na=False)]
filtered_train.head()

Unnamed: 0,id,dialogue,summary
136,13827893,"Kate: I'm here <file_other>\r\nKate: there was no place in Red Lion\r\nSteven: hey! but it's quite far away\r\nKate: c'mon it's just 10 min by bike!\r\nSteven: yes, but I'm not by bike\r\nKate: car?\r\nSteven: nope\r\nSteven: by foot :P :P \r\nSteven: anyway google maps says 15 min and I'm there:D\r\nKate: ok, w8in ^^",Kate will meet with Steven in 15 minutes.
428,13811484-1,"Jenny: Let's go out to eat.\r\nLucy: That sounds like fun.\r\nJenny: Where do you wanna go?\r\nLucy: Let me think a minute.\r\nJenny: I feel like Chinese.\r\nLucy: That sounds yummy.\r\nJenny: I know a good Chinese restaurant.\r\nLucy: How far away is it?\r\nJenny: It's only 10 minutes from my place.\r\nLucy: Do we have to book a table?\r\nJenny: Oh, no. We can walk right in.\r\nLucy: Cool. Will be in 15 minute. I'm really hungry!",Jenny and Lucy are going to a Chinese restaurant to eat. They do not need to book a table. Lucy will be at Jenny's in 15 minutes.
570,13818296,Danielle: hey where RU?\r\nJuan: I told u I'd be late!\r\nDanielle: but it's been almost 45 mins!\r\nDanielle: <file_gif>\r\nJuan: I'll be there in 15 minutes\r\nJuan: <file_gif>,Juan is almost 45 minutes late. He'll be there in 15 minutes.
1213,13682296-1,"John: I know you will be outraged but I like to provoke you :P\r\nTyre: What is it?\r\nJohn: I talked to our neighbour today and I am really starting to think that religious people are just stupid.\r\nTyre: Gosh. You know it's a stupid claim.\r\nJohn: I know that there are some clever, religious individuals. But statistically religious people are stupid.\r\nTyre: It's not true. There are stupid religious people and clever ones, just like atheists.\r\nJohn: But most of academics are not religious.\r\nTyre: How do you know it?\r\nJohn: Experience but also some data I've seen.\r\nTyre: It's just not true.\r\nJohn: They are mostly people believing in things that have nothing to do with logic or reason: miracles, ghosts, witchcraft, just as our neighbour.\r\nTyre: I think it's only one part of them. There are theologians, people who actually know a lot about philosophy, logic etc.\r\nJohn: Yes, there are also people doing ""scientifically"" tarot, horoscopes and astrology.\r\nTyre: You ca...",John and Tyre's neighbour stopped John in the staircase and talked about some miracles for 15 minutes. John thinks that religious people are stupid. Tyre disagrees with this generalization.
1812,13820691,"Madge: are you alive? xD\r\nDorothy: i'm still drunk\r\nMadge: xDDDDDDDD jeeez\r\nFelicia: I don't know...how much did i drink?\r\nMadge: like 10 rounds\r\nFelicia: SHIT \r\nFelicia: you gotta be kidding me ahahaha xDDDDDDDDDDDDDDDDDD\r\nDorothy: of course she is\r\nDorothy: it was at least 15\r\nFelicia: ;________________;\r\nFelicia: was nice to meet you girls...shame on me as always\r\nDorothy: oh stop talking\r\nDorothy: just live the moment B-)\r\nFelicia: how am i supossed to live the moment if i don't remember the half o the night XD\r\nDorothy: well it happens :p \r\nMadge: we gotta repeat it, i had a lot of fun :D\r\nDorothy: i'm in. in 15min?\r\nFelicia: you're crazy ;-;",Dorothy is still intoxicated after at least 15 rounds of drink yesterday and can't remember much of what happened. She would like to meet her friends for a drink again in 15 minutes.


The last row gives us an idea of why we see so many terms related to minutes in summaries, but not in dialogues. In dialogues, poeple may write `15min` together or even other forms of it, such as `15m`, whereas the summaries give us a patternized desciption, making it natural to be more prominent than other forms to describe time.

## Statistic with the trigrams

In [13]:
vectorizer=TfidfVectorizer(max_features=15, stop_words='english', ngram_range=(3,3))
x=vectorizer.fit_transform(train['dialogue'].fillna(''))
df_tfidfvet=pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidvect, 'Trigrams', 'Train - Dialogue', 800, 800, 12)

In [14]:
vectorizer=TfidfVectorizer(max_features=15, stop_words='english', ngram_range=(3,3))
x=vectorizer.fit_transform(train['summary'].fillna(''))
df_tfidfvect=pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Trigrams', 'Train - Summary', 800, 800, 12)

We can see that the terms are not strongly correlated. But still, it is possible to see pairs that seem logical to appear together in the corpus. Let's perform the exact same analysis on the test and val datasets. We expect same behavior as ones seen during the analysis of the training set. However, if something different appears, we will surely investigate further.

# Visualization and Statistic the Test Dataset

In [15]:
describe_dataframe(test)


DataFrame shape: (819, 3)

819 samples

3 attributes

Missing Data: 
id          0
dialogue    0
summary     0
dtype: int64

Duplicates:0

Data types:
id          object
dialogue    object
summary     object
dtype: object

Categorical Features:
id, dialogue, summary

Continuous Features:
None

Binary Features:
None

DataFrame Head: 



Unnamed: 0,id,dialogue,summary
0,13862856,"Hannah: Hey, do you have Betty's number?\nAmanda: Lemme check\nHannah: <file_gif>\nAmanda: Sorry, can't find it.\nAmanda: Ask Larry\nAmanda: He called her last time we were at the park together\nHannah: I don't know him well\nHannah: <file_gif>\nAmanda: Don't be shy, he's very nice\nHannah: If you say so..\nHannah: I'd rather you texted him\nAmanda: Just text him 🙂\nHannah: Urgh.. Alright\nHannah: Bye\nAmanda: Bye bye",Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.
1,13729565,Eric: MACHINE!\r\nRob: That's so gr8!\r\nEric: I know! And shows how Americans see Russian ;)\r\nRob: And it's really funny!\r\nEric: I know! I especially like the train part!\r\nRob: Hahaha! No one talks to the machine like that!\r\nEric: Is this his only stand-up?\r\nRob: Idk. I'll check.\r\nEric: Sure.\r\nRob: Turns out no! There are some of his stand-ups on youtube.\r\nEric: Gr8! I'll watch them now!\r\nRob: Me too!\r\nEric: MACHINE!\r\nRob: MACHINE!\r\nEric: TTYL?\r\nRob: Sure :),Eric and Rob are going to watch a stand-up on youtube.
2,13680171,"Lenny: Babe, can you help me with something?\r\nBob: Sure, what's up?\r\nLenny: Which one should I pick?\r\nBob: Send me photos\r\nLenny: <file_photo>\r\nLenny: <file_photo>\r\nLenny: <file_photo>\r\nBob: I like the first ones best\r\nLenny: But I already have purple trousers. Does it make sense to have two pairs?\r\nBob: I have four black pairs :D :D\r\nLenny: yeah, but shouldn't I pick a different color?\r\nBob: what matters is what you'll give you the most outfit options\r\nLenny: So I guess I'll buy the first or the third pair then\r\nBob: Pick the best quality then\r\nLenny: ur right, thx\r\nBob: no prob :)",Lenny can't decide which trousers to buy. Bob advised Lenny on that topic. Lenny goes with Bob's advice to pick the trousers that are of best quality.
3,13729438,"Will: hey babe, what do you want for dinner tonight?\r\nEmma: gah, don't even worry about it tonight\r\nWill: what do you mean? everything ok?\r\nEmma: not really, but it's ok, don't worry about cooking though, I'm not hungry\r\nWill: Well what time will you be home?\r\nEmma: soon, hopefully\r\nWill: you sure? Maybe you want me to pick you up?\r\nEmma: no no it's alright. I'll be home soon, i'll tell you when I get home. \r\nWill: Alright, love you. \r\nEmma: love you too.",Emma will be home soon and she will let Will know.
4,13828600,"Ollie: Hi , are you in Warsaw\r\nJane: yes, just back! Btw are you free for diner the 19th?\r\nOllie: nope!\r\nJane: and the 18th?\r\nOllie: nope, we have this party and you must be there, remember?\r\nJane: oh right! i lost my calendar.. thanks for reminding me\r\nOllie: we have lunch this week?\r\nJane: with pleasure!\r\nOllie: friday?\r\nJane: ok\r\nJane: what do you mean "" we don't have any more whisky!"" lol..\r\nOllie: what!!!\r\nJane: you just call me and the all thing i heard was that sentence about whisky... what's wrong with you?\r\nOllie: oh oh... very strange! i have to be carefull may be there is some spy in my mobile! lol\r\nJane: dont' worry, we'll check on friday.\r\nOllie: don't forget to bring some sun with you\r\nJane: I can't wait to be in Morocco..\r\nOllie: enjoy and see you friday\r\nJane: sorry Ollie, i'm very busy, i won't have time for lunch tomorrow, but may be at 6pm after my courses?this trip to Morocco was so nice, but time consuming!\r\nOllie: ok fo...",Jane is in Warsaw. Ollie and Jane has a party. Jane lost her calendar. They will get a lunch this week on Friday. Ollie accidentally called Jane and talked about whisky. Jane cancels lunch. They'll meet for a tea at 6 pm.



DataFrame Tail: 



Unnamed: 0,id,dialogue,summary
814,13611902-1,"Alex: Were you able to attend Friday night's basketball game?\r\nBenjamin: I was unable to make it.\r\nAlex: You should have been there. It was intense.\r\nBenjamin: Is that right. Who ended up winning?\r\nAlex: Our team was victorious.\r\nBenjamin: I wish I was free that night. I'm kind of mad that I didn't go.\r\nAlex: It was a great game. Everything alright tough?\r\nBenjamin: Yeah man thanks for asking, it's just that my mom is sick and I am taking care of her.\r\nAlex: Oh sorry to hear that. Hope she makes a fast recovery 💪\r\nBenjamin: She will, she just has a nasty flu but she will be alright :D\r\nAlex: Glad to hear that!\r\nBenjamin: What was the score at the end of the game?\r\nAlex: Our team won 101-98.\r\nBenjamin: Sounds like it was a close game then.\r\nAlex: That's the reason it was such a great game.\r\nBenjamin: I'll go to the next one for sure.\r\nAlex: It's next weekend so you better put on your calendar ahaha\r\nBenjamin: ahaha I will I will. Talk to you later!\...",Benjamin didn't come to see a basketball game on Friday's night. The team supported by Alex won 101-98. Benjamin's mom has a flu and he's looking after her. Benjamin declares to attend the next basketball match.
815,13820989,Jamilla: remember that the audition starts at 7.30 P.M.\r\nKiki: which station?\r\nJamilla: Antena 3\r\nYoyo: roger that,The audition starts at 7.30 P.M. in Antena 3.
816,13717193,"Marta: <file_gif>\r\nMarta: Sorry girls, I clicked something by accident :D\r\nAgnieszka: No problem :p\r\nWeronika: Hahaha\r\nAgnieszka: Good thing you didn't send something from your gallery ;)","Marta sent a file accidentally,"
817,13829115,"Cora: Have you heard how much fuss British media made about meet and greet with James Charles in Birmingham?\r\nEllie: no...! what happened?\r\nCora: Well, there was a meet and greet with James Charles in one of the malls in Birmingham and about 8000 fans showed up for it.\r\nCora: It cause a gridlock around the mall and - of course - British media had to make some (quite negative) comments on it.\r\nEllie: they came for sister James?! >:(\r\nEllie: i sister snapped!! :p :D\r\nCora: Haha :D\r\nCora: You shouldn't watch so much youtube, you're getting weirder and weirder. :d\r\nEllie: sister shut up :P so, what did they say?\r\nCora: ;) :* ""Daily Mail"" was surprised that a meet and greet with a ""virtually unknown"" youtuber gathered 8000 people. :p\r\nCora: A host from LBC tried to find an answer to an unanswerable question: ""Who is James Charles?"". Eventually James called him and introduced himself. On air. :D\r\nEllie: there's something called google lol\r\nCora: Right? :p\r\nCora:...",There was a meet-and-greet with James Charles in Birmingham which gathered 8000 people.
818,13818810,"Rachel: <file_other>\r\nRachel: Top 50 Best Films of 2018\r\nRachel: :)\r\nJanice: Omg, I've watched almost all 50... xDD\r\nSpencer: Hahah, Deadpool 2 also??\r\nJanice: Yep\r\nSpencer: Really??\r\nJanice: My bf forced me to watch it xD\r\nRachel: Hahah\r\nJanice: It wasn't that bad\r\nJanice: I thought it'd be worse\r\nRachel: And Avengers? :D\r\nJanice: 2 times\r\nRachel: Omg\r\nJanice: xP\r\nRachel: You are the best gf in the world\r\nRachel: Your bf should appreciate that ;-)\r\nJanice: He does\r\nJanice: x)","Rachel sends a list of Top 50 films of 2018. Janice watched almost half of them, Deadpool 2 and Avengers included."


In [16]:
# removing 'Id' from the global variable categorical features
categorical_features.remove('id')

df_text_length=pd.DataFrame()
for feat in categorical_features:
    df_text_length[feat]=test[feat].apply(lambda x: len(str(x).split()))

histogram_boxplot(df_text_length, '#89c2e0', '#d500ff', 600, 1000, True, 'Test Dataset')









In [17]:
vectorizer=TfidfVectorizer(max_features=15, stop_words='english')
x=vectorizer.fit_transform(test['dialogue'].fillna(''))
df_tfidfvect=pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Unigrams', 'Test - Dialogue', 800, 800, 12)

In [18]:
vectorizer=TfidfVectorizer(max_features=15, stop_words='english')
x=vectorizer.fit_transform(test['summary'].fillna(''))
df_tfidvect=pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Unigrams', 'Test - Summary', 800, 800, 12)

In [19]:
vectorizer=TfidfVectorizer(max_features=15, stop_words='english', ngram_range=(2,2))
x=vectorizer.fit_transform(test['dialogue'].fillna(''))
df_tfidfvect=pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Biggrams', 'Test - Dialogue', 800, 800, 12)

In [20]:
vectorizer=TfidfVectorizer(max_features=15, stop_words='english', ngram_range=(2,2))
x=vectorizer.fit_transform(test['summary'].fillna(''))
df_tfidfvect=pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Bigrams', 'Test - Summary', 800, 800, 12)

In [21]:
vectorizer=TfidfVectorizer(max_features=15, stop_words='english', ngram_range=(3,3))
x=vectorizer.fit_transform(test['dialogue'].fillna(''))
df_tfidfvect=pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Trigrams', 'Test - Dialogue', 800, 800, 12)

In [22]:
vectorizer=TfidfVectorizer(max_features=15, stop_words='english', ngram_range=(3,3))
x=vectorizer.fit_transform(test['summary'].fillna(''))
df_tfidfvect=pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Trigrams', 'Test - Summary', 800, 800, 12)

# Validation Dataset

In [23]:
describe_dataframe(val)


DataFrame shape: (818, 3)

818 samples

3 attributes

Missing Data: 
id          0
dialogue    0
summary     0
dtype: int64

Duplicates:0

Data types:
id          object
dialogue    object
summary     object
dtype: object

Categorical Features:
id, dialogue, summary

Continuous Features:
None

Binary Features:
None

DataFrame Head: 



Unnamed: 0,id,dialogue,summary
0,13817023,"A: Hi Tom, are you busy tomorrow’s afternoon?\r\nB: I’m pretty sure I am. What’s up?\r\nA: Can you go with me to the animal shelter?.\r\nB: What do you want to do?\r\nA: I want to get a puppy for my son.\r\nB: That will make him so happy.\r\nA: Yeah, we’ve discussed it many times. I think he’s ready now.\r\nB: That’s good. Raising a dog is a tough issue. Like having a baby ;-) \r\nA: I'll get him one of those little dogs.\r\nB: One that won't grow up too big;-)\r\nA: And eat too much;-))\r\nB: Do you know which one he would like?\r\nA: Oh, yes, I took him there last Monday. He showed me one that he really liked.\r\nB: I bet you had to drag him away.\r\nA: He wanted to take it home right away ;-).\r\nB: I wonder what he'll name it.\r\nA: He said he’d name it after his dead hamster – Lemmy - he's a great Motorhead fan :-)))",A will go to the animal shelter tomorrow to get a puppy for her son. They already visited the shelter last Monday and the son chose the puppy.
1,13716628,"Emma: I’ve just fallen in love with this advent calendar! Awesome! I wanna one for my kids!\r\nRob: I used to get one every year as a child! Loved them! \r\nEmma: Yeah, i remember! they were filled with chocolates!\r\nLauren: they are different these days! much more sophisticated! Haha!\r\nRob: yeah, they can be fabric/ wooden, shop bought/ homemade, filled with various stuff\r\nEmma: what do you fit inside?\r\nLauren: small toys, Christmas decorations, creative stuff, hair bands & clips, stickers, pencils & rubbers, small puzzles, sweets\r\nEmma: WOW! That’s brill! X\r\nLauren: i add one more very special thing as well- little notes asking my children to do something nice for someone else\r\nRob: i like that! My sister adds notes asking her kids questions about christmas such as What did the 3 wise men bring? etc\r\nLauren: i reckon it prepares them for Christmas \r\nEmma: and makes it more about traditions and being kind to other people\r\nLauren: my children get very excited eve...","Emma and Rob love the advent calendar. Lauren fits inside calendar various items, for instance, small toys and Christmas decorations. Her children are excited whenever they get the calendar."
2,13829420,Jackie: Madison is pregnant\r\nJackie: but she doesn't wanna talk about it\r\nIggy: why\r\nJackie: I don't know why because she doesn't wanna talk about it\r\nIggy: ok\r\nJackie: I wanted to prepare you for it because people get super excited and ask lots of questions\r\nJackie: and she looked way more anxious than excited\r\nIggy: she's probably worrying about it\r\nIggy: she's taking every commitment really seriously\r\nJackie: it could be money problems or relationship problems\r\nIggy: or maybe she wants an abortion\r\nJackie: it could be all of the above\r\nIggy: but you know what?\r\nIggy: once my friend was pregnant and I couldn't bring myself to be happy about it\r\nJackie: why?\r\nIggy: I felt they were immature and I couldn't picture this couple as parents\r\nJackie: I felt similar way on Patricia's wedding\r\nIggy: Patricia Stevens?\r\nJackie: yes\r\nIggy: so we're talking about the same person\r\nJackie: what a coincidence\r\nJackie: so she's pregnant?\r\nIggy: she thou...,Madison is pregnant but she doesn't want to talk about it. Patricia Stevens got married and she thought she was pregnant.
3,13819648,"Marla: <file_photo>\r\nMarla: look what I found under my bed\r\nKiki: lol\r\nTamara: is that someone's underwear?\r\nMarla: it certainly isn't mine, my ass is big but it isn't huge\r\nKiki: it looks like male underwear\r\nTamara: not necessarily, maybe some butch had fun in your room while you were gone\r\nMarla: ok but how can you leave your underwear after hooking up? wtf is wrong with people\r\nKiki: she or he could be too wasted to notice\r\nTamara: or maybe someone put their pants there to piss you off\r\nMarla: that makes no sense\r\nMarla: it's so fucking childish\r\nKiki: if it's childish then it must have been your sister's idea\r\nMarla: she's 13, she doesn't have underwear that isn't pink\r\nTamara: maybe it belonged to one of your exes?\r\nKiki: she would have recognized it\r\nMarla: lol we're doing total CSI investigation on one pair of boxers :D\r\nKiki: <file_gif>\r\nTamara: lol\r\nTamara: I think your sister convinced someone to put their underwear in your room as a...",Marla found a pair of boxers under her bed.
4,13728448,Robert: Hey give me the address of this music shop you mentioned before\r\nRobert: I have to buy guitar cable\r\nFred: <file_other>\r\nFred: Catch it on google maps\r\nRobert: thx m8\r\nFred: ur welcome,Robert wants Fred to send him the address of the music shop as he needs to buy guitar cable.



DataFrame Tail: 



Unnamed: 0,id,dialogue,summary
813,13829423,"Carla: I've got it...\r\nDiego: what?\r\nCarla: my date for graduation. Hope you're coming\r\nDiego: if you tell me when...\r\nCarla: oups sorry. June 4th\r\nDiego: we've got time.\r\nCarla: of course, but you have to book your plane\r\nDiego: i still don't know, and it's quite expensive\r\nCarla: that's why you have to book it right now. Please tell me you'll come\r\nDiego: i'd love to for sure\r\nCarla: come, come, please\r\nDiego: ok, i'll have a look and tell you.\r\nCarla: you could stay home for the week, my roommate won't be there.\r\nDiego: didn't you tell me your parents would come?\r\nCarla: yes they will, but they've got friends they could stay with.\r\nDiego: what was the company you flew with when you came last month?\r\nCarla: aeromexico was the cheapest at that time, but check with delta\r\nDiego: i think there is some flight comparison websites and also some apps.\r\nCarla: i only know the canadian one\r\nDiego: don't worry i'll find out \r\nCarla: ok ! i've to l...",Carla's date for graduation is on June 4th. Diego will try to come then.
814,13727710,"Gita: Hello, this is Beti's Mum Gita, I wanted to ask if you were going on the school trip?\r\nBev: Hi Gita, yes, Milo wants me to come, he's a bit nervous going away from home or school still.\r\nGita: Yes, Beti is the same, they are still only 4 or 5 after all.\r\nBev: I know, still so young! It will help the teachers and TAs anyway, they have a lot to cope with!\r\nGita: I know, I could never do their job! I work part time as a music teacher, going round schools.\r\nBev: Oh really? I am in Marks, part time too, love it there! \r\nGita: Yes, it really helps to do some sort of work doesn't it! I could never manage full time, though.\r\nBev: Oh, I know, Gita. My sister's in management and she doesn't see her kids from 6.30am to 6.30pm every day! She is a high flier, but she does miss them. She does do lots with them on the weekend, though.\r\nGita: Yes, but children need time to just be at home and play or just be with family, not galavanting around all the time!\r\nBev: I agree 10...",Bev is going on the school trip with her son. Gita is going on the school trip with her daughter. Bev's sister rarely sees her children during the week because of her job. Gita has a few pets at home. The mothers with their children have to be at school at 7.45 to not miss the bus.
815,13829261,"Julia: Greg just texted me\r\nRobert: ugh, delete him already\r\nJulia: He's saying he's sorry\r\nRobert: damn girl, delete the bastard\r\nJulia: it's not that simple, you know it\r\nRobert: No Julia, it is pretty simple\r\nRobert: go and delete him\r\nJulia: But he apologised, ok? He's never done it before\r\nRobert: srsly?\r\nRobert: do I need to remind you he cheated on you?\r\nRobert: Julia I'm not going through this again with you\r\nJulia: People change, I do believe it, maybe he changed. He apologised\r\nRobert: and that's it? That' ok? how's different from two other times?\r\nJulia: i told you - he apologised! he's sorry, he wants to meet\r\nRobert: don't, honey, really. We've been through this\r\nJulia: I know, but it's not easy. I think I love him\r\nRobert: i know you do, but you need to be strong. do you want to come over?\r\nJulia: no, thank you love, but i have to get up early tomorrow\r\nRobert: ok, you should go to sleep then\r\nJulia: what about Greg?\r\nRobert: do...",Greg cheated on Julia. He apologises to her. Robert tells Julia not to meet Greg.
816,13680226,"Marry: I broke my nail ;(\r\nTina: oh, no!\r\nMarry: u know I have that party tomorrow!!!\r\nTina: I know, let me think...\r\nTina: I got it!. My sister friend is a cosmetitian, maybe she 'll help\r\nMarry: anyone will be good, I'm desperate!\r\nTina: I'll call her and let u know, ok?\r\nMarry: ok, I'll wait, but hurry!",Marry broke her nail and has a party tomorrow. Tina will call a cosmetician that she knows and let Marry know if she can help.
817,13862383,"Paige: I asked them to wait and send the declaration later\nPaige: Even end of March if it's possible\nMaddy: What did they say?\nPaige: They want to close it asap cause Lisa is afraid she forgets about it later\nPaige: But I can remind her in a couple of weeks\nPaige: It's my responsibility after all\nMaddy: But does it really matter? I mean the declaration\nMaddy: I think the deadline for payment is 31 March anyway\nPaige: I'm not sure, that's what I asked her\nPaige: Hope she confirms",Paige wants to have the declaration sent later. Lisa wants to send it soon. The deadline for payment is 31 March.


In [24]:
categorical_features.remove('id')
df_text_length=pd.DataFrame()
for feat in categorical_features:
    df_text_length[feat]=val[feat].apply(lambda x:len(str(x).split()))

histogram_boxplot(df_text_length, '#89c2e0', '#d500ff', 600, 1000, True, 'Validation Dataset')









In [25]:
vectorizer=TfidfVectorizer(max_features=15, stop_words='english')
x=vectorizer.fit_transform(val['dialogue'].fillna(''))
df_tfidfvect=pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Unigrams', 'Validation - Dialogue', 800, 800, 12)

In [26]:
vectorizer=TfidfVectorizer(max_features=15, stop_words='english')
x=vectorizer.fit_transform(val['summary'].fillna(''))
df_tfidfvect=pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Unigrams', 'Validation - Summary', 800, 800, 12)

In [27]:
vectorizer=TfidfVectorizer(max_features=15, stop_words='english', ngram_range=(2,2))
x=vectorizer.fit_transform(val['dialogue'].fillna(''))
df_tfidfvect=pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidvect, 'Bigrams', 'Validation - Dialogue', 800, 800, 12)

In [28]:
vectorizer=TfidfVectorizer(max_features=15, stop_words='english', ngram_range=(2,2))
x=vectorizer.fit_transform(val['summary']. fillna(''))
df_tfidvect=pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Bigrams','Validation - Summary', 800, 800, 12)

In [29]:
vectorizer=TfidfVectorizer(max_features=15, stop_words='english', ngram_range=(3,3))
x=vectorizer.fit_transform(val['dialogue'].fillna(''))
df_tfidfvect=pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Trigrams', 'Validation - Dialogue', 800, 800, 12)

In [30]:
vectorizer=TfidfVectorizer(max_features=15, stop_words='english', ngram_range=(3,3))
x=vectorizer.fit_transform(val['summary'].fillna(''))
df_tfidfvect=pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Trigrams', 'Validation - Summary', 800, 800, 12)

Overall, we have similar patterns across all teh three datasets. Summaries are shorter in length than dialogues - as expected -and lots of terms that seem reasonable to be together have a higher degree of correlation. By analyzing the `n-grams` heatmaps, it is also clear that this data consists of chat/dialogue texts, since we can see a lot of terms that would usually appear in conversations. We will use this dataset to fine-tune LLM for text summarization tasks in [Text Summarization with Bart Series LLM](https://www.kaggle.com/code/aisuko/text-summarization-with-bart-series-llm).

# Credit

* https://www.kaggle.com/code/lusfernandotorres/text-summarization-with-large-language-models/notebook