# The One With ML : Episode clustering

We will take the summary of all episodes of the Friends TV Show to project them into an embedding space using an Embedding Model. Then, we will cluster them using DBSCAN and visualize them with UMAP in a 2D plane. 

The goal is to show if similar episodes are regrouped together.

We will use the data from the [Friends Series Dataset](https://www.kaggle.com/datasets/rezaghari/friends-series-dataset).

### Imports

In [None]:
from tqdm import tqdm

# HuggingFace
from transformers import AutoModel, AutoTokenizer
from datasets import Dataset

# ML
from sklearn.cluster import DBSCAN
import pandas as pd
import numpy as np
import torch
import torch.nn.functional as F
import nltk
import umap.umap_ as umap

# Visualization
import plotly.express as px


In [2]:
data_path = 'data/'
dataset_file_name = 'friends_episodes_augmented.csv'
dataset_path = data_path + dataset_file_name

### Loading data

In [3]:
# In the first line of the dataset, there is an unkown character, simply replace the description with :
# Monica and the gang introduce Rachel to the ""real world"" after she leaves her fiancé at the altar.

friends_episodes_description_df = pd.read_csv(dataset_path)
friends_episodes_description_df.head(5)

Unnamed: 0,Year_of_prod,Season,Episode Number,Episode_Title,Duration,Summary,Director,Stars,Votes
0,1994,1,1,The One Where Monica Gets a Roommate: The Pilot,22,"Monica and the gang introduce Rachel to the ""r...",James Burrows,8.3,7440
1,1994,1,2,The One with the Sonogram at the End,22,Ross finds out his ex-wife is pregnant. Rachel...,James Burrows,8.1,4888
2,1994,1,3,The One with the Thumb,22,Monica becomes irritated when everyone likes h...,James Burrows,8.2,4605
3,1994,1,4,The One with George Stephanopoulos,22,Joey and Chandler take Ross to a hockey game t...,James Burrows,8.1,4468
4,1994,1,5,The One with the East German Laundry Detergent,22,"Eager to spend time with Rachel, Ross pretends...",Pamela Fryman,8.5,4438


In [4]:
print(friends_episodes_description_df.head(5).to_markdown())

|    |   Year_of_prod |   Season |   Episode Number | Episode_Title                                   |   Duration | Summary                                                                                                                                                                                                                      | Director      |   Stars |   Votes |
|---:|---------------:|---------:|-----------------:|:------------------------------------------------|-----------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------|--------:|--------:|
|  0 |           1994 |        1 |                1 | The One Where Monica Gets a Roommate: The Pilot |         22 | Monica and the gang introduce Rachel to the "real world" after she leaves her fiancé at the altar.                                 

Let's show some statistics about the summaries of each episode

In [5]:
# Find number of characters and words in summaries
friends_episodes_description_df['summary_char_length'] = friends_episodes_description_df['Summary'].map(lambda x: len(x))
friends_episodes_description_df['summary_word_length'] = friends_episodes_description_df['Summary'].map(lambda x: len(nltk.word_tokenize(x)))

In [6]:
px.histogram(
    friends_episodes_description_df['summary_char_length'], 
    nbins=30, 
    title='Number of characters in episode summaries',
    width=1024,
    height=512
).update_layout(
    xaxis_title='Number of characters',
    yaxis_title='Count',
    showlegend=False
)

In [7]:
px.histogram(
    friends_episodes_description_df['summary_word_length'], 
    nbins=30, 
    title='Number of words in episode summaries',
    width=1024,
    height=512
).update_layout(
    xaxis_title='Number of words',
    yaxis_title='Count',
    showlegend=False
)

For a better understanding of the episode, let's concatenate the title of the episode to the description

In [8]:
friends_episodes_description_df['title_summary'] = friends_episodes_description_df['Episode_Title'] + ' : ' + friends_episodes_description_df['Summary']
friends_episodes_description_df.head()

Unnamed: 0,Year_of_prod,Season,Episode Number,Episode_Title,Duration,Summary,Director,Stars,Votes,summary_char_length,summary_word_length,title_summary
0,1994,1,1,The One Where Monica Gets a Roommate: The Pilot,22,"Monica and the gang introduce Rachel to the ""r...",James Burrows,8.3,7440,98,21,The One Where Monica Gets a Roommate: The Pilo...
1,1994,1,2,The One with the Sonogram at the End,22,Ross finds out his ex-wife is pregnant. Rachel...,James Burrows,8.1,4888,151,29,The One with the Sonogram at the End : Ross fi...
2,1994,1,3,The One with the Thumb,22,Monica becomes irritated when everyone likes h...,James Burrows,8.2,4605,181,36,The One with the Thumb : Monica becomes irrita...
3,1994,1,4,The One with George Stephanopoulos,22,Joey and Chandler take Ross to a hockey game t...,James Burrows,8.1,4468,197,41,The One with George Stephanopoulos : Joey and ...
4,1994,1,5,The One with the East German Laundry Detergent,22,"Eager to spend time with Rachel, Ross pretends...",Pamela Fryman,8.5,4438,220,42,The One with the East German Laundry Detergent...


### Model Loading

In [9]:
# Load right device
DEVICE = 'cpu'

if torch.cuda.is_available():
    DEVICE = 'cuda'
elif torch.backends.mps.is_available():
    DEVICE = 'mps'

In [10]:
# Load bge model used for generating embeddings of summaries
model_path = 'BAAI/bge-large-en-v1.5'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(DEVICE)

### Embeddings generation

In [11]:
# Convert to HuggingFace dataset for easier batching and better ram handling 
dataset = Dataset.from_pandas(friends_episodes_description_df)

In [12]:
MAX_LENGTH_TOKENS = 64 # Summaries are usually not more than 60 words

def generate_embeddings(summaries):
    """Converts a list of summaries into a list containing their respective summaries"""
    tokenized_summaries = tokenizer(summaries, max_length=MAX_LENGTH_TOKENS, padding=True, truncation=True, return_tensors='pt')
    tokenized_summaries.to(DEVICE)
    outputs = model(**tokenized_summaries)
    embeddings = outputs.last_hidden_state[:, 0]
    normalized_embeddings = F.normalize(embeddings, p=2, dim=1)
    return {'embeddings' : normalized_embeddings}

In [13]:
# Generate embeddings from summaries (with added title)
dataset = dataset.map(lambda x: generate_embeddings(x['title_summary']), batched=True, batch_size=8)

Map: 100%|██████████| 236/236 [00:04<00:00, 55.16 examples/s]


In [14]:
friends_episodes_description_df['embeddings'] = dataset['embeddings']

### Dimensionality reduction

In [None]:
manifold = umap.UMAP(random_state=42)
reduced_embeddings = manifold.fit_transform(friends_episodes_description_df['embeddings'].tolist())

### Clustering

We will know cluster the episodes using the embeddings generated by the model

In [16]:

def diversity_measure(data):
    return len(np.unique(data)) / len(data)

def entropy(data):
    """
    Computes the entropy of a distribution divided by the maximum theoretical entropy
    """
    n_unique, counts = np.unique(data, return_counts=True)
    probabilities = counts / len(data)
    entropy = -np.sum(probabilities * np.log2(probabilities))
    return entropy / np.log2(len(n_unique) + 1e-5)

def clustering_factor(data):
    diversity = diversity_measure(data)
    uniformity = entropy(data)
    return diversity * 0.5 + uniformity * 0.5

def grid_search_db_scans_params(
    embeddings, 
    eps_list,
    min_samples_list,
    max_clusters = 10 
):
    """
    Performs a simple grid search over DBSCAN's parameters eps and min_samples. To find the best configuration,
    it checks for the highest number of clusters that is less than `max_clusters`. The entropy and number of
    classes is what will be used to evaluate if a configuration is better than another. Ideally, we want high
    entropy 

    Args :
        - embeddings : Embeddings used for clustering
        - eps_list : List of possible values for the `eps` attribute
        - min_samples_list : List of possible values for the `min_samples` attribute
        - max_clusters : Max number of clusters we want during clustering. This is used to prevent the ideal 
        configuration to simply be all samples in different clusters
    """
    max_score = 0
    best_params = (eps_list[0], min_samples_list[0])

    for eps in tqdm(eps_list, total=len(eps_list)):
        for min_samples in min_samples_list:

            dbscan = DBSCAN(eps=eps, min_samples=min_samples)
            clusters = dbscan.fit(embeddings)
            current_score = clustering_factor(clusters.labels_)
            nb_clusters = len(np.unique(clusters.labels_))

            if max_score < current_score and nb_clusters < max_clusters and nb_clusters > 4:
                max_score = current_score
                best_params = (eps, min_samples)
    return best_params, max_score

In [17]:
grid_search_db_scans_params(
    reduced_embeddings, 
    eps_list=[x / 10.0 for x in range(1, 100)],
    min_samples_list=list(range(1, 10)),
    max_clusters=len(dataset) / 10
)

100%|██████████| 99/99 [00:00<00:00, 150.57it/s]


((0.3, 5), np.float64(0.488581374626665))

In [18]:
clustering = DBSCAN(eps=0.3, min_samples=5)
clusters = clustering.fit(reduced_embeddings)

In [19]:
if 'cluster' in friends_episodes_description_df.columns:
    friends_episodes_description_df = friends_episodes_description_df.drop(['cluster'], axis=1)
friends_episodes_description_df['cluster'] = clusters.labels_ + 1

### Visualization

In [20]:
cluster_assignments = clusters.labels_ + 1

In [21]:
radii = []
centers = []
for i in range(len(np.unique(cluster_assignments))):
    cluster_points = reduced_embeddings[cluster_assignments == i]
    center = np.mean(cluster_points, axis=0)
    centers.append(center)
    distances = np.linalg.norm(cluster_points - center, axis=1)
    radii.append(np.max(distances))


In [22]:
from plotly.colors import n_colors

n_clusters = len(np.unique(cluster_assignments))

# Use a built-in qualitative color sequence
colors = px.colors.qualitative.Plotly + px.colors.qualitative.Set1 + px.colors.qualitative.Pastel

# Create a color map
color_map = {str(i): colors[i % len(colors)] for i in range(n_clusters)}


fig = px.scatter(
    dataset, 
    reduced_embeddings[:, 0],
    reduced_embeddings[:, 1],
    color=cluster_assignments.astype(str),
    color_discrete_sequence=colors[:n_clusters],
    hover_data=['Episode_Title', 'Episode Number', 'Season'],
    width=600,
    height=600
)

# Allows us to retrieve the range of axis (needed to filter circles that are too big)
fig = fig.full_figure_for_development()

color_map = {}
for trace in fig.data:
    color_map[trace.name] = trace.marker.color

# Add circles around clusters
for i, (center, radius) in enumerate(zip(centers, radii)):
    
    # Do not draw big circles for esthetics reasons
    x_range = fig.layout.xaxis.range[1] - fig.layout.xaxis.range[0]
    if radius > x_range / 6:
        continue

    fig.add_shape(
        type="circle",
        xref="x", yref="y",
        x0=center[0] - radius, y0=center[1] - radius,
        x1=center[0] + radius, y1=center[1] + radius,
        line_color=color_map[str(i)],
        line_width=2,
        # line_dash="dash",
        opacity=1.0
    )

fig.update_layout(showlegend=False)
fig.show()


full_figure_for_development is not recommended or necessary for production use in most circumstances. 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [28]:
nb_clusters = len(np.unique(cluster_assignments))

stars = []
total_avg = np.mean(dataset['Stars'])

for i in range(nb_clusters):
    indices = list(np.where(cluster_assignments == i)[0]) # Indices of elements in cluster
    cluster_stars = dataset[indices]['Stars'] # Stars of elements in cluster
    avg_stars = np.mean(cluster_stars)
    stars.append(avg_stars - total_avg)

print(total_avg)

8.458898305084746


In [29]:
px.bar(
    stars,
    color=list(map(lambda x: color_map[str(x)], range(nb_clusters))),#[color_map[str(x)] for x in range(nb_clusters)],
    color_discrete_sequence=colors[:n_clusters],
    labels=range(nb_clusters),
    orientation='h',
).update_layout(
    showlegend=False,
    yaxis_title='Cluster',
    xaxis_title='Distance from average (cluster average - total average)',
)

In [41]:
most_popular_cluster = 14

pd.set_option('display.max_colwidth', None) # Allows to see the full column
print(friends_episodes_description_df.groupby('cluster').get_group(most_popular_cluster)['title_summary'].head(5).to_markdown())

|     | title_summary                                                                                                                                                                          |
|----:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|   8 | The One Where Underdog Gets Away : The gang's plans for Thanksgiving go awry after they get locked out of Monica and Rachel's apartment.                                               |
|  56 | The One with the Football : Old sibling rivalry between Monica and Ross resurfaces and postpones Thanksgiving dinner when the gang decide to play a game of "touch" football.          |
| 178 | The One with the Rumor : Monica invites Will, an old school friend of her and Ross over for Thanksgiving dinner, unaware he isn't too fond of Rachel.                                  |
| 201 | The One with Rachel's Other

In [40]:
second_most_popular_cluster = 10
print(friends_episodes_description_df.groupby('cluster').get_group(second_most_popular_cluster)['title_summary'].head(10).to_markdown())

|     | title_summary                                                                                                                                                                                                                                                                   |
|----:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|  30 | The One Where Ross Finds Out : A drunken Rachel calls Ross and reveals her feelings for him on his answering machine. Meanwhile, Monica keeps busy by being Chandler's personal trainer and Phoebe constantly wonders why her current boyfriend won't sleep with her.           |
|  63 | The One with the Morning After : Ross is guilt-ridden after sleeping with Chloe and desperately tries to stop Rachel from finding out and when she

In [42]:
third_most_popular_cluster = 6
print(friends_episodes_description_df.groupby('cluster').get_group(third_most_popular_cluster)['title_summary'].head(5).to_markdown())

|     | title_summary                                                                                                                                                                                                                                                               |
|----:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|  22 | The One with the Birth : As Carol goes into labor; Ross and Susan argue, Rachel flirts with Carol's doctor, and Joey assists a pregnant single mother.                                                                                                                      |
|  84 | The One with the Embryos : Phoebe's uterus is examined for implantation of the embryos. Meanwhile, a seemingly harmless game between Chandler and Joey against