# BERT Analysis

Approaching using BERT to analyze the overview column plan:
* Problem Understanding
  * Extract themes, genres, or patterns that correlate with popularity
  * BERT can reveal the underlying patterns by identifying contextual keywords and themes
* Using BERT for Key Information Extraction
  * BERT is a pre-trained model that reads text in both directions (both directions means that it reads starting from the left and the right side)
  * This is good for understanding word context better
  * Encoder vs. Decoder
    * Encoder: Extracts contextual information (good for classification, clustering)
      * Goal is to assign labels (genre, sentiment) based on the content OR Goal is to group similar texts together based on their meaning or themes
      * BERT understands word meanings more since it reads both left and right contexts
    * Decoder: Generates sequences (like summaries or paraphrasing)
  * Encoder would be best since we can classify the theme of the show as "romance," "heist," etc.
  * Decoder might be used if we need to summarize the overview or generate a more compact feature from it

# Feature Extractions via Embeddings
* Convert each overview into BERT embeddings, which is our vector representations
* Use a pre-trained BERT model form Hugging Face (bert-base-uncased) to generate enbeddings
  * BERT-base is the original configuration of the BERT model
  * Using uncased model since the capitalization does not impact the meaning
  * Will test the results of the BERT model on cleaned and uncleaned text data
    * Read that using over cleaned data into the BERT model can negatively affect its performance
  * Can use cosine similarity to show the similarity between overviews (clustering shows with similar themes)

In [None]:
import pandas as pd
import os
from IPython.display import display
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

TMDB_filename = os.path.join(os.getcwd(), "TMDB_tv_dataset_v3.csv")
df = pd.read_csv(TMDB_filename)

In [None]:
# import transformers library, which is a popular open-source library from HUgging Face
from transformers import BertTokenizer, BertModel
# import PyTorch to develop and train deep learning models
import torch
from tqdm import tqdm  # for progress bar

# loading the pre-trained BERT model and tokenizer 
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# tokenize and create tensor input
def get_embedding(text): 
  if pd.isnull(text):
    return torch.zeros(model.config.hidden_size).tolist()
  '''
  padding is set to true so that all input sequences have the same length
  truncation is set to true so that it truncates longer sequences that are longer
  than the models maximum input length so the text does not exceed the models capacity
  'return_tensor="pt"' means the output should be returned as a PyTorch tensor since the 
  model requires input in tensor format to perform computations
  '''
  tokens = tokenizer(text, padding=True, truncation=True, return_tensors="pt").to(device)
  '''
  torch.no_grad() disables gradient tracking to reduce memory consumption for computations that do not require gradients
  model(**tokens) passes the tokenized input to the pre-trained model (BERT)
  '''
  with torch.no_grad(): # generate embeddings 
    outputs = model(**tokens)
  '''
  outputs.last_hidden_state retrieves the hidden states from the last layer of the model for all tokens in the input sequence.
  Each token has an associated embedding vector.
  mean(dim=1) calculates the mean of the embeddings along the token dimension, which produces a single embedding vector for the entire input text, which
  can be used for various downstream tasks like classification, clustering, etc.
  '''
  return outputs.last_hidden_state.mean(dim=1).squeeze().tolist() # average pooling of embedings

In [None]:
# function to process the DataFrame with progress tracking
def process_with_progress(series):
  embeddings = []
  for text in tqdm(series, desc="Processing Embeddings"):
    embeddings.append(get_embedding(text))
  return embeddings

In [None]:
filtered_df = df[(df['genres'].notna() & (df['genres'] != 'Unknown')) | (df['overview'].notna() & (df['overview'] != 'Unknown'))]

missing_both = df[(df['genres'].isna() | (df['genres'] == 'Unknown')) & (df['overview'].isna() | (df['overview'] == 'Unknown'))]

percentage_missing_both = (len(missing_both) / len(df)) * 100
print(f"Percentage of rows with both 'genres' and 'overview' missing: {percentage_missing_both:.2f}%")

* Use the get_embedding function on the text data from the overview column

In [None]:
# process the overview column and assign embeddings to a new column
df['bert_cleaned_overview'] = process_with_progress(df['overview'])

print("Processing complete!")

In [None]:
# filter out rows where both 'genres' and 'overview' are either NaN or 'Unknown'
df_filtered = df[
  ~((df['genres'].isna() | (df['genres'] == 'Unknown')) & (df['overview'].isna() | (df['overview'] == 'Unknown')))
]
print(f"Filtered DataFrame shape: {df_filtered.shape}")

# Clustering/Classification
* After converting the overviews to embeddings:
  * Use clustering algorithms (K-Means) to find shows with similar themes
  * Classification model to predict genre based on overview content
  * If we noticed that certain clustered shows share a theme like "heist" we can make this a new feature in our dataset

In [None]:
# preprocess by ensuring the embeddings are in a suitable format (numpy array) and potentially normalize them
embeddings_array = np.vstack(df_filtered['bert_cleaned_overview'].to_numpy())

In [None]:
# use MiniBatchKMeans since it is a memory-efficient and faster version of KMeans that avoids some threading issues so we can do the clustering
from sklearn.cluster import MiniBatchKMeans

num_clusters = 5
kmeans = MiniBatchKMeans(n_clusters=num_clusters, random_state=0, batch_size=100)
df_filtered['cluster'] = kmeans.fit_predict(embeddings_array)

* Since we are using BERT-base embeddings, each embedding vector has 768 dimensions
* This means for every text in our dataset, there is a 768-dimensional vector

In [None]:
# inspect the shape to see the dimension of the data
print(f"Shape of embeddings array: {embeddings_array.shape}")

In [None]:
# using PCA to reduce dimensionality since the data is high-dimensional when visualizing
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2) # reduce dimensions to 2 for visualization
reduced_embeddings = pca.fit_transform(embeddings_array)

plt.figure(figsize=(10, 8))
plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], c=df_filtered['cluster'], cmap='viridis', alpha=0.5)
plt.title('K-means Clustering of TV Show Overviews')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(label='Cluster')
plt.show()

### K-means clustering results for TV show overviews reduced to two dimensions using PCA
* Axes (PCA Component 1 and 2)
  * The first two principle components resulting from PCA (PCA is used to reduce the dimensionality of the data)
  * Since the BERT embedings were high-dimensional, PCA was applied to project them in a 2D space
* Clusters
  * Each point represent a TV shows overview text that was converted into a BERT embedding
  * We specified 5 clusters, so the data points are grouped based on the similarity
  * The similarity between tv shows is based on the semantic meaning captured by the BERT embeddings
  * Shows that are closer have more similar content in their overviews, while shows in different clusters have distinct textual differences
  * The color bar on the right hand side indicates the cluster label assigned by the K-means algorithm
    * The numbers 0, 1, 2, 3, 4 correspond to different clusters
* Insights
  * The overlap in the clusters means that some TV shows share themes across multiple groups
  * The isolated blue cluster on the right could suggest a group of TV shows with overviews that are semantically distinct from other shows

### Next Steps
* Analyze the content of the clusters by looking at the shows in each group
* Experiment with more clusters

View Sample Data for Each Cluster

In [None]:
for cluster_id in range(num_clusters):
  print(f"\nCluster {cluster_id}:")
  display(df_filtered[df_filtered['cluster'] == cluster_id][['name', 'overview', 'genres']].head(10))

Get Genre Distributions per Cluster

In [None]:
'''
# create a breakdown of genres in each cluster
for cluster_id in range(num_clusters):
    print(f"\nCluster {cluster_id} Genre Distribution:")
    genre_counts = df[df['cluster'] == cluster_id]['genres'].value_counts()
    print(genre_counts.head(10))  # show top 10 most common genres
'''

# iterate over each cluster to display the top 10 genres for each cluster
for cluster_id in range(num_clusters):
	print(f"\nCluster {cluster_id} Genre Distribution:")
	cluster_df = df_filtered[df_filtered['cluster'] == cluster_id]
	genre_sums = cluster_df.loc[:, 'Action & Adventure':'Western'].sum()
	sorted_genres = genre_sums.sort_values(ascending=False)
	print(sorted_genres.head(10))

Plot Genre Distribution Per Cluster For Visual Comparison

In [None]:
'''
# flatten genres into individual entries
df_exploded = df.explode('genres')  # if 'genres' is a list

plt.figure(figsize=(15, 5))
sns.countplot(data=df_exploded, x='genres', hue='cluster')
plt.xticks(rotation=90)
plt.title("Genre Distribution by Cluster")
plt.show()
'''

Identify Themes or Topics in Overviews

In [None]:
import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords

# get the list of stop words
stop_words = set(stopwords.words('english'))

In [None]:
from collections import Counter
import re

# function to get the top words from overviews in a given cluster
def get_top_words(cluster_id, num_words=20):
  overviews = df_filtered[df_filtered['cluster'] == cluster_id]['overview'].fillna('')
  
  # combine all overviews into a single string, and split into words
  all_words = ' '.join(overviews).lower()
  
  # remove punctuation and split into words
  all_words = re.findall(r'\b\w+\b', all_words)
  
  # filter out stop words
  filtered_words = [word for word in all_words if word not in stop_words]
  
  # count word frequencies
  word_counts = Counter(filtered_words)
  
  return word_counts.most_common(num_words)

# print the top words for each cluster
for cluster_id in range(num_clusters):
  print(f"\nTop words in Cluster {cluster_id}:")
  print(get_top_words(cluster_id))

Creating a Word Cloud for Each Cluster

In [None]:
from wordcloud import WordCloud

def generate_word_cloud(cluster_id):
  overviews = df_filtered[df_filtered['cluster'] == cluster_id]['overview'].fillna('')
  all_words = ' '.join(overviews).lower()
  words = re.findall(r'\b\w+\b', all_words)
  filtered_words = [word for word in words if word not in stop_words]
  text = ' '.join(filtered_words)

  # create the word cloud object
  wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
  
  plt.figure(figsize=(10, 5))
  plt.imshow(wordcloud, interpolation='bilinear')
  plt.axis('off')
  plt.title(f'Word Cloud for Cluster {cluster_id}', fontsize=16)
  plt.show()

# generate word clouds for all clusters
for cluster_id in range(num_clusters):
  generate_word_cloud(cluster_id)


In [None]:
cluster_top_words = {}
for cluster_id in range(num_clusters):
  cluster_top_words[cluster_id] = get_top_words(cluster_id)

# plot the top words of each cluster
for cluster_id in range(num_clusters):
	# get top words and their counts for the current cluster
	words, counts = zip(*get_top_words(cluster_id))
	
	plt.figure(figsize=(10, 4))
	plt.bar(words, counts, color='skyblue')
	plt.title(f"Top Words in Cluster {cluster_id}", fontsize=14)
	plt.xlabel("Words", fontsize=12)
	plt.ylabel("Frequency", fontsize=12)
	plt.xticks(rotation=45, fontsize=10)
	plt.tight_layout() 
	plt.show()

In [None]:
from IPython.display import display, HTML

# filter rows where the word "family" appears in the 'overview' column
family_shows = df_filtered[df_filtered['cleaned_overview'].str.contains(r'\bfamily\b', case=False, na=False)]

html_output = family_shows[['name', 'popularity']].to_html()
display(HTML(f'<div style="max-height: 300px; overflow-y: scroll;">{html_output}</div>'))

In [None]:
# create a new feature 'has_family' that is 1 if 'family' is in the overview, else 0
df['has_family'] = df['cleaned_overview'].fillna('').str.contains(r'\bfamily\b', case=False, na=False).astype(int)

print(df[['cleaned_overview', 'has_family']].head())

In [None]:
print(df['has_family'].value_counts())

In [None]:
# Good seperation of clusters
# Popularities for each cluster
# Getting a tag (boolean fld ex: including the word 'world')
# If a word is very popular, or two words are corelated ot eachother, find relations between words
# create a feature based on the word (can be multiple words)
# family an dlove can be a feature
# because we are missing 44% of the data, we should combine the overview column with another feature
# if we are able to get a genre for that, we can consider the other 44 percent, so we can populate the missing genre
# Witht he overview column, we can consider how the model performs with the overview column and without it
# If it is not making much of a difference, we can exclude it
# Simple model to predict the existing popularity

# llamma model 

In [None]:
average_popularity_per_cluster = df_filtered.groupby('cluster')['popularity'].mean()

print("Average Popularity per Cluster:")
print(average_popularity_per_cluster)

In [None]:
# Bar plot to visualize average popularity for each cluster
average_popularity_per_cluster.plot(kind='bar', color='skyblue', edgecolor='black')
plt.xlabel('Cluster')
plt.ylabel('Average Popularity')
plt.title('Average Popularity Across Clusters')
plt.xticks(rotation=0)
plt.show()

# do some log transformation
# check how many rows have a missing overview but has a genre and vice versa
# If both are missing we can ignore for now and see how the BERT model performs

In [None]:
correlation = df_filtered['popularity'].corr(df_filtered['cluster'])
print(f"Correlation between Popularity and Clusters: {correlation}")

In [None]:
# missing or 'Unknown' overview but a valid genre
missing_or_unknown_overview_with_genre = df[
  (df['overview'].isna() | (df['overview'] == 'Unknown')) & 
  ~(df['genres'].isna() | (df['genres'] == 'Unknown'))
]
count_missing_or_unknown_overview_with_genre = missing_or_unknown_overview_with_genre.shape[0]

# missing or 'Unknown' genre but a valid overview
missing_or_unknown_genre_with_overview = df[
  (df['genres'].isna() | (df['genres'] == 'Unknown')) & 
  ~(df['overview'].isna() | (df['overview'] == 'Unknown'))
]
count_missing_or_unknown_genre_with_overview = missing_or_unknown_genre_with_overview.shape[0]

print(f"Number of rows with missing or 'Unknown' overview but a valid genre: {count_missing_or_unknown_overview_with_genre}")
print(f"Number of rows with missing or 'Unknown' genre but a valid overview: {count_missing_or_unknown_genre_with_overview}")


In [None]:
# save the DataFrame to a CSV file without the BERT embeddings
# column_to_exclude = 'bert_cleaned_overview'
# df.drop(columns=[column_to_exclude]).to_csv("TMDB_tv_dataset_v3.csv", index=False)