# IMDb Movie Genre Classification Dataset - Topic Modeling
## IMD1107 - Natural Language Processing
### Lucas Pires de Souza, Mariana Emerenciano

This notebook provides a comprehensive workflow for discovering, analyzing, and visualizing thematic topics from movie synopses using the BERTopic library. The process can be broken down into several key stages

## 1. Setup and Library Imports

The notebook begins by importing the necessary libraries, each serving a specific purpose:

* **pandas**: The cornerstone for data manipulation, used to load and manage the movie data from the CSV files.
* **sentence_transformers**: This library is used under the hood by BERTopic but is sometimes imported for advanced use cases. It provides the language models that convert text into meaningful numerical embeddings.
* **bertopic**: The core library for the entire analysis. It orchestrates the topic modeling pipeline.
* **plotly.express**: A powerful and user-friendly library for creating interactive visualizations. It is used in the notebook as an alternative method for plotting the "Topics per Genre" distribution.
* **sklearn (Scikit-learn)**: Used for various machine learning utilities, such as splitting data (`train_test_split`) or, in more advanced steps, for vectorization (`TfidfVectorizer`) and clustering.

In [1]:
import os
import zipfile
import pandas as pd
from sklearn.model_selection import train_test_split
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
from bertopic import BERTopic

# loading data
overview_df = pd.read_csv("data/movies_overview.csv")
genres_df = pd.read_csv("data/movies_genres.csv")

# mapping genre IDs to names
genre_map = dict(zip(genres_df["id"], genres_df["name"]))

def decode_genres(genre_ids_str):
    import ast
    try:
        genre_ids = ast.literal_eval(genre_ids_str)
        return ", ".join([genre_map.get(gid, "") for gid in genre_ids])
    except:
        return ""

overview_df["genres"] = overview_df["genre_ids"].apply(decode_genres)



  from .autonotebook import tqdm as notebook_tqdm


## 2. Data Loading and Preprocessing

This is a critical stage where the raw data is prepared for analysis.

* **Loading Data**: The notebook loads two datasets: `movies_overview.csv` (containing movie titles, overviews, and genre IDs) and `movies_genres.csv` (which acts as a lookup table for genre IDs and their names).
* **Merging & Mapping Genres**: The most important preprocessing step performed here is the mapping of `genre_ids` to human-readable `genre_names`. The code creates a dictionary from the `genres_df` (e.g., `{18: 'Drama', 80: 'Crime'}`) and then applies it to the `genre_ids` column in the `overview_df`. This enriches the main dataset, making it possible to later analyze topics in the context of their known genres. This step is fundamental for generating the "Topics per Class" visualization.
* **Data Cleaning**: The notebook likely includes steps to handle missing values, for instance, by dropping rows where the movie overview is empty, as this text is essential for the topic modeling process.

In [2]:
# unzipping the fine-tuned model

zip_path = 'models/fine_tuned_imdb_model.zip'

extract_path = 'models/fine_tuned_imdb_model'

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)


In [3]:
# fine tuning
# creating pairs of overviews (e.g., same overview with the same genre)
examples = [
    InputExample(texts=[row["overview"], row["overview"]], label=1.0)
    for _, row in overview_df.iterrows()
    if isinstance(row["overview"], str) and row["overview"].strip() != ""
]

# base model 
base_model = SentenceTransformer("all-MiniLM-L6-v2")
base_model = base_model.load("models/fine_tuned_imdb_model/fine_tuned_imdb_model")


In [4]:
# generating embeddings 
valid_overviews = overview_df["overview"].dropna().tolist()
embeddings = base_model.encode(valid_overviews, show_progress_bar=True)


Batches: 100%|██████████| 312/312 [00:37<00:00,  8.22it/s]


## 3. BERTopic Model Training

This is the core of the discovery phase.

* **Model Instantiation**: A BERTopic model object is created with `topic_model = BERTopic(...)`. Key parameters may be configured here, such as `min_topic_size` (to control the minimum number of movies required to form a topic) or `language` (to select the appropriate embedding model).
* **Fitting the Model**: The `topic_model.fit_transform(docs)` command is executed, where `docs` is the list of movie overviews. This single command triggers the entire BERTopic pipeline:
    * **Embedding**: Each movie overview is converted into a high-dimensional vector that captures its semantic meaning.
    * **Dimensionality Reduction**: `UMAP` is used to reduce the dimensions of these vectors, making them easier to cluster while preserving their spatial relationships.
    * **Clustering**: `HDBSCAN` is applied to group similar movie overviews into clusters. Each cluster represents a potential topic. Outlier documents that don't fit well into any cluster are grouped into `Topic -1`.
    * **Topic Representation**: The `c-TF-IDF` algorithm is used to extract the most representative keywords for each cluster, allowing for human interpretation of the topics.

In [5]:
from umap import UMAP
from sklearn.feature_extraction.text import CountVectorizer

# umap model for dimensionality reduction
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

# count vectorizer
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 3))

topic_model = BERTopic(
    embedding_model=base_model,
    umap_model=umap_model,
    vectorizer_model=vectorizer_model,
    verbose=True
)


## 4. Generating Visualizations and Insights

After the model is trained, the notebook proceeds to generate the suite of visualizations you shared. This is how each one was created:

* **Barchart of Topic-Word Scores (`visualize_barchart`)**
    * This was created by calling `topic_model.visualize_barchart()`. This function accesses the calculated `c-TF-IDF` scores for each topic and plots the top N words, providing a direct view into the theme of each topic cluster.

* **Intertopic Distance Map (`visualize_topics`)**
    * This map was generated by calling `topic_model.visualize_topics()`. The function takes the topic embeddings, applies a 2D projection (like `UMAP`), and plots them using `Plotly`. The size of each circle is automatically scaled based on the topic's frequency.

* **Hierarchical Clustering Dendrogram (`visualize_hierarchy`)**
    * This chart was created using `topic_model.visualize_hierarchy()`. This function uses the topic embeddings to perform a hierarchical clustering analysis and plots the resulting tree structure, showing how topics are related and how they can be merged.

* **Topics per Class (Genre Distribution)**
    * This was the most involved visualization, as shown in the notebook snippet. The process was:
        * **Prepare the Dataframe**: A new DataFrame (`counts`) was meticulously prepared. It lists every movie-genre-topic combination and calculates the frequency of each topic within each genre.
        * **Call the Plotting Function**: The notebook then calls `topic_model.visualize_topics_per_class(counts, ...)`. This function is specifically designed to take such a pre-aggregated DataFrame and create the stacked bar chart. Parameters like `normalize_frequency=True` are used to show the relative distribution rather than raw counts.
        * **Fallback Visualization**: The notebook wisely includes a `try...except` block. If the BERTopic function fails for any reason, it falls back to a custom plot using `plotly.express`. This alternative creates a faceted bar chart, which is another effective way to show the same information, demonstrating robust coding practices.

In [6]:
# training the bertopic model
topics, probs = topic_model.fit_transform(valid_overviews, embeddings)

# results
topic_model.visualize_topics().show()
topic_model.visualize_barchart(top_n_topics=12).show()
topic_model.save("models/bertopic_imdb_model", serialization="pytorch")



2025-06-21 09:43:30,008 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-06-21 09:44:02,320 - BERTopic - Dimensionality - Completed ✓
2025-06-21 09:44:02,320 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-06-21 09:44:02,771 - BERTopic - Cluster - Completed ✓
2025-06-21 09:44:02,771 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-06-21 09:44:03,536 - BERTopic - Representation - Completed ✓


### Intertopic Distance Map

The first chart is a 2D visualiztion of the "thematic universe" of the dataset. Each circle is a topic. The circle size represents the frequency of the topic (how many movies belong to it), and the distance between circles represents semantic similarity: nearby topics are thematically similar, distant topics are different.

There's a large cluster in the center and other groups on the periphery. It suggests that there are broad thematic areas that are interconnected and more specific, distinct themes on the edges. 

Topic 0 is the largest bubble, indicating that the most prevalent topic in the dataset talks about common themes in sci-fi films.

We can infer relationships such as Topic 2 (High School) being far from Topic 3 (Police), since the themes addressed in these types of film tend to be different. From the same perspective, we can infer that Topic 0 (planet earth, aliens) and Topic 1 (ancient king, goku??) are both addressing common themes in sci-fi films.

Another inferable fact is that the 9980 movies were grouped into 11 clusters, something we can prove in future charts.

Altogether, the Intertopic Distance Map offers a macro view of the movies. It not only shows the topics but also how they relate to one another, helping to understand the overall structure of the dataset.

### Topic Word Scores

This chart is the heart of topic interpretation. Each group of bars represents an individual topic discovered by the model. For each topic, it displays the 5 most representative words and their corresponding c-TF-IDF scores. A longer bar indicates that the word is more important and distinctive for that topic compared to all others.

This visualization confirms that BERTopic has identified coherent and easily identifiable narrative themes, such as Topic 5 (war related), Topic 9 (high school related), Topic 8 (espionage related), Topic 6 (family related) and others.

This chart validates the quality of the model. The clarity and coherence of the keywords for each topic demonstrate that the model was able to separate the synopses into meaningful thematic categories that make sense to a human. It serves as the foundation for subsequent analyses.

In [7]:
# hierarchical visualization
hierarchical_topics = topic_model.hierarchical_topics(valid_overviews)
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics).show(height=800, width=1200)


100%|██████████| 58/58 [00:00<00:00, 399.70it/s]


### Hierarchical Clustering 

This is probably the most technical chart. It shows the "making of" the topic grouping. It's a tree diagram that illustrates how individual topics are merged into larger clusters as we move up the Y-axis (which represents the distance/dissimilarity).

The tree structures reveals "families" of themes. We can see that the first topics to merge are 0 and 1 (sci-fi related), indicating that both themes address similar dynamics in sci-fi movies.

The behavior inferred in the Intertopic Distance Map (the dataset being grouped in 11 clusters) can be seen in this chart, proving that these 11 clusters are a faithful representation of the hierarchical structure of the themes in the dataset. This shows that the grouping is not random but based on the semantic similarity calculated by the model.

In [8]:
import ast

movies_overview = pd.read_csv('data/movies_overview.csv')
movies_genres = pd.read_csv('data/movies_genres.csv')

# converting 'genre_ids' from string to list
movies_overview['genre_ids'] = movies_overview['genre_ids'].apply(lambda x: ast.literal_eval(x) if pd.notnull(x) else [])

# genre dictionary
genre_dict = movies_genres.set_index('id')['name'].to_dict()

# creating a new column with genre names
movies_overview['genres'] = movies_overview['genre_ids'].apply(
    lambda ids: [genre_dict[id] for id in ids if id in genre_dict]
)

# removing rows with empty overviews
movies_overview = movies_overview.dropna(subset=['overview']).reset_index(drop=True)

In [9]:
from bertopic.plotting import visualize_topics_per_class


# obtaining topic information
topic_info = topic_model.get_topic_info()

#  mapping 
topic_name_map = dict(zip(topic_info['Topic'], topic_info['Name']))

# extracting words for each topic
def get_topic_words(topic):
    try:
        topic_words = topic_model.get_topic(topic)
        if isinstance(topic_words[0], tuple):
            return [word for word, _ in topic_words[:5]]
        else:
            return topic_words[:5]
    except:
        return ["N/A"]


# creating a dataframe for topics per class
data = []
for idx, (topic, genres) in enumerate(zip(topics, movies_overview['genres'])):
    for genre in genres:
        data.append({
            'Genre': genre,
            'Topic': topic,
            'Name': topic_name_map.get(topic, f"Topic_{topic}"),
            'Words': ", ".join(get_topic_words(topic))
        })

topics_per_class = pd.DataFrame(data)


# calculating frequencies of topics per genre
counts = topics_per_class.groupby(['Genre', 'Topic', 'Name', 'Words']).size().reset_index(name='Count')
genre_totals = counts.groupby('Genre')['Count'].sum()
counts['Frequency'] = counts.apply(lambda x: x['Count'] / genre_totals[x['Genre']], axis=1)


try:
    # default visualization
    fig = topic_model.visualize_topics_per_class(
        counts.rename(columns={'Genre': 'Class'}),
        top_n_topics=60,
        normalize_frequency=True,
        title='Distribuição de Tópicos por Gênero',
        height=600,
        width=1200
    )
    fig.show()
except Exception as e:
    print(f"Usando visualização alternativa devido a: {str(e)}")
    
    # workaround for the issue with the default visualization
    import plotly.express as px
    
    top_genres = counts.groupby('Genre')['Count'].sum().nlargest(15).index
    filtered = counts[counts['Genre'].isin(top_genres)]
    
    fig = px.bar(
        filtered.sort_values(['Genre', 'Frequency'], ascending=[True, False]),
        x='Genre',
        y='Frequency',
        color='Name',
        hover_data=['Words'],
        facet_row='Name',
        height=1200,
        title='Distribuição de Tópicos por Gênero (Top 15)'
    )
    fig.update_layout(showlegend=False)
    fig.show()

### Topics per Genre Distribution

In this visualization, each vertical bar represents a movie genre. The bar is segmented by colors, where each color corresponds to a topic, and the size of the segment represents the frequency of that topic within the genre.

With this chart, we can infer that some genres are almost entirely dominated by its respective topic, such as Topic 5 (war related) and its genre "War". However, it is also inferable that, in certain cases, the name of the topics does not seem to represent exactly what is expected of their genre, as is the case of topic 56 (monster, frankenstein, dracula, samurai) and the Western genre.

It is also noticeable the existence of transversal topics, such as topic 6 (family related) being present in genres such as fantasy, horror, mystery and TV Movie.

This chart connects the unsupervised analysis of BERTopic with the supervised nature of our data. It not only proves that the model is working but also allows us to describe genres in a much richer way. Instead of just saying the genre of a movie, we can say that the genre is composed of "X% crime themes", "Y% friendship themes" and "Z% alien themes", for example.

## Summary

In summary, thisnotebook executes a sophisticated unsupervised NLP project. It begins with raw data, cleans and prepares it, applies a state-of-the-art topic modeling technique, and finally, generates a suite of interactive visualizations. Each graph is a product of specific functions within the BERTopic library, designed to inspect the model's results from different angles: from the micro-level of individual topic keywords to the macro-level of the entire thematic landscape and its relationship with pre-existing metadata like genres.
