# IMDb Movie Genre Classification Dataset
## IMD1107 - Natural Language Processing
### Lucas Pires de Souza, Mariana Emerenciano

# Movie Dataset Description

## Overview

This dataset contains information about movies along with their genres, structured across two related CSV files. The data enables analysis of movie synopses and their associated genres, making it particularly useful for Natural Language Processing (NLP) tasks and multi-label classification problems.

## File Structure

### 1. `movies_overview.csv`

This file contains the core movie information:

| Column     | Type   | Description                                                                 |
|------------|--------|-----------------------------------------------------------------------------|
| `title`    | string | The title of the movie                                                      |
| `overview` | string | A brief description or synopsis of the movie's plot                         |
| `genre_ids`| string | One or more genre identifiers (comma-separated if multiple) associated with the movie |

**Key Characteristics:**
- Each row represents a unique movie
- The `genre_ids` field may contain multiple values (multi-label)
- The `overview` provides textual data suitable for NLP analysis

### 2. `movies_genres.csv`

This file provides the genre reference mapping:

| Column | Type   | Description                          |
|--------|--------|--------------------------------------|
| `id`   | int    | Unique identifier for each genre     |
| `name` | string | The human-readable name of the genre |

**Key Characteristics:**
- Serves as a lookup table for genre identifiers
- Enables conversion of numeric genre IDs to meaningful labels
- Typically contains standard film genres (e.g., Action, Comedy, Drama)

## Dataset Relationships

The two files relate through the genre identifiers:
- `movies_overview.genre_ids` → `movies_genres.id`
- Multiple genres per movie are represented as comma-separated IDs in `genre_ids`


In [1]:
# Import libraries with standard conventions
import pandas as pd
import numpy as np
import re
from ast import literal_eval
from itertools import chain
from collections import Counter
from pprint import pprint
import plotly.express as px
import plotly.graph_objects as go
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


# Loading the data
df_movies = pd.read_csv('data/movies_overview.csv')
df_genres = pd.read_csv('data/movies_genres.csv')

In [2]:
# Download NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /home/lucas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/lucas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# EDA

In [3]:
print("=== Movies Overview ===")
display(df_movies.head())
print("\n=== Genres ===")
display(df_genres.head())


print("\nMovie dataset information:")
print(f"- Total records: {len(df_movies)}")
print(f"- Columns: {df_movies.columns.tolist()}")
print(f"- Null values:\n{df_movies.isna().sum()}")

=== Movies Overview ===


Unnamed: 0,title,overview,genre_ids
0,The Shawshank Redemption,Imprisoned in the 1940s for the double murder ...,"[18, 80]"
1,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...","[18, 80]"
2,The Godfather Part II,In the continuing saga of the Corleone crime f...,"[18, 80]"
3,Schindler's List,The true story of how businessman Oskar Schind...,"[18, 36, 10752]"
4,12 Angry Men,The defense and the prosecution have rested an...,[18]



=== Genres ===


Unnamed: 0,id,name
0,28,Action
1,12,Adventure
2,16,Animation
3,35,Comedy
4,80,Crime



Movie dataset information:
- Total records: 9980
- Columns: ['title', 'overview', 'genre_ids']
- Null values:
title        0
overview     0
genre_ids    0
dtype: int64


In [4]:
df_movies["genre_ids"].value_counts()

genre_ids
[18]                           577
[35]                           571
[18, 10749]                    273
[35, 10749]                    246
[35, 18]                       232
                              ... 
[28, 53, 18, 12]                 1
[16, 35, 878, 10770, 10749]      1
[878, 53, 80, 9648]              1
[16, 878, 10751, 12]             1
[18, 36, 35, 10749]              1
Name: count, Length: 2222, dtype: int64

In [5]:
id_name_map = pd.read_csv('data/movies_genres.csv').to_dict(orient='records')
id_name_map = {item['id']: item['name'] for item in id_name_map}

An interesting topic is see how well distributed is the the occurrecies of each the genres are among our dataset.

Considering that we have almost 10000 data points, it's interesting to propose the usage of imbalanced techniques such as SMOTE for classes such as the ones with less than 500 entries.

It needs to be checked the impact since it is a multi label classification problem.

Looking at the top genres shown in the code, we can observe that Drama and Comedy dominate the dataset with more than 3000 occurrences each, followed by Thriller and Action with more than 2000 instances. On the other hand, genres like Western, TV Movie, and Music have less than 300 occurrences, indicating a significant imbalance in the dataset distribution.

This imbalance could potentially affect the model's performance in predicting these underrepresented genres.

In [6]:
from pprint import pprint

pprint(id_name_map)
pprint(df_movies[["overview", "genre_ids"]].sample(5).to_dict(orient='records'))

{12: 'Adventure',
 14: 'Fantasy',
 16: 'Animation',
 18: 'Drama',
 27: 'Horror',
 28: 'Action',
 35: 'Comedy',
 36: 'History',
 37: 'Western',
 53: 'Thriller',
 80: 'Crime',
 99: 'Documentary',
 878: 'Science Fiction',
 9648: 'Mystery',
 10402: 'Music',
 10749: 'Romance',
 10751: 'Family',
 10752: 'War',
 10770: 'TV Movie'}
[{'genre_ids': '[18, 36, 10752]',
  'overview': 'Two families, abolitionist Northerners the Stonemans and '
              'Southern landowners the Camerons, intertwine. When Confederate '
              'colonel Ben Cameron is captured in battle, nurse Elsie Stoneman '
              'petitions for his pardon. In Reconstruction-era South Carolina, '
              "Cameron founds the Ku Klux Klan, battling Elsie's congressman "
              'father and his African-American protégé, Silas Lynch.'},
 {'genre_ids': '[9648, 27, 878]',
  'overview': 'In the middle of a routine patrol, officer Daniel Carter '
              'happens upon a blood-soaked figure limping down a 

In [7]:
import re

def remove_extra_spaces(text):
    return re.sub(r'\s+', ' ', text).strip()

def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)


df_movies["overview"] = df_movies["overview"].apply(remove_punctuation)
df_movies["overview"] = df_movies["overview"].apply(remove_extra_spaces)

The graph shows the distribution of overviews lentghs, there is a dense occurrence around 25 count of words, meaning that the texts are not long.

There are some occurences which has less than 10 words, this could lead to some genres that is not identifiable because of too little data. Let's check how it looks like.

In [8]:
import plotly.graph_objects as go

word_count = df_movies["overview"].str.split().str.len()

fig = go.Figure(data=[go.Histogram(x=word_count)])

fig.update_layout(
    title="Distribution of Word Count in Movie Overviews",
    xaxis_title="Word Count",
    yaxis_title="Frequency",
    xaxis=dict(
        tickmode='linear',
        dtick=1,
        range=[0, 80]
    )
)
fig.show()

Clearly there is not much information in these overviews with less than 10 words, compared to longer overviews with 80+ words.

The longer overviews tend to provide more detailed plot descriptions and better context for genre classification, often containing 80-100 words. These detailed synopses typically include multiple plot points, character relationships, and thematic elements that are valuable for accurately determining movie genres.

In contrast, the shorter overviews (under 10 words) are often too brief to provide enough context, sometimes containing only basic statements or incomplete descriptions that make genre classification more challenging.

This observation suggests that setting a minimum overview length requirement could improve the quality of our dataset for genre prediction tasks. But not necessarly is the case, since some has a good detail to predict the genre.

In [9]:
pprint(df_movies[df_movies["overview"].apply(lambda overview: len(overview.split()) <= 9)].sample(10).to_dict(orient='records'))

[{'genre_ids': '[35, 10749]',
  'overview': 'Two emotionally unavailable men attempt a relationship',
  'title': 'Bros'},
 {'genre_ids': '[35]',
  'overview': 'A comedy show',
  'title': 'Tel chi el telùn'},
 {'genre_ids': '[18, 10402]',
  'overview': 'A ferocious bullying music teacher teaches a dedicated student',
  'title': 'Whiplash'},
 {'genre_ids': '[16, 35]',
  'overview': 'Bill struggles to put together his shattered psyche',
  'title': "It's Such a Beautiful Day"},
 {'genre_ids': '[18]',
  'overview': 'The life story of Brazilian football legend Pele',
  'title': 'Pelé: Birth of a Legend'},
 {'genre_ids': '[18]',
  'overview': '8mm work directed by Norihiko Morinaga',
  'title': 'RETURN'},
 {'genre_ids': '[35]',
  'overview': 'A comic movie divided in three episodes',
  'title': 'Grande, grosso e Verdone'},
 {'genre_ids': '[35]',
  'overview': 'Four different stories about italian football teams supporter',
  'title': 'Tifosi'},
 {'genre_ids': '[18, 10749]',
  'overview': 'A b

In [10]:
pprint(df_movies[df_movies["overview"].apply(lambda overview: len(overview.split()) > 20)].sample(10).to_dict(orient='records'))

[{'genre_ids': '[28, 53, 80]',
  'overview': 'Chev Chelios a hit man wanting to go straight lets his latest '
              'target slip away Then he awakes the next morning to a phone '
              'call that informs him he has been poisoned and has only an hour '
              'to live unless he keeps adrenaline coursing through his body '
              'while he searches for an antidote',
  'title': 'Crank'},
 {'genre_ids': '[35, 10749, 878]',
  'overview': 'Three magazine employees head out on an assignment to interview '
              'a guy who placed a classified ad seeking a companion for time '
              'travel',
  'title': 'Safety Not Guaranteed'},
 {'genre_ids': '[18, 12, 10751]',
  'overview': 'The adventure of Bella a dog who embarks on an epic 400mile '
              'journey home after she is separated from her beloved human',
  'title': "A Dog's Way Home"},
 {'genre_ids': '[28, 35]',
  'overview': 'Victor Maynard is a middleaged solitary assassin who lives to '
 

As up for now, there is much noise in our dataset with stopwords, some transpassing 20000 occurrences.

Common words like "the," "and," "to," and other stop words dominate the dataset frequency, reaching over 20,000 occurrences. These high-frequency stop words provide little value in understanding genre-specific vocabulary, as they are universally used across all text regardless of genre. To better identify genre patterns, we need to filter out these common stop words and focus on words which content could indicate specific genres like "murder" for crime or "love" for romance.

In [11]:
from collections import Counter

def most_frequent_words(texts, n=10):
    words = [word for txt in texts for word in txt.split()]

    word_counts = Counter(words)

    most_common = dict(word_counts.most_common(n))

    fig = go.Figure(
        go.Bar(
            x=list(most_common.keys()),
            y=list(most_common.values())
        )
    )
    fig.update_layout(
        title_text=f"Top {n} most frequent words in the text",
        title_x=0.5,  # Center the title
        xaxis_title="Words",
        yaxis_title="Frequency",
        xaxis_tickangle=-45,
    )
    fig.show()
most_frequent_words(df_movies["overview"], 40)


# Preprocessing

In [12]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [13]:
import spacy

nlp = spacy.load("en_core_web_sm")

stop_words_spacy = nlp.Defaults.stop_words

list(stop_words_spacy)[:10]

['already',
 'amount',
 'through',
 'out',
 'various',
 '’ve',
 'himself',
 'off',
 'was',
 'by']

It's also a nice thing to check for duplicity in our dataset. It'll not contribute to the training step, besides leading towards a greater training time.

In [14]:
#removing duplicates
movies_clean = df_movies.drop_duplicates(subset=['overview'])
movies_clean

Unnamed: 0,title,overview,genre_ids
0,The Shawshank Redemption,Imprisoned in the 1940s for the double murder ...,"[18, 80]"
1,The Godfather,Spanning the years 1945 to 1955 a chronicle of...,"[18, 80]"
2,The Godfather Part II,In the continuing saga of the Corleone crime f...,"[18, 80]"
3,Schindler's List,The true story of how businessman Oskar Schind...,"[18, 36, 10752]"
4,12 Angry Men,The defense and the prosecution have rested an...,[18]
...,...,...,...
9975,Double Dragon,Two brothers have half of a powerful ancient C...,"[28, 12, 35, 14, 878]"
9976,The Fanatic,A rabid film fan stalks his favorite action he...,"[80, 53]"
9977,SPF-18,18yearold Penny Cooper spent years pining for ...,"[10749, 18]"
9978,Fantastic Four,Four young outsiders teleport to a dangerous u...,"[28, 12, 878]"


To help visualize the genres with the movies, let's add a column of the genre names assigned to the movies.

In [15]:
import ast
import re

def parse_genre_ids_alternative(x):
    if pd.isnull(x):
        return []
    ids = re.findall(r'\d+', str(x))
    return [id_name_map[int(id)] for id in ids if int(id) in id_name_map]

movies_clean['genres'] = movies_clean['genre_ids'].apply(parse_genre_ids_alternative)

movies_clean



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,title,overview,genre_ids,genres
0,The Shawshank Redemption,Imprisoned in the 1940s for the double murder ...,"[18, 80]","[Drama, Crime]"
1,The Godfather,Spanning the years 1945 to 1955 a chronicle of...,"[18, 80]","[Drama, Crime]"
2,The Godfather Part II,In the continuing saga of the Corleone crime f...,"[18, 80]","[Drama, Crime]"
3,Schindler's List,The true story of how businessman Oskar Schind...,"[18, 36, 10752]","[Drama, History, War]"
4,12 Angry Men,The defense and the prosecution have rested an...,[18],[Drama]
...,...,...,...,...
9975,Double Dragon,Two brothers have half of a powerful ancient C...,"[28, 12, 35, 14, 878]","[Action, Adventure, Comedy, Fantasy, Science F..."
9976,The Fanatic,A rabid film fan stalks his favorite action he...,"[80, 53]","[Crime, Thriller]"
9977,SPF-18,18yearold Penny Cooper spent years pining for ...,"[10749, 18]","[Romance, Drama]"
9978,Fantastic Four,Four young outsiders teleport to a dangerous u...,"[28, 12, 878]","[Action, Adventure, Science Fiction]"


An interesting topic is see how well distributed is the the occurrecies of each the genres are among our dataset.

Considering that we have almost 10000 data points, it's interesting to propose the usage of imbalanced techniques such as SMOTE for classes such as
the ones with less than 500 entries.

It needs to be checked the impact since it is a multi label classification problem.

In [16]:
from ast import literal_eval
from itertools import chain
from collections import Counter


genre_list_id = list(chain.from_iterable(df_movies["genre_ids"].apply(literal_eval)))

genre_list = [id_name_map[genre_id] for genre_id in genre_list_id if genre_id in id_name_map]

genre_counter = Counter(genre_list)

genre_counter

Counter({'Drama': 4523,
         'Comedy': 3626,
         'Thriller': 2757,
         'Action': 2349,
         'Adventure': 1700,
         'Romance': 1699,
         'Crime': 1573,
         'Horror': 1475,
         'Science Fiction': 1235,
         'Fantasy': 1154,
         'Family': 1134,
         'Mystery': 966,
         'Animation': 910,
         'History': 490,
         'War': 324,
         'Music': 279,
         'Western': 152,
         'TV Movie': 119})

Now we have a prettier way to see the genres distributions. It'll be seen after that some n-grams are related to these genre distributions.

In [17]:
import plotly.express as px


exploded_genres = movies_clean.explode('genres')

# Genre counts
genre_counts = exploded_genres['genres'].value_counts().reset_index()
genre_counts.columns = ['Genre', 'Count']

# Plot with Plotly
fig = px.bar(genre_counts, x='Genre', y='Count', title='Genre Distribution')
fig.update_layout(xaxis_tickangle=-45)
fig.show()

After removing stop words, we can observe that meaningful content words like "life", "young", "world", "family", "love" have emerged as the most frequent terms. This is an improvement over our case with stopwords

However, these common content words still appear frequently across multiple genres and don't necessarily help distinguish between them.

In [18]:
import string

stop_words = set(stopwords.words('english')) | set(stop_words_spacy)
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.lower()
    tokens = text.split()
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return ' '.join(tokens)

movies_clean['cleaned_overview'] = movies_clean['overview'].apply(clean_text)
most_frequent_words(movies_clean["cleaned_overview"], 40)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Some of the representations of the cleaned overview text with CountVectorizer e TF-IDF.

In [19]:

count_vec = CountVectorizer(binary=True, max_features=1000)

X_count = count_vec.fit_transform(movies_clean['cleaned_overview'])

count_df = pd.DataFrame(X_count.toarray(), columns=count_vec.get_feature_names_out())
print("One-hot matrix dimensions:", X_count.shape)
display(count_df.iloc[:5, :10]) 


One-hot matrix dimensions: (9971, 1000)


Unnamed: 0,abandoned,ability,able,accident,accidentally,accused,act,action,actor,actress
0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0


In [20]:
tfidf_vec = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))

X_tfidf = tfidf_vec.fit_transform(movies_clean['cleaned_overview'])

tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vec.get_feature_names_out())
print("TF-IDF matrix dimensions:", X_tfidf.shape)
display(tfidf_df.iloc[:5, :10])  

TF-IDF matrix dimensions: (9971, 1000)


Unnamed: 0,abandoned,ability,able,accident,accidentally,accused,act,action,actor,actress
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.370512,0.0,0.0,0.0,0.0


The bigram analysis reveals interesting patterns in the movie overviews:

- Location-based phrases like "new york", "los angeles", and "york city" are among the most frequent, suggesting many movies are set in major cities
- Character relationships and demographics appear often through bigrams like "best friend", "young woman", "young man", "young girl" and "young boy"
- Time-related phrases such as "year later", "year old", "year ago" are common narrative devices
- Plot elements emerge through phrases like "fall love", "world war", "true story", "serial killer"
- Professional roles appear in bigrams like "police officer", "fbi agent"

The high frequency of youth-related terms ("young woman/man/girl/boy") and relationship terms ("best friend", "fall love") suggests many movies focus on coming-of-age stories and relationships. The presence of action-oriented phrases ("police officer", "serial killer") and historical terms ("world war") reflects the prominence of crime, action and historical genres.

In [30]:
bigram_vec = CountVectorizer(ngram_range=(2, 2), max_features=30)
X_bigram = bigram_vec.fit_transform(movies_clean['cleaned_overview'])

bigram_counts = X_bigram.sum(axis=0)
bigram_freq = [(word, bigram_counts[0, idx]) for word, idx in bigram_vec.vocabulary_.items()]
bigram_freq = sorted(bigram_freq, key=lambda x: x[1], reverse=True)

bigram_df = pd.DataFrame(bigram_freq, columns=['Bigram', 'Count'])

fig = px.bar(bigram_df, x='Count', y='Bigram', orientation='h',
             title='Top 20 Most Frequent Bigrams')
fig.show()

The trigram analysis provides deeper insights into narrative patterns in movie overviews:

Common time-related phrases:
- "world war ii" reflects the prevalence of WWII historical dramas and war films
- "life begin change" and "takes place future" signal key plot transitions and drama settings

Character relationships and demographics: 
- "young high school" and "high school student" highlight coming-of-age stories and teen films
- "one day life" and "man young woman" indicate relationship-focused narratives

Location-based descriptors:
- "new york city" appears frequently, reinforcing New York as a popular setting
- "los angeles california" establishes west coast locations

Action and plot elements:
- "must find way" and "trying find way" suggest quest/journey narratives
- "group young people" hints at ensemble casts and team dynamics

These trigrams reveal common storytelling structures and thematic elements across genres, providing insight into how movie plots are typically constructed and described.

While some leads towards some genre like wwii to war, some is cross-genre terms.

In [None]:
trigram_vec = CountVectorizer(ngram_range=(3, 3), max_features=30)
X_trigram = trigram_vec.fit_transform(movies_clean['cleaned_overview'])

trigram_counts = X_trigram.sum(axis=0)
trigram_freq = [(word, trigram_counts[0, idx]) for word, idx in trigram_vec.vocabulary_.items()]
trigram_freq = sorted(trigram_freq, key=lambda x: x[1], reverse=True)

trigram_df = pd.DataFrame(trigram_freq, columns=['Trigram', 'Count'])

fig = px.bar(trigram_df, x='Count', y='Trigram', orientation='h',
             title='Top 30 Most Frequent Trigrams')
fig.show()