# IMDb Movie Genre Classification Dataset
## IMD1107 - Natural Language Processing
### Lucas Pires de Souza, Mariana Emerenciano

# Movie Dataset Description

## Overview

This dataset contains information about movies along with their genres, structured across two related CSV files. The data enables analysis of movie synopses and their associated genres, making it particularly useful for Natural Language Processing (NLP) tasks and multi-label classification problems.

## File Structure

### 1. `movies_overview.csv`

This file contains the core movie information:

| Column     | Type   | Description                                                                 |
|------------|--------|-----------------------------------------------------------------------------|
| `title`    | string | The title of the movie                                                      |
| `overview` | string | A brief description or synopsis of the movie's plot                         |
| `genre_ids`| string | One or more genre identifiers (comma-separated if multiple) associated with the movie |

**Key Characteristics:**
- Each row represents a unique movie
- The `genre_ids` field may contain multiple values (multi-label)
- The `overview` provides textual data suitable for NLP analysis

### 2. `movies_genres.csv`

This file provides the genre reference mapping:

| Column | Type   | Description                          |
|--------|--------|--------------------------------------|
| `id`   | int    | Unique identifier for each genre     |
| `name` | string | The human-readable name of the genre |

**Key Characteristics:**
- Serves as a lookup table for genre identifiers
- Enables conversion of numeric genre IDs to meaningful labels
- Typically contains standard film genres (e.g., Action, Comedy, Drama)

## Dataset Relationships

The two files relate through the genre identifiers:
- `movies_overview.genre_ids` → `movies_genres.id`
- Multiple genres per movie are represented as comma-separated IDs in `genre_ids`


In [None]:
# Import libraries with standard conventions
import pandas as pd
import numpy as np
import re
from ast import literal_eval
from itertools import chain
from collections import Counter
from pprint import pprint
import plotly.express as px
import plotly.graph_objects as go
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


# Loading the data
df_movies = pd.read_csv('data/movies_overview.csv')
df_genres = pd.read_csv('data/movies_genres.csv')

In [None]:
# Download NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\marie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\marie\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# EDA

In [None]:
print("=== Movies Overview ===")
display(df_movies.head())
print("\n=== Genres ===")
display(df_genres.head())


print("\nMovie dataset information:")
print(f"- Total records: {len(df_movies)}")
print(f"- Columns: {df_movies.columns.tolist()}")
print(f"- Null values:\n{df_movies.isna().sum()}")

=== Movies Overview ===


Unnamed: 0,title,overview,genre_ids
0,The Shawshank Redemption,Imprisoned in the 1940s for the double murder ...,"[18, 80]"
1,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...","[18, 80]"
2,The Godfather Part II,In the continuing saga of the Corleone crime f...,"[18, 80]"
3,Schindler's List,The true story of how businessman Oskar Schind...,"[18, 36, 10752]"
4,12 Angry Men,The defense and the prosecution have rested an...,[18]



=== Genres ===


Unnamed: 0,id,name
0,28,Action
1,12,Adventure
2,16,Animation
3,35,Comedy
4,80,Crime



Movie dataset information:
- Total records: 9980
- Columns: ['title', 'overview', 'genre_ids']
- Null values:
title        0
overview     0
genre_ids    0
dtype: int64


In [None]:
df_movies["genre_ids"].value_counts()

genre_ids
[18]                        577
[35]                        571
[18, 10749]                 273
[35, 10749]                 246
[35, 18]                    232
                           ... 
[35, 27, 878, 53]             1
[10749, 35, 18, 14]           1
[80, 53, 9648, 28]            1
[28, 10752, 12, 36, 18]       1
[878, 28, 35, 10770, 27]      1
Name: count, Length: 2222, dtype: int64

In [None]:
id_name_map = pd.read_csv('data/movies_genres.csv').to_dict(orient='records')
id_name_map = {item['id']: item['name'] for item in id_name_map}

Counter({'Drama': 4523,
         'Comedy': 3626,
         'Thriller': 2757,
         'Action': 2349,
         'Adventure': 1700,
         'Romance': 1699,
         'Crime': 1573,
         'Horror': 1475,
         'Science Fiction': 1235,
         'Fantasy': 1154,
         'Family': 1134,
         'Mystery': 966,
         'Animation': 910,
         'History': 490,
         'War': 324,
         'Music': 279,
         'Western': 152,
         'TV Movie': 119})

As it can be seen, is not necessary much text cleaning with regex, but it would be interesting make use of stopwords and punctuation removal.

It would be interesting, also, make use of lemmatization to restrict the dimension of our features. Lower casing can be promising also to restrict the dimensions.

In [None]:
from pprint import pprint

pprint(id_name_map)
pprint(df_movies[["overview", "genre_ids"]].sample(5).to_dict(orient='records'))

{12: 'Adventure',
 14: 'Fantasy',
 16: 'Animation',
 18: 'Drama',
 27: 'Horror',
 28: 'Action',
 35: 'Comedy',
 36: 'History',
 37: 'Western',
 53: 'Thriller',
 80: 'Crime',
 99: 'Documentary',
 878: 'Science Fiction',
 9648: 'Mystery',
 10402: 'Music',
 10749: 'Romance',
 10751: 'Family',
 10752: 'War',
 10770: 'TV Movie'}
[{'genre_ids': '[14, 18, 9648]',
  'overview': 'As children in the loving Ekdahl family, Fanny and Alexander '
              'enjoy a happy life with their parents, who run a theater '
              'company. After their father dies unexpectedly, however, the '
              'siblings end up in a joyless home when their mother, Emilie, '
              'marries a stern bishop. The bleak situation gradually grows '
              'worse as the bishop becomes more controlling, but dedicated '
              'relatives make a valiant attempt to aid Emilie, Fanny and '
              'Alexander.'},
 {'genre_ids': '[28, 80, 53]',
  'overview': 'After being enlisted to recove

In [None]:
import re

def remove_extra_spaces(text):
    return re.sub(r'\s+', ' ', text).strip()

def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)


df_movies["overview"] = df_movies["overview"].apply(remove_punctuation)
df_movies["overview"] = df_movies["overview"].apply(remove_extra_spaces)

The graph shows the distribution of overviews lentghs, there is a dense occurrence around 25 count of words, meaning that the texts are not long.

In [None]:
import plotly.graph_objects as go

word_count = df_movies["overview"].str.split().str.len()

fig = go.Figure(data=[go.Histogram(x=word_count)])

fig.update_layout(
    title="Distribution of Word Count in Movie Overviews",
    xaxis_title="Word Count",
    yaxis_title="Frequency",
    xaxis=dict(
        tickmode='linear',
        dtick=1,
        range=[0, 80]
    )
)
fig.show()

As up for now, there is much noise in our dataset with stopwords, some transpassing 20000 occurrences.

In [None]:
from collections import Counter

def most_frequent_words(texts, n=10):
    words = [word for txt in texts for word in txt.split()]

    word_counts = Counter(words)

    most_common = dict(word_counts.most_common(n))

    fig = go.Figure(
        go.Bar(
            x=list(most_common.keys()),
            y=list(most_common.values())
        )
    )
    fig.update_layout(
        title_text=f"Top {n} most frequent words in the text",
        title_x=0.5,  # Center the title
        xaxis_title="Words",
        yaxis_title="Frequency",
        xaxis_tickangle=-45,
    )
    fig.show()
most_frequent_words(df_movies["overview"], 40)


# Preprocessing

In [None]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
      --------------------------------------- 0.3/12.8 MB ? eta -:--:--
      --------------------------------------- 0.3/12.8 MB ? eta -:--:--
      --------------------------------------- 0.3/12.8 MB ? eta -:--:--
     - ------------------------------------- 0.5/12.8 MB 365.1 kB/s eta 0:00:34
     - -----------------

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

stop_words_spacy = nlp.Defaults.stop_words

list(stop_words_spacy)[:10]

['without',
 'around',
 'yourselves',
 'our',
 'as',
 'thus',
 'made',
 'become',
 'anything',
 'namely']

It's also a nice thing to check for duplicity in our dataset. It'll not contribute to the training step, besides leading towards a greater training time.

In [None]:
#removing duplicates
movies_clean = df_movies.drop_duplicates(subset=['overview'])
movies_clean

Unnamed: 0,title,overview,genre_ids
0,The Shawshank Redemption,Imprisoned in the 1940s for the double murder ...,"[18, 80]"
1,The Godfather,Spanning the years 1945 to 1955 a chronicle of...,"[18, 80]"
2,The Godfather Part II,In the continuing saga of the Corleone crime f...,"[18, 80]"
3,Schindler's List,The true story of how businessman Oskar Schind...,"[18, 36, 10752]"
4,12 Angry Men,The defense and the prosecution have rested an...,[18]
...,...,...,...
9975,Double Dragon,Two brothers have half of a powerful ancient C...,"[28, 12, 35, 14, 878]"
9976,The Fanatic,A rabid film fan stalks his favorite action he...,"[80, 53]"
9977,SPF-18,18yearold Penny Cooper spent years pining for ...,"[10749, 18]"
9978,Fantastic Four,Four young outsiders teleport to a dangerous u...,"[28, 12, 878]"


To help visualize the genres with the movies, let's add a column of the genre names assigned to the movies.

In [None]:
import ast
import re

def parse_genre_ids_alternative(x):
    if pd.isnull(x):
        return []
    ids = re.findall(r'\d+', str(x))
    return [id_name_map[int(id)] for id in ids if int(id) in id_name_map]

movies_clean['genres'] = movies_clean['genre_ids'].apply(parse_genre_ids_alternative)

movies_clean



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,title,overview,genre_ids,genres
0,The Shawshank Redemption,Imprisoned in the 1940s for the double murder ...,"[18, 80]","[Drama, Crime]"
1,The Godfather,Spanning the years 1945 to 1955 a chronicle of...,"[18, 80]","[Drama, Crime]"
2,The Godfather Part II,In the continuing saga of the Corleone crime f...,"[18, 80]","[Drama, Crime]"
3,Schindler's List,The true story of how businessman Oskar Schind...,"[18, 36, 10752]","[Drama, History, War]"
4,12 Angry Men,The defense and the prosecution have rested an...,[18],[Drama]
...,...,...,...,...
9975,Double Dragon,Two brothers have half of a powerful ancient C...,"[28, 12, 35, 14, 878]","[Action, Adventure, Comedy, Fantasy, Science F..."
9976,The Fanatic,A rabid film fan stalks his favorite action he...,"[80, 53]","[Crime, Thriller]"
9977,SPF-18,18yearold Penny Cooper spent years pining for ...,"[10749, 18]","[Romance, Drama]"
9978,Fantastic Four,Four young outsiders teleport to a dangerous u...,"[28, 12, 878]","[Action, Adventure, Science Fiction]"


An interesting topic is see how well distributed is the the occurrecies of each the genres are among our dataset.

Considering that we have almost 10000 data points, it's interesting to propose the usage of imbalanced techniques such as SMOTE for classes such as
the ones with less than 500 entries.

It needs to be checked the impact since it is a multi label classification problem.

In [None]:
from ast import literal_eval
from itertools import chain
from collections import Counter


genre_list_id = list(chain.from_iterable(df_movies["genre_ids"].apply(literal_eval)))

genre_list = [id_name_map[genre_id] for genre_id in genre_list_id if genre_id in id_name_map]

genre_counter = Counter(genre_list)

genre_counter

Now we have a prettier way to see the genres distributions. It'll be seen after that some n-grams are related to these genre distributions.

In [None]:
import plotly.express as px


exploded_genres = movies_clean.explode('genres')

# Genre counts
genre_counts = exploded_genres['genres'].value_counts().reset_index()
genre_counts.columns = ['Genre', 'Count']

# Plot with Plotly
fig = px.bar(genre_counts, x='Genre', y='Count', title='Genre Distribution')
fig.update_layout(xaxis_tickangle=-45)
fig.show()

These are some of the most frequent words. It says to us that some themes could be present in this dataset, such as WW2, family and love.

In [None]:
import string

stop_words = set(stopwords.words('english')) | set(stop_words_spacy)
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.lower()
    tokens = text.split()
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return ' '.join(tokens)

movies_clean['cleaned_overview'] = movies_clean['overview'].apply(clean_text)
most_frequent_words(movies_clean["cleaned_overview"], 40)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Some of the representations of the cleaned overview text with CountVectorizer e TF-IDF.

In [None]:

count_vec = CountVectorizer(binary=True, max_features=1000)

X_count = count_vec.fit_transform(movies_clean['cleaned_overview'])

count_df = pd.DataFrame(X_count.toarray(), columns=count_vec.get_feature_names_out())
print("One-hot matrix dimensions:", X_count.shape)
display(count_df.iloc[:5, :10]) 


One-hot matrix dimensions: (9971, 1000)


Unnamed: 0,abandoned,ability,able,accident,accidentally,accused,act,action,actor,actress
0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0


In [None]:
tfidf_vec = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))

X_tfidf = tfidf_vec.fit_transform(movies_clean['cleaned_overview'])

tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vec.get_feature_names_out())
print("TF-IDF matrix dimensions:", X_tfidf.shape)
display(tfidf_df.iloc[:5, :10])  

TF-IDF matrix dimensions: (9971, 1000)


Unnamed: 0,abandoned,ability,able,accident,accidentally,accused,act,action,actor,actress
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.370512,0.0,0.0,0.0,0.0


With bigrams we can see some genre appear, mostly drama and some thriller and action occurrencies.

In [None]:
bigram_vec = CountVectorizer(ngram_range=(2, 2), max_features=30)
X_bigram = bigram_vec.fit_transform(movies_clean['cleaned_overview'])

bigram_counts = X_bigram.sum(axis=0)
bigram_freq = [(word, bigram_counts[0, idx]) for word, idx in bigram_vec.vocabulary_.items()]
bigram_freq = sorted(bigram_freq, key=lambda x: x[1], reverse=True)

bigram_df = pd.DataFrame(bigram_freq, columns=['Bigram', 'Count'])

fig = px.bar(bigram_df, x='Count', y='Bigram', orientation='h',
             title='Top 20 Most Frequent Bigrams')
fig.show()

Here with trigrams we have a more evident version with some action, thriller and war.

In [None]:
trigram_vec = CountVectorizer(ngram_range=(3, 3), max_features=30)
X_trigram = trigram_vec.fit_transform(movies_clean['cleaned_overview'])

trigram_counts = X_trigram.sum(axis=0)
trigram_freq = [(word, trigram_counts[0, idx]) for word, idx in trigram_vec.vocabulary_.items()]
trigram_freq = sorted(trigram_freq, key=lambda x: x[1], reverse=True)

trigram_df = pd.DataFrame(trigram_freq, columns=['Trigram', 'Count'])

fig = px.bar(trigram_df, x='Count', y='Trigram', orientation='h',
             title='Top 30 Most Frequent Trigrams')
fig.show()