# 🍿 Final Project - The Movie Database

## Introduction 

Is money strong enough to determine profitability, popularity and public appreciation of movies?  
Are movies so plain that one may guess their genres with only their overview? 

For this project, we have targeted movies to analyze them and predict:
- A budget a movie must be allocated, to achieve a certain amount of revenue, public appreciation and popularity (Numerical);
- And the single or multiple genres of a movie, by providing its text overview (Textual).

This project uses the TMDB API but is not endorsed or certified by TMDB.

TODOs:
- TODO: more models 
- TODO: comparison of the models with visualization 
- TODO: conclusion - business recommendations 

Imports we will be using for the project

In [1]:
import requests
import json
import pandas as pd
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

import torch
import spacy
import spacy.cli

from transformers import AutoTokenizer, AutoModel

In [2]:
# For BERT embeddings
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

In [3]:
# Download the language model, that includes tokenization, part-of-speech tagging, and lemmatization
SPACY_MODEL="fr_core_news_md"
spacy.cli.download(SPACY_MODEL)

Collecting fr-core-news-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_md-3.8.0/fr_core_news_md-3.8.0-py3-none-any.whl (45.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.8/45.8 MB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## How did we collect the data?
> The Movie Database (TMDB) is a public API which provides lot of data about movies, series and TV shows, including descriptions, title, ratings, and more.

In [None]:
DATA_LANGUAGE="fr-FR"
API_KEY = input("Enter your API key: ")

For safety, we will verify whether the entered API_KEY is correct.  

In [None]:
# Throw error if API_KEY is incorrect
try:
    if not API_KEY:
        raise ValueError("no API_KEY provided")
    url = "https://api.themoviedb.org/3/authentication"
    headers = {"accept": "application/json", "Authorization": f"Bearer {API_KEY}"}
    response = requests.get(url, headers=headers)
    response.raise_for_status()
except Exception as e:
    raise ValueError(
        "API_KEY is not set or invalid. Please provide a valid API key."
    ) from e

We need to get the IDs of the movies we'd like to collect.  

TMDB uses a system of pages, where each page contains a list of movies.  
We will fetch the data with API requests in French. 

In [None]:
def get_full_array(page):
    try:
        url = f"https://api.themoviedb.org/3/discover/movie?language={DATA_LANGUAGE}&page={page}"
        headers = {
            "accept": "application/json",
            "Authorization": f"Bearer {API_KEY}"
        }

        response = requests.get(url, headers=headers).text
        data = json.loads(response)
        return [movie['id'] for movie in data['results']]
    except Exception as e:
        return []

IDs = []
for page in tqdm(range(1, 501)):
    IDs += get_full_array(page)



In [None]:
# Display the first 10 movie IDs
IDs[:10] 

After we have collected the movies's ID, we can now get the details of each one in French. 

In [None]:
data = []
for id in tqdm(range(len(IDs))):
    id = IDs[id]
    url = f"https://api.themoviedb.org/3/movie/{id}?language={DATA_LANGUAGE}"
    headers = {
        "accept": "application/json",
        "Authorization": f"Bearer {API_KEY}"
    }
    response = requests.get(url, headers=headers).text
    movie = json.loads(response)
    data.append(movie)

We convert the array into `Pandas` dataframe

In [None]:
df = pd.DataFrame(data)
pd.set_option('display.max_columns', None) # To display all columns
df

Finally, we filter the irrelevant rows out,  
and we save the dataframe to a permanent file, to avoid refetching again.  

In [None]:
df_copy = df.drop(df[df.overview == ""].index)  # Drop rows with empty values in 'overview' (meaning they have no French overview)
df_copy.to_pickle("all_data_fr.pkl") # Save DataFrame to a pickle file (To directly use the DataFrame later, instead of re-fetching)

## Preprocessing data

### Importing and filtering data 

In case we already have fetched the data, we can load the dataframe using the following command.

In [None]:
df: pd.DataFrame = pd.read_pickle("all_data_fr.pkl")
fr_data = df.copy()
fr_data

### Numerical values

Let's show the plots related to the collected data

In [None]:
fig, ax = plt.subplots(2,2)

ax[0][0].scatter(fr_data['vote_average'], fr_data['budget'])
ax[0][0].set_xlabel('Vote Average')
ax[0][0].set_ylabel('Budget')

ax[0][1].scatter(fr_data['popularity'], fr_data['budget'])
ax[0][1].set_xlabel('Popularity')
ax[0][1].set_ylabel('Budget')

ax[1][0].scatter(fr_data['revenue'], fr_data['budget'])
ax[1][0].set_xlabel('Revenue')
ax[1][0].set_ylabel('Budget')

ax[1][1].axis('off')

plt.xscale('log')
plt.yscale('log')
plt.tight_layout()
plt.show()

For the numerical values, we remove null values. Then, we split the dataset into train & test datasets.

Finally, we normalize the X train dataset, and we use the obtained values of the normalization for the X test dataset.  
We do not need to normalize the y datasets, because they are predicted. 

We also do not filter out the extreme values (outliers), because they are relevant in our study.

In [None]:
filtered_num_df = fr_data[(fr_data['revenue'] > 0) & (fr_data['budget'] > 0)] # Some movies do not have revenue or budget data
print("filtered df shape: ", filtered_num_df.shape)
numerical_num_df = filtered_num_df[["budget", "vote_average", "revenue", "popularity"]] # Extracting numerical columns

X = numerical_num_df[['revenue', 'vote_average', 'popularity']]
y = numerical_num_df['budget']

X_train_num, X_test_num, y_train_num, y_test_num = train_test_split(X, y, test_size=0.2, random_state=42)

num_scaler = StandardScaler()
X_train_num_scaled = num_scaler.fit_transform(X_train_num)
X_test_num_scaled = num_scaler.transform(X_test_num)

X_train_num_scaled

### Textual values

For the textual part, we have analyzed it as a multi-labelling problem, which means that a movie's overview can belong to several `genres`.  
This is a multi-label problem, because a movie can be associated with multiple genres. 

Furthermore, we will preprocess the text: convert the text to lowercase, remove punctuation and lemmatize the words.  
We can preprocess before the train & test split, because there is no risk of data leakage.  

In [None]:
df_text = fr_data[['overview', 'genres']]

# Convert genres from a list of dictionaries to a list of strings
df_text.loc[:, 'genres'] = df_text['genres'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

# Preprocess the overview text
nlp = spacy.load(SPACY_MODEL)
with nlp.select_pipes(disable=[]):  # We use the parser component (syntaxic analysis), tokenization (text -> list), lemmatization (word to most basic form) and NER (unlinkable words) 
    preprocessed_texts = []
    for (i, doc) in enumerate(nlp.pipe(df_text["overview"], batch_size=50, n_process=-1)):
        # Exclude tokens that are part of a named entity (NER), punctuation, and stop words
        preprocessed_texts.append(' '.join([token.lemma_ for token in doc if not token.ent_type_ and not token.is_punct and not token.is_stop]))
    df_text.loc[:, "overview"] = preprocessed_texts
    
df_text

For the classification, the model needs all the target possibilities. Therefore, we will list all the movie genres into one place. 

Classes are the possible categories of the dataset.  
Labels are the actual categories assigned to the data samples. 

In [None]:
all_genres_set = set()
for genres in df_text['genres']:
    all_genres_set.update(genres)
all_genres: list[str] = sorted(list(all_genres_set))
all_genres

We will train the models with multi-label binarizer, as the teacher recommended. 

It transforms the labels into numbers. 

In [None]:
mlb = MultiLabelBinarizer(classes=all_genres)
y_text = mlb.fit_transform(df_text['genres']) # We embed the target

X_train_text, X_test_text, y_train_text, y_test_text = train_test_split(
    df_text['overview'],
    y_text,
    test_size=0.2,
    random_state=42
)

y_text


When we display the class distribution of the dataset, we can observe it is skewed towards "drama".  
We will need to handle this problem later.  

In [None]:
plt.figure(figsize=(10, 5))
plt.bar(all_genres, height=np.sum(y_text, axis=0))
plt.xticks(rotation=90)
plt.title("Distribution of genre labels")
plt.xlabel("Genres")
plt.ylabel("Number of Samples")
plt.show()

Let's display a word cloud for every movie genre.

We may notice that some words have high correlation with their genre:
- "documentaire" word is highly present in "documentaire" genre;
- "soldat" and "guerre" words are overwhelmingly present in data samples with "guerre" genre;
- "musique" with "musique".

Later on, this would definitely help the label prediction. 

In [None]:
from wordcloud import WordCloud

def display_wordclouds(n_cols=4):
    n_rows = (len(all_genres) + n_cols - 1) // n_cols
    idx = 0

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(5 * n_cols, 4 * n_rows))
    axes = axes.flatten()

    for label in all_genres:
        # Get all overviews for this label
        text = " ".join(
            df_text[df_text["genres"].apply(lambda genres: label in genres)]["overview"]
        )
        wordcloud = WordCloud(background_color="white").generate(text)
        axes[idx].imshow(wordcloud, interpolation="bilinear")
        axes[idx].set_title(label)
        axes[idx].axis("off")
        idx += 1

    # Hide any unused subplots
    for j in range(idx, len(axes)):
        axes[j].axis("off")

    plt.tight_layout()
    plt.show()
    
display_wordclouds()

In [None]:
# Calculate the length of each overview in words
plt.figure(figsize=(8, 5))
plt.hist(df_text['overview'].apply(lambda x: len(x.split())), bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Number of Words')
plt.ylabel('Number of Samples')
plt.title('Histogram of Text Sample Lengths')
plt.tight_layout()
plt.show()

## Implementation of models

### Numerical values 

For the numerical values, we are going to use a Logistic Regression

In [None]:
models = ["Linear Regression", "Decision Tree Regressor", "Gradient Boosting Regressor", "Random Forest Regressor"]
values = []

In [None]:
lr_model_num = LinearRegression()
lr_model_num.fit(X_train_num_scaled, y_train_num)

y_pred_num = lr_model_num.predict(X_test_num_scaled)

r2 = r2_score(y_test_num, y_pred_num)
values.append(r2)
print("Mean Squared Error:", mean_squared_error(y_test_num, y_pred_num))
print("R² Score:", r2)

In [None]:
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(max_depth=5, random_state=42)
regressor.fit(X_train_num_scaled, y_train_num)
y_pred_num_tree = regressor.predict(X_test_num_scaled)

r2 = r2_score(y_test_num, y_pred_num_tree)
values.append(r2)
print("Mean Squared Error of the Decision Tree Regressor:", mean_squared_error(y_test_num, y_pred_num_tree))
print("R² Score:", r2)

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

model = GradientBoostingRegressor(random_state=42)

param_grid = {
    'max_depth': [3, 5, 10, 20, 30, None],
    'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.5, 1.0]
}

grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring='r2',
    cv=5,
    n_jobs=-1,
    verbose=1,
)

grid_search.fit(X_train_num_scaled, y_train_num)

y_pred = grid_search.predict(X_test_num_scaled)

grid_search

In [None]:
values.append(grid_search.best_score_)
print("Best R2: ", grid_search.best_score_)
print("Best parameters: ", grid_search.best_params_)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

rf_model_num  = RandomForestRegressor(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [None, 10, 20, 30],  
    'min_samples_split': [2, 5, 10],  
    'min_samples_leaf': [1, 2, 4] 
}

scoring = 'r2'

grid_search = GridSearchCV(
    estimator=rf_model_num ,
    param_grid=param_grid,
    scoring=scoring,
    cv=5,
    n_jobs=-1,
    verbose=1,
)

grid_search.fit(X_train_num_scaled, y_train_num)

print("Best parameters:", grid_search.best_params_)

# grid_search

y_pred_rf_num = grid_search.predict(X_test_num_scaled)

r2 = r2_score(y_test_num, y_pred_rf_num)
values.append(r2)
print("Random Forest Mean Squared Error:", mean_squared_error(y_test_num, y_pred_rf_num))
print("Random Forest R² Score:", r2)

### Textual values

#### TF-IDF

Vectorizing the text with TF-IDF, which was the best vectorizer in the second lab 

In [None]:
tfidf_vectorizer = TfidfVectorizer(stop_words=None)
X_train_text_tfidf = tfidf_vectorizer.fit_transform(X_train_text)

# Display the first 5 rows of the TF-IDF vectors 
X_train_text_tfidf.toarray()[:5]

We will use OneVsRestClassifier to solve our multi-label classification problem.  
This works by training 1 classifier per class which learns to distinguish this class versus all others; and for prediction, for each class, the classifier with the highest score is chosen. 

To handle class imbalance, we also use a "balanced" class weight, which penalizes misclassification of minority genres more.   

In [None]:
### Multi-label Logistic Regression on TF-IDF

def MultiLabel_LR(X_train_text, X_test_text):
    lr_model_text = OneVsRestClassifier(LogisticRegression(max_iter=1000, class_weight="balanced"))
    lr_model_text.fit(X_train_text, y_train_text)

    y_pred_text = lr_model_text.predict(X_test_text)
    print("Classification Report:\n", classification_report(y_test_text, y_pred_text, target_names=mlb.classes_, zero_division=np.nan))


MultiLabel_LR(X_train_text_tfidf, tfidf_vectorizer.transform(X_test_text))

After fitting the One Vs. Rest classifier, we now can estimate its effectiveness with the test dataset.  
Note that the accuracy score is very low. Indeed, in multi-label classification, for a test sample, the set of its predicted labels must exactly match the corresponding set of its real labels.

In [None]:
### Naive Bayes on TF-IDF

from sklearn.naive_bayes import MultinomialNB

def naive_bayes(X_train_text, X_test_text):
    nb_model_text = OneVsRestClassifier(MultinomialNB())
    nb_model_text.fit(X_train_text, y_train_text)

    y_pred_text_tfidf = nb_model_text.predict(X_test_text)
    print("Classification Report:\n", classification_report(y_test_text, y_pred_text_tfidf, target_names=mlb.classes_))

naive_bayes(X_train_text_tfidf, tfidf_vectorizer.transform(X_test_text))

Example

#### BERT Embeddings

In [None]:
current_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(current_device)
model.eval()  # Set model to evaluation mode
def get_embeddings(texts, tokenizer, model, batch_size=16, device=current_device):
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt", max_length=256)
        inputs = {k: v.to(device) for k, v in inputs.items()}
        with torch.no_grad():
            outputs = model(**inputs)
            batch_embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()
            embeddings.append(batch_embeddings)
    return np.vstack(embeddings)

# Example: get embeddings for train and test sets
X_train_embeddings = get_embeddings(X_train_text.tolist(), tokenizer, model)
X_test_embeddings = get_embeddings(X_test_text.tolist(), tokenizer, model)

In [None]:
# Naive Bayes on BERT unapplicable because of negative values.

# Multi-label Logistic Regression on BERT embeddings
MultiLabel_LR(X_train_embeddings, X_test_embeddings)

In [None]:
plt.bar(models, values)
plt.xticks(rotation=45)
plt.xlabel('Models')
plt.ylabel('R² Score')
plt.title('Model Performance Comparison (Numerical Data)')
plt.show()

As we can see above, the Random Forest model is better than the others for different reasons:
- Random Forest trains a set of decision trees with different subsets -> avoid overfitting
- Random Forest captures non-linear problem (real life problem are never perfectly linear)
- Random Forest computes the variance by the average of all the small trees (Decision Tree might has a huge variance) -> more stable

#### BoW

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# We need to download the stop words for French (not provided by default in sklearn)
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
french_stop_words = stopwords.words('french')

vectorizer = CountVectorizer(stop_words=french_stop_words)
X_train_vectorized_bow = vectorizer.fit_transform(X_train_text)

In [None]:
### Naive Bayes on BoW

from sklearn.naive_bayes import MultinomialNB

nb_model_text = OneVsRestClassifier(MultinomialNB())
nb_model_text.fit(X_train_vectorized_bow, y_train_text)

naive_bayes(X_train_vectorized_bow, vectorizer.transform(X_test_text))

In [None]:
MultiLabel_LR(X_train_vectorized_bow, vectorizer.transform(X_test_text))