# Cleaning and Filtering the Movies Dataset

In the first notebook, we conducted an Exploratory Data Analysis (EDA) on the movies dataset to ensure we have enough data and understand its structure. In this notebook, I will clean and filter some movies so we have only relevant films for our recommender system

In [2]:
import pandas as pd
import ast
import numpy as np
import warnings
warnings.filterwarnings("ignore")

## Movies Dataset

In [3]:
mdf = pd.read_csv('../Data/Raw/movies_metadata.csv', quotechar='"', lineterminator='\n')
mdf.columns

Index(['adult', 'backdrop_path', 'belongs_to_collection', 'budget', 'genres',
       'homepage', 'id', 'imdb_id', 'origin_country', 'original_language',
       'original_title', 'overview', 'popularity', 'poster_path',
       'production_companies', 'production_countries', 'release_date',
       'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title',
       'video', 'vote_average', 'vote_count', 'movieId'],
      dtype='object')

In [4]:
# Select only the relevant columns
mdf = mdf[['movieId', 'id', 'title', 'genres', 'overview', 'release_date', 
           'popularity', 'runtime', 'status', 'tagline',  'vote_average', 
           'vote_count', 'poster_path', 'backdrop_path']]

mdf.head().transpose()

Unnamed: 0,0,1,2,3,4
movieId,16,11,7,8,1
id,524,9087,11860,45325,862
title,Casino,The American President,Sabrina,Tom and Huck,Toy Story
genres,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[{'id': 10749, 'name': 'Romance'}, {'id': 18, ...","[{'id': 10751, 'name': 'Family'}, {'id': 28, '...","[{'id': 16, 'name': 'Animation'}, {'id': 12, '..."
overview,"In early-1970s Las Vegas, Sam ""Ace"" Rothstein ...","Widowed U.S. president Andrew Shepherd, one of...","Sabrina Fairchild, a chauffeur's daughter, gre...","A mischievous young boy, Tom Sawyer, witnesses...","Led by Woody, Andy's toys live happily in his ..."
release_date,1995-11-22,1995-11-17,1995-12-15,1995-12-22,1995-11-22
popularity,6.6361,1.6287,2.7156,0.7317,22.3485
runtime,179,114,127,97,81
status,Released,Released,Released,Released,Released
tagline,No one stays at the top forever.,Why can't the most powerful man in the world h...,You are cordially invited to the most surprisi...,A lot of kids get into trouble. These two inve...,Hang on for the comedy that goes to infinity a...


In [5]:
mdf.dtypes

movieId            int64
id                 int64
title             object
genres            object
overview          object
release_date      object
popularity       float64
runtime            int64
status            object
tagline           object
vote_average     float64
vote_count         int64
poster_path       object
backdrop_path     object
dtype: object

It seems that all the features are in its correct data type

In [6]:
print(f'The original movies dataset has {mdf.shape[0]:,} movies')

The original movies dataset has 86,351 movies


## Cleaning and Preprocessing  the Dataset

We will clean the dataset by removing rows that lack important information. Specifically, we will remove movies that do not have a `title`, `overview`, or `genre`, as these are essential for identifying movies and calculating similarities in a content-based recommender system. Without a title, we cannot determine which movie it is, and without an overview or genre, we cannot derive meaningful movie comparisons.

Also we will transform the data types of certain columns to ensure they are properly formatted for further analysis. Specifically, we will extract and convert the genres, collection (if available), and production companies, as these columns are currently stored as stringified dictionaries.


In [7]:
# Drop films without title or genre
mdf.dropna(subset='title', inplace=True)
mdf.dropna(subset='overview', inplace=True)

In [8]:
def extract_dict(s):
    try:
        return ast.literal_eval(s)
    except (ValueError, SyntaxError):
        return np.nan

In [9]:
# Extract the genres
mdf['genres'] =  mdf['genres'].apply(lambda x: ast.literal_eval(x)).apply(lambda x: [item['name'] for item in x])
mdf.head().transpose()

Unnamed: 0,0,1,2,3,4
movieId,16,11,7,8,1
id,524,9087,11860,45325,862
title,Casino,The American President,Sabrina,Tom and Huck,Toy Story
genres,"[Crime, Drama]","[Comedy, Drama, Romance]","[Romance, Drama]","[Family, Action, Adventure, Drama]","[Animation, Adventure, Family, Comedy]"
overview,"In early-1970s Las Vegas, Sam ""Ace"" Rothstein ...","Widowed U.S. president Andrew Shepherd, one of...","Sabrina Fairchild, a chauffeur's daughter, gre...","A mischievous young boy, Tom Sawyer, witnesses...","Led by Woody, Andy's toys live happily in his ..."
release_date,1995-11-22,1995-11-17,1995-12-15,1995-12-22,1995-11-22
popularity,6.6361,1.6287,2.7156,0.7317,22.3485
runtime,179,114,127,97,81
status,Released,Released,Released,Released,Released
tagline,No one stays at the top forever.,Why can't the most powerful man in the world h...,You are cordially invited to the most surprisi...,A lot of kids get into trouble. These two inve...,Hang on for the comedy that goes to infinity a...


Delete the films of which we do not have their genres

In [10]:
mdf['genres'] = mdf['genres'].apply(lambda x: np.nan if not x else x)
mdf.dropna(subset='genres', inplace=True)

#### Poster Path

The poster path will be useful to display the poster of the movie in the user interface, as well as the backdrop poster; so let's check how many posters are missing

In [11]:
mdf['poster_path'].isnull().sum()

1111

In [12]:
mdf['backdrop_path'].isnull().sum()

10890

In [13]:
mdf[mdf['poster_path'].isnull()].sort_values(by='popularity', ascending=False)[['title', 'popularity']].head()

Unnamed: 0,title,popularity
32843,Carlos Spills the Beans,1.7193
22353,Hard Sun,1.549
22395,Serial Killer Culture,1.2484
31976,Targeting,1.1432
44121,Trailer Park Boys: Swearnet Live,1.0292


The movies without a poster path appear to be quite unpopular. Since the poster path is essential for displaying the film in the final UI, we will remove the films that lack a poster path. On the other hand almost 11,000 of the movies do not have a backdrop path, so we will leave like this the dataset.

In [14]:
mdf.dropna(subset='poster_path', inplace=True)

In [15]:
print(f'After doing some cleaning we are left with {mdf.shape[0]:,} movies')

After doing some cleaning we are left with 83,663 movies


## Filtering

To enhance the performance of our recommender system, we will filter and select only relevant and current movies for the following reasons:

- **Relevance to Modern Audiences:** Older movies may not resonate with today’s viewers. By focusing on more recent or popular titles, we ensure that recommendations remain aligned with current trends and user preferences.
- **Avoiding Data Sparsity:** Older movies typically have fewer interactions and ratings, leading to data sparsity. Since recommender systems rely on user interactions, movies with limited data may not generate meaningful recommendations.
- **Reducing Complexity:** A smaller, more focused dataset of relevant movies reduces model complexity. Working with fewer, more relevant movies means fewer features to process, resulting in faster computation and more efficient learning.
- **Improving User-Item Interactions:** Users are generally more engaged with recent or trending movies. Filtering out older titles helps focus the system on movies with more active user interactions, thereby enhancing the model’s accuracy.

Therefore, older movies may not align well with current user behavior, making them less useful for the recommender system.

### Votes

We are going to filter the movies that have more than 199 votes, since those movies without fewer votes might not be relevant.

In [16]:
mdf['vote_count'] = mdf['vote_count'].astype('int')
mdf = mdf[mdf['vote_count'] > 200]
mdf.shape

(12420, 14)

This was an important filter, since we are only left with approximately 12,500 films

### Year & Rating

We will extract the release year of the films and select movies released in 1995 or later. At the same time, we will filter out movies with a rating below 7.5. Since the year filtering is applied first, we might lose highly rated movies released before 1995 that are still very popular.

In [17]:
mdf['release_date'] = pd.to_datetime(mdf['release_date'], errors='coerce')
mdf['year'] = mdf['release_date'].dt.year.fillna(1989).astype('int')
# Drop the release date
mdf = mdf.drop(columns=['release_date']).reset_index(drop=True)

In [18]:
mdf = mdf[ (mdf['year'] > 1994) | (mdf['vote_average'] > 7.5)]
mdf.head()

Unnamed: 0,movieId,id,title,genres,overview,popularity,runtime,status,tagline,vote_average,vote_count,poster_path,backdrop_path,year
0,16,524,Casino,"[Crime, Drama]","In early-1970s Las Vegas, Sam ""Ace"" Rothstein ...",6.6361,179,Released,No one stays at the top forever.,7.997,5994,/gziIkUSnYuj9ChCi8qOu2ZunpSC.jpg,/iZGiMD0p1M2AOmzKknFo5bkuz94.jpg,1995
1,11,9087,The American President,"[Comedy, Drama, Romance]","Widowed U.S. president Andrew Shepherd, one of...",1.6287,114,Released,Why can't the most powerful man in the world h...,6.5,732,/yObOAYFIHXHkFPQ3jhgkN2ezaD.jpg,/62BnXyJtVEq4WKNSpnPG7QPYYDI.jpg,1995
2,7,11860,Sabrina,"[Romance, Drama]","Sabrina Fairchild, a chauffeur's daughter, gre...",2.7156,127,Released,You are cordially invited to the most surprisi...,6.204,641,/i8PbLJDPU7vCwwscWD625oHbJy.jpg,/oMGV48EGhsNavC1PL8HMeWs5Udq.jpg,1995
3,1,862,Toy Story,"[Animation, Adventure, Family, Comedy]","Led by Woody, Andy's toys live happily in his ...",22.3485,81,Released,Hang on for the comedy that goes to infinity a...,7.968,18714,/uXDfjJbdP4ijW5hWSBrPrlKpxab.jpg,/3Rfvhy1Nl6sSGJwyjb0QiZzZYlB.jpg,1995
4,12,12110,Dracula: Dead and Loving It,"[Comedy, Horror]",When a lawyer shows up at the vampire's doorst...,2.0197,88,Released,You'll laugh until you die...then you'll rise ...,6.1,997,/fwdXQU3Dbs2CMHu8K87bAxlWV0t.jpg,/47HK3UNlIfLcS2C2pDIznSyUTsI.jpg,1995


In [19]:
mdf.shape

(10073, 14)

Finally, we are left with almost 10,000 films that have complete data, more than 199 votes and were either released after 1994 or have a high average rating.

In [21]:
# Save only the ids on a csv file 
mdf[['movieId', 'id']].to_csv('../Data/Processed/movies_ids.csv', index=False)