# Create the final movies dataset

In this notebook I will create the final movie dataset by cleaning, filtering and merging it with the cast information for the recommender system.

In [1]:
import pandas as pd
import ast
import numpy as np
import warnings
warnings.filterwarnings("ignore")

## Movies Dataset

In [2]:
mdf = pd.read_csv('../Data/movies_metadata.csv')
# Select only the relevant columns
mdf = mdf[['id', 'title', 'genres', 'overview', 'popularity',
       'belongs_to_collection', 'production_companies', 
       'release_date', 'tagline', 'vote_average', 'vote_count']]
mdf.head().transpose()

Unnamed: 0,0,1,2,3,4
id,862,8844,15602,31357,11862
title,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[{'id': 35, 'name': 'Comedy'}]"
overview,"Led by Woody, Andy's toys live happily in his ...",When siblings Judy and Peter discover an encha...,A family wedding reignites the ancient feud be...,"Cheated on, mistreated and stepped on, the wom...",Just when George Banks has recovered from his ...
popularity,21.946943,17.015539,11.7129,3.859495,8.387519
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",,"{'id': 96871, 'name': 'Father of the Bride Col..."
production_companies,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'name': 'Warner Bros.', 'id': 6194}, {'name'...",[{'name': 'Twentieth Century Fox Film Corporat...,"[{'name': 'Sandollar Productions', 'id': 5842}..."
release_date,1995-10-30,1995-12-15,1995-12-22,1995-12-22,1995-02-10
tagline,,Roll the dice and unleash the excitement!,Still Yelling. Still Fighting. Still Ready for...,Friends are the people who let you be yourself...,Just When His World Is Back To Normal... He's ...
vote_average,7.7,6.9,6.5,6.1,5.7


## Clean the dataset

We will clean the dataset by removing rows that don't contain important information and converting certain columns to the required data types.

### Title and Overview

Drop the films that do not have title or overview. Since without the title we do not know which movie is and without the overview we cannot get the similarity between movies.

In [3]:
mdf.dropna(subset='title', inplace=True)
mdf.dropna(subset='overview', inplace=True)

### Genres, Belongs to Collection & Production Companies

Extract the genres, collection (if available), and production companies, as those columns are stored as json

In [4]:
def extract_dict(s):
    try:
        return ast.literal_eval(s)
    except (ValueError, SyntaxError):
        return np.nan

In [5]:
mdf['genres'] =  mdf['genres'].apply(lambda x: ast.literal_eval(x)).apply(lambda x: [item['name'] for item in x])
mdf['belongs_to_collection'] = mdf['belongs_to_collection'].apply(extract_dict).apply(lambda x: x['name'] if isinstance(x, dict) else np.nan)
mdf['production_companies'] = mdf['production_companies'].apply(lambda x: ast.literal_eval(x)).apply(lambda x: [item['name'] for item in x])
mdf.head().transpose()

Unnamed: 0,0,1,2,3,4
id,862,8844,15602,31357,11862
title,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II
genres,"[Animation, Comedy, Family]","[Adventure, Fantasy, Family]","[Romance, Comedy]","[Comedy, Drama, Romance]",[Comedy]
overview,"Led by Woody, Andy's toys live happily in his ...",When siblings Judy and Peter discover an encha...,A family wedding reignites the ancient feud be...,"Cheated on, mistreated and stepped on, the wom...",Just when George Banks has recovered from his ...
popularity,21.946943,17.015539,11.7129,3.859495,8.387519
belongs_to_collection,Toy Story Collection,,Grumpy Old Men Collection,,Father of the Bride Collection
production_companies,[Pixar Animation Studios],"[TriStar Pictures, Teitler Film, Interscope Co...","[Warner Bros., Lancaster Gate]",[Twentieth Century Fox Film Corporation],"[Sandollar Productions, Touchstone Pictures]"
release_date,1995-10-30,1995-12-15,1995-12-22,1995-12-22,1995-02-10
tagline,,Roll the dice and unleash the excitement!,Still Yelling. Still Fighting. Still Ready for...,Friends are the people who let you be yourself...,Just When His World Is Back To Normal... He's ...
vote_average,7.7,6.9,6.5,6.1,5.7


Delete the films of which we do not have their genre(s)

In [6]:
mdf['genres'] = mdf['genres'].apply(lambda x: np.nan if not x else x)
mdf.dropna(subset='genres', inplace=True)

See all the genres we have

In [7]:
genres = set([genre for sublist in mdf['genres'] for genre in sublist])
len(genres), genres

(20,
 {'Action',
  'Adventure',
  'Animation',
  'Comedy',
  'Crime',
  'Documentary',
  'Drama',
  'Family',
  'Fantasy',
  'Foreign',
  'History',
  'Horror',
  'Music',
  'Mystery',
  'Romance',
  'Science Fiction',
  'TV Movie',
  'Thriller',
  'War',
  'Western'})

Convert to Nan the movies that do not have their production company

In [8]:
mdf['production_companies'] = mdf['production_companies'].apply(lambda x: np.nan if not x else x)

## Filtering the films

Once we have extracted all the data correctly and cleaned the dataset, we are now going to filter the movies

### Votes

We are going to filter the movies that have more than 50 votes

In [9]:
mdf['vote_count'] = mdf['vote_count'].astype('int')
mdf = mdf[mdf['vote_count'] > 50]
mdf.shape

(9008, 11)

We are left with approximately 25% of the dataset

### Year

Extract the year from the films.

In [10]:
mdf['release_date'] = pd.to_datetime(mdf['release_date'], errors='coerce')
mdf['year'] = mdf['release_date'].dt.year.fillna(1989).astype('int')

Select the movies released after 1994 or with a rating higher than 8.

In [11]:
mdf = mdf[ (mdf['year'] > 1994) | (mdf['vote_average'] > 8)]
mdf.head()

Unnamed: 0,id,title,genres,overview,popularity,belongs_to_collection,production_companies,release_date,tagline,vote_average,vote_count,year
0,862,Toy Story,"[Animation, Comedy, Family]","Led by Woody, Andy's toys live happily in his ...",21.946943,Toy Story Collection,[Pixar Animation Studios],1995-10-30,,7.7,5415,1995
1,8844,Jumanji,"[Adventure, Fantasy, Family]",When siblings Judy and Peter discover an encha...,17.015539,,"[TriStar Pictures, Teitler Film, Interscope Co...",1995-12-15,Roll the dice and unleash the excitement!,6.9,2413,1995
2,15602,Grumpier Old Men,"[Romance, Comedy]",A family wedding reignites the ancient feud be...,11.7129,Grumpy Old Men Collection,"[Warner Bros., Lancaster Gate]",1995-12-22,Still Yelling. Still Fighting. Still Ready for...,6.5,92,1995
4,11862,Father of the Bride Part II,[Comedy],Just when George Banks has recovered from his ...,8.387519,Father of the Bride Collection,"[Sandollar Productions, Touchstone Pictures]",1995-02-10,Just When His World Is Back To Normal... He's ...,5.7,173,1995
5,949,Heat,"[Action, Crime, Drama, Thriller]","Obsessive master thief, Neil McCauley leads a ...",17.924927,,"[Regency Enterprises, Forward Pass, Warner Bros.]",1995-12-15,A Los Angeles Crime Saga,7.7,1886,1995


In [12]:
mdf.shape

(6717, 12)

Finally, we are left with 6,717 films that have complete data and were either released after 1994 or have a high average rating.

In [13]:
# Drop the release date and year
mdf = mdf.drop(columns=['release_date', 'year']).reset_index(drop=True)

In [14]:
mdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6717 entries, 0 to 6716
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     6717 non-null   object 
 1   title                  6717 non-null   object 
 2   genres                 6717 non-null   object 
 3   overview               6717 non-null   object 
 4   popularity             6717 non-null   object 
 5   belongs_to_collection  1374 non-null   object 
 6   production_companies   6520 non-null   object 
 7   tagline                5139 non-null   object 
 8   vote_average           6717 non-null   float64
 9   vote_count             6717 non-null   int64  
dtypes: float64(1), int64(1), object(8)
memory usage: 524.9+ KB


In [15]:
# Convert popularity to Float
mdf['popularity'] = mdf['popularity'].astype('float')

In [16]:
mdf.isnull().sum()

id                          0
title                       0
genres                      0
overview                    0
popularity                  0
belongs_to_collection    5343
production_companies      197
tagline                  1578
vote_average                0
vote_count                  0
dtype: int64

In the final dataset we have:

- 5343 films that do not belongs to a collection
- 197 films that do not have information about their prodcution companies 
- 1578 movies without tagline.

The null taglines does not matter as we are going to concatenate it with the overview. We will fill the null values of production companies with 'unkwon', later we will check out what we'll do with the null values in the column 'belongs to a collection'.

### Taglines and Overview

Concatenate the tagline with the overview, so in the recommender system we can use these features as one to find the similarity.

In [20]:
mdf['overview'] = mdf['overview'] + mdf['tagline'].fillna('')
mdf.drop(columns='tagline', inplace=True)
mdf.head()

Unnamed: 0,id,title,genres,overview,popularity,belongs_to_collection,production_companies,vote_average,vote_count
0,862,Toy Story,"[Animation, Comedy, Family]","Led by Woody, Andy's toys live happily in his ...",21.946943,Toy Story Collection,[Pixar Animation Studios],7.7,5415
1,8844,Jumanji,"[Adventure, Fantasy, Family]",When siblings Judy and Peter discover an encha...,17.015539,,"[TriStar Pictures, Teitler Film, Interscope Co...",6.9,2413
2,15602,Grumpier Old Men,"[Romance, Comedy]",A family wedding reignites the ancient feud be...,11.7129,Grumpy Old Men Collection,"[Warner Bros., Lancaster Gate]",6.5,92
3,11862,Father of the Bride Part II,[Comedy],Just when George Banks has recovered from his ...,8.387519,Father of the Bride Collection,"[Sandollar Productions, Touchstone Pictures]",5.7,173
4,949,Heat,"[Action, Crime, Drama, Thriller]","Obsessive master thief, Neil McCauley leads a ...",17.924927,,"[Regency Enterprises, Forward Pass, Warner Bros.]",7.7,1886


### Production Companies

Let's explore the films that do not have a production company

In [21]:
mdf[mdf['production_companies'].isna()].sort_values(by='popularity', ascending=False)[['title', 'popularity']]

Unnamed: 0,title,popularity
6673,Black Mirror: White Christmas,24.910782
428,The Opposite of Sex,14.734622
2296,The Librarian: Return to King Solomon's Mines,12.494010
4712,Puss in Boots: The Three Diablos,12.018446
2271,Big Nothing,11.802069
...,...,...
5567,Amiche da morire,2.572216
3641,Birdemic: Shock and Terror,2.529680
6393,The Present,2.435431
3068,Objectified,2.400134


We have popular films with no producers, so we can't delete these films, so we will assign them the value of unknwon

In [22]:
mdf['production_companies'] = mdf['production_companies'].fillna('Unknown')

Finally we end up with this movies dataset

In [23]:
mdf.isna().sum()

id                          0
title                       0
genres                      0
overview                    0
popularity                  0
belongs_to_collection    5343
production_companies        0
vote_average                0
vote_count                  0
dtype: int64

In the recommendation system, we will use the 'Belongs to a Collection' feature with one-hot encoding, so the null values do not affect the process.

We will combine vote_average and vote_count to calculate a weighted average for more accurate recommendations, using the "Bayesian rating" technique.

## Movies Crew

Now we are going to add to the films their crew information, such as the actors and directors

In [24]:
cdf = pd.read_csv('../Data/credits.csv')
cdf.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [25]:
mdf['id'] = mdf['id'].astype('int')

Merge the films crew with the films metadata

In [26]:
df = mdf.merge(cdf, on='id')
df.shape

(6733, 11)

We have more movies than before the merge, so lets drop duplicated movies

In [27]:
df.drop_duplicates(subset=['id'], inplace=True)
df.shape

(6710, 11)

Extract the directors and actors

In [28]:
# Convert them to a python object
df['crew'] = df['crew'].apply(lambda x: ast.literal_eval(x))
df['cast'] = df['cast'].apply(lambda x: ast.literal_eval(x))

In [29]:
# Function to extract the director from the crew
def get_director(x):
    for item in x:
        if item['job'] == 'Director':
            return item['name']
    return np.nan

In [30]:
df['director'] = df['crew'].apply(lambda x: get_director(x))
df['cast'] = df['cast'].apply(lambda x: [item['name'] for item in x])

In [31]:
# Nan in the films that do no have cast
df['cast'] = df['cast'].apply(lambda x: np.nan if not x else x)

Check how many films have no directors or actors

In [32]:
print(f'{df['director'].isna().sum()} Films do not have director')
print(f'{df['cast'].isna().sum()} Films do not have actors')

5 Films do not have director
18 Films do not have actors


We only have 23 films without this information, which is great. We are going to delete this films from our dataset

In [33]:
df.dropna(subset=['director', 'cast'], inplace=True)
print(f'{df['director'].isna().sum()} Films do not have director')
print(f'{df['cast'].isna().sum()} Films do not have actors')

0 Films do not have director
0 Films do not have actors


In [34]:
# Drop Crew feature, since we have already extracted the director
df.drop(columns='crew', inplace=True)

## Final Dataset

In [35]:
df.head()

Unnamed: 0,id,title,genres,overview,popularity,belongs_to_collection,production_companies,vote_average,vote_count,cast,director
0,862,Toy Story,"[Animation, Comedy, Family]","Led by Woody, Andy's toys live happily in his ...",21.946943,Toy Story Collection,[Pixar Animation Studios],7.7,5415,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...",John Lasseter
1,8844,Jumanji,"[Adventure, Fantasy, Family]",When siblings Judy and Peter discover an encha...,17.015539,,"[TriStar Pictures, Teitler Film, Interscope Co...",6.9,2413,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...",Joe Johnston
2,15602,Grumpier Old Men,"[Romance, Comedy]",A family wedding reignites the ancient feud be...,11.7129,Grumpy Old Men Collection,"[Warner Bros., Lancaster Gate]",6.5,92,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...",Howard Deutch
3,11862,Father of the Bride Part II,[Comedy],Just when George Banks has recovered from his ...,8.387519,Father of the Bride Collection,"[Sandollar Productions, Touchstone Pictures]",5.7,173,"[Steve Martin, Diane Keaton, Martin Short, Kim...",Charles Shyer
4,949,Heat,"[Action, Crime, Drama, Thriller]","Obsessive master thief, Neil McCauley leads a ...",17.924927,,"[Regency Enterprises, Forward Pass, Warner Bros.]",7.7,1886,"[Al Pacino, Robert De Niro, Val Kilmer, Jon Vo...",Michael Mann


In [36]:
df.shape

(6687, 11)

In [38]:
df.isnull().sum()

id                          0
title                       0
genres                      0
overview                    0
popularity                  0
belongs_to_collection    5321
production_companies        0
vote_average                0
vote_count                  0
cast                        0
director                    0
dtype: int64

Save the final dataset

In [37]:
df.to_csv('../Data/movies_clean.csv')