Importing Libraries

In [1]:
from pathlib import Path
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_rows',20)
pd.set_option('display.max_columns',50)
from sklearn.preprocessing import MinMaxScaler

Data Loading

In [2]:
movie_path = Path('../data/movies.csv') #movie_path

# Loading movies data
movies = pd.read_csv(movie_path)
movies.head(2)

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski


Handling Missing values & Identifying irrelevant data

In [3]:
# Checking for missing values
missing_values = movies.isnull().sum()
missing_values[missing_values > 0] / movies.shape[0] * 100

genres           0.582969
homepage        64.355611
keywords         8.577972
overview         0.062461
release_date     0.020820
runtime          0.041641
tagline         17.572351
cast             0.895274
director         0.624610
dtype: float64

There are missing values in the dataset, we need to remove them from the dataset or replace them. Replacing then unfortunately will lead to loss of the data. If the recommendations from the model is not good enough we will remove the null values but if its not a problem, we will just replace the missing values for now.



In [4]:
movies['genres'] = movies['genres'].fillna(movies['genres'].mode()[0]) #filling with most common genre
movies['homepage'] = movies['homepage'].fillna('unknown url')
movies['keywords'] = movies['keywords'].fillna('unknown')
movies['overview'] = movies['overview'].fillna('unknown')
movies['release_date'] = movies['release_date'].ffill()
movies['runtime'] = movies['runtime'].fillna(movies['runtime'].mean())
movies['tagline'] = movies['tagline'].fillna('unknown')
movies['cast'] = movies['cast'].fillna('unknown')
movies['director'] = movies['director'].fillna('unknown')



Columns with null values have been filled to ensure data consistency

In [5]:
movies.isna().sum().sum()

0

No more null values in the data

#### Parsing/Transforming Fields

In [6]:
movies['genres'] = movies['genres'].str.split(' ')
movies['keywords'] = movies['keywords'].str.split(' ')
movies['cast'] = movies['cast'].str.split(' ')
movies['director'] = movies['director'].str.split(' ')

The above columns have been parsed to help ensure the recommendations are better. If they do not help during recommendation, the features might be dropped.

#### Feature Creation

In [7]:
movies['combined_features'] = movies['title'].astype(str) + ' ' + movies['keywords'].astype(str) + ' ' + movies['cast'].astype(str) + ' ' + movies['genres'].astype(str)

In [8]:
numerical_data = movies.select_dtypes('number').columns.to_list()
scaler = MinMaxScaler()
movies[numerical_data]  = scaler.fit_transform(movies[numerical_data])

More features have been added and numerical columns normalized

In [9]:
movies.head(2)

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director,combined_features
0,0.0,0.623684,"[Action, Adventure, Fantasy, Science, Fiction]",http://www.avatarmovie.com/,0.043505,"[culture, clash, future, space, war, space, co...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",0.171815,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,1.0,0.47929,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,0.72,0.858057,"[Sam, Worthington, Zoe, Saldana, Sigourney, We...","[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...","[James, Cameron]","Avatar ['culture', 'clash', 'future', 'space',..."
1,0.000208,0.789474,"[Adventure, Fantasy, Action]",http://disney.go.com/disneypictures/pirates/,0.000609,"[ocean, drug, abuse, exotic, island, east, ind...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",0.158846,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,0.344696,0.5,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,0.69,0.327225,"[Johnny, Depp, Orlando, Bloom, Keira, Knightle...","[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...","[Gore, Verbinski]",Pirates of the Caribbean: At World's End ['oce...


In [10]:
movies.to_csv('../data/cleaned_data.csv',index=False)