# 100_Movie_Details_prep

### Purpose

The purpose of this notebook is to load and prepare the Movie Details dataset. This will include cleaning the dataset and looking for any missing or redundant values. This stage is important to ensure greater accuracy of the later analysis. 

### Datasets

The dataset used here is called movie details and was obtained from Kaggle. The link is provided below:

https://www.kaggle.com/stephanerappeneau/350-000-movies-from-themoviedborg/data


# Loading the data

We start by importing all the libraries and loading in the raw dataset. 

In [1]:
import pandas as pd
import os.path
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
details = pd.read_csv('../../data/raw/AllMoviesDetailsCleaned.csv', sep=';', engine = 'python')

FileNotFoundError: [Errno 2] No such file or directory: '../../data/raw/AllMoviesDetailsCleaned.csv'

# Data Cleaning and Preperation

The data is really messy at the moment so we need to go through a number of checks and cleaning methods

In [3]:
details.head(5)

Unnamed: 0,ï»¿id,budget,genres,imdb_id,original_language,original_title,overview,popularity,production_companies,production_countries,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,production_companies_number,production_countries_number,spoken_languages_number
0,2,0,Drama|Crime,tt0094675,fi,Ariel,Taisto Kasurinen is a Finnish coal miner whose...,0.823904,Villealfa Filmproduction Oy,Finland,...,69.0,suomi,Released,,Ariel,7.1,40,2,1,2
1,3,0,Drama|Comedy,tt0092149,fi,Varjoja paratiisissa,"An episode in the life of Nikander, a garbage ...",0.47445,Villealfa Filmproduction Oy,Finland,...,76.0,English,Released,,Shadows in Paradise,7.0,32,1,1,3
2,5,4000000,Crime|Comedy,tt0113101,en,Four Rooms,It's Ted the Bellhop's first night on the job....,1.698,Miramax Films,United States of America,...,98.0,English,Released,Twelve outrageous guests. Four scandalous requ...,Four Rooms,6.5,485,2,1,1
3,6,0,Action|Thriller|Crime,tt0107286,en,Judgment Night,"While racing to a boxing match, Frank, Mike, J...",1.32287,Universal Pictures,Japan,...,110.0,English,Released,Don't move. Don't whisper. Don't even breathe.,Judgment Night,6.5,69,3,2,1
4,8,42000,Documentary,tt0825671,en,Life in Loops (A Megacities RMX),Timo Novotny labels his new project an experim...,0.054716,inLoops,Austria,...,80.0,English,Released,A Megacities remix.,Life in Loops (A Megacities RMX),6.4,4,1,1,5


In [4]:
details.drop(['budget', 'original_language', 'overview', 'production_companies', 'production_countries', 'spoken_languages', 'status', 'tagline', 
              'production_companies_number', 'production_countries_number', 'revenue', 'spoken_languages_number', 'imdb_id',
              'original_title'
             ], axis=1, inplace=True)

We dropped the rows which we were not going to use to decrease the chances data redundancy

In [5]:
details.head(5)

Unnamed: 0,ï»¿id,genres,popularity,release_date,runtime,title,vote_average,vote_count
0,2,Drama|Crime,0.823904,21/10/1988,69.0,Ariel,7.1,40
1,3,Drama|Comedy,0.47445,16/10/1986,76.0,Shadows in Paradise,7.0,32
2,5,Crime|Comedy,1.698,25/12/1995,98.0,Four Rooms,6.5,485
3,6,Action|Thriller|Crime,1.32287,15/10/1993,110.0,Judgment Night,6.5,69
4,8,Documentary,0.054716,01/01/2006,80.0,Life in Loops (A Megacities RMX),6.4,4


In [6]:
details.rename(columns={'ï»¿id': 'Movie ID', 'genres': 'Genres','popularity': 'Popularity', 
                        'release_date': 'Released','runtime': 'Runtime', 'title': 'Movie Title',
                        'vote_average': 'Average Rating','vote_count': 'Number of Ratings'}, inplace=True)

We decided to rename some columns to make the dataset more descriptive

In [7]:
details.head()

Unnamed: 0,Movie ID,Genres,Popularity,Released,Runtime,Movie Title,Average Rating,Number of Ratings
0,2,Drama|Crime,0.823904,21/10/1988,69.0,Ariel,7.1,40
1,3,Drama|Comedy,0.47445,16/10/1986,76.0,Shadows in Paradise,7.0,32
2,5,Crime|Comedy,1.698,25/12/1995,98.0,Four Rooms,6.5,485
3,6,Action|Thriller|Crime,1.32287,15/10/1993,110.0,Judgment Night,6.5,69
4,8,Documentary,0.054716,01/01/2006,80.0,Life in Loops (A Megacities RMX),6.4,4


In [8]:
details.shape

(329044, 8)

At the moment the dataset is huge. We need to check for any missing value as they could negatively influence the later stages.

In [9]:
details.isnull().sum()

Movie ID                  0
Genres               121529
Popularity                0
Released              24046
Runtime               36792
Movie Title               1
Average Rating            0
Number of Ratings         0
dtype: int64

In [10]:
details.head(10)

Unnamed: 0,Movie ID,Genres,Popularity,Released,Runtime,Movie Title,Average Rating,Number of Ratings
0,2,Drama|Crime,0.823904,21/10/1988,69.0,Ariel,7.1,40
1,3,Drama|Comedy,0.47445,16/10/1986,76.0,Shadows in Paradise,7.0,32
2,5,Crime|Comedy,1.698,25/12/1995,98.0,Four Rooms,6.5,485
3,6,Action|Thriller|Crime,1.32287,15/10/1993,110.0,Judgment Night,6.5,69
4,8,Documentary,0.054716,01/01/2006,80.0,Life in Loops (A Megacities RMX),6.4,4
5,9,Drama,0.001647,02/09/2004,15.0,Sunday in August,5.3,2
6,11,Adventure|Action|Science Fiction,10.492614,25/05/1977,121.0,Star Wars,8.0,6168
7,12,Animation|Family,9.915573,30/05/2003,100.0,Finding Nemo,7.6,5531
8,13,Comedy|Drama|Romance,10.351236,06/07/1994,142.0,Forrest Gump,8.2,7204
9,14,Drama,8.191009,15/09/1999,122.0,American Beauty,7.9,2994


We found a lot of missing values, now we will do some checks to see exactly how badly this data could influence our analysis. Some values may just be equal to 0 instead of NaN or null, therefore we will do some checks for them as well.

In [11]:
(details.Genres.isnull()).value_counts()

False    207515
True     121529
Name: Genres, dtype: int64

In [12]:
(details.Popularity == 0).value_counts()

False    329044
Name: Popularity, dtype: int64

In [13]:
(details.Runtime == 0).value_counts()

False    264254
True      64790
Name: Runtime, dtype: int64

In [14]:
(details['Average Rating'] == 0).value_counts()

True     197744
False    131300
Name: Average Rating, dtype: int64

In [15]:
(details['Number of Ratings'] == 0).value_counts()

True     196636
False    132408
Name: Number of Ratings, dtype: int64

In [16]:
details.isnull().sum()

Movie ID                  0
Genres               121529
Popularity                0
Released              24046
Runtime               36792
Movie Title               1
Average Rating            0
Number of Ratings         0
dtype: int64

There is a big number of NaN and 0 values, therefore we decided to drop the rows which include them.

In [17]:
details.dropna(subset=['Genres', 'Released'], inplace = True)

We dropped a lot of rows which lack information, we left the runtime for now 

In [18]:
details.isnull().sum()

Movie ID                 0
Genres                   0
Popularity               0
Released                 0
Runtime              20998
Movie Title              0
Average Rating           0
Number of Ratings        0
dtype: int64

In [19]:
details.head(10)

Unnamed: 0,Movie ID,Genres,Popularity,Released,Runtime,Movie Title,Average Rating,Number of Ratings
0,2,Drama|Crime,0.823904,21/10/1988,69.0,Ariel,7.1,40
1,3,Drama|Comedy,0.47445,16/10/1986,76.0,Shadows in Paradise,7.0,32
2,5,Crime|Comedy,1.698,25/12/1995,98.0,Four Rooms,6.5,485
3,6,Action|Thriller|Crime,1.32287,15/10/1993,110.0,Judgment Night,6.5,69
4,8,Documentary,0.054716,01/01/2006,80.0,Life in Loops (A Megacities RMX),6.4,4
5,9,Drama,0.001647,02/09/2004,15.0,Sunday in August,5.3,2
6,11,Adventure|Action|Science Fiction,10.492614,25/05/1977,121.0,Star Wars,8.0,6168
7,12,Animation|Family,9.915573,30/05/2003,100.0,Finding Nemo,7.6,5531
8,13,Comedy|Drama|Romance,10.351236,06/07/1994,142.0,Forrest Gump,8.2,7204
9,14,Drama,8.191009,15/09/1999,122.0,American Beauty,7.9,2994


We decided to change the format of the Released column to aid later analysis

In [20]:
details['Released'] = pd.to_datetime(details['Released'])

We sorted the data by date to see exactly how far back and forwards it goes

In [21]:
details.sort_values(by='Released')

Unnamed: 0,Movie ID,Genres,Popularity,Released,Runtime,Movie Title,Average Rating,Number of Ratings
199569,315946,Documentary,0.091601,1874-12-09,1.0,Passage of Venus,6.0,15
115398,194079,Documentary,0.071501,1878-06-14,1.0,Sallie Gardner at a Gallop,6.2,22
293283,426903,Documentary,0.000152,1883-11-19,1.0,Buffalo Running,6.7,3
288521,421316,Documentary,0.000143,1884-01-01,1.0,Descending Stairs and Turning Around,0.0,0
288520,421315,Documentary,0.000143,1884-01-01,2.0,Human Figure in Motion,5.0,1
99061,159897,Documentary,0.158993,1887-08-18,1.0,Man Walking Around a Corner,4.0,15
67440,96882,Documentary,0.001601,1888-01-01,1.0,Accordion Player,4.4,17
217252,335765,Documentary,0.002793,1888-05-31,1.0,Pferd und Reiter Springen Ãœber ein Hindernis,6.7,3
10067,16463,History|Documentary,0.29507,1888-10-14,1.0,Roundhay Garden Scene,6.4,35
10068,16464,Documentary,0.050777,1888-10-15,1.0,Traffic Crossing Leeds Bridge,5.8,24


We realised that the data includes a lot of movies which are missing some vital information. The dataset also includes a lot of movies which haven't been released yet. We tackled this problem by removing all the rows which had an average rating equal to 0 as the unreleased movies won't have a rating.

In [22]:
(details['Average Rating'] == 0.0).value_counts()

False    107678
True      94793
Name: Average Rating, dtype: int64

In [23]:
details = details[details['Average Rating'] != 0]

In [24]:
(details['Average Rating'] == 0.0).value_counts()

False    107678
Name: Average Rating, dtype: int64

The removal of this data ensured that the data is now accurate. However, there is still some missing information in the runtime section, therefore we will get rid of that as well.

In [74]:
details.sort_values(by='Released')

Unnamed: 0,Movie ID,Genres,Popularity,Released,Runtime,Movie Title,Average Rating,Number of Ratings
199569,315946,Documentary,0.091601,1874-12-09,1.0,Passage of Venus,6.0,15
115398,194079,Documentary,0.071501,1878-06-14,1.0,Sallie Gardner at a Gallop,6.2,22
293283,426903,Documentary,0.000152,1883-11-19,1.0,Buffalo Running,6.7,3
288520,421315,Documentary,0.000143,1884-01-01,2.0,Human Figure in Motion,5.0,1
99061,159897,Documentary,0.158993,1887-08-18,1.0,Man Walking Around a Corner,4.0,15
67440,96882,Documentary,0.001601,1888-01-01,1.0,Accordion Player,4.4,17
217252,335765,Documentary,0.002793,1888-05-31,1.0,Pferd und Reiter Springen Ãœber ein Hindernis,6.7,3
10067,16463,History|Documentary,0.29507,1888-10-14,1.0,Roundhay Garden Scene,6.4,35
10068,16464,Documentary,0.050777,1888-10-15,1.0,Traffic Crossing Leeds Bridge,5.8,24
20553,32406,Documentary,0.01014,1889-01-14,1.0,"Leisurely Pedestrians, Open Topped Buses and H...",6.0,1


In [75]:
(details['Runtime'] == 0).value_counts()

False    99224
True      8454
Name: Runtime, dtype: int64

In [76]:
details = details[details['Runtime'] != 0]

In [77]:
(details['Runtime'] == 0).value_counts()

False    99224
Name: Runtime, dtype: int64

In [78]:
details.dropna(subset=['Runtime'], inplace = True)

The data should now be clean, although we have lost a lot of rows.

In [79]:
details.shape

(94207, 8)

In [80]:
details.head()

Unnamed: 0,Movie ID,Genres,Popularity,Released,Runtime,Movie Title,Average Rating,Number of Ratings
0,2,Drama|Crime,0.823904,1988-10-21,69.0,Ariel,7.1,40
1,3,Drama|Comedy,0.47445,1986-10-16,76.0,Shadows in Paradise,7.0,32
2,5,Crime|Comedy,1.698,1995-12-25,98.0,Four Rooms,6.5,485
3,6,Action|Thriller|Crime,1.32287,1993-10-15,110.0,Judgment Night,6.5,69
4,8,Documentary,0.054716,2006-01-01,80.0,Life in Loops (A Megacities RMX),6.4,4


We realised that the Genre column has a lot of genres separated by '|'. We decided to split these up into separate columns and give them a value '1' if the movie is of that genre and '0' otherwise. This will ensure that the calculations of movie prominence will be easier to make in the analysis stage

In [81]:
details['Drama'] = details['Genres'].str.contains('Drama').apply(lambda x:1 if x else 0)

details['Crime'] = details['Genres'].str.contains('Crime').apply(lambda x:1 if x else 0)

details['Action'] = details['Genres'].str.contains('Action').apply(lambda x:1 if x else 0)

details['Documentary'] = details['Genres'].str.contains('Documentary').apply(lambda x:1 if x else 0)

details['Adventure'] = details['Genres'].str.contains('Adventure').apply(lambda x:1 if x else 0)

details['Animation'] = details['Genres'].str.contains('Animation').apply(lambda x:1 if x else 0)

details['Comedy'] = details['Genres'].str.contains('Comedy').apply(lambda x:1 if x else 0)

details['Mystery'] = details['Genres'].str.contains('Mystery').apply(lambda x:1 if x else 0)

details['Horror'] = details['Genres'].str.contains('Horror').apply(lambda x:1 if x else 0)

details['Western'] = details['Genres'].str.contains('Western').apply(lambda x:1 if x else 0)

details['Science Fiction'] = details['Genres'].str.contains('Science Fiction').apply(lambda x:1 if x else 0)

details['Thriller'] = details['Genres'].str.contains('Thriller').apply(lambda x:1 if x else 0)

details['Romance'] = details['Genres'].str.contains('Romance').apply(lambda x:1 if x else 0)

details['Fantasy'] = details['Genres'].str.contains('Fantasy').apply(lambda x:1 if x else 0)

details['War'] = details['Genres'].str.contains('War').apply(lambda x:1 if x else 0)

details['Family'] = details['Genres'].str.contains('Family').apply(lambda x:1 if x else 0)

details['Music'] = details['Genres'].str.contains('Music').apply(lambda x:1 if x else 0)

details['History'] = details['Genres'].str.contains('History').apply(lambda x:1 if x else 0)

details['TV Movie'] = details['Genres'].str.contains('TV Movie').apply(lambda x:1 if x else 0)

details['Foreign'] = details['Genres'].str.contains('Foreign').apply(lambda x:1 if x else 0)

The Genre column is now split up

In [82]:
details.head()

Unnamed: 0,Movie ID,Genres,Popularity,Released,Runtime,Movie Title,Average Rating,Number of Ratings,Drama,Crime,...,Science Fiction,Thriller,Romance,Fantasy,War,Family,Music,History,TV Movie,Foreign
0,2,Drama|Crime,0.823904,1988-10-21,69.0,Ariel,7.1,40,1,1,...,0,0,0,0,0,0,0,0,0,0
1,3,Drama|Comedy,0.47445,1986-10-16,76.0,Shadows in Paradise,7.0,32,1,0,...,0,0,0,0,0,0,0,0,0,0
2,5,Crime|Comedy,1.698,1995-12-25,98.0,Four Rooms,6.5,485,0,1,...,0,0,0,0,0,0,0,0,0,0
3,6,Action|Thriller|Crime,1.32287,1993-10-15,110.0,Judgment Night,6.5,69,0,1,...,0,1,0,0,0,0,0,0,0,0
4,8,Documentary,0.054716,2006-01-01,80.0,Life in Loops (A Megacities RMX),6.4,4,0,0,...,0,0,0,0,0,0,0,0,0,0


We decided to look at the types and see if they are okay for every field

In [83]:
details.dtypes

Movie ID                      int64
Genres                       object
Popularity                   object
Released             datetime64[ns]
Runtime                     float64
Movie Title                  object
Average Rating              float64
Number of Ratings             int64
Drama                         int64
Crime                         int64
Action                        int64
Documentary                   int64
Adventure                     int64
Animation                     int64
Comedy                        int64
Mystery                       int64
Horror                        int64
Western                       int64
Science Fiction               int64
Thriller                      int64
Romance                       int64
Fantasy                       int64
War                           int64
Family                        int64
Music                         int64
History                       int64
TV Movie                      int64
Foreign                     

The Popularity column is an object, however, it should be a float. Initially we weren't able to do this as some values in the Popularity column included the letter 'E'. To overcome this we decided to get rid of these columns.

In [84]:
details = details[~details['Popularity'].str.contains('E')]

In [85]:
details["Popularity"] = pd.to_numeric(details.Popularity)

Popularity is now a float

In [86]:
details.dtypes

Movie ID                      int64
Genres                       object
Popularity                  float64
Released             datetime64[ns]
Runtime                     float64
Movie Title                  object
Average Rating              float64
Number of Ratings             int64
Drama                         int64
Crime                         int64
Action                        int64
Documentary                   int64
Adventure                     int64
Animation                     int64
Comedy                        int64
Mystery                       int64
Horror                        int64
Western                       int64
Science Fiction               int64
Thriller                      int64
Romance                       int64
Fantasy                       int64
War                           int64
Family                        int64
Music                         int64
History                       int64
TV Movie                      int64
Foreign                     

It will be much easier to work with just year and month in the analysis, therefore, we created 2 extra columns based on the movie's release date

In [87]:
details['Year'], details['Month'] = details['Released'].dt.year, details['Released'].dt.month

In [88]:
details.head()

Unnamed: 0,Movie ID,Genres,Popularity,Released,Runtime,Movie Title,Average Rating,Number of Ratings,Drama,Crime,...,Romance,Fantasy,War,Family,Music,History,TV Movie,Foreign,Year,Month
0,2,Drama|Crime,0.823904,1988-10-21,69.0,Ariel,7.1,40,1,1,...,0,0,0,0,0,0,0,0,1988,10
1,3,Drama|Comedy,0.47445,1986-10-16,76.0,Shadows in Paradise,7.0,32,1,0,...,0,0,0,0,0,0,0,0,1986,10
2,5,Crime|Comedy,1.698,1995-12-25,98.0,Four Rooms,6.5,485,0,1,...,0,0,0,0,0,0,0,0,1995,12
3,6,Action|Thriller|Crime,1.32287,1993-10-15,110.0,Judgment Night,6.5,69,0,1,...,0,0,0,0,0,0,0,0,1993,10
4,8,Documentary,0.054716,2006-01-01,80.0,Life in Loops (A Megacities RMX),6.4,4,0,0,...,0,0,0,0,0,0,0,0,2006,1


We saved this dataset to a pickle so that we will be able to use it in the other notebooks

In [89]:
details.to_pickle('../../data/processed/Movies_Details.pkl')