# Business Problem

For this project, a client has request the production of a MySQL database on Movies from a subset of IMDB's publicly available dataset. They would like an analysis of this database to help determine what makes a movie successful and recommendations on how to make a successful movie.

## Part 1: 

Download several files from IMDB’s movie data set and filter out the subset of moves requested by the stakeholder.

## Specifications

- Exclude any movie with missing values for genre or runtime
- Include only full-length movies (titleType = "movie").
- Include only fictional movies (not from documentary genre)
- Include only movies that were released 2000 - 2021 (include 2000 and 2021)
- Include only movies that were released in the United States

# Data Location
- The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. 

# Data Dictionary

## Title Aka 

- titleId (string) - a tconst, an alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- title (string) – the localized title
- region (string) - the region for this version of the title
- language (string) - the language of the title
- types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative",  "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
- attributes (array) - Additional terms to describe this alternative title, not enumerated
- isOriginalTitle (boolean) – 0: not original title; 1: original title

- tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
- isAdult (boolean) - 0: non-adult title; 1: adult title
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title

## Title Ratings

## Title Basics 

- tconst (string) - alphanumeric unique identifier of the title
- averageRating – weighted average of all the individual user ratings
- numVotes - number of votes the title has received

# Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os

# Download Data

In [2]:
basics_url="https://datasets.imdbws.com/title.basics.tsv.gz"
aka_url= "https://datasets.imdbws.com/title.akas.tsv.gz"
ratings_url="https://datasets.imdbws.com/title.ratings.tsv.gz"

In [3]:
# Create Dataframes
basics_df = pd.read_csv(basics_url, sep='\t', low_memory=False)

aka_df= basics = pd.read_csv(aka_url, sep='\t', low_memory=False)

ratings_df = pd.read_csv(ratings_url, sep='\t', low_memory=False)

# Preprocessing

## Title Basics 

- Replace "\N" with np.nan
- Eliminate movies that are null for runtimeMinutes
- Eliminate movies that are null for genre
- keep only titleType==Movie
- keep startYear 2000-2022
- Eliminate movies that include "Documentary" in genre (see tip below)
- Keep only US movies (Use AKAs table, see "Filtering one dataframe based on another" section below)

In [4]:
basics_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [5]:
# Replace "\N" with np.nan
basics_df.replace({'\\N':np.nan},inplace=True)

In [6]:
# Eliminate movies that are missing values for runtimeMinutes, genres, startYear
basics_df = basics_df.dropna(subset = ['runtimeMinutes', 'genres', 'startYear'])

In [7]:
# Keep titleType movie
basics_df = basics_df.loc[basics_df['titleType'] == 'movie']

In [8]:
# Keep startYear 2000-2022
basics_df['startYear'] = basics_df['startYear'].astype(int)
basics_df = basics_df.loc[(basics_df['startYear'] >= 2000) & (basics_df['startYear'] <=2022)]

In [9]:
basics_df['startYear'].value_counts()

2017    14388
2018    14358
2019    14112
2016    13974
2015    13483
2014    13126
2022    12961
2021    12433
2013    12397
2012    11655
2020    11601
2011    10783
2010    10216
2009     9372
2008     8167
2007     6972
2006     6528
2005     5851
2004     5219
2003     4601
2002     4140
2001     3878
2000     3646
Name: startYear, dtype: int64

In [10]:
# Eliminate movies that include "Documentary" in genre
documentary_filter = basics_df['genres'].str.contains('documentary', case=False)
basics_df = basics_df[~documentary_filter]

In [11]:
# making new folder with os
os.makedirs('Data/',exist_ok=True) 
# Confirm folder created
os.listdir("Data/")

['title_akas.csv.gz', 'title_basics.csv.gz', 'title_ratings.csv.gz']

In [12]:
## Save dataframe to file
basics_df.to_csv("Data/title_basics.csv.gz", compression='gzip', index=False)

In [13]:
# Open saved file and preview again
basics_df = pd.read_csv("Data/title_basics.csv.gz", low_memory=False)
basics_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0043139,movie,Life of a Beijing Policeman,Wo zhe yi bei zi,0,2013,,120,"Drama,History"
2,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,,70,Drama
3,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"


In [14]:
basics_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147799 entries, 0 to 147798
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          147799 non-null  object 
 1   titleType       147799 non-null  object 
 2   primaryTitle    147799 non-null  object 
 3   originalTitle   147799 non-null  object 
 4   isAdult         147799 non-null  int64  
 5   startYear       147799 non-null  int64  
 6   endYear         0 non-null       float64
 7   runtimeMinutes  147799 non-null  int64  
 8   genres          147799 non-null  object 
dtypes: float64(1), int64(3), object(5)
memory usage: 10.1+ MB


## Title AKAs
- keep only US movies.
- Replace "\N" with np.nan

In [15]:
# Create dataframe
akas_df = pd.read_csv(aka_url,sep='\t', low_memory=False)

In [16]:
## Save dataframe to file
akas_df.to_csv("Data/title_akas.csv.gz", compression='gzip', index=False)

In [17]:
# Replace "\N" with np.nan
akas_df.replace({'\\N':np.nan},inplace=True)

In [18]:
akas_df = akas_df.loc[akas_df['region'] =='US']

In [19]:
# Open saved file and preview again
akas_df = pd.read_csv("Data/title_akas.csv.gz", low_memory=False)
akas_df.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


In [20]:
akas_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36361502 entries, 0 to 36361501
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   titleId          object
 1   ordering         int64 
 2   title            object
 3   region           object
 4   language         object
 5   types            object
 6   attributes       object
 7   isOriginalTitle  object
dtypes: int64(1), object(7)
memory usage: 2.2+ GB


## Title Ratings
- Replace "\N" with np.nan (if any)
- Keep only US movies (Use AKAs table, see "Filtering one dataframe based on another" section below)

In [21]:
# Create dataframe
ratings_df = pd.read_csv(ratings_url,sep='\t', low_memory=False)

In [22]:
ratings_df.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1982
1,tt0000002,5.8,265
2,tt0000003,6.5,1838
3,tt0000004,5.5,178
4,tt0000005,6.2,2625


In [23]:
# Replace "\N" with np.nan
ratings_df.replace({'\\N':np.nan},inplace=True)

In [24]:
## Save dataframe to file
ratings_df.to_csv("Data/title_ratings.csv.gz", compression='gzip', index=False)

In [25]:
# Open saved file and preview again
ratings_df = pd.read_csv("Data/title_ratings.csv.gz", low_memory=False)
ratings_df.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1982
1,tt0000002,5.8,265
2,tt0000003,6.5,1838
3,tt0000004,5.5,178
4,tt0000005,6.2,2625


In [26]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1323907 entries, 0 to 1323906
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1323907 non-null  object 
 1   averageRating  1323907 non-null  float64
 2   numVotes       1323907 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 30.3+ MB
