# IMDB Movie Analysis
Cameron Peace

### Assignment Checklist

<mark>***Download several files from IMDB’s movie data set and filter out the subset of moves requested by the stakeholder.***

tables requested:
* title.basics.tsv.gz
* title.ratings.tsv.gz
* title.akas.tsv.gz

Filtering/Cleaning Steps:
* Title Basics:
    * [x] Replace "\N" with np.nan
    * [x] Eliminate movies that are null for runtimeMinutes
    * [x] Eliminate movies that are null for genre
    * [x] keep only titleType==Movie
    * [x] keep startYear 2000-2022
    * [x] Eliminate movies that include "Documentary" in genre (see tip below)
    * [x] Keep only US movies (Use AKAs table, see "Filtering one dataframe based on another" section below)
* AKAs:
    * [x] keep only US movies.
    * [x] Replace "\N" with np.nan
* Ratings:
    * [x] Replace "\N" with np.nan (if any)
    * [x] Keep only US movies (Use AKAs table, see "Filtering one dataframe based on another" section below)
    
Deliverable:
After filtering out movies that do not meet the stakeholder's specifications:

* [x] Before saving, run a final .info() for each of the dataframes to show a summary of how many movies remain and the datatypes of each feature
* [x] Save each file to a compressed csv file "Data/" folder inside your repository.
* [x] Commit your changes to your repository in GitHub desktop and Publish repository / Push Changes.
* [x] Submit the link to your repository

### Data Background

IMDb Datasets
Subsets of IMDb data are available for access to customers for personal and non-commercial use. You can hold local copies of this data, and it is subject to our terms and conditions. Please refer to the Non-Commercial Licensing and copyright/license and verify compliance.

Data Location:  The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily.

### Data Dictionaries

**title.basics.tsv.gz:**

* tconst (string) - alphanumeric unique identifier of the title
* titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
* primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
* originalTitle (string) - original title, in the original language
* isAdult (boolean) - 0: non-adult title; 1: adult title
* startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
* endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
* runtimeMinutes – primary runtime of the title, in minutes
* genres (string array) – includes up to three genres associated with the title

---

**title.akas.tsv.gz:**

* titleId (string) - a tconst, an alphanumeric unique identifier of the title
* ordering (integer) – a number to uniquely identify rows for a given titleId
* title (string) – the localized title
* region (string) - the region for this version of the title
* language (string) - the language of the title
* types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
* attributes (array) - Additional terms to describe this alternative title, not enumerated
* isOriginalTitle (boolean) – 0: not original title; 1: original title

---
**title.ratings.tsv.gz:**

* tconst (string) - alphanumeric unique identifier of the title
* averageRating – weighted average of all the individual user ratings
* numVotes - number of votes the title has received

## Imports

In [56]:
import pandas as pd
import numpy as np
import os

## Loading and Viewing Data

In [4]:
# imdb urls for datasets
basics_url = "https://datasets.imdbws.com/title.basics.tsv.gz"
akas_url = "https://datasets.imdbws.com/title.akas.tsv.gz"
ratings_url = "https://datasets.imdbws.com/title.ratings.tsv.gz"

In [6]:
# loading the data
basics = pd.read_csv(basics_url, sep='\t', low_memory=False)
akas = pd.read_csv(akas_url, sep='\t', low_memory=False)
ratings = pd.read_csv(ratings_url, sep='\t', low_memory=False)

# viewing the data
display(basics.head(), akas.head(), ratings.head())

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1960
1,tt0000002,5.8,263
2,tt0000003,6.5,1799
3,tt0000004,5.6,179
4,tt0000005,6.2,2597


## Cleaning Data

### Replacing '\N' with np.nan in All DFs

In [11]:
# replacing '\N' with np.nan
basics = basics.replace('\\N', np.nan).copy()
akas = akas.replace('\\N', np.nan).copy()
ratings = ratings.replace('\\N', np.nan).copy()

In [12]:
# confirming changes
display(basics.head(3), akas.head(3))

# ratings did not appear to have any NaN values
ratings.isna().sum()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,,4,"Animation,Comedy,Romance"


Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,,imdbDisplay,,0
1,tt0000001,2,Carmencita,DE,,,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,,imdbDisplay,,0


tconst           0
averageRating    0
numVotes         0
dtype: int64

<mark><u>**Comment:**</u>

<font color='dodgerblue' size=4><i>
It looks like everything worked, the ratings table did not appear to have any null values, but I'm leaving the code in there in case that changes in the future.
</i></font>

### Cleaning Akas DF

In [20]:
# doubling checking region values in case "USA" or something similar occurs
akas['region'].sort_values().unique()

array(['AD', 'AE', 'AF', 'AG', 'AI', 'AL', 'AM', 'AN', 'AO', 'AQ', 'AR',
       'AS', 'AT', 'AU', 'AW', 'AZ', 'BA', 'BB', 'BD', 'BE', 'BF', 'BG',
       'BH', 'BI', 'BJ', 'BM', 'BN', 'BO', 'BR', 'BS', 'BT', 'BUMM', 'BW',
       'BY', 'BZ', 'CA', 'CC', 'CD', 'CF', 'CG', 'CH', 'CI', 'CK', 'CL',
       'CM', 'CN', 'CO', 'CR', 'CSHH', 'CSXX', 'CU', 'CV', 'CW', 'CY',
       'CZ', 'DDDE', 'DE', 'DJ', 'DK', 'DM', 'DO', 'DZ', 'EC', 'EE', 'EG',
       'EH', 'ER', 'ES', 'ET', 'FI', 'FJ', 'FM', 'FO', 'FR', 'GA', 'GB',
       'GD', 'GE', 'GF', 'GH', 'GI', 'GL', 'GM', 'GN', 'GP', 'GQ', 'GR',
       'GT', 'GU', 'GW', 'GY', 'HK', 'HN', 'HR', 'HT', 'HU', 'ID', 'IE',
       'IL', 'IM', 'IN', 'IQ', 'IR', 'IS', 'IT', 'JE', 'JM', 'JO', 'JP',
       'KE', 'KG', 'KH', 'KI', 'KM', 'KN', 'KP', 'KR', 'KW', 'KY', 'KZ',
       'LA', 'LB', 'LC', 'LI', 'LK', 'LR', 'LS', 'LT', 'LU', 'LV', 'LY',
       'MA', 'MC', 'MD', 'ME', 'MG', 'MH', 'MK', 'ML', 'MM', 'MN', 'MO',
       'MP', 'MQ', 'MR', 'MS', 'MT', 'MU', 'MV', 

In [25]:
# getting length of US films to compare against for confirmation
display(f"Number akas entries: {len(akas)}  \
        Number of aka US films: {len(akas[akas['region'] == 'US'])}")

'Number akas entries: 35307575 ,         Number of aka US films: 1423164'

In [26]:
akas = akas[akas['region'] == 'US'].copy()

# confirming changes
len(akas)

1423164

<mark><u>**Comment:**</u>

<font color='dodgerblue' size=4><i>
Looks like we have successfully filtered akas for only US films.
</i></font>

### Cleaning Basics DF


    * [x] Eliminate movies that are null for runtimeMinutes
    * [x] Eliminate movies that are null for genre
    * [x] keep only titleType==Movie
    * [x] keep startYear 2000-2022
    * [x] Eliminate movies that include "Documentary" in genre (see tip below)
    * [x] Keep only US movies

In [15]:
basics['titleType'].unique()

array(['short', 'movie', 'tvShort', 'tvMovie', 'tvSeries', 'tvEpisode',
       'tvMiniSeries', 'tvSpecial', 'video', 'videoGame', 'tvPilot'],
      dtype=object)

<mark><u>**Comment:**</u>

<font color='dodgerblue' size=4><i>
It looks like there are both 'movie', and 'tvmovie' in this dataset.  The instructions call for only movies, this might also encompass tv movies, but the LP instructions call for only filtering for 'movie'.  Including this comment in case this distinction matters.
</i></font>

In [31]:
# 'startYear' is a string column, changing it to an float (to account for NaNs)
basics['startYear'] = basics['startYear'].astype(float)

# confirming
basics.dtypes

tconst             object
titleType          object
primaryTitle       object
originalTitle      object
isAdult            object
startYear         float64
endYear            object
runtimeMinutes     object
genres             object
dtype: object

In [None]:
basics['genres'].

### Basics - filtering out NaNs

In [34]:
# filtering out nulls in genres and runtimeMinutes
basics = basics[(basics['genres'].notnull()) &
               (basics['runtimeMinutes'].notnull())
        ].copy()

In [35]:
# confirming changes
basics.isna().sum()

tconst                  0
titleType               0
primaryTitle            1
originalTitle           1
isAdult                 0
startYear          160253
endYear           2722977
runtimeMinutes          0
genres                  0
dtype: int64

### Basics - filtering for other conditions
- type: movie
- Start Year: between 2000 and 2022
- Genre: No Documentaries
- Region: Only US Movies

In [38]:
basics = basics[(basics['titleType'] == 'movie') &
               (basics['startYear'].between(2000, 2022, inclusive='both')) &
               (~basics['genres'].str.contains('documentary', case=False)) &
               (basics['tconst'].isin(akas['titleId']))
        ].copy()

In [42]:
# confirming changes
for i in ['titleType', 'startYear']:
    print(i.upper(), '\n', sorted(basics[i].unique()))

TITLETYPE 
 ['movie']
STARTYEAR 
 [2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, 2006.0, 2007.0, 2008.0, 2009.0, 2010.0, 2011.0, 2012.0, 2013.0, 2014.0, 2015.0, 2016.0, 2017.0, 2018.0, 2019.0, 2020.0, 2021.0, 2022.0]


In [43]:
# confirming changes
any(basics['genres'].str.contains('documentary', case=False))

False

In [47]:
# confirming changes
display(all(basics['tconst'].isin(akas['titleId'])))

# confirming only US movies in the basics df
display(basics.merge(akas, left_on='tconst', right_on='titleId')['region'].unique())

True

array(['US'], dtype=object)

<mark><u>**Comment:**</u>

<font color='dodgerblue' size=4><i>
It looks like all of our filter operations were successful
</i></font>

### Ratings - filtering out Non US movies

In [48]:
ratings = ratings[ratings['tconst'].isin(akas['titleId'])].copy()

In [49]:
# confirming changes
ratings.merge(akas, left_on='tconst', right_on='titleId')['region'].unique()

array(['US'], dtype=object)

## Summary of cleaned dfs

### Basics

In [53]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 86039 entries, 34803 to 9703217
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          86039 non-null  object 
 1   titleType       86039 non-null  object 
 2   primaryTitle    86039 non-null  object 
 3   originalTitle   86039 non-null  object 
 4   isAdult         86039 non-null  object 
 5   startYear       86039 non-null  float64
 6   endYear         0 non-null      object 
 7   runtimeMinutes  86039 non-null  object 
 8   genres          86039 non-null  object 
dtypes: float64(1), object(8)
memory usage: 6.6+ MB


### Akas

In [54]:
akas.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1423164 entries, 5 to 35307319
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype 
---  ------           --------------    ----- 
 0   titleId          1423164 non-null  object
 1   ordering         1423164 non-null  int64 
 2   title            1423164 non-null  object
 3   region           1423164 non-null  object
 4   language         3859 non-null     object
 5   types            976026 non-null   object
 6   attributes       46220 non-null    object
 7   isOriginalTitle  1421819 non-null  object
dtypes: int64(1), object(7)
memory usage: 97.7+ MB


### Ratings

In [55]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 492764 entries, 0 to 1291943
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tconst         492764 non-null  object 
 1   averageRating  492764 non-null  float64
 2   numVotes       492764 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 15.0+ MB


## Exporting Data

In [57]:
# creating folder in directory
os.makedirs('Data/',exist_ok=True) 
# confirming folder creation
os.listdir("Data/")

[]

In [58]:
# saving cleaned dfs to csv
basics.to_csv("Data/title_basics.csv.gz", compression='gzip', index=False)
akas.to_csv("Data/title_akas.csv.gz", compression='gzip', index=False)
ratings.to_csv("Data/title_ratings.csv.gz", compression='gzip', index=False)

In [59]:
# creating a new df to confirm export
basics_new = pd.read_csv("Data/title_basics.csv.gz", low_memory=False)

In [61]:
display(basics.shape, basics_new.shape)

(86039, 9)

(86039, 9)

<mark><u>**Comment:**</u>

<font color='dodgerblue' size=4><i>
All files appear to have exported successfully.
</i></font>