# Project 3 Code
- 09.19.2023
- Data Enrichment

## Data Dictionary

**title.akas.tsv.gz**
- titleId (string) - a tconst, an alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- title (string) – the localized title
- region (string) - the region for this version of the title
- language (string) - the language of the title
- types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
- attributes (array) - Additional terms to describe this alternative title, not enumerated
- isOriginalTitle (boolean) – 0: not original title; 1: original title

**title.basics.tsv.gz**
- tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
- isAdult (boolean) - 0: non-adult title; 1: adult title
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title

**title.ratings.tsv.gz**
- tconst (string) - alphanumeric unique identifier of the title
- averageRating – weighted average of all the individual user ratings
- numVotes - number of votes the title has received

In [31]:
# Imports
import pandas as pd
import numpy as np
import seaborn as sns

In [32]:
# Load dataset URLs
# akas_url = "https://datasets.imdbws.com/title.akas.tsv.gz"
# basics_url = "https://datasets.imdbws.com/title.basics.tsv.gz"
# ratings_url = "https://datasets.imdbws.com/title.ratings.tsv.gz"

In [33]:
# Read in the datasets
# akas = pd.read_csv(akas_url, sep='\t', low_memory=False)
# basics = pd.read_csv(basics_url, sep='\t', low_memory=False)
# ratings = pd.read_csv(ratings_url, sep='\t', low_memory=False)

In [34]:
# Open saved akas file and preview again
akas = pd.read_csv("Data/title_akas.csv.gz", low_memory = False)
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
1,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
2,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
3,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
4,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0


In [35]:
# Open saved basics file and preview again
basics = pd.read_csv("Data/title_basics.csv.gz", low_memory = False)
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
2,tt0068865,movie,Lives of Performers,Lives of Performers,0,2016.0,,90,Drama
3,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama
4,tt0082328,movie,Embodiment of Evil,Encarnação do Demônio,0,2008.0,,94,Horror


In [36]:
# Open saved ratings file and preview again
ratings = pd.read_csv("Data/title_ratings.csv.gz", low_memory = False)
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0035423,6.4,87548
1,tt0062336,6.4,180
2,tt0068865,5.4,74
3,tt0069049,6.7,7827
4,tt0082328,5.9,1749


In [37]:
# Preview first 5 rows of akas dataset
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
1,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
2,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
3,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
4,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0


In [38]:
# Preview first 5 rows of basics dataset
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
2,tt0068865,movie,Lives of Performers,Lives of Performers,0,2016.0,,90,Drama
3,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama
4,tt0082328,movie,Embodiment of Evil,Encarnação do Demônio,0,2008.0,,94,Horror


In [39]:
# Preview first 5 rows of ratings dataset
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0035423,6.4,87548
1,tt0062336,6.4,180
2,tt0068865,5.4,74
3,tt0069049,6.7,7827
4,tt0082328,5.9,1749


In [40]:
# Keep only US movies in the akas dataframe
akas_us = akas[akas['region'] == 'US']
akas_us.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
1,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
2,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
3,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
4,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0


In [41]:
# Make a copy of akas_us
akas_us_copy = akas_us.copy()

In [42]:
# Replace "\N" with np.nan and verify that it stuck
akas_us_copy.replace({'\\N':np.nan}, inplace=True)
akas_us_copy.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
1,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
2,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
3,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
4,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0


In [43]:
# Keep only US movies in the basics dataframe
basics_keepers = basics['tconst'].isin(akas_us_copy['titleId'])
basics_filtered = basics[basics_keepers]
basics_filtered

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
2,tt0068865,movie,Lives of Performers,Lives of Performers,0,2016.0,,90,Drama
3,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama
4,tt0082328,movie,Embodiment of Evil,Encarnação do Demônio,0,2008.0,,94,Horror
...,...,...,...,...,...,...,...,...,...
114687,tt9915436,movie,Vida em Movimento,Vida em Movimento,0,2019.0,,70,Documentary
114688,tt9915872,movie,The Last White Witch,Boku no kanojo wa mahoutsukai,0,2019.0,,97,"Comedy,Drama,Fantasy"
114689,tt9916170,movie,The Rehearsal,O Ensaio,0,2019.0,,51,Drama
114690,tt9916190,movie,Safeguard,Safeguard,0,2020.0,,95,"Action,Adventure,Thriller"


In [44]:
# Make a copy of basics_filtered
basics_filtered_copy = basics_filtered.copy()

In [45]:
# Replace "\N" with np.nan and verify that it stuck
basics_filtered_copy.replace({'\\N':np.nan}, inplace=True)
basics_filtered_copy.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
2,tt0068865,movie,Lives of Performers,Lives of Performers,0,2016.0,,90,Drama
3,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama
4,tt0082328,movie,Embodiment of Evil,Encarnação do Demônio,0,2008.0,,94,Horror


In [46]:
# Eliminate movies that are null for runtimeMinutes
# Eliminate movies that are null for genre
basics_filtered_copy = basics_filtered_copy.dropna(subset=['runtimeMinutes', 'genres'])

In [47]:
# Keep only titleType==Movie
basics_filtered_copy = basics_filtered_copy[basics_filtered_copy['titleType'] == 'movie']

In [48]:
# Convert the startYear column to float data type.
basics_filtered_copy['startYear'] = basics_filtered_copy['startYear'].astype(float)

In [49]:
# Filter the dataframe using startYear. Keep years between 2000-2021 (Including 2000 and 2021)
basics_filtered_copy = basics_filtered_copy[(basics_filtered_copy['startYear'] 
                                             >= 2000) & (basics_filtered_copy
                                                         ['startYear'] <= 2021)]

In [50]:
# Filter the ratings DataFrame based on tconst from basics_filtered_copy
ratings_filtered = ratings[ratings['tconst'].isin(basics_filtered_copy['tconst'])]

In [51]:
# Make a copy of ratings_filtered
ratings_filtered_copy = ratings_filtered.copy()

In [52]:
# Replace "\N" with np.nan in the ratings DataFrame
ratings_filtered_copy.replace({'\\N':np.nan}, inplace=True)
ratings_filtered_copy.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0035423,6.4,87548
1,tt0062336,6.4,180
2,tt0068865,5.4,74
3,tt0069049,6.7,7827
4,tt0082328,5.9,1749


In [59]:
# Run a final .info() on akas_us_copy dataframe
akas_us_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1466561 entries, 0 to 1466560
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   titleId          1466561 non-null  object 
 1   ordering         1466561 non-null  int64  
 2   title            1466561 non-null  object 
 3   region           1466561 non-null  object 
 4   language         4135 non-null     object 
 5   types            983910 non-null   object 
 6   attributes       47502 non-null    object 
 7   isOriginalTitle  1465219 non-null  float64
dtypes: float64(1), int64(1), object(6)
memory usage: 100.7+ MB


In [60]:
# Run a final .info() on basics_filtered_copy dataframe
basics_filtered_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 114692 entries, 0 to 114691
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          114692 non-null  object 
 1   titleType       114692 non-null  object 
 2   primaryTitle    114692 non-null  object 
 3   originalTitle   114692 non-null  object 
 4   isAdult         114692 non-null  int64  
 5   startYear       114692 non-null  float64
 6   endYear         0 non-null       float64
 7   runtimeMinutes  114692 non-null  int64  
 8   genres          114692 non-null  object 
dtypes: float64(2), int64(2), object(5)
memory usage: 8.8+ MB


In [61]:
# Run a final .info() on ratings_filtered_copy dataframe
ratings_filtered_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 87718 entries, 0 to 87717
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   tconst         87718 non-null  object 
 1   averageRating  87718 non-null  float64
 2   numVotes       87718 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 2.7+ MB


In [62]:
# Save akas_us_copy dataframe to csv file
akas_us_copy.to_csv("Data/title_akas.csv.gz", compression='gzip', index=False)

In [63]:
# Save basics_filtered_copy dataframe to csv file
basics_filtered_copy.to_csv("Data/title_basics.csv.gz", compression='gzip', index=False)

In [64]:
# Save ratings_us_copy dataframe to csv file
ratings_filtered_copy.to_csv("Data/title_ratings.csv.gz", compression='gzip', index=False)