# Data Importation and Basic Cleaning

This notebook's purpose is to download several files from IMDB's movie data set and filter out the subset of movies as requested by the stakeholder. Additionally, basic data cleaning (removal of duplicates, etc.) will be provided.

The following files are to be downloaded and filtered:
* [Akas](https://datasets.imdbws.com/title.akas.tsv.gz)
* [Ratings](https://datasets.imdbws.com/title.ratings.tsv.gz)
* [Basics](https://datasets.imdbws.com/name.basics.tsv.gz)
* [Crew](https://datasets.imdbws.com/title.crew.tsv.gz)
* [Principals](https://datasets.imdbws.com/title.principals.tsv.gz)
* [Names](https://datasets.imdbws.com/name.basics.tsv.gz)

The data dictionary can be found [here](https://www.imdb.com/interfaces/).

Datasets for extraction can be found [here](https://datasets.imdbws.com/)


## Library Importation, Folder Creation, and Function Implementation

Importing various libraries such as Pandas, creating any folders, and implementing any useful functions later on.

In [1]:
#Importing numpy and pandas for basic data manipulation
import numpy as np
import pandas as pd

#Importing os to connect with operating system
import os

In [2]:
#Setting pandas options to max column and row displays
pd.set_option('display.max_columns', None) #Used for displaying columns
pd.set_option('display.max_rows', None) #Used for displaying rows

In [3]:
#Making data folder if one does not already exist
FOLDER = "Data/"
os.makedirs(FOLDER, exist_ok=True)
os.listdir(FOLDER)

[]

## Creating and Cleaning Databases

The following will be applied to all databases:

* Exclude any movie with missing values for genre or runtime
* Include only full-length movies (titleType = "movie").
* Include only fictional movies (not from documentary genre)
* Include only movies that were released 2000 - 2021 (include 2000 and 2021)
* Include only movies that were released in the United States
* Replace all "\N" values with np.nan

### Creating and Cleaning Akas

akas_url: https://datasets.imdbws.com/title.akas.tsv.gz

In [4]:
akas_url = "https://datasets.imdbws.com/title.akas.tsv.gz"

akas_df = pd.read_csv(akas_url, sep = "\t", low_memory = False)
akas_df.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


In [5]:
#Checking what abbreviation = US
#i.e. is the abbr. US, or USA?
akas_df["region"].value_counts()

DE      4301742
FR      4297613
JP      4296175
IN      4237269
ES      4217497
IT      4198332
PT      4128642
\N      1890319
US      1437292
GB       449125
CA       227936
XWW      173059
AU       137103
BR       117847
RU        96083
MX        94268
GR        92369
PL        88217
FI        87060
SE        76618
HU        74626
NL        62968
AR        59999
PH        58019
NO        56666
DK        55201
TR        52720
XWG       51711
SUHH      37671
HK        34693
BE        33450
KR        32570
TW        31974
CN        31300
ZA        31300
SG        30064
AT        29817
RO        28374
BG        27822
UA        26020
RS        22716
CZ        20418
IL        20005
ID        18999
IE        16373
AE        16338
XYU       15780
EG        14873
HR        14228
IR        13644
CH        13235
VE        13145
NZ        12882
TH        11949
LT        11580
VN        11489
CL        10341
CSHH       9868
DDDE       9645
EC         9537
SI         9449
CO         8933
NG      

<center>The abbreviation for movies from the USA, is "US".</center>

In [6]:
#Filtering out non-US regions
akas_filter = akas_df["region"] == "US"

akas_df = akas_df[akas_filter]

akas_df["region"].value_counts()

US    1437292
Name: region, dtype: int64

In [7]:
#Removing all \N values
akas_df = akas_df.replace({"\\N":np.nan})

akas_df.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,,imdbDisplay,,0
14,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0
33,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0
36,tt0000005,1,Blacksmithing Scene,US,,alternative,,0
41,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0


In [8]:
#Checking for duplicated values
akas_df.duplicated().sum()

0

In [9]:
#Preliminary check for missing values
akas_df.isna().sum()

titleId                  0
ordering                 0
title                    0
region                   0
language           1433364
types               458387
attributes         1390707
isOriginalTitle       1345
dtype: int64

In [10]:
#Preliminary check for missing values by %
akas_df.isna().sum()/len(akas_df) * 100

titleId             0.000000
ordering            0.000000
title               0.000000
region              0.000000
language           99.726708
types              31.892406
attributes         96.758835
isOriginalTitle     0.093579
dtype: float64

### Creating and Cleaning Ratings

ratings_url: https://datasets.imdbws.com/title.ratings.tsv.gz

In [11]:
ratings_url = "https://datasets.imdbws.com/title.ratings.tsv.gz"

ratings_df = pd.read_csv(ratings_url, sep = "\t", low_memory = False)
ratings_df.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1972
1,tt0000002,5.8,264
2,tt0000003,6.5,1819
3,tt0000004,5.6,178
4,tt0000005,6.2,2615


In [12]:
#Filtering out non-US ratings
ratings_in_US_filter = ratings_df["tconst"].isin(akas_df["titleId"])

ratings_df = ratings_df[ratings_in_US_filter]
ratings_df.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1972
1,tt0000002,5.8,264
4,tt0000005,6.2,2615
5,tt0000006,5.1,181
6,tt0000007,5.4,818


In [13]:
#Removing all \N values
ratings_df = ratings_df.replace({"\\N":np.nan})

ratings_df.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1972
1,tt0000002,5.8,264
4,tt0000005,6.2,2615
5,tt0000006,5.1,181
6,tt0000007,5.4,818


In [14]:
#Checking for duplicated values
ratings_df.duplicated().sum()

0

In [15]:
#Preliminary check for missing values
ratings_df.isna().sum()

tconst           0
averageRating    0
numVotes         0
dtype: int64

### Creating and Cleaning Basics

basics_url: https://datasets.imdbws.com/title.basics.tsv.gz

In [16]:
basics_url = "https://datasets.imdbws.com/title.basics.tsv.gz"

basics_df = pd.read_csv(basics_url, sep = "\t", low_memory = False)
basics_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [17]:
#Filtering out non-US movies
movies_in_US_filter = basics_df["tconst"].isin(akas_df["titleId"])

basics_df = basics_df[movies_in_US_filter]
basics_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,\N,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,\N,1,"Short,Sport"


In [18]:
# Removing all \N values
basics_df = basics_df.replace({"\\N": np.nan})

basics_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,,5,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,,1,"Comedy,Short"
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,,1,"Short,Sport"


In [19]:
#Filtering out non-movies
isMovie = basics_df["titleType"] == "movie"
basics_df = basics_df[isMovie]

basics_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,,45.0,Romance
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897,,100.0,"Documentary,News,Sport"
570,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,,70.0,"Action,Adventure,Biography"
587,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907,,90.0,Drama
625,tt0000630,movie,Hamlet,Amleto,0,1908,,,Drama


In [20]:
#Checking type of "genres"
basics_df["genres"].info()

<class 'pandas.core.series.Series'>
Int64Index: 296731 entries, 8 to 9854244
Series name: genres
Non-Null Count   Dtype 
--------------   ----- 
286069 non-null  object
dtypes: object(1)
memory usage: 4.5+ MB


In [21]:
#Filtering out documentaries
is_documentary = basics_df["genres"].str.contains("documentary", na = False)
is_Documentary = basics_df["genres"].str.contains("Documentary", na = False)
basics_df = basics_df[~is_documentary & ~is_Documentary]

basics_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,,45.0,Romance
570,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,,70.0,"Action,Adventure,Biography"
587,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907,,90.0,Drama
625,tt0000630,movie,Hamlet,Amleto,0,1908,,,Drama
672,tt0000679,movie,The Fairylogue and Radio-Plays,The Fairylogue and Radio-Plays,0,1908,,120.0,"Adventure,Fantasy"


In [22]:
#Changing startYear to be an int
basics_df["startYear"] = basics_df["startYear"].astype(float)

#Filters for movies from 2000-2021 inclusive
isOlderThan2000 = basics_df["startYear"] >= 2000
isYoungerThan2022 = basics_df["startYear"] <= 2021

basics_df = basics_df[isOlderThan2000 & isYoungerThan2022]

#Checking to make sure filters work
basics_df.describe()

Unnamed: 0,startYear
count,97315.0
mean,2013.114895
std,5.730841
min,2000.0
25%,2009.0
50%,2014.0
75%,2018.0
max,2021.0


In [23]:
#Changing runtimeMinutes to an int
basics_df["runtimeMinutes"] = basics_df["runtimeMinutes"].astype(float)

#Removing NA values in runtimeMinutes, genres
basics_df = basics_df.dropna(subset = ["runtimeMinutes", "genres"])

#Preliminary check for missing values
basics_df.isna().sum()

tconst                0
titleType             0
primaryTitle          0
originalTitle         0
isAdult               0
startYear             0
endYear           81738
runtimeMinutes        0
genres                0
dtype: int64

In [24]:
#Checking for duplicates
basics_df.duplicated().sum()

0

### Creating and Cleaning Crew

crew_url: https://datasets.imdbws.com/title.crew.tsv.gz

In [25]:
crew_url = "https://datasets.imdbws.com/title.crew.tsv.gz"

crew_df = pd.read_csv(crew_url, sep = "\t", low_memory = False)
crew_df.head()

Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,\N
1,tt0000002,nm0721526,\N
2,tt0000003,nm0721526,\N
3,tt0000004,nm0721526,\N
4,tt0000005,nm0005690,\N


In [26]:
#Filtering out crew that are not in basics
movies_in_basics_filter = crew_df["tconst"].isin(basics_df["tconst"])

crew_df = crew_df[movies_in_basics_filter]
crew_df.head()

Unnamed: 0,tconst,directors,writers
34803,tt0035423,nm0003506,"nm0737216,nm0003506"
61116,tt0062336,"nm0749914,nm0765384","nm0749914,nm1146177"
67669,tt0069049,nm0000080,"nm0000080,nm0462648"
86801,tt0088751,"nm0628399,nm0078540",nm0628399
93938,tt0096056,nm0324875,"nm0234502,nm0324875"


In [27]:
# Removing all \N values
crew_df = crew_df.replace({"\\N": np.nan})

crew_df.head()

Unnamed: 0,tconst,directors,writers
34803,tt0035423,nm0003506,"nm0737216,nm0003506"
61116,tt0062336,"nm0749914,nm0765384","nm0749914,nm1146177"
67669,tt0069049,nm0000080,"nm0000080,nm0462648"
86801,tt0088751,"nm0628399,nm0078540",nm0628399
93938,tt0096056,nm0324875,"nm0234502,nm0324875"


In [28]:
crew_df.duplicated().sum()

0

In [29]:
#Preliminary check for missing values
crew_df.isna().sum()

tconst          0
directors     887
writers      4455
dtype: int64

In [30]:
#Preliminary check for missing values by %
crew_df.isna().sum()/len(crew_df) * 100

tconst       0.000000
directors    1.085175
writers      5.450341
dtype: float64

#### Creating unique pairs of Directors and Writers

In [31]:
#Splitting writers and directors into lists
crew_df["directors_split"] = crew_df["directors"].str.split(',')
crew_df["writers_split"] = crew_df["writers"].str.split(',')
crew_df.head()

Unnamed: 0,tconst,directors,writers,directors_split,writers_split
34803,tt0035423,nm0003506,"nm0737216,nm0003506",[nm0003506],"[nm0737216, nm0003506]"
61116,tt0062336,"nm0749914,nm0765384","nm0749914,nm1146177","[nm0749914, nm0765384]","[nm0749914, nm1146177]"
67669,tt0069049,nm0000080,"nm0000080,nm0462648",[nm0000080],"[nm0000080, nm0462648]"
86801,tt0088751,"nm0628399,nm0078540",nm0628399,"[nm0628399, nm0078540]",[nm0628399]
93938,tt0096056,nm0324875,"nm0234502,nm0324875",[nm0324875],"[nm0234502, nm0324875]"


In [32]:
#Removing unnecessary directors and writers columns
crew_df = crew_df.drop(columns = ["directors", "writers"])
crew_df.head()

Unnamed: 0,tconst,directors_split,writers_split
34803,tt0035423,[nm0003506],"[nm0737216, nm0003506]"
61116,tt0062336,"[nm0749914, nm0765384]","[nm0749914, nm1146177]"
67669,tt0069049,[nm0000080],"[nm0000080, nm0462648]"
86801,tt0088751,"[nm0628399, nm0078540]",[nm0628399]
93938,tt0096056,[nm0324875],"[nm0234502, nm0324875]"


In [33]:
#Exploding directors and writers
crew_df = crew_df.explode("directors_split")
crew_df = crew_df.explode("writers_split")
crew_df.head()

Unnamed: 0,tconst,directors_split,writers_split
34803,tt0035423,nm0003506,nm0737216
34803,tt0035423,nm0003506,nm0003506
61116,tt0062336,nm0749914,nm0749914
61116,tt0062336,nm0749914,nm1146177
61116,tt0062336,nm0765384,nm0749914


In [35]:
#Renaming directors and writers columns
crew_df = crew_df.rename(columns = {"directors_split": "director", "writers_split": "writer"})
crew_df.head()

Unnamed: 0,tconst,director,writer
34803,tt0035423,nm0003506,nm0737216
34803,tt0035423,nm0003506,nm0003506
61116,tt0062336,nm0749914,nm0749914
61116,tt0062336,nm0749914,nm1146177
61116,tt0062336,nm0765384,nm0749914


In [36]:
unique_director = crew_df["director"].unique()
unique_director

array(['nm0003506', 'nm0749914', 'nm0765384', ..., 'nm8063415',
       'nm5412267', 'nm7308376'], dtype=object)

In [37]:
unique_writer = crew_df["writer"].unique()
unique_writer

array(['nm0737216', 'nm0003506', 'nm0749914', ..., 'nm5412267',
       'nm6743460', 'nm3471432'], dtype=object)

### Creating and Cleaning Principals

principals_url: https://datasets.imdbws.com/title.principals.tsv.gz

In [38]:
principals_url = "https://datasets.imdbws.com/title.principals.tsv.gz"

principals_df = pd.read_csv(principals_url, sep = "\t", low_memory = False)

principals_df.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Self""]"
1,tt0000001,2,nm0005690,director,\N,\N
2,tt0000001,3,nm0374658,cinematographer,director of photography,\N
3,tt0000002,1,nm0721526,director,\N,\N
4,tt0000002,2,nm1335271,composer,\N,\N


In [40]:
#Filtering out principals that are not in basics
movies_in_basics_filter = principals_df["tconst"].isin(basics_df["tconst"])

principals_df = principals_df[movies_in_basics_filter]
principals_df.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
283378,tt0035423,10,nm0107463,editor,\N,\N
283379,tt0035423,1,nm0000212,actress,\N,"[""Kate McKay""]"
283380,tt0035423,2,nm0413168,actor,\N,"[""Leopold""]"
283381,tt0035423,3,nm0000630,actor,\N,"[""Stuart Besser""]"
283382,tt0035423,4,nm0005227,actor,\N,"[""Charlie McKay""]"


In [41]:
# Removing all \N values
principals_df = principals_df.replace({"\\N": np.nan})

principals_df.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
283378,tt0035423,10,nm0107463,editor,,
283379,tt0035423,1,nm0000212,actress,,"[""Kate McKay""]"
283380,tt0035423,2,nm0413168,actor,,"[""Leopold""]"
283381,tt0035423,3,nm0000630,actor,,"[""Stuart Besser""]"
283382,tt0035423,4,nm0005227,actor,,"[""Charlie McKay""]"


In [42]:
#Checking for duplicates
principals_df.duplicated().sum()

0

In [43]:
#Preliminary check for missing values
principals_df.isna().sum()

tconst             0
ordering           0
nconst             0
category           0
job           571572
characters    405600
dtype: int64

In [44]:
#Preliminary check for missing values by %
principals_df.isna().sum() / len(principals_df) * 100

tconst         0.000000
ordering       0.000000
nconst         0.000000
category       0.000000
job           78.801323
characters    55.919143
dtype: float64

### Creating and Cleaning Names

names_url: https://datasets.imdbws.com/name.basics.tsv.gz

In [45]:
names_url = "https://datasets.imdbws.com/name.basics.tsv.gz"

names_df = pd.read_csv(names_url, sep = "\t", low_memory = False)

names_df.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0045537,tt0050419,tt0053137,tt0072308"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0038355,tt0117057,tt0037382,tt0075213"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,music_department","tt0049189,tt0054452,tt0056404,tt0057345"
3,nm0000004,John Belushi,1949,1982,"actor,soundtrack,writer","tt0078723,tt0072562,tt0077975,tt0080455"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0060827,tt0050986,tt0083922,tt0069467"


In [46]:
#Filtering out crew
names_in_principals_filter = names_df["nconst"].isin(principals_df["nconst"])
names_in_directors_filter = names_df["nconst"].isin(unique_director)
names_in_writers_filter = names_df["nconst"].isin(unique_writer)

names_df = names_df[names_in_principals_filter | 
                    names_in_directors_filter |
                    names_in_writers_filter]
names_df.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0038355,tt0117057,tt0037382,tt0075213"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0060827,tt0050986,tt0083922,tt0069467"
5,nm0000006,Ingrid Bergman,1915,1982,"actress,soundtrack,producer","tt0038787,tt0038109,tt0036855,tt0034583"
7,nm0000008,Marlon Brando,1924,2004,"actor,soundtrack,director","tt0070849,tt0047296,tt0068646,tt0078788"
11,nm0000012,Bette Davis,1908,1989,"actress,soundtrack,make_up_department","tt0056687,tt0042192,tt0031210,tt0035140"


In [47]:
# Removing all \N values
names_df = names_df.replace({"\\N": np.nan})

names_df.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0038355,tt0117057,tt0037382,tt0075213"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0060827,tt0050986,tt0083922,tt0069467"
5,nm0000006,Ingrid Bergman,1915,1982,"actress,soundtrack,producer","tt0038787,tt0038109,tt0036855,tt0034583"
7,nm0000008,Marlon Brando,1924,2004,"actor,soundtrack,director","tt0070849,tt0047296,tt0068646,tt0078788"
11,nm0000012,Bette Davis,1908,1989,"actress,soundtrack,make_up_department","tt0056687,tt0042192,tt0031210,tt0035140"


In [48]:
#Checking for duplicates
names_df.duplicated().sum()

0

In [49]:
#Preliminary check for missing values
names_df.isna().sum()

nconst                    0
primaryName               0
birthYear            317955
deathYear            389537
primaryProfession      5214
knownForTitles         7722
dtype: int64

In [52]:
#Preliminary check for missing values by %
names_df.isna().sum()/len(names_df) * 100

nconst                0.000000
primaryName           0.000000
birthYear            79.983850
deathYear            97.990813
primaryProfession     1.311619
knownForTitles        1.942524
dtype: float64

## Deliverables

In order to showcase what has been done

* A summary of how many movies remain in each dataset and the datatypes for each feature will be provided
* Each pandas dataframe will be compressed into a csv file into the "Data/" folder

### Showcasing how many movies remain in each dataset w/ features

In [53]:
akas_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1437292 entries, 5 to 35918156
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype 
---  ------           --------------    ----- 
 0   titleId          1437292 non-null  object
 1   ordering         1437292 non-null  int64 
 2   title            1437292 non-null  object
 3   region           1437292 non-null  object
 4   language         3928 non-null     object
 5   types            978905 non-null   object
 6   attributes       46585 non-null    object
 7   isOriginalTitle  1435947 non-null  object
dtypes: int64(1), object(7)
memory usage: 98.7+ MB


In [54]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 499319 entries, 0 to 1312091
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tconst         499319 non-null  object 
 1   averageRating  499319 non-null  float64
 2   numVotes       499319 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 15.2+ MB


In [55]:
basics_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 81738 entries, 34803 to 9854120
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          81738 non-null  object 
 1   titleType       81738 non-null  object 
 2   primaryTitle    81738 non-null  object 
 3   originalTitle   81738 non-null  object 
 4   isAdult         81738 non-null  object 
 5   startYear       81738 non-null  float64
 6   endYear         0 non-null      object 
 7   runtimeMinutes  81738 non-null  float64
 8   genres          81738 non-null  object 
dtypes: float64(2), object(7)
memory usage: 6.2+ MB


In [56]:
crew_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 212601 entries, 34803 to 9854120
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   tconst    212601 non-null  object
 1   director  211555 non-null  object
 2   writer    207096 non-null  object
dtypes: object(3)
memory usage: 6.5+ MB


In [57]:
principals_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 725333 entries, 283378 to 56143155
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   tconst      725333 non-null  object
 1   ordering    725333 non-null  int64 
 2   nconst      725333 non-null  object
 3   category    725333 non-null  object
 4   job         153761 non-null  object
 5   characters  319733 non-null  object
dtypes: int64(1), object(5)
memory usage: 38.7+ MB


In [58]:
names_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 397524 entries, 1 to 12533718
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   nconst             397524 non-null  object
 1   primaryName        397524 non-null  object
 2   birthYear          79569 non-null   object
 3   deathYear          7987 non-null    object
 4   primaryProfession  392310 non-null  object
 5   knownForTitles     389802 non-null  object
dtypes: object(6)
memory usage: 21.2+ MB


### Compressing datasets into .csv.gz files

In [59]:
## Save current dataframes to file.
akas_df.to_csv("Data/title_akas.csv.gz",compression='gzip',index=False)
ratings_df.to_csv("Data/title_ratings.csv.gz",compression='gzip',index=False)
basics_df.to_csv("Data/title_basics.csv.gz",compression='gzip',index=False)
crew_df.to_csv("Data/title_crew.csv.gz",compression='gzip',index=False)
principals_df.to_csv("Data/title_principals.csv.gz",compression='gzip',index=False)
names_df.to_csv("Data/title_names.csv.gz",compression='gzip',index=False)