# Official IMDB Data Analysis

- Overview/Data Dictionary: https://www.imdb.com/interfaces/

## Objective

**Specifications**

Your stakeholder only wants you to include information for movies based on the following specifications:

- Exclude any movie with missing values for genre or runtime
- Include only full-length movies (titleType = "movie").
- Include only fictional movies (not from documentary genre)
- Include only movies that were released 2000 - 2021 (include 2000 and 2021)
- Include only movies that were released in the United States


**Deliverable**

After filtering out movies that do not meet the stakeholder's specifications:

- Before saving, run a final .info() for each of the dataframes to show a summary of how many movies remain and the datatypes of each feature
- Save each file to a compressed csv file "Data/" folder inside your repository.
- Commit your changes to your repository in GitHub desktop and Publish repository / Push Changes.
- Submit the link to your repository

## Imports

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import os

## Data

In [2]:
# Basics
basics_url="https://datasets.imdbws.com/title.basics.tsv.gz"
basics = pd.read_csv(basics_url, sep='\t', low_memory=False)

# Akas
akas_url="https://datasets.imdbws.com/title.akas.tsv.gz"
akas = pd.read_csv(akas_url, sep='\t', low_memory=False)

# Ratings
ratings_url="https://datasets.imdbws.com/title.ratings.tsv.gz"
ratings = pd.read_csv(ratings_url, sep='\t', low_memory=False)

### AKAs Data

In [3]:
# Display the first 5 rows of the akas dataframe
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


In [4]:
# Display the column names, count of non-null values, and their datatypes
akas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36461473 entries, 0 to 36461472
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   titleId          object
 1   ordering         int64 
 2   title            object
 3   region           object
 4   language         object
 5   types            object
 6   attributes       object
 7   isOriginalTitle  object
dtypes: int64(1), object(7)
memory usage: 2.2+ GB


In [5]:
# Display the descriptive statistics for all columns
akas.describe()

Unnamed: 0,ordering
count,36461470.0
mean,4.14976
std,3.93739
min,1.0
25%,2.0
50%,4.0
75%,6.0
max,249.0


In [6]:
# Display the number of duplicate rows in the dataset
print(f'There are {akas.duplicated().sum()} duplicate rows.')

There are 0 duplicate rows.


In [7]:
# Display the total number of missing values
print(f'There are {akas.isna().sum().sum()} missing values.')

There are 119 missing values.


In [8]:
# Replace "\N" with np.nan
akas.replace({'\\N': np.nan}, inplace=True)

In [9]:
# Keep only US movies
# drop nulls
akas.dropna (subset = ['region'], inplace = True)
# Apply filter
akas = akas[(akas['region'] == 'US')]
akas['region'].value_counts()

US    1449468
Name: region, dtype: int64

In [10]:
print (akas)

            titleId  ordering                                      title  \
5         tt0000001         6                                 Carmencita   
14        tt0000002         7                     The Clown and His Dogs   
33        tt0000005        10                           Blacksmith Scene   
36        tt0000005         1                        Blacksmithing Scene   
41        tt0000005         6                        Blacksmith Scene #1   
...             ...       ...                                        ...   
36460999  tt9916560         1  March of Dimes Presents: Once Upon a Dime   
36461069  tt9916620         1                          The Copeland Case   
36461158  tt9916702         1              Loving London: The Playground   
36461201  tt9916756         1                   Pretty Pretty Black Girl   
36461217  tt9916764         1                                         38   

         region language        types             attributes isOriginalTitle  
5       

### Basics Data

In [11]:
# Display the first 5 rows of the basics dataframe
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [12]:
# Display the column names, count of non-null values, and their datatypes
basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9985918 entries, 0 to 9985917
Data columns (total 9 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         object
 5   startYear       object
 6   endYear         object
 7   runtimeMinutes  object
 8   genres          object
dtypes: object(9)
memory usage: 685.7+ MB


In [13]:
# Display the descriptive statistics for all columns
basics.describe()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
count,9985918,9985918,9985907,9985907,9985918,9985918,9985918,9985918,9985903
unique,9985918,11,4508287,4530472,11,154,96,893,2340
top,tt0000001,tvEpisode,Episode #1.1,Episode #1.1,0,\N,\N,\N,Drama
freq,1,7595077,48302,48302,9671592,1345133,9876330,7020913,1137875


In [14]:
# Display the number of duplicate rows in the dataset
print(f'There are {basics.duplicated().sum()} duplicate rows.')

There are 0 duplicate rows.


In [15]:
# Display the total number of missing values
print(f'There are {basics.isna().sum().sum()} missing values.')

There are 37 missing values.


In [16]:
# Replace "\N" with np.nan
basics.replace({'\\N':np.nan}, inplace = True)

In [17]:
# Eliminate movies that are null for runtimeMinutes & genres
basics.dropna(subset=['runtimeMinutes','genres'], inplace = True)

In [18]:
# Keep only titleType==Movie
basics = basics.loc[basics['titleType']=='movie']

In [19]:
# Check categories left in column titleType
basics['titleType'].value_counts()

movie    384853
Name: titleType, dtype: int64

In [36]:
# Keep startYear 2000-2021
# from .info, its observed its an object, so ill change dtype to int 
basics.dropna (subset = ["startYear"], inplace = True)
basics['startYear'] = basics['startYear'].astype(int)

# Apply two filters to make startYear between 2000 to 2021
basics = basics[(basics['startYear'] >= 2000) & (basics['startYear'] <= 2021)]
basics['startYear'].value_counts()

2019    5876
2018    5779
2017    5641
2016    5255
2021    5157
2015    5055
2020    5005
2014    4916
2013    4710
2012    4522
2011    4229
2010    3861
2009    3555
2008    2910
2007    2575
2006    2438
2005    2182
2004    1902
2003    1684
2001    1575
2002    1570
2000    1456
Name: startYear, dtype: int64

In [37]:
# Eliminate movies that include "Documentary" in genre 
is_documentary = basics['genres'].str.contains('documentary',case=False)
basics = basics[~is_documentary]

In [38]:
# Keep only US movies
# Filter the ratings table down to only include the US by 
# using the filter akas dataframe
basics_keepers = basics['tconst'].isin(akas['titleId'])

# Filter ratings
basics = basics[basics_keepers]

In [39]:
print (basics)

          tconst titleType                                       primaryTitle  \
0      tt0035423     movie                                     Kate & Leopold   
1      tt0062336     movie  The Tango of the Widower and Its Distorting Mi...   
2      tt0069049     movie                         The Other Side of the Wind   
3      tt0088751     movie                                  The Naked Monster   
4      tt0096056     movie                               Crime and Punishment   
...          ...       ...                                                ...   
86912  tt9914942     movie                             Life Without Sara Amat   
86913  tt9915872     movie                               The Last White Witch   
86914  tt9916170     movie                                      The Rehearsal   
86915  tt9916190     movie                                          Safeguard   
86916  tt9916362     movie                                              Coven   

                           

### Ratings Data

In [24]:
# Display the first 5 rows of the ratings dataframe
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1984
1,tt0000002,5.8,265
2,tt0000003,6.5,1840
3,tt0000004,5.5,178
4,tt0000005,6.2,2623


In [25]:
# Display the column names, count of non-null values, and their datatypes
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1327278 entries, 0 to 1327277
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1327278 non-null  object 
 1   averageRating  1327278 non-null  float64
 2   numVotes       1327278 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 30.4+ MB


In [26]:
# Display the descriptive statistics for all columns
ratings.describe()

Unnamed: 0,averageRating,numVotes
count,1327278.0,1327278.0
mean,6.955056,1038.381
std,1.382464,17474.77
min,1.0,5.0
25%,6.2,11.0
50%,7.1,26.0
75%,7.9,101.0
max,10.0,2761075.0


In [27]:
# Display the number of duplicate rows in the dataset
print(f'There are {ratings.duplicated().sum()} duplicate rows.')

There are 0 duplicate rows.


In [28]:
# Display the total number of missing values
print(f'There are {ratings.isna().sum().sum()} missing values.')

There are 0 missing values.


In [29]:
# Replace "\N" with np.nan
ratings.replace({'\\N': np.nan}, inplace=True)

In [40]:
# Keep only US movies
# Filter the ratings table down to only include the US by 
# using the filter akas dataframe
ratings_keepers =ratings['tconst'].isin(akas['titleId'])

# Filter ratings
ratings = ratings[ratings_keepers]

In [41]:
print (ratings)

           tconst  averageRating  numVotes
0       tt0000001            5.7      1984
1       tt0000002            5.8       265
2       tt0000005            6.2      2623
3       tt0000006            5.1       182
4       tt0000007            5.4       820
...           ...            ...       ...
502895  tt9916200            8.1       230
502896  tt9916204            8.2       264
502897  tt9916348            8.3        18
502898  tt9916362            6.4      5405
502899  tt9916428            3.8        14

[502900 rows x 3 columns]


# Save Files into Repository

In [32]:
os.makedirs('Data/',exist_ok=True) 
# Confirm folder created
os.listdir("Data/")

['title_akas.csv.gz', 'title_basics.csv.gz', 'title_ratings.csv.gz']

In [34]:
## Save current dataframe to file.
akas.to_csv("Data/title_akas.csv.gz",compression='gzip',index=False)

# Open saved file and preview again
akas = pd.read_csv("Data/title_akas.csv.gz", low_memory = False)
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
1,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
2,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
3,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
4,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0


In [42]:
## Save current dataframe to file.
basics.to_csv("Data/title_basics.csv.gz",compression='gzip',index=False)

# Open saved file and preview again
basics = pd.read_csv("Data/title_basics.csv.gz", low_memory = False)
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
3,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"
4,tt0096056,movie,Crime and Punishment,Crime and Punishment,0,2002,,126,Drama


In [35]:
## Save current dataframe to file.
ratings.to_csv("Data/title_ratings.csv.gz",compression='gzip',index=False)

# Open saved file and preview again
ratings = pd.read_csv("Data/title_ratings.csv.gz", low_memory = False)
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1984
1,tt0000002,5.8,265
2,tt0000005,6.2,2623
3,tt0000006,5.1,182
4,tt0000007,5.4,820
