# Imports

In [1]:
# Imports 
import pandas as pd
import numpy as np

# Data Dictionary

**IMDb Datasets**

Subsets of IMDb data are available for access to customers for personal and non-commercial use. You can hold local copies of this data, and it is subject to our terms and conditions. Please refer to the Non-Commercial Licensing and copyright/license and verify compliance.

**Data Location**

The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily.

**IMDb Dataset Details**

Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name. The available datasets are as follows:

**title.akas.tsv.gz** - Contains the following information for titles:

- titleId (string) - a tconst, an alphanumeric unique identifier of the title

- ordering (integer) – a number to uniquely identify rows for a given titleId

- title (string) – the localized title

- region (string) - the region for this version of the title

- language (string) - the language of the title

- types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning

- attributes (array) - Additional terms to describe this alternative title, not enumerated

- isOriginalTitle (boolean) – 0: not original title; 1: original title

**title.basics.tsv.gz** - Contains the following information for titles:

- tconst (string) - alphanumeric unique identifier of the title

- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)

- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release

- originalTitle (string) - original title, in the original language

- isAdult (boolean) - 0: non-adult title; 1: adult title

- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year

- endYear (YYYY) – TV Series end year. ‘\N’ for all other title types

- runtimeMinutes – primary runtime of the title, in minutes

- genres (string array) – includes up to three genres associated with the title

**title.ratings.tsv.gz** – Contains the IMDb rating and votes information for titles
- tconst (string) - alphanumeric unique identifier of the title
- averageRating – weighted average of all the individual user ratings
- numVotes - number of votes the title has received

# Loading the Data

In [2]:
# URLs for each dataframe needed
basics_url = 'https://datasets.imdbws.com/title.basics.tsv.gz'
akas_url = 'https://datasets.imdbws.com/title.akas.tsv.gz'
ratings_url = 'https://datasets.imdbws.com/title.ratings.tsv.gz'
# Loading the data for basics
basics = pd.read_csv(basics_url, sep='\t', low_memory=False)

In [3]:
# Loading the Data for akas
akas = pd.read_csv(akas_url, sep='\t', low_memory=True)

  akas = pd.read_csv(akas_url, sep='\t', low_memory=True)


In [4]:
# Loading the data for ratings
ratings = pd.read_csv(ratings_url, sep='\t', low_memory=True)

# Data Cleaning

## Cleaning the Basics Dataset

In [5]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800939 entries, 0 to 9800938
Data columns (total 9 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         object
 5   startYear       object
 6   endYear         object
 7   runtimeMinutes  object
 8   genres          object
dtypes: object(9)
memory usage: 673.0+ MB


In [6]:
akas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35704646 entries, 0 to 35704645
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   titleId          object
 1   ordering         int64 
 2   title            object
 3   region           object
 4   language         object
 5   types            object
 6   attributes       object
 7   isOriginalTitle  object
dtypes: int64(1), object(7)
memory usage: 2.1+ GB


In [7]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1304466 entries, 0 to 1304465
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1304466 non-null  object 
 1   averageRating  1304466 non-null  float64
 2   numVotes       1304466 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 29.9+ MB
