In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### **IMDb Datasets**

Subsets of IMDb data are available for access to customers for personal and non-commercial use. You can hold local copies of this data, and it is subject to our terms and conditions. Please refer to the [Non-Commercial Licensing](https://help.imdb.com/article/imdb/general-information/can-i-use-imdb-data-in-my-software/G5JTRESSHJBBHTGX?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=3aefe545-f8d3-4562-976a-e5eb47d1bb18&amp;pf_rd_r=C5Z70M2N900ND3CA6CAT&amp;pf_rd_s=center-1&amp;pf_rd_t=60601&amp;pf_rd_i=interfaces&amp;ref_=fea_mn_lk1) and [copyright/license](http://www.imdb.com/Copyright?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=3aefe545-f8d3-4562-976a-e5eb47d1bb18&amp;pf_rd_r=C5Z70M2N900ND3CA6CAT&amp;pf_rd_s=center-1&amp;pf_rd_t=60601&amp;pf_rd_i=interfaces&amp;ref_=fea_mn_lk2) and verify compliance.

**Data Location**

The dataset files can be accessed and downloaded from [https://datasets.imdbws.com/](https://www.imdb.com/offsite/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=3aefe545-f8d3-4562-976a-e5eb47d1bb18&amp;pf_rd_r=C5Z70M2N900ND3CA6CAT&amp;pf_rd_s=center-1&amp;pf_rd_t=60601&amp;pf_rd_i=interfaces&amp;page-action=offsite-imdbws&amp;token=BCYqjCgbxhvkkPZqvIDuKZraCiIzU4_zRcPbCXwwUCIaXN2hnfdasUG9HBE3n73r6wALAyCPfonV%0D%0Akjf8k3Y1O1CWQhCYs4XmdobRUcXM-h2JG2s4iVkTzfLrQlct2rDDA-5gFvg-dP5pawpBPca9hvEQ%0D%0AFWjS1McvWWO13OTSyme0m0cxC0-J6yvJ2RGGM75xTV1PQikR59ssSmGnEduhQksu4Q%0D%0A&amp;ref_=fea_mn_lk3). The data is refreshed daily.

**IMDb Dataset Details**

Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A _&#39;\N&#39;_ is used to denote that a particular field is missing or null for that title/name. The available datasets are as follows:

**title.akas.tsv.gz** - Contains the following information for titles:

- titleId (string) - a tconst, an alphanumeric unique identifier of the title
- rdering (integer) – a number to uniquely identify rows for a given titleId
- title (string) – the localized title
- region (string) - the region for this version of the title
- language (string) - the language of the title
- types (array) - Enumerated set of attributes for this alternative title. One or more of the following: &quot;alternative&quot;, &quot;dvd&quot;, &quot;festival&quot;, &quot;tv&quot;, &quot;video&quot;, &quot;working&quot;, &quot;original&quot;, &quot;imdbDisplay&quot;. New values may be added in the future without warning
- attributes (array) - Additional terms to describe this alternative title, not enumerated
- isOriginalTitle (boolean) – 0: not original title; 1: original title

**title.basics.tsv.gz** - Contains the following information for titles:

- tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- riginalTitle (string) - original title, in the original language
- isAdult (boolean) - 0: non-adult title; 1: adult title
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. &#39;\N&#39; for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title

**title.crew.tsv.gz** – Contains the director and writer information for all the titles in IMDb. Fields include:

- tconst (string) - alphanumeric unique identifier of the title
- directors (array of nconsts) - director(s) of the given title
- writers (array of nconsts) – writer(s) of the given title

**title.episode.tsv.gz** – Contains the tv episode information. Fields include:

- tconst (string) - alphanumeric identifier of episode
- parentTconst (string) - alphanumeric identifier of the parent TV Series
- seasonNumber (integer) – season number the episode belongs to
- episodeNumber (integer) – episode number of the tconst in the TV series

**title.principals.tsv.gz** – Contains the principal cast/crew for titles

- tconst (string) - alphanumeric unique identifier of the title
- rdering (integer) – a number to uniquely identify rows for a given titleId
- nconst (string) - alphanumeric unique identifier of the name/person
- category (string) - the category of job that person was in
- job (string) - the specific job title if applicable, else &#39;\N&#39;
- characters (string) - the name of the character played if applicable, else &#39;\N&#39;

**title.ratings.tsv.gz** – Contains the IMDb rating and votes information for titles

- tconst (string) - alphanumeric unique identifier of the title
- averageRating – weighted average of all the individual user ratings
- numVotes - number of votes the title has received

**name.basics.tsv.gz** – Contains the following information for names:

- nconst (string) - alphanumeric unique identifier of the name/person
- primaryName (string)– name by which the person is most often credited
- birthYear – in YYYY format
- deathYear – in YYYY format if applicable, else &#39;\N&#39;
- primaryProfession (array of strings)– the top-3 professions of the person
- knownForTitles (array of tconsts) – titles the person is known for

In [2]:
data_name_basics_tsv = pd.read_csv("./data/name.basics.tsv/data.tsv", sep="\t", header=0)

In [3]:
data_name_basics_tsv.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0043044,tt0050419,tt0053137,tt0072308"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0037382,tt0117057,tt0071877,tt0038355"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,producer","tt0049189,tt0059956,tt0057345,tt0054452"
3,nm0000004,John Belushi,1949,1982,"actor,writer,soundtrack","tt0072562,tt0080455,tt0078723,tt0077975"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0069467,tt0083922,tt0050986,tt0050976"


In [4]:
data_title_akas_tsv = pd.read_csv("./data/title.akas.tsv/data.tsv", sep="\t", header=0)

  interactivity=interactivity, compiler=compiler, result=result)


In [5]:
data_title_akas_tsv.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
1,tt0000001,2,Καρμενσίτα,GR,\N,\N,\N,0
2,tt0000001,3,Карменсита,RU,\N,\N,\N,0
3,tt0000001,4,Carmencita,US,\N,\N,\N,0
4,tt0000001,5,Carmencita,\N,\N,original,\N,1


In [6]:
data_title_akas_tsv[data_title_akas_tsv.region == "US"].head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
3,tt0000001,4,Carmencita,US,\N,\N,\N,0
10,tt0000002,6,The Clown and His Dogs,US,\N,\N,literal English title,0
23,tt0000005,1,Blacksmithing Scene,US,\N,alternative,\N,0
26,tt0000005,4,Blacksmith Scene #1,US,\N,alternative,\N,0
27,tt0000005,5,Blacksmithing,US,\N,\N,informal alternative title,0


In [7]:
data_title_basics_tsv = pd.read_csv("./data/title.basics.tsv/data.tsv", sep="\t", header=0)

In [8]:
data_title_basics_tsv.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [9]:
data_title_basics_tsv[data_title_basics_tsv.genres.str.contains("Comedy")==True].head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
13,tt0000014,short,The Sprinkler Sprinkled,L'arroseur arrosé,0,1895,\N,1,"Comedy,Short"
18,tt0000019,short,The Clown Barber,The Clown Barber,0,1898,\N,\N,"Comedy,Short"
31,tt0000033,short,Trick Riding,La voltige,0,1895,\N,1,"Comedy,Documentary,Short"


In [10]:
data_title_crew_tsv = pd.read_csv("./data/title.crew.tsv/data.tsv", sep="\t", header=0)

In [11]:
data_title_crew_tsv.head()

Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,\N
1,tt0000002,nm0721526,\N
2,tt0000003,nm0721526,\N
3,tt0000004,nm0721526,\N
4,tt0000005,nm0005690,\N


In [12]:
data_title_episode_tsv = pd.read_csv("./data/title.episode.tsv/data.tsv", sep="\t", header=0)

In [13]:
data_title_episode_tsv.head()

Unnamed: 0,tconst,parentTconst,seasonNumber,episodeNumber
0,tt0041951,tt0041038,1,9
1,tt0042816,tt0989125,1,17
2,tt0042889,tt0989125,\N,\N
3,tt0043426,tt0040051,3,42
4,tt0043631,tt0989125,2,16


In [14]:
data_title_principals_tsv = pd.read_csv("./data/title.principals.tsv/data.tsv", sep="\t", header=0)

In [15]:
data_title_principals_tsv.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Herself""]"
1,tt0000001,2,nm0005690,director,\N,\N
2,tt0000001,3,nm0374658,cinematographer,director of photography,\N
3,tt0000002,1,nm0721526,director,\N,\N
4,tt0000002,2,nm1335271,composer,\N,\N


In [16]:
data_title_ratings_tsv = pd.read_csv("./data/title.ratings.tsv/data.tsv", sep="\t", header=0)

In [17]:
data_title_ratings_tsv.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.8,1492
1,tt0000002,6.3,181
2,tt0000003,6.6,1131
3,tt0000004,6.4,110
4,tt0000005,6.2,1832
