## Dataset information ([Source](https://developer.imdb.com/non-commercial-datasets/))
# title.akas.tsv
- titleId (string) - a tconst, an alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- title (string) – the localized title
- region (string) - the region for this version of the title
- language (string) - the language of the title
- types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
- attributes (array) - Additional terms to describe this alternative title, not enumerated
- isOriginalTitle (boolean) – 0: not original title; 1: original title

# title.basics.tsv
- tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
- isAdult (boolean) - 0: non-adult title; 1: adult title
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title

# title.crew.tsv
- tconst (string) - alphanumeric unique identifier of the title
- directors (array of nconsts) - director(s) of the given title
- writers (array of nconsts) – writer(s) of the given title

# title.episode.tsv
- tconst (string) - alphanumeric identifier of episode
- parentTconst (string) - alphanumeric identifier of the parent TV Series
- seasonNumber (integer) – season number the episode belongs to
- episodeNumber (integer) – episode number of the tconst in the TV series

# title.principals.tsv
- tconst (string) - alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- nconst (string) - alphanumeric unique identifier of the name/person
- category (string) - the category of job that person was in
- job (string) - the specific job title if applicable, else '\N'
- characters (string) - the name of the character played if applicable, else '\N'

# title.ratings.tsv
- tconst (string) - alphanumeric unique identifier of the title
- averageRating – weighted average of all the individual user ratings
- numVotes - number of votes the title has received
# name.basics.tsv
- nconst (string) - alphanumeric unique identifier of the name/person
- primaryName (string)– name by which the person is most often credited
- birthYear – in YYYY format
- deathYear – in YYYY format if applicable, else '\N'
- primaryProfession (array of strings)– the top-3 professions of the person
- knownForTitles (array of tconsts) – titles the person is known for

In [1]:
# Import list
import pandas as pd

In [3]:
title_principals = pd.read_csv('data/title.principals.tsv', sep='\t')
title_principals.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Self""]"
1,tt0000001,2,nm0005690,director,\N,\N
2,tt0000001,3,nm0005690,producer,producer,\N
3,tt0000001,4,nm0374658,cinematographer,director of photography,\N
4,tt0000002,1,nm0721526,director,\N,\N


# Creating a new data set with only the necessary information for the project

For challenge A: Degrees of separation between actors we need
- Actor names and nconst from name.basics.tsv and filter only actors
- From title.principals.tsv we need tconst, nconst for only actor and actress

We could also used title.basics.tsv to filter movie from non movie based on the tconst ID but I don't think it is necessary

In [10]:
# Loading the datasets into pandas DataFrames
base_path = "data/"

name_basics_df = pd.read_csv(f"{base_path}name.basics.tsv", sep='\t')
# title_akas_df = pd.read_csv(f"{base_path}title.akas.tsv", sep='\t')
# title_basics_df = pd.read_csv(f"{base_path}title.basics.tsv", sep='\t')
# title_crew_df = pd.read_csv(f"{base_path}title.crew.tsv", sep='\t')
# title_episode_df = pd.read_csv(f"{base_path}title.episode.tsv", sep='\t')
title_principals_df = pd.read_csv(f"{base_path}title.principals.tsv", sep='\t')
# title_ratings_df = pd.read_csv(f"{base_path}title.ratings.tsv", sep='\t')

In [13]:
name_basics_df[['nconst', 'primaryName']]

Unnamed: 0,nconst,primaryName
0,nm0000001,Fred Astaire
1,nm0000002,Lauren Bacall
2,nm0000003,Brigitte Bardot
3,nm0000004,John Belushi
4,nm0000005,Ingmar Bergman
...,...,...
13378119,nm9993714,Romeo del Rosario
13378120,nm9993716,Essias Loberg
13378121,nm9993717,Harikrishnan Rajan
13378122,nm9993718,Aayush Nair


Challenge A)

In [14]:
actors_only_df = title_principals_df[title_principals_df['category'].isin(['actor', 'actress'])][['tconst', 'nconst', 'category']]
nconst_name_mapping_df = name_basics_df[['nconst', 'primaryName']]
challenge_A_df = pd.merge(actors_only_df, nconst_name_mapping_df, on='nconst', how='inner')
challenge_A_df
#we can save that DF and used it for the project

Unnamed: 0,tconst,nconst,category,primaryName
0,tt0000005,nm0443482,actor,Charles Kayser
1,tt0000005,nm0653042,actor,John Ott
2,tt0000007,nm0179163,actor,James J. Corbett
3,tt0003116,nm0179163,actor,James J. Corbett
4,tt0003730,nm0179163,actor,James J. Corbett
...,...,...,...,...
35503716,tt9916856,nm10538646,actor,Andreas Demmel
35503717,tt9916856,nm10538647,actress,Kathrin Knöpfle
35503718,tt9916856,nm10538651,actress,Beatrice Bresolin
35503719,tt9916856,nm10538648,actor,Amit Goldenberg
