## Dataset information ([Source](https://developer.imdb.com/non-commercial-datasets/))
# title.akas.tsv
- titleId (string) - a tconst, an alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- title (string) – the localized title
- region (string) - the region for this version of the title
- language (string) - the language of the title
- types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
- attributes (array) - Additional terms to describe this alternative title, not enumerated
- isOriginalTitle (boolean) – 0: not original title; 1: original title

# title.basics.tsv
- tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
- isAdult (boolean) - 0: non-adult title; 1: adult title
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title

# title.crew.tsv
- tconst (string) - alphanumeric unique identifier of the title
- directors (array of nconsts) - director(s) of the given title
- writers (array of nconsts) – writer(s) of the given title

# title.episode.tsv
- tconst (string) - alphanumeric identifier of episode
- parentTconst (string) - alphanumeric identifier of the parent TV Series
- seasonNumber (integer) – season number the episode belongs to
- episodeNumber (integer) – episode number of the tconst in the TV series

# title.principals.tsv
- tconst (string) - alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- nconst (string) - alphanumeric unique identifier of the name/person
- category (string) - the category of job that person was in
- job (string) - the specific job title if applicable, else '\N'
- characters (string) - the name of the character played if applicable, else '\N'

# title.ratings.tsv
- tconst (string) - alphanumeric unique identifier of the title
- averageRating – weighted average of all the individual user ratings
- numVotes - number of votes the title has received
# name.basics.tsv
- nconst (string) - alphanumeric unique identifier of the name/person
- primaryName (string)– name by which the person is most often credited
- birthYear – in YYYY format
- deathYear – in YYYY format if applicable, else '\N'
- primaryProfession (array of strings)– the top-3 professions of the person
- knownForTitles (array of tconsts) – titles the person is known for

In [1]:
# Import list
import pandas as pd
import time
# Loading the datasets into pandas DataFrames
base_path = "data/"

In [2]:
title_principals = pd.read_csv('data/title.principals.tsv', sep='\t')
title_principals.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Self""]"
1,tt0000001,2,nm0005690,director,\N,\N
2,tt0000001,3,nm0005690,producer,producer,\N
3,tt0000001,4,nm0374658,cinematographer,director of photography,\N
4,tt0000002,1,nm0721526,director,\N,\N


# Challenge A

For challenge A (Degrees of separation between actors), we need :
- Actor names and nconst from name.basics.tsv and filter only actor/actress
- From title.principals.tsv we need tconst, nconst for only actor/actress

We could also used title.basics.tsv to filter movie from non movie based on the tconst ID but I don't think it is necessary

In [15]:
# Start the timer
start_time = time.time()

name_basics_df = pd.read_csv(f"{base_path}name.basics.tsv", sep='\t')
title_principals_df = pd.read_csv(f"{base_path}title.principals.tsv", sep='\t')

# End the timer
end_time = time.time()
# Calculate and print the total time taken
total_time = end_time - start_time
print(f"Total time to load data : {total_time} seconds")

Total time to load data : 110.52546501159668 seconds


In [4]:
name_basics_df[['nconst', 'primaryName']].head()

Unnamed: 0,nconst,primaryName
0,nm0000001,Fred Astaire
1,nm0000002,Lauren Bacall
2,nm0000003,Brigitte Bardot
3,nm0000004,John Belushi
4,nm0000005,Ingmar Bergman


In [5]:
title_principals_df[['tconst', 'nconst', 'category']].head(50)

Unnamed: 0,tconst,nconst,category
0,tt0000001,nm1588970,self
1,tt0000001,nm0005690,director
2,tt0000001,nm0005690,producer
3,tt0000001,nm0374658,cinematographer
4,tt0000002,nm0721526,director
5,tt0000002,nm1335271,composer
6,tt0000003,nm0721526,director
7,tt0000003,nm1770680,producer
8,tt0000003,nm0721526,producer
9,tt0000003,nm1335271,composer


In [16]:
# Start the timer
start_time = time.time()

actors_only_df = title_principals_df[title_principals_df['category'].isin(['actor', 'actress'])][['tconst', 'nconst']]
challenge_A_df = pd.merge(actors_only_df, name_basics_df[['nconst', 'primaryName']], on='nconst')


# End the timer
end_time = time.time()
# Calculate and print the total time taken
total_time = end_time - start_time
print(f"Total time to merge data : {total_time} seconds")

challenge_A_df.head(10)

Total time to merge data : 42.00557494163513 seconds


Unnamed: 0,tconst,nconst,primaryName
0,tt0000005,nm0443482,Charles Kayser
1,tt0000005,nm0653042,John Ott
2,tt0000007,nm0179163,James J. Corbett
3,tt0003116,nm0179163,James J. Corbett
4,tt0003730,nm0179163,James J. Corbett
5,tt0003730,nm0179163,James J. Corbett
6,tt0010460,nm0179163,James J. Corbett
7,tt0011603,nm0179163,James J. Corbett
8,tt0012927,nm0179163,James J. Corbett
9,tt0020949,nm0179163,James J. Corbett


In [5]:
# To test later with only that dataframe
# challenge_A_df.to_csv('ChallA.csv', index=False)

In [None]:
# Need to sort the DF by movie so that we can try the Bacon Number with just a sample of the DF without having only movie with one actor
# NOT USEFUL IF WE USE THE FULL DATA
# sorted_df = challenge_A_df.sort_values(by='tconst')
# challenge_A_df = sorted_df.iloc[10000000:11000000]
# challenge_A_df

In [17]:
#Identify Kevin Bacon's nconst or another name
kevin_bacon_nconst = name_basics_df[name_basics_df['primaryName'] == 'Kevin Bacon']['nconst'].iloc[0]
kevin_bacon_nconst

'nm0000102'

In [42]:
# Possible problem later => multiple Kevin Bacon in the database
name_basics_df[name_basics_df['primaryName'] == 'Kevin Bacon']

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
101,nm0000102,Kevin Bacon,1958,\N,"actor,producer,director","tt0087277,tt0164052,tt0361127,tt0327056"
1116648,nm10210267,Kevin Bacon,\N,\N,\N,tt1554356
2339911,nm11500328,Kevin Bacon,\N,\N,sound_department,"tt0206501,tt9187244"
3395206,nm12606581,Kevin Bacon,\N,\N,sound_department,tt10810428
4539805,nm13843189,Kevin Bacon,\N,\N,sound_department,tt18554124
8216275,nm3636162,Kevin Bacon,\N,\N,"camera_department,actor,cinematographer","tt1954811,tt4875654,tt2063666,tt2186712"
8563563,nm4025714,Kevin Bacon,\N,\N,"sound_department,director,writer","tt1737781,tt3067274,tt0098193,tt8429394"
12816242,nm9323132,Kevin Bacon,\N,\N,"actor,soundtrack","tt5112578,tt7470112,tt0259795"
13206272,nm9792572,Kevin Bacon,\N,\N,sound_department,"tt28358231,tt28485954,tt11506404,tt8183128"


In [18]:
# Start the timer
start_time = time.time()

# Initialize the Bacon Numbers DataFrame
bacon_numbers = pd.DataFrame({'nconst': [kevin_bacon_nconst], 'BaconNumber': [0]})
processed_movies = set()

# While loop until no new movies can be processed
while True:
    # Select movies not yet processed that have actors with known Bacon Numbers
    new_movies = challenge_A_df[challenge_A_df['tconst'].isin(challenge_A_df.loc[challenge_A_df['nconst'].isin(bacon_numbers['nconst']), 'tconst']) & ~challenge_A_df['tconst'].isin(processed_movies)]
    processed_movies.update(new_movies['tconst'].unique())

    # If no new movies to process, break
    if new_movies.empty:
        break

    # Get all actors from these new movies
    new_actors = challenge_A_df[challenge_A_df['tconst'].isin(new_movies['tconst'])]

    # Proceed only if there are truly new actors
    if not new_actors.empty:
        new_bacons = new_actors.merge(bacon_numbers, on='nconst', how='left')
        new_bacons['BaconNumber'] = new_bacons.groupby('tconst')['BaconNumber'].transform(lambda x: x.min() + 1)

        # Update the Bacon Numbers DataFrame
        bacon_numbers = pd.concat([bacon_numbers, new_bacons[['nconst', 'BaconNumber']].drop_duplicates()], ignore_index=True)

        # Ensure that each actor has only the lowest Bacon Number
        bacon_numbers = bacon_numbers.groupby('nconst').agg({'BaconNumber': 'min'}).reset_index()

# End the timer
end_time = time.time()

# Calculate and print the total time taken
total_time = end_time - start_time
print(f"Total time taken to compute the Bacon Numbers: {total_time} seconds")

Total time taken to compute the Bacon Numbers: 1183.2064416408539 seconds


For A Martinez :
0.1 millions rows : 0.4sec
1 millions rows : 22sec
2 millions rows : 50sec
5 millions rows : 129sec
full 35 millions rows : 919sec => +-15min


For Kevin Bacon :
full dataset : 1201sec => +-20min

In [6]:
# Sort the Bacon Numbers DataFrame by the Bacon Number
bacon_numbers = bacon_numbers.sort_values(by='BaconNumber')
bacon_numbers

Unnamed: 0,nconst,BaconNumber
91,nm0000102,0.0
600,nm0000622,1.0
1901915,nm4456120,1.0
3461,nm0005213,1.0
289187,nm0783033,1.0
...,...,...
2099564,nm5476488,11.0
1001928,nm13801027,11.0
1211922,nm15338150,12.0
641763,nm11549950,12.0


# Challenge B
For challenge B (Degrees of separation between actors), we need :
From title.principals.tsv we need tconst, nconst for only actor/actress
From title.ratings.tsv we need tconst and averageRating

In [11]:
# Start the timer
start_time = time.time()

title_principals_df = pd.read_csv(f"{base_path}title.principals.tsv", sep='\t')
title_ratings_df = pd.read_csv(f"{base_path}title.ratings.tsv", sep='\t')

# End the timer
end_time = time.time()
# Calculate and print the total time taken
total_time = end_time - start_time
print(f"Total time to load data : {total_time} seconds")

Total time to load data : 82.9723207950592 seconds


In [12]:
# Start the timer
start_time = time.time()
actors_only_df = title_principals_df[title_principals_df['category'].isin(['actor', 'actress'])][['tconst', 'nconst']]
challenge_B_df = pd.merge(actors_only_df, title_ratings_df[['tconst', 'averageRating']], on='tconst')

# End the timer
end_time = time.time()
# Calculate and print the total time taken
total_time = end_time - start_time
print(f"Total time to merge data : {total_time} seconds")

challenge_B_df.head(20)

Total time to merge data : 14.546685218811035 seconds


Unnamed: 0,tconst,nconst,averageRating
0,tt0000005,nm0443482,6.2
1,tt0000005,nm0653042,6.2
2,tt0000007,nm0179163,5.4
3,tt0000007,nm0183947,5.4
4,tt0000008,nm0653028,5.4
5,tt0000009,nm0063086,5.3
6,tt0000009,nm0183823,5.3
7,tt0000009,nm1309758,5.3
8,tt0000011,nm3692297,5.2
9,tt0000014,nm0166380,7.1


In [13]:
# Start the timer
start_time = time.time()

average_ratings = challenge_B_df.groupby('nconst')['averageRating'].mean().reset_index()

# End the timer
end_time = time.time()
# Calculate and print the total time taken
total_time = end_time - start_time
print(f"Total time to grouby the data: {total_time} seconds")

average_ratings.head(50)

Total time to grouby the data: 5.2402184009552 seconds


Unnamed: 0,nconst,averageRating
0,nm0000001,6.867692
1,nm0000002,6.748
2,nm0000003,5.864103
3,nm0000004,7.075124
4,nm0000005,7.471429
5,nm0000006,6.612963
6,nm0000007,6.766667
7,nm0000008,7.106897
8,nm0000009,7.499153
9,nm0000010,6.734783


In [9]:
average_ratings[average_ratings['nconst']=='nm0183823']

Unnamed: 0,nconst,averageRating
63711,nm0183823,5.933333
