# Data Cleaning

Please be aware that this notebook loads several very heavy datasets, and running all the cells will take few minutes.

## Imports

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

## Movies Duration

In this first part, we clean the database and transform it in order to export a much smaller csv file that will be used in the Streamlit app to plot the graph.

In [2]:
# We first load the dataset into a dataframe
# Due to its size, we isolate this step into one independent cell

basics_df = pd.read_csv("https://datasets.imdbws.com/title.basics.tsv.gz", sep="\t", low_memory=False)

In [3]:
# We filter out the adult category, and select only the movies

basics_df_movies = basics_df[basics_df['isAdult'] == '0']
basics_df_movies =  basics_df_movies[basics_df_movies['titleType'] == 'movie']

# We split the genres column to isolate the main genre of the movies
basics_df_movies[['mainGenre', 'secondaryGenres']] = basics_df_movies['genres'].str.split(',', n=1, expand=True)

# We replace the missing values by a proper NaN

basics_df_movies_nan = basics_df_movies.replace('\\N', np.nan)

# We remove the columns we don't need anymore and drop the missing values

runtime = basics_df_movies_nan[['startYear', 'runtimeMinutes', 'mainGenre']]
runtime_dropna = runtime.dropna()

In [4]:
# We change the type of those columns to integers and set boundaries to the movies duration

runtime_int = runtime_dropna.astype({'runtimeMinutes': int})
runtime_int = runtime_int.astype({'startYear': int})
runtime_int = runtime_int[(runtime_int['runtimeMinutes'] >= 58) & (runtime_int['runtimeMinutes'] <= 270)]

# We limit the date range to the movies that were released between the years 1918 and 2021

runtime_int = runtime_int[(runtime_int['startYear'] >= 1918) & (runtime_int['startYear'] <= 2021)]

In [5]:
# We group by year and get the mean of the duration of the movies of all genres for each year, rounded to 2 decimals

runtime_mean = runtime_int.drop('mainGenre', axis=1)
runtime_mean = round(runtime_mean.groupby('startYear').mean('runtimeMinutes'), 2)
runtime_mean.reset_index(inplace=True)

In [6]:
# Now let's have a look at the a sample of the different genres, that we determined to be the ones counting the most top rated movies:
# Comedy, Drama, Action, Adventure, Crime
# Let's start with Comedy

runtime_comedy = runtime_int[runtime_int['mainGenre'] == 'Comedy']
runtime_comedy = runtime_comedy.drop('mainGenre', axis=1)
runtime_comedy = round(runtime_comedy.groupby('startYear').mean('runtimeMinutes'), 2)
runtime_comedy.rename({'runtimeMinutes': 'Comedy'}, axis=1, inplace=True)
runtime_comedy.reset_index(inplace=True)

In [7]:
# Next Drama

runtime_drama = runtime_int[runtime_int['mainGenre'] == 'Drama']
runtime_drama = runtime_drama.drop('mainGenre', axis=1)
runtime_drama = round(runtime_drama.groupby('startYear').mean('runtimeMinutes'), 2)
runtime_drama.rename({'runtimeMinutes': 'Drama'}, axis=1, inplace=True)
runtime_drama.reset_index(inplace=True)

In [8]:
# Next Action

runtime_action = runtime_int[runtime_int['mainGenre'] == 'Action']
runtime_action = runtime_action.drop('mainGenre', axis=1)
runtime_action = round(runtime_action.groupby('startYear').mean('runtimeMinutes'), 2)
runtime_action.rename({'runtimeMinutes': 'Action'}, axis=1, inplace=True)
runtime_action.reset_index(inplace=True)

In [9]:
# Next Adventure

runtime_adventure = runtime_int[runtime_int['mainGenre'] == 'Adventure']
runtime_adventure = runtime_adventure.drop('mainGenre', axis=1)
runtime_adventure = round(runtime_adventure.groupby('startYear').mean('runtimeMinutes'), 2)
runtime_adventure.rename({'runtimeMinutes': 'Adventure'}, axis=1, inplace=True)
runtime_adventure.reset_index(inplace=True)

In [10]:
# And finally Crime

runtime_crime = runtime_int[runtime_int['mainGenre'] == 'Crime']
runtime_crime = runtime_crime.drop('mainGenre', axis=1)
runtime_crime = round(runtime_crime.groupby('startYear').mean('runtimeMinutes'), 2)
runtime_crime.rename({'runtimeMinutes': 'Crime'}, axis=1, inplace=True)
runtime_crime.reset_index(inplace=True)

In [11]:
# Now we can gather all that data into one final dataframe

runtime_final = runtime_mean.copy()
runtime_final['Comedy'] = runtime_comedy['Comedy']
runtime_final['Drama'] = runtime_drama['Drama']
runtime_final['Action'] = runtime_action['Action']
runtime_final['Adventure'] = runtime_adventure['Adventure']
runtime_final['Crime'] = runtime_crime['Crime']
runtime_final.rename({'runtimeMinutes': 'Average'}, axis=1, inplace=True)


In [12]:
# We can now plot the results using plotly

fig = go.Figure()

fig.add_trace(go.Scatter(
  x=runtime_final['startYear'],
  y=runtime_final['Average'],
  line_shape='spline',
  line_color='green',
  name='Average'
))

fig.update_layout(
        width=1300,
        height=600,
        template='plotly_dark',
        title='Average Movie Duration per Year',
        xaxis_title='Year',
        yaxis_title='Duration in Minutes'
    )

fig.show()

In [13]:
# We add another plot showing the average duration of the movies per genre

fig = go.Figure()

fig.add_trace(go.Scatter(
  x=runtime_final['startYear'],
  y=runtime_final['Average'],
  line_shape='spline',
  line_color='green',
  line_width=8,
  opacity=0.9,
  name='Average'
))

fig.add_trace(go.Scatter(
  x=runtime_final['startYear'],
  y=runtime_final['Comedy'],
  line_shape='spline',
  line_color='beige',
  line_width=1,
  opacity=0.8,
  name='Comedy'
))

fig.add_trace(go.Scatter(
  x=runtime_final['startYear'],
  y=runtime_final['Drama'],
  line_shape='spline',
  line_color='blueviolet',
  line_width=1,
  opacity=0.8,
  name='Drama'
))

fig.add_trace(go.Scatter(
  x=runtime_final['startYear'],
  y=runtime_final['Adventure'],
  line_shape='spline',
  line_color='coral',
  line_width=1,
  opacity=0.8,
  name='Adventure'
))

fig.add_trace(go.Scatter(
  x=runtime_final['startYear'],
  y=runtime_final['Action'],
  line_shape='spline',
  line_color='royalblue',
  line_width=1,
  opacity=0.8,
  name='Action'
))

fig.add_trace(go.Scatter(
  x=runtime_final['startYear'],
  y=runtime_final['Crime'],
  line_shape='spline',
  line_color='red',
  line_width=1,
  opacity=0.8,
  name='Crime'
))

fig.update_layout(
        width=1300,
        height=600,
        template='plotly_dark',
        title='Average Movie Duration per Year',
        legend_title='Genre',
        xaxis_title='Year',
        yaxis_title='Duration in Minutes'
    )

fig.show()

In [14]:
# We can now export the cleaned dataframe to a csv file

runtime_final.to_csv('data/movies_duration.csv.zip', compression='zip')

## Top Rated Movies

In [15]:
# We first load the required dataset into a dataframe

ratings_df = pd.read_csv("https://datasets.imdbws.com/title.ratings.tsv.gz", sep="\t")

In [16]:
# The title.basics dataset is already loaded and filtered on non adult movies, we can select only the relevant columns

basics_df_stripped = basics_df_movies[['tconst', 'primaryTitle', 'startYear', 'genres']]

# We can then merge the two dataframes

movies_ratings = pd.merge(basics_df_stripped, ratings_df, how='inner', left_on='tconst', right_on='tconst')
movies_ratings.reset_index(drop=True, inplace=True)

In [17]:
# We finally filter the dataframe based on the rating and the number of votes
# This ensures that we only keep the very best movies that were decently popular
# This also removes possible outliers with very few votes that could be heavily biased

movies_ratings_top = movies_ratings[movies_ratings['averageRating'] >= 8.4]
movies_ratings_top = movies_ratings_top[movies_ratings_top['numVotes'] >= 20000]

# Finally, we split the genres column into two in order to only use the main genre of a movie

movies_ratings_top[['mainGenre', 'secondaryGenres']] = movies_ratings['genres'].str.split(',', n=1, expand=True)


In [18]:
# Let's plot the results with  a 3D scatter plot first

fig = px.scatter_3d(
    movies_ratings_top, 
    x='startYear', 
    y='averageRating', 
    z='mainGenre', 
    color='averageRating', 
    size='numVotes', 
    opacity = 0.8,
    labels={
        'startYear': 'Year',
        'averageRating': 'Rating',
        'mainGenre': 'Genre',
        'numVotes': 'Number of Votes'
    }, 
    size_max=25,    
    hover_name='primaryTitle'
)


fig.update_layout(
    width=1300,
    height=1000,
    scene=dict(zaxis=dict(nticks=11)),
    title='IMDB Top Rated Movies (>= 8.4) per Genre, Number of Votes and Year',
    template='plotly_dark'
)
fig.show()

In [19]:
# Then let's use a bar chart to see the distribution of the genres

fig = px.histogram(
    data_frame=movies_ratings_top,
    x='mainGenre',
    color='mainGenre',
    labels={'mainGenre': 'Genre'},
    color_discrete_sequence=px.colors.qualitative.Pastel
)

fig.update_layout(
    width=1300,
    height=600,
    title='IMDB Top Rated Movies (>= 8.4) Genre Distribution',
    template='plotly_dark'
)
fig.show()

In [20]:
# We can now export the dataframe as a new csv file

movies_ratings_top.to_csv('data/movies_ratings.csv.zip', compression='zip')

## Most Active Actors

In [21]:
# We load two other datasets into dataframes

name_df = pd.read_csv("https://datasets.imdbws.com/name.basics.tsv.gz", sep="\t")
principals_df = pd.read_csv("https://datasets.imdbws.com/title.principals.tsv.gz", sep="\t")

In [22]:
# We now select only the actors and actresses

actors = principals_df[(principals_df['category'] == 'actor') | (principals_df['category'] == 'actress')]

# And then we merge that dataframe with the names

list_actors = pd.merge(actors, name_df, how='inner')

In [23]:
# We can use again the title basics dataframe for another merge

cols = ['tconst', 'startYear','titleType', 'isAdult', 'runtimeMinutes','genres']
actors_titleType = pd.merge(list_actors, basics_df[cols], how='inner')

In [24]:
# Let's remove again the adult movies, although we will see in the analysis that this is not perfect

actors_titleType_no_adult = actors_titleType[actors_titleType['isAdult']=='0']

# We now remove all the genres other than movies, replace missing values with nan, and drop columns

actors_movies = actors_titleType_no_adult[(actors_titleType_no_adult['titleType']=='movie')]
actors_movies_nan = actors_movies.replace('\\N', np.nan)
actors_movies_filtre = actors_movies_nan[['primaryName','runtimeMinutes','startYear']]

# More cleaning to remove rows with missing values, convert duration and year to integers

actors_movies_clean = actors_movies_filtre.dropna()
actors_movies_clean = actors_movies_clean.astype({'startYear': int})
actors_movies_clean = actors_movies_clean.astype({'runtimeMinutes': int})

# We use the same filter for the movies duration, and we set the first year to 1920 in order to keep proper decades

actors_movies_clean = actors_movies_clean[(actors_movies_clean['runtimeMinutes'] >= 58) & (actors_movies_clean['runtimeMinutes'] <= 270)]
actors_movies_year = actors_movies_clean.loc[actors_movies_clean['startYear'] >= 1920]
actors_movies_year_final =  actors_movies_year[['primaryName', 'startYear']]

In [25]:
# We can now plot the top 5 most productive actors in movies per decade

depart = 1920
fin = 1929
subplot = []
for i in range(11):
    actors_movies_decade = actors_movies_year[(actors_movies_year['startYear'] >= depart)&(actors_movies_year['startYear'] <= fin)]
    temp = actors_movies_decade['primaryName'].value_counts()[:5].rename_axis('name').reset_index(name='count')
    subplot.append(temp)
    depart += 10
    fin += 10
globa = actors_movies_year['primaryName'].value_counts()[:5].rename_axis('name').reset_index(name='count')


fig = make_subplots(
    rows=4, cols=3,
    subplot_titles=('1920-1929', '1930-1939','1940-1949','1950-1959','1960-1969','1970-1979','1980-1989','1990-1999','2000-2009','2010-2019','2020-2029','Overall Results'),
    )

fig.append_trace(
    go.Bar(x=subplot[0]['name'],
    y=subplot[0]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=1, col=1
)

fig.append_trace(
    go.Bar(x=subplot[1]['name'],
    y=subplot[1]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=1, col=2
)

fig.append_trace(
    go.Bar(x=subplot[2]['name'],
    y=subplot[2]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=1, col=3
)

fig.append_trace(
    go.Bar(x=subplot[3]['name'],
    y=subplot[3]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=2, col=1
)

fig.append_trace(
    go.Bar(x=subplot[4]['name'],
    y=subplot[4]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=2, col=2
)

fig.append_trace(
    go.Bar(x=subplot[5]['name'],
    y=subplot[5]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=2, col=3
)

fig.append_trace(
    go.Bar(x=subplot[6]['name'],
    y=subplot[6]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=3, col=1
)

fig.append_trace(
    go.Bar(x=subplot[7]['name'],
    y=subplot[7]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=3, col=2
)

fig.append_trace(
    go.Bar(x=subplot[8]['name'],
    y=subplot[8]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=3, col=3
)

fig.append_trace(
    go.Bar(x=subplot[9]['name'],
    y=subplot[9]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=4, col=1
)

fig.append_trace(
    go.Bar(x=subplot[10]['name'],
    y=subplot[10]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=4, col=2
)

fig.append_trace(
    go.Bar(x=globa['name'],
    y=globa['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=4, col=3
)

fig.update_layout(
    template='plotly_dark',
    title='5 Most Active Actors in Movies per Decade',
    showlegend=False,
    height = 1250,
    width=1300
)
fig.show()

In [26]:
# We can now export the file as a csv

actors_movies_year.to_csv('data/actors_movies_year.csv.zip', compression='zip')

In [27]:
# We will now repeat the operations above to get the actors from series

actors_series = actors_titleType.loc[actors_titleType['titleType']=='tvSeries']

# We split the genres into different columns in order to filter some of them out

actors_series_genres = actors_series.copy()
actors_series_genres[['mainGenre', 'secondaryGenres','thirdGenres']] = actors_series_genres['genres'].str.split(',', expand=True)
actors_series_nan = actors_series_genres.replace('\\N', np.nan)

In [28]:
# We can now filter out all the genres that are not relevant for our analysis

actors_series_nan = actors_series_nan[actors_series_nan['mainGenre']!='Animation']
actors_series_nan = actors_series_nan[actors_series_nan['mainGenre']!='Reality-TV']
actors_series_nan = actors_series_nan[actors_series_nan['mainGenre']!='Talk-Show']
actors_series_nan = actors_series_nan[actors_series_nan['mainGenre']!='Game-Show']
actors_series_nan = actors_series_nan[actors_series_nan['mainGenre']!='Adult']
actors_series_nan = actors_series_nan[actors_series_nan['mainGenre']!='Short']

actors_series_nan = actors_series_nan[actors_series_nan['secondaryGenres']!='Animation']
actors_series_nan = actors_series_nan[actors_series_nan['secondaryGenres']!='Reality-TV']
actors_series_nan = actors_series_nan[actors_series_nan['secondaryGenres']!='Talk-Show']
actors_series_nan = actors_series_nan[actors_series_nan['secondaryGenres']!='Game-Show']
actors_series_nan = actors_series_nan[actors_series_nan['secondaryGenres']!='Adult']
actors_series_nan = actors_series_nan[actors_series_nan['secondaryGenres']!='Short']

actors_series_nan = actors_series_nan[actors_series_nan['thirdGenres']!='Animation']
actors_series_nan = actors_series_nan[actors_series_nan['thirdGenres']!='Reality-TV']
actors_series_nan = actors_series_nan[actors_series_nan['thirdGenres']!='Talk-Show']
actors_series_nan = actors_series_nan[actors_series_nan['thirdGenres']!='Game-Show']
actors_series_nan = actors_series_nan[actors_series_nan['thirdGenres']!='Adult']
actors_series_nan = actors_series_nan[actors_series_nan['thirdGenres']!='Short']

In [29]:
# Now we can drop the nan values and keep only the columns we need

actors_series_filtre = actors_series_nan[['primaryName','startYear']]
actors_series_clean = actors_series_filtre.dropna()
actors_series_clean = actors_series_clean.astype({'startYear': int})

# One last filter to keep, as previously, only the data posterior to 1920

actors_series_year = actors_series_clean[actors_series_clean['startYear']>=1920]

In [30]:
# We can now run the same script on the new data to display the same plots, with actors from series

depart = 1920
fin = 1929
subplot = []
for i in range(11):
    actors_series_decade = actors_series_year[(actors_series_year['startYear']>=depart)&(actors_series_year['startYear']<=fin)]
    temp = actors_series_decade['primaryName'].value_counts()[:5].rename_axis('name').reset_index(name='count')
    subplot.append(temp)
    depart+=10
    fin+=10
globa = actors_series_year['primaryName'].value_counts()[:5].rename_axis('name').reset_index(name='count')

fig = make_subplots(
    rows=4, cols=3,
    subplot_titles=('1920-1929', '1930-1939','1940-1949','1950-1959','1960-1969','1970-1979','1980-1989','1990-1999','2000-2009','2010-2019','2020-2029','Overall Results'),
    )

fig.append_trace(
    go.Bar(x=subplot[0]['name'],
    y=subplot[0]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=1, col=1
)

fig.append_trace(
    go.Bar(x=subplot[1]['name'],
    y=subplot[1]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=1, col=2
)

fig.append_trace(
    go.Bar(x=subplot[2]['name'],
    y=subplot[2]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=1, col=3
)

fig.append_trace(
    go.Bar(x=subplot[3]['name'],
    y=subplot[3]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=2, col=1
)

fig.append_trace(
    go.Bar(x=subplot[4]['name'],
    y=subplot[4]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=2, col=2
)

fig.append_trace(
    go.Bar(x=subplot[5]['name'],
    y=subplot[5]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=2, col=3
)

fig.append_trace(
    go.Bar(x=subplot[6]['name'],
    y=subplot[6]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=3, col=1
)

fig.append_trace(
    go.Bar(x=subplot[7]['name'],
    y=subplot[7]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=3, col=2
)

fig.append_trace(
    go.Bar(x=subplot[8]['name'],
    y=subplot[8]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=3, col=3
)

fig.append_trace(
    go.Bar(x=subplot[9]['name'],
    y=subplot[9]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=4, col=1
)

fig.append_trace(
    go.Bar(x=subplot[10]['name'],
    y=subplot[10]['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=4, col=2
)

fig.append_trace(
    go.Bar(x=globa['name'],
    y=globa['count'],
    marker_color=px.colors.qualitative.Plotly),
    row=4, col=3
)

fig.update_layout(
    template='plotly_dark',
    title='5 Most Active Actors in Series per Decade',
    showlegend=False,
    height = 1250,
    width=1300
)
fig.show()

In [31]:
# We can now export the dataframe into a csv

actors_series_year.to_csv('data/actors_series_year.csv.zip', compression='zip')

## Mean Age of Actors in Movies

In [32]:
# We need the name.basics dataset again, so let's start by loading it and sorting it by birth year to help with the cleaning process

actors = name_df.sort_values(by=['birthYear']).reset_index(drop=True)
actors.replace('\\N', np.nan, inplace=True)

In [33]:
# We now need the title.basics dataset

movies = basics_df.sort_values(by=['startYear']).reset_index(drop=True)
movies.replace('\\N', np.nan, inplace=True)


In [34]:
# Last table required here, the title.principals dataset

title = principals_df.copy()
title = title.loc[(title['category']=='actor')|(title['category']=='actress')]
title.reset_index(drop=True, inplace=True)
title.replace('\\N', np.nan, inplace=True)

In [35]:
# Now we can start properly cleaning the movies dataframe by keeping only the movie category

movies = movies[movies['titleType'] == 'movie']

# We then filter out the adult movies and drop the unneeded columns

movies = movies[movies['isAdult'] == '0']
movies = movies.drop(columns=['primaryTitle','titleType', 'endYear', 'isAdult'])

# We apply the same filter on movie duration than before and drop the nan rows

movies.dropna(subset=['startYear', 'runtimeMinutes' ], inplace=True)
movies = movies.astype({'startYear': int})
movies = movies.astype({'runtimeMinutes': int})
movies = movies[(movies['runtimeMinutes']>=58) & (movies['runtimeMinutes']<=270)]

# We now set the index on the tconst column

movies = movies.sort_values('tconst')
movies = movies.set_index('tconst')


In [36]:
# Next we clean the actors dataframe, starting with dropping the nan values in the birth year column

actors.dropna(subset=['birthYear'], inplace=True)

# We now limit the birth year to 1800 minimum and set this column to int

actors = actors[actors['birthYear'] >= '1800']
actors = actors.astype({'birthYear': int})

# We set the index to the nconst column

actors = actors.sort_values('nconst')
actors = actors.set_index('nconst')

In [37]:
# We now work on the title dataframe and drop the unneeded columns

title.drop(columns=['ordering','job', 'characters'], inplace=True)

# We set the index on nconst again because

title = title.sort_values('nconst')
title = title.set_index('nconst')

In [38]:
# Now to the merging part, starting with actors and title

df_actors_title = actors.merge(title , how='inner', on='nconst')
df_actors_title = df_actors_title.reset_index()

# Now the new dataframe df_actors_title and movies

actors_title_movies = df_actors_title.merge(movies, how='inner', on='tconst' )

In [39]:
# There are outliers showing actors as cast of a movie released long after their death, so we'll filter those
# Some cases exist where an actor died during production and was not alive during the release, so we'll allow a 2 years delta

actors_title_movies = actors_title_movies.astype({'deathYear': float})
actors_title_movies['year_diff'] = actors_title_movies['deathYear'] - actors_title_movies['startYear']
actors_title_movies = actors_title_movies[actors_title_movies['year_diff'] >= -2]

In [40]:
# We can now create another column to compute the age of the actor or actress when the movie was released, and drop the ones we don't need

actors_title_movies['age_actor_actress'] = actors_title_movies['startYear'] - actors_title_movies['birthYear']
actors_title_movies.reset_index(drop=True, inplace=True)
actors_title_movies.drop(columns=['tconst', 'nconst', 'primaryProfession','knownForTitles','runtimeMinutes', 'genres', 'year_diff'], inplace=True)

# We apply another filter based on the release year of the movie, the same as before

actors_title_movies = actors_title_movies[(actors_title_movies['startYear'] >= 1918) & (actors_title_movies['startYear'] <= 2021)]

In [41]:
# We group the data by year and aggregate the data by the mean age of the actors, and drop more columns

age_mean_year = round(actors_title_movies.groupby(by=["startYear"]).mean('age_actor_actress'),2)
age_mean_year.reset_index(inplace=True)
age_mean_year.drop(columns=['deathYear', 'birthYear'], inplace =True)

# We can rename the a column to make its meaning more explicit

age_mean_year.rename(columns={'age_actor_actress':'mean_age_actors_actress'}, inplace=True)

In [42]:
# Now that we have a usable table containing the mean age of the acting staff per year, we can do the same with actors and actresses specifically, starting with actors

actors = actors_title_movies[actors_title_movies['category'] == "actor"]

# And actresses
actress = actors_title_movies[actors_title_movies['category'] == "actress"]

In [43]:
# Same operations as before, here with the actors only

age_mean_year_actors = round(actors.groupby(by=["startYear"]).mean('age_actor_actress'),2)
age_mean_year_actors.reset_index(inplace=True)
age_mean_year_actors.drop(columns=['deathYear', 'birthYear'], inplace=True)
age_mean_year_actors.rename(columns={'age_actor_actress':'mean_age_actors'}, inplace=True)

In [44]:
# Now with actresses

age_mean_year_actress = round(actress.groupby(by=["startYear"]).mean('age_actor_actress'),2)
age_mean_year_actress.reset_index(inplace=True)
age_mean_year_actress.drop(columns=['deathYear', 'birthYear'], inplace=True)
age_mean_year_actress.rename(columns={'age_actor_actress':'mean_age_actress'}, inplace=True)

In [45]:
# Finally we can merge all that data into a single dataframe

df_graph = age_mean_year.copy()
df_graph['mean_age_actress'] = age_mean_year_actress['mean_age_actress']
df_graph['mean_age_actors'] = age_mean_year_actors['mean_age_actors']

In [46]:
# We can now plot the graph, as a line chart showing all 3 age columns

fig = go.Figure() 

fig.add_trace(go.Scatter(x=df_graph.startYear, 
                    y=df_graph.mean_age_actors_actress, 
                    name="Both Genders",
                    line_shape='spline',
                    line_color='green'))

fig.add_trace(go.Scatter(x=df_graph.startYear, 
                    y=df_graph.mean_age_actress,
                    name="Actresses",
                    line_shape='spline',
                    line_color='rgb(231,107,243)'))

fig.add_trace(go.Scatter(x=df_graph.startYear, 
                    y=df_graph.mean_age_actors,
                    name="Actors",
                    line_shape='spline',
                    line_color='blue'))

fig.update_layout(title ='Mean Age of Actors and Actresses',
                    width=1300,
                    height=600,
                    legend_title="Gender",
                    template='plotly_dark'
                    )

fig.show()

In [47]:
# We can finally export the dataframe as a csv

df_graph.to_csv('data/actors_age.csv.zip', compression='zip')

## Recommendation Algorithm Preprocessing

In [97]:
# Here we repeat the same steps we did on previous cleanings to remove adult movies, keep only movies, and apply the same runtime and release year.

basics_df_2 =  basics_df.drop('endYear',axis=1)
movies = basics_df_2.loc[basics_df_2['titleType']=='movie']
movies = movies.drop('titleType',axis=1)
movies_ok = movies.loc[movies['isAdult']=='0']
movies_ok = movies_ok.drop('isAdult',axis=1)
movies_ok_nan = movies_ok.replace('\\N', np.nan)
movies_clean = movies_ok_nan.dropna()
movies_clean = movies_clean.astype({'runtimeMinutes': int})
movies_clean = movies_clean.astype({'startYear': int})
movies_clean = movies_clean[(movies_clean['runtimeMinutes'] >= 58) & (movies_clean['runtimeMinutes'] <= 270)]
movies_clean = movies_clean[(movies_clean['startYear'] >= 1918) & (movies_clean['startYear'] <= 2021)]

In [98]:
# Here we filter out movies with a rating less than 7.0 and with less than 1000 votes

movies_rating = pd.merge(movies_clean, ratings_df, how='inner')
movies_7 = movies_rating.loc[movies_rating['averageRating']>=6.0]
movies_7_rating = movies_7[movies_7['numVotes'] >= 1000]

# Now we can drop the unnecessary columns and reset the index

movies_merged = movies_7_rating.drop(columns=['primaryTitle','runtimeMinutes','averageRating','numVotes'],axis =1)
movies_merged.reset_index(drop=True, inplace=True)

In [99]:
# We need to cast the year type as string to add it to the column that will be used by the algorithm

movies_merged = movies_merged.astype({'startYear': str})

# Now we can concatenate the year to the genres to create a new column

movies_merged['data'] = movies_merged['startYear'] + ',' + movies_merged['genres']

# And we drop the original columns

movies_merged.drop(columns=['startYear','genres'],axis=1,inplace=True)

In [100]:
# We load the title.crew dataset to get the directors and writers

crew_df = pd.read_csv("https://datasets.imdbws.com/title.crew.tsv.gz", sep="\t")

In [101]:
movies_crew = pd.merge(movies_merged,crew_df,how='inner')

# We replace the null values with a None string, as this is information that could be used by the algorithm

movies_crew = movies_crew.replace('\\N', 'None')

In [102]:
# We get the actors from the title.principals dataset

actors = principals_df.loc[(principals_df['category']=='actor')|(principals_df['category']=='actress')]
actors_clean = actors.copy()
actors_clean.drop(columns=['ordering', 'category', 'job', 'characters'],axis=1,inplace=True)

In [103]:
# We now group by the tconst and aggregate the nconst data as a list

actors_clean = (actors_clean.groupby(['tconst'])['nconst'].apply(list)).reset_index()

# And we can merge this data to our main Dataframe

movies_all = pd.merge(movies_crew, actors_clean, how='inner')

In [104]:
# Let's make the actors column name a bit more readable

movies_all.rename(columns={'nconst': 'actors'}, inplace=True)

# And now we can transform the lists into single strings

movies_all['actors'] = movies_all['actors'].str.join(',')

# Finally we concat all those columns into the data column

movies_all['data'] = movies_all['data'] + ',' + movies_all['directors'] + ',' + movies_all['writers'] + ',' + movies_all['actors']

In [105]:
# Now we can bring the finishing touches to this dataset's cleaning

movies_final = movies_all.drop(columns=['directors','writers','actors'],axis=1)
movies_final

# We create a small function to replace the commas by spaces in order to prepare the data for the algorithm

def clear(city):
    city_split = city.split(',')
    
    merged_city = " ".join(city_split)
    return merged_city

# And we apply it

movies_final['data'] = movies_final['data'].apply(clear)

In [107]:
movies_final.shape

(23259, 3)

In [108]:
# And the dataset is now exported as a csv to be used by the recommendation algorithm

movies_final.to_csv('data/movies_merged.csv.zip',index=False, compression='zip')

## Conclusion

In order to help with the data exploration, besides the usual .head() and .tail() methods that we didn't show here to avoid clutter, we constructed a MCD based on the dataset documentation and our own observations. This helped greatly to handle the merges.

As the datasets are very large, we decided to clean them and reduce them to the only rows and columns needed for the graphs and the recommendation algorithm, and then export those dataframes to new csvs, which are the only ones used in the Streamlit app. That way we didn't have to limit our scope to a certain category or time period. The only concession on that regard was to limit the recommendations database to movies rated more than 6.0 on average, and with more than 1000 votes. It is worth noting that the algorithm's performance is quite good with those settings, and it is possible, if needed, to expand the dataset to include more movies by lowering the rating and votes filters, without affecting the algorithm's performance too much.

Regarding the variables used to compare and select movies, we decided to use the movie's release year, its director(s), the writers, the actors of the movie, and the genres (up to 3 different ones per movie). Our tests showed that the recommendations are very relevant with those settings already. One thing that could be added without too much work is the movie's title, but we didn't feel that to be necessary, as the algorithm already finds the other movies from a same series or universe without it, and it could lead to an unwanted bias.

There are 2 main ways to improve the algorithm for the future. One would be to get the IMDb summary of every movie and use this text, after being tokenized, to find similar movies. Unfortunately, that summary is not available in the dataset, so this would require a very heavy and long scraping process to get those, not mentioning the fact that if it's not in the database, it's probably for a reason and scraping that editorial data is morally gray at best.

The other way would be to switch to a user based system instead of the current content based one. To do that, the theatre needs to build a users database, and store the movies they watched and their ratings for each movie. That way we could perform some clustering on the user base and make recommendations to users based on what similar users have watched and liked. This requires a huge amount of data though, and may not be applicable in this scenario, even in the future.

## Fin.