# Intermovie project

This Notebook aims to analyze a dataset of films in order to retrieve several information:

- The list of actors by film.

- The list of American films (keeping their name in French) and their average rating.

- The average scores of the different genres.

- The average rating of each actor in relation to the films in which he appears.


## Structure of tsv files

- name.basics :         nconst / primaryName / birthYear / deathYear / primaryProfession / knownForTitles

- title.akas :          titleId / ordering / title / region / language / types / attributes / isOriginalTitle

- title.basics :        tconst / titleType / primaryTitle / originalTitle / isAdult / startYear / endYear / runtimeMinutes / genres

- title.principals :    tconst / ordering / nconst / category / job / characters

- title.ratings :       tconst / averageRating / numVotes


## Import from libraries

In [2]:
# IPython extension reloading modules before user enters code.
%load_ext autoreload
%autoreload 2

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from src.credentials import Credentials as cr
import src.split_datas as sd

## Data split

We use here the function ** split_datas ** allowing us to sort the data according to our needs by creating CSV files sorted according to a specified column directly in a folder of the name of this column.

*Example* : We want to sort the films by region, this will create a "region" folder and fill it with the different regions of the world in the form of several CSV files (FR.csv, UK.csv, US.csv etc.).

In [3]:
sd.split_datas(cr.TITLE_BASICS, 'titleType')
sd.split_datas(cr.TITLE_PRINCIPALS, 'category')
sd.split_datas(cr.TITLE_AKAS, 'region')

c:\Users\utilisateur\Documents\Intermovie\Intermovie-Group1\data\CURATED\titleType\
c:\Users\utilisateur\Documents\Intermovie\Intermovie-Group1\data\CURATED\titleType\short.csv
c:\Users\utilisateur\Documents\Intermovie\Intermovie-Group1\data\CURATED\titleType\movie.csv
c:\Users\utilisateur\Documents\Intermovie\Intermovie-Group1\data\CURATED\titleType\tvMovie.csv
c:\Users\utilisateur\Documents\Intermovie\Intermovie-Group1\data\CURATED\titleType\tvSeries.csv
c:\Users\utilisateur\Documents\Intermovie\Intermovie-Group1\data\CURATED\titleType\tvEpisode.csv
c:\Users\utilisateur\Documents\Intermovie\Intermovie-Group1\data\CURATED\titleType\tvShort.csv
c:\Users\utilisateur\Documents\Intermovie\Intermovie-Group1\data\CURATED\titleType\tvMiniSeries.csv
c:\Users\utilisateur\Documents\Intermovie\Intermovie-Group1\data\CURATED\titleType\tvSpecial.csv
c:\Users\utilisateur\Documents\Intermovie\Intermovie-Group1\data\CURATED\titleType\video.csv
c:\Users\utilisateur\Documents\Intermovie\Intermovie-Grou

## Creation of "global" DataFrames

The following cells create all of our dataFrames used recurrently within this NoteBook. They will be called global_filename by naming convention in order to determine their origin.

In [4]:
global_title_basics = pd.read_csv("./data/CURATED/titleType/movie.csv", usecols = ['tconst', 'originalTitle'])

global_title_basics = global_title_basics.dropna(axis = 0)
global_title_basics = global_title_basics.drop_duplicates()
global_title_basics.describe()

Unnamed: 0,tconst,originalTitle
count,536034,536034
unique,536034,477163
top,tt8764430,Home
freq,1,49


In [5]:
local_actor = pd.read_csv("./data/CURATED/category/actor.csv", usecols = ['tconst', 'nconst'])
local_actress = pd.read_csv("./data/CURATED/category/actress.csv", usecols = ['tconst', 'nconst'])
local_self = pd.read_csv("./data/CURATED/category/self.csv", usecols = ['tconst', 'nconst'])

global_title_principals = pd.concat([local_actor, local_actress, local_self])
global_title_principals = global_title_principals.dropna(axis = 0)
global_title_principals = global_title_principals.drop_duplicates()

del local_actor
del local_actress
del local_self

global_title_principals.describe()

Unnamed: 0,tconst,nconst
count,20971198,20971198
unique,5127449,2514373
top,tt4295846,nm0318114
freq,10,8535


In [6]:
global_name_basics = pd.read_csv("./data/RAW/name.basics.tsv", usecols = ['nconst', 'primaryName'], delimiter = '\t')

global_name_basics = global_name_basics.dropna(axis = 0)
global_name_basics = global_name_basics.drop_duplicates()
global_name_basics.describe()

Unnamed: 0,nconst,primaryName
count,9706922,9706922
unique,9706922,7618021
top,nm3926615,David Smith
freq,1,310


In [7]:
global_title_akas = pd.read_csv("./data/CURATED/region/US.csv", usecols = ['titleId'])

global_title_akas = global_title_akas.dropna(axis = 0)
global_title_akas = global_title_akas.drop_duplicates()
global_title_akas = global_title_akas.rename(columns = {'titleId' : 'tconst'})
global_title_akas.describe()

Unnamed: 0,tconst
count,815853
unique,815853
top,tt5659564
freq,1


In [8]:
global_title_ratings = pd.read_csv("./data/RAW/title.ratings.tsv", usecols = ['tconst', 'averageRating'], delimiter = '\t')

global_title_ratings = global_title_ratings.dropna(axis=0)
global_title_ratings = global_title_ratings.drop_duplicates()
global_title_ratings.describe()

Unnamed: 0,averageRating
count,993153.0
mean,6.886546
std,1.400235
min,1.0
25%,6.1
50%,7.1
75%,7.9
max,10.0


# 1- List of actors by film

In [9]:
local_request_1 = global_title_principals.merge(global_title_basics, how = 'left', on = 'tconst')
local_request_1 = local_request_1.merge(global_name_basics, how = 'left', on = 'nconst')
local_request_1 = local_request_1.drop(['tconst', 'nconst'], axis = 1)
local_request_1 = local_request_1.dropna(axis = 0)

local_request_1_final = local_request_1.groupby('originalTitle', as_index = False).agg({'primaryName': ','.join}, axis = 0)
local_request_1_final.columns = ['originalTitle', 'movieCasting']

local_request_1_final.to_csv('./data/REQUESTS/request_1.csv')

del local_request_1
del local_request_1_final

# 2- List of American films (keeping their name in French) and their average rating

In [10]:
local_request_2 = global_title_akas.merge(global_title_basics, how = 'left', on = 'tconst')
local_request_2 = local_request_2.merge(global_title_ratings, how = 'left', on = 'tconst')
local_request_2 = local_request_2.drop(['tconst'], axis = 1)
local_request_2 = local_request_2.dropna(axis = 0)
local_request_2 = local_request_2.drop_duplicates()

local_request_2.to_csv('./data/REQUESTS/request_2.csv')

mean_averageRating = local_request_2['averageRating'].mean()
print(mean_averageRating)

del local_request_2

6.130151340059558


# 3- The average scores of the different genres

In [11]:
local_title_basics = pd.read_csv("./data/RAW/title.basics.tsv", usecols = ['tconst', 'genres'], delimiter = '\t')
local_title_basics_split = local_title_basics['genres'].str.split(",", expand = True)
local_title_basics_split = local_title_basics_split.join(local_title_basics).drop(['genres'], axis = 1)

local_request_3 = local_title_basics_split.merge(global_title_ratings, how = 'left', on = 'tconst')

local_mean_1 = local_request_3.groupby([0])['averageRating'].mean()
local_mean_2 = local_request_3.groupby([1])['averageRating'].mean()
local_mean_3 = local_request_3.groupby([2])['averageRating'].mean()

local_mean = pd.concat([local_mean_1, local_mean_2, local_mean_3], axis = 1, keys = ["mean1", "mean2", "mean3"])
local_mean['mean'] = local_mean[['mean1', 'mean2', 'mean3']].mean(axis = 1)

local_request_3 = local_mean.drop(['mean1', 'mean2', 'mean3'], axis = 1)

local_request_3.to_csv('./data/REQUESTS/request_3.csv')

del local_title_basics
del local_title_basics_split
del local_mean_1
del local_mean_2
del local_mean_3
del local_mean
del local_request_3

# 4- The average rating of each actor in relation to the films in which he appears


In [12]:
local_request_4 = global_title_principals.merge(global_title_basics, how = 'left', on = 'tconst')
local_request_4 = local_request_4.merge(global_title_ratings, how = 'left', on = 'tconst')
local_request_4 = local_request_4.merge(global_name_basics, how = 'left', on = 'nconst')
local_request_4 = local_request_4.drop(['tconst', 'nconst'], axis = 1)
local_request_4 = local_request_4.dropna(axis = 0)

local_actor_average_rating = local_request_4.groupby('primaryName', as_index = False)['averageRating'].mean()
local_actor_average_rating.columns = ['primaryName', 'actorAverageRating']
local_actor_filmography = local_request_4.groupby('primaryName', as_index = False).agg({'originalTitle' : ','.join}, axis = 0)
local_actor_filmography.columns = ['primaryName', 'actorFilmography']

local_request_4 = local_request_4.merge(local_actor_average_rating, how = 'left', on = 'primaryName')
local_request_4 = local_request_4.merge(local_actor_filmography, how = 'left', on = 'primaryName')
local_request_4 = local_request_4.drop(['originalTitle', 'averageRating'], axis = 1)

local_request_4.to_csv('./data/REQUESTS/request_4.csv')

del local_actor_average_rating
del local_actor_filmography
del local_request_4

In [13]:
del global_name_basics
del global_title_akas
del global_title_basics
del global_title_principals
del global_title_ratings