# Intermovie
* La liste des acteurs par film
* La liste des films Américains (en gardant leur nom en français) et leur note moyenne
* Les notes moyennes des différents genres
* La note moyenne de chaque acteur par rapport aux films dans lesquels il apparaît

## Imports and useful functions

In [4]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import random
from sklearn.feature_extraction.text import CountVectorizer

from modules.loader import IntermovieDataLoader 
#TODO filter column csv
#TODO idempotence

CURATED_LOCAL_PATH = '../data/CURATED/'
RAW_LOCAL_PATH = '../data/RAW/'

data_loader = IntermovieDataLoader()

In [None]:
data_loader.split_data('title.basics.tsv', 'titleType', 'movie')

In [8]:
df_movies = pd.read_csv(f'{CURATED_LOCAL_PATH}movie.csv', usecols=['tconst', 'originalTitle', 'genres'], index_col='tconst', encoding='utf-8')
df_names = pd.read_csv(f'{RAW_LOCAL_PATH}name.basics.tsv', sep='\t', usecols=["nconst", "primaryName"], encoding='utf-8')
# df_principals = pd.read_csv(f'{RAW_LOCAL_PATH}title.principals.tsv', sep='\t', usecols=["tconst", "nconst", "category"], encoding='utf-8')

In [6]:
data_loader.split_data('title.principals.tsv', 'category', ['actor', 'actress'])

KeyboardInterrupt: 

In [5]:
df_actors = pd.concat([pd.read_csv(f'{CURATED_LOCAL_PATH}actor.csv', usecols=['tconst', 'nconst'], encoding='utf-8'), pd.read_csv(f'{CURATED_LOCAL_PATH}actress.csv', usecols=['tconst', 'nconst'], encoding='utf-8')])


## La liste des *acteurs* par *film*

In [9]:
df_actors = df_actors.merge(df_names)

In [10]:
df_actors_groupby_movies = df_actors.groupby('tconst')['primaryName'].apply(', '.join)

In [12]:
df_actors_groupby_movies = df_actors_groupby_movies.to_frame().merge(df_movies, on='tconst')

In [13]:
df_actors_groupby_movies.to_csv(CURATED_LOCAL_PATH +'cast.csv')

## La liste des films Américains (en gardant leur nom en français) et leur note moyenne

### Films (en gardant le nom fr)

### Titres Américains

In [14]:
df_us = pd.read_csv('../data/CURATED/US.csv', usecols=['titleId', 'region'], index_col='titleId', encoding='utf-8')
df_us = df_us[~df_us.index.duplicated(keep='first')]
df_us

Unnamed: 0_level_0,region
titleId,Unnamed: 1_level_1
tt0000001,US
tt0000002,US
tt0000005,US
tt0000006,US
tt0000007,US
...,...
tt0097084,US
tt0097085,US
tt0097086,US
tt0097087,US


In [15]:
df_title_ratings = pd.read_csv(f'{RAW_LOCAL_PATH}title.ratings.tsv', sep='\t', usecols=['tconst', 'averageRating'], index_col='tconst', encoding='utf-8')

In [16]:
df_movies_us = df_movies.merge(df_us, left_index=True, left_on='tconst', right_index=True, right_on='titleId')

In [19]:
df_movies_us_ratings = df_movies_us.merge(df_title_ratings, left_on='tconst', right_index=True, right_on='tconst')
df_movies_us_ratings.to_csv(f'{CURATED_LOCAL_PATH}movies_us_ratings.csv', index=False)
df_movies_us_ratings

Unnamed: 0,tconst,originalTitle,genres,region,averageRating
tt0000009,tt0000009,Miss Jerry,Romance,US,5.4
tt0000147,tt0000147,The Corbett-Fitzsimmons Fight,"Documentary,News,Sport",US,5.2
tt0000630,tt0000630,Amleto,Drama,US,2.7
tt0000679,tt0000679,The Fairylogue and Radio-Plays,"Adventure,Fantasy",US,4.8
tt0000886,tt0000886,Hamlet,Drama,US,5.2
...,...,...,...,...,...
tt0097074,tt0097074,Cohen and Tate,"Crime,Thriller",US,6.4
tt0097076,tt0097076,Cold Feet,"Comedy,Crime",US,4.7
tt0097078,tt0097078,Cold Heat,"Action,Drama,Thriller",US,2.8
tt0097079,tt0097079,Cold Justice,Drama,US,6.4


## Les notes moyennes des différents genres


In [20]:
df_title_basics = pd.read_csv(f'{RAW_LOCAL_PATH}title.basics.tsv', sep='\t', usecols=['tconst', 'genres'], index_col='tconst', encoding='utf-8')
df_title_basics

Unnamed: 0_level_0,genres
tconst,Unnamed: 1_level_1
tt0000001,"Documentary,Short"
tt0000002,"Animation,Short"
tt0000003,"Animation,Comedy,Romance"
tt0000004,"Animation,Short"
tt0000005,"Comedy,Short"
...,...
tt9916848,"Action,Drama,Family"
tt9916850,"Action,Drama,Family"
tt9916852,"Action,Drama,Family"
tt9916856,Short


In [21]:
df_title_basics['genres'] = df_title_basics['genres'].str.split(',')
df_title_basics

Unnamed: 0_level_0,genres
tconst,Unnamed: 1_level_1
tt0000001,"[Documentary, Short]"
tt0000002,"[Animation, Short]"
tt0000003,"[Animation, Comedy, Romance]"
tt0000004,"[Animation, Short]"
tt0000005,"[Comedy, Short]"
...,...
tt9916848,"[Action, Drama, Family]"
tt9916850,"[Action, Drama, Family]"
tt9916852,"[Action, Drama, Family]"
tt9916856,[Short]


In [23]:
df_genre_ratings = df_title_basics.merge(df_title_ratings, on='tconst')
df_genre_ratings = df_genre_ratings.explode('genres').groupby('genres').mean()
df_genre_ratings.to_csv(CURATED_LOCAL_PATH +'genre_ratings.csv')
df_genre_ratings

Unnamed: 0_level_0,averageRating
genres,Unnamed: 1_level_1
Action,6.951029
Adult,6.331053
Adventure,7.05673
Animation,7.046786
Biography,7.180115
Comedy,6.919199
Crime,7.165008
Documentary,7.241741
Drama,7.018454
Family,6.989731


## La note moyenne de chaque acteur par rapport aux films dans lesquels il apparaît

In [29]:
df_actors.merge(df_title_ratings, on='tconst').groupby(['nconst', 'primaryName'])['averageRating'].mean()

nconst     primaryName     
nm0000001  Fred Astaire        6.850000
nm0000002  Lauren Bacall       7.114286
nm0000003  Brigitte Bardot     5.911111
nm0000004  John Belushi        6.550000
nm0000006  Ingrid Bergman      6.789130
                                 ...   
nm9952131  Raul de Oliveira    5.800000
nm9954430  Ivan Kalmykov       7.200000
nm9959928  Johan Price         5.700000
nm9959929  Fred Boston         5.700000
nm9959930  Thomas Boston       5.700000
Name: averageRating, Length: 54385, dtype: float64

## Bonus
Elle souhaiterait pouvoir générer des recommandations de nouveaux films en fonction de l'historique de l'utilisateur.

Content-based recommenders: suggest similar items based on a particular item. This system uses item metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations. The general idea behind these recommender systems is that if a person likes a particular item, he or she will also like an item that is similar to it. And to recommend that, it will make use of the user's past item metadata. A good example could be YouTube, where based on your *history*, it suggests you new videos that you could potentially watch.

### Problem formulation
To build a recommender system that recommends movies based on the genre, cast, crew and some keywords of a previously watched movie.

https://medium.com/analytics-vidhya/metadata-based-recommender-systems-in-python-c6aae213b25c
https://www.datacamp.com/community/tutorials/recommender-systems-python
# ID / Title / Genres / Cast / Crew (director)

In [13]:
data_loader.split_data('title.principals.tsv', 'category', 'director')

In [14]:
df_principals = pd.read_csv(f'{RAW_LOCAL_PATH}title.principals.tsv', sep='\t', usecols=['tconst', 'nconst', 'category'], encoding='utf-8')
df_principals

Unnamed: 0,tconst,nconst,category
0,tt0000001,nm1588970,self
1,tt0000001,nm0005690,director
2,tt0000001,nm0374658,cinematographer
3,tt0000002,nm0721526,director
4,tt0000002,nm1335271,composer
...,...,...,...
36468812,tt9916880,nm0996406,director
36468813,tt9916880,nm1482639,writer
36468814,tt9916880,nm2586970,writer
36468815,tt9916880,nm1594058,producer


In [107]:
actors_group = df_actors.groupby('tconst')['primaryName'].apply(', '.join)
actors_group

NameError: name 'df_actors' is not defined

In [13]:
df_directors = df_principals[df_principals['category'] == 'director']
df_directors = df_directors.merge(df_name_basics, on='nconst')
df_directors

Unnamed: 0,tconst,nconst,category,primaryName
0,tt0000192,nm0924920,director,James H. White
1,tt0203699,nm0924920,director,James H. White
2,tt0203782,nm0924920,director,James H. White
3,tt0203948,nm0924920,director,James H. White
4,tt0203956,nm0924920,director,James H. White
...,...,...,...,...
166465,tt9915110,nm5225408,director,Juan Cota-Ysaguirre
166466,tt9916144,nm10282565,director,Odd Arne Olderbakk
166467,tt9916146,nm10282565,director,Odd Arne Olderbakk
166468,tt9916152,nm10282565,director,Odd Arne Olderbakk


In [14]:
directors_group = df_directors.groupby('tconst')['primaryName'].apply(', '.join)
directors_group

tconst
tt0000192       James H. White
tt0000194    Cecil M. Hepworth
tt0000609         Gaston Velle
tt0000615     Charles MacMahon
tt0000846        Hermanos Alva
                   ...        
tt9916666         Yash Chauhan
tt9916670         Hilary Audus
tt9916682         Hilary Audus
tt9916688         Hilary Audus
tt9916880         Hilary Audus
Name: primaryName, Length: 164256, dtype: object

In [106]:
df_actors_group = actors_group.to_frame()
df_actors_group = df_actors_group.rename(columns={'primaryName': 'cast'})

NameError: name 'actors_group' is not defined

In [16]:
df_directors_group = directors_group.to_frame()
df_directors_group = df_directors_group.rename(columns={'primaryName': 'crew'})

In [105]:
df_cast_crew = df_actors_group.merge(df_directors_group, on='tconst')

NameError: name 'df_actors_group' is not defined

In [104]:
df_merge = df_cast_crew.merge(df_title_basics, on='tconst')
df_merge['cast'] = df_merge['cast'].apply(lambda x: x.replace(' ', ''))
df_merge['crew'] = df_merge['crew'].apply(lambda x: x.replace(' ', ''))
df_merge['cast'] = df_merge['cast'].apply(lambda x: x.split(','))
df_merge['crew'] = df_merge['crew'].apply(lambda x: x.split(','))
df_merge['genres'] = df_merge['genres'].apply(lambda x: x.split(','))
df_merge

NameError: name 'df_cast_crew' is not defined

In [46]:
df_merge['metadata'] = df_merge.apply(lambda x : ' '.join(x['genres']) + ' ' + ' '.join(x['cast']) + ' ' + ' '.join(x['crew']), axis = 1)

In [None]:
count_vec = CountVectorizer(stop_words=’english’)
count_vec_matrix = count_vec.fit_transform(movies_df[‘metadata’])
cosine_sim_matrix = cosine_similarity(count_vec_matrix, count_vec_matrix)
#movies index mapping
mapping = pd.Series(movies_df.index,index = movies_df[‘title’])

In [None]:

#recommender function to recommend movies based on metadata
def recommend_movies_based_on_metadata(movie_input):
    movie_index = mapping[movie_input]
    #get similarity values with other movies
    similarity_score = list(enumerate(cosine_sim_matrix[movie_index]))
    similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)
    # Get the scores of the 15 most similar movies. Ignore the first movie.
    similarity_score = similarity_score[1:15]
    movie_indices = [i[0] for i in similarity_score]
    return (movies_df[‘title’].iloc[movie_indices])