# Intermovie
* La liste des acteurs par film
* La liste des films Américains (en gardant leur nom en français) et leur note moyenne
* Les notes moyennes des différents genres
* La note moyenne de chaque acteur par rapport aux films dans lesquels il apparaît
Imports and useful functions

## Imports and useful functions

In [7]:
import pandas as pd
import random

#TODO filter column csv
#TODO idempotence

def skip_idx(file):
    """Function to sample a csv file."""
    # Count the lines
    num_lines = sum(1 for l in open(file, encoding='utf-8'))
    print("File size", num_lines)

    # Sample size - in this case ~10%
    size = int(num_lines / 10)

    # The row indices to skip - make sure 0 is not included to keep the header!
    random.seed(30)
    skip_idx = random.sample(range(1, num_lines), num_lines - size)

    return skip_idx

## La liste des acteurs par film

## La liste des films Américains (en gardant leur nom en français) et leur note moyenne

### Films

In [None]:
df_title_basics = pd.read_csv('../data/title.basics.tsv', sep='\t', skiprows=skip_idx('../data/title.basics.tsv'), usecols=['tconst', 'titleType', 'originalTitle', 'genres'])

In [80]:
df_title_basics

Unnamed: 0,tconst,titleType,originalTitle,genres
0,tt0000006,short,Chinese Opium Den,Short
1,tt0000013,short,Neuville-sur-Saône: Débarquement du congrès de...,"Documentary,Short"
2,tt0000016,short,Barque sortant du port,"Documentary,Short"
3,tt0000022,short,Les forgerons,"Documentary,Short"
4,tt0000027,short,Place des Cordeliers à Lyon,"Documentary,Short"
...,...,...,...,...
632124,tt9916758,tvEpisode,Willy Naessens,"Comedy,News,Talk-Show"
632125,tt9916766,tvEpisode,Episode #10.15,"Family,Reality-TV"
632126,tt9916798,tvEpisode,Episode #2.36,"Action,Drama,Family"
632127,tt9916810,tvEpisode,Danira Boukhriss Terkessidis,"Comedy,News,Talk-Show"


In [64]:
df_title_basics['titleType'].unique()

array(['short', 'movie', 'tvSeries', 'tvMovie', 'tvEpisode', 'tvShort',
       'tvMiniSeries', 'video', 'tvSpecial', 'videoGame'], dtype=object)

In [68]:
df_movies = df_title_basics[(df_title_basics['titleType'] == 'movie') | (df_title_basics['titleType'] == 'tvMovie')]
df_movies.head()

Unnamed: 0,tconst,titleType,originalTitle,genres
78,tt0000886,movie,Hamlet,Drama
83,tt0000941,movie,Locura de amor,Drama
88,tt0000984,movie,Niños en la alameda,\N
95,tt0001080,movie,Two of the Boys,\N
102,tt0001159,movie,The Connecticut Yankee,\N


### Titres Américains

In [8]:
df_title_akas = pd.read_csv('../data/title.akas.tsv', sep='\t', skiprows=skip_idx('../data/title.akas.tsv'), usecols=['titleId', 'title', 'region'])

File size 19527972


In [9]:
df_title_akas.head()

Unnamed: 0,titleId,title,region
0,tt0000001,Carmencita,\N
1,tt0000003,Szegény Pierrot,HU
2,tt0000003,Poor Pierrot,GB
3,tt0000004,Полная кружка пива,RU
4,tt0000005,The Blacksmith's Forge,GB


In [10]:
df_us = df_title_akas[df_title_akas['region'] == 'US']
df_us.head()

Unnamed: 0,titleId,title,region
5,tt0000009,Miss Jerry,US
18,tt0000026,The Messers. Lumière at Cards,US
20,tt0000029,Baby's Dinner,US
28,tt0000045,The Washerwomen,US
34,tt0000076,Rip and the Dwarf,US


In [40]:
df_title_ratings = pd.read_csv('../data/title.ratings.tsv', sep='\t', skiprows=skip_idx('../data/title.ratings.tsv'), usecols=['tconst', 'averageRating'])

File size 993154


In [78]:
df_movies_us_ratings = df_movies.merge(df_us, left_on='tconst', right_on='titleId').merge(df_title_ratings)
df_movies_us_ratings.to_csv('movies_us_ratings.csv', index=False)
df_movies_us_ratings

Unnamed: 0,tconst,titleType,originalTitle,genres,titleId,title,region,averageRating
0,tt0000886,movie,Hamlet,Drama,tt0000886,"Hamlet, Prince of Denmark",US,5.2
1,tt0014417,movie,La roue,Drama,tt0014417,The Wheel,US,7.4
2,tt0015191,movie,On Time,"Action,Adventure,Comedy",tt0015191,On Time!,US,4.7
3,tt0020466,movie,Sunnyside Up,"Comedy,Musical",tt0020466,Sunnyside Up,US,6.5
4,tt0020649,movie,The Arizona Kid,"Mystery,Romance,Western",tt0020649,The Arizona Kid,US,5.8
...,...,...,...,...,...,...,...,...
124,tt7389882,movie,The Area,Documentary,tt7389882,The Area,US,6.8
125,tt7573720,movie,Karma,Thriller,tt7573720,Karma,US,7.8
126,tt7583568,movie,La Negrada,Drama,tt7583568,Black Mexicans,US,6.6
127,tt8106596,movie,The Legend of Cocaine Island,Documentary,tt8106596,The Legend of Cocaine Island,US,6.3


## Les notes moyennes des différents genres


In [79]:
df_genre_ratings = df_title_basics.merge(df_title_ratings)
df_genre_ratings['genres'] = df_genre_ratings['genres'].apply(lambda x: x.split(','))
df_genre_ratings

Unnamed: 0,tconst,titleType,originalTitle,genres,averageRating
0,tt0000027,short,Place des Cordeliers à Lyon,[Drama],5.5
1,tt0000082,short,A Hard Wash,[Drama],4.9
2,tt0000361,short,Lavatory moderne,"[Action, Adventure, Comedy]",8.1
3,tt0000399,short,Jack and the Beanstalk,"[Comedy, Musical]",6.1
4,tt0000800,short,The Awakening,"[Mystery, Romance, Western]",5.3
...,...,...,...,...,...
9952,tt9878308,tvEpisode,Adam Ruins Little Bugs,,7.1
9953,tt9878754,tvEpisode,Hotel Hanky Panky,,8.5
9954,tt9898410,tvEpisode,Episode #1.7,,9.2
9955,tt9913610,tvEpisode,A Fish is to Water as a Genius is to X,,7.1


In [58]:
df_genre_ratings_explode = df_genre_ratings.explode('genres').groupby('genres').mean()
df_genre_ratings_explode

Unnamed: 0_level_0,averageRating
genres,Unnamed: 1_level_1
Action,6.290625
Adult,6.271429
Adventure,6.370588
Animation,6.24375
Biography,6.27
Comedy,6.276923
Crime,6.196154
Documentary,6.156818
Drama,6.377273
Family,6.117391


## Bonus
Elle souhaiterait pouvoir générer des recommandations de nouveaux films en fonction de l'historique de l'utilisateur.