# Final Projet: Movie Pairing System
In order to use the data necessary, please download the `ml-1m`, `title.basics.tsv` and `title.crew.tsv` datasets, which can be done directly using the `data_download.sh` shell script provided, or by running the first few cells of this notebook.
### Why those 3 datasets ?
The `ml-1m` dataset is chosen as it contains 1 million reviews for around 4 thousand movies, which makes for a good amount in order to recommend movies later down the line. the `title.basics.tsv` dataset contains an enormous amount of on movies, with more information that seen on the `ml-1m` dataset, but unfortunately does not contain reviews, and contains much more than movies, so it needs to be properly cleaned up. The `title.crew.tsv` dataset is chosen as a complimentary dataset, as it contains additional information for the movies, including the directors, which will be used as a feature in our model. At first, the `names.basic.tsv` dataset was used instead, but it was very soon clear that it would require more work to extract the same information, and could give false information as well (such an example being Clint Eastwood, which would be put as a director in movies he acted in). As such, to avoid mistakes and potential problems, it was dropped in favor of `title.crew.tsv`. While we do not have the actual names, this is not a problem for the model.

In [101]:
%%bash
if [ ! -d "data/movielens_latest" ];
then
    wget http://files.grouplens.org/datasets/movielens/ml-1m.zip
    mkdir -p data/movielens_latest
    unzip -o ml-1m.zip -d data/movielens_latest;
else
    echo "Movielens data already downloaded";
fi

if [ ! -d "data/imdb" ];
then
    mkdir data/imdb
    wget https://datasets.imdbws.com/title.crew.tsv.gz
    gunzip -c title.crew.tsv.gz > data/imdb/title.crew.tsv;
    wget https://datasets.imdbws.com/title.basics.tsv.gz
    gunzip -c title.basics.tsv.gz > data/imdb/title.basics.tsv;
else
    echo "imdb data already downloaded";
fi

--2024-07-07 19:29:16--  http://files.grouplens.org/datasets/movielens/ml-1m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5917549 (5.6M) [application/zip]
Saving to: ‘ml-1m.zip.1’

     0K .......... .......... .......... .......... ..........  0%  206K 28s
    50K .......... .......... .......... .......... ..........  1%  406K 21s
   100K .......... .......... .......... .......... ..........  2% 15.5M 14s
   150K .......... .......... .......... .......... ..........  3% 14.3M 10s
   200K .......... .......... .......... .......... ..........  4%  423K 11s
   250K .......... .......... .......... .......... ..........  5% 10.2M 9s
   300K .......... .......... .......... .......... ..........  6% 13.0M 8s
   350K .......... .......... .......... .......... ..........  6% 12.9M 7s
   400K .......... .......... ...

Archive:  ml-1m.zip
   creating: data/movielens_latest/ml-1m/
  inflating: data/movielens_latest/ml-1m/movies.dat  
  inflating: data/movielens_latest/ml-1m/ratings.dat  
  inflating: data/movielens_latest/ml-1m/README  
  inflating: data/movielens_latest/ml-1m/users.dat  


--2024-07-07 19:29:18--  https://datasets.imdbws.com/title.crew.tsv.gz
Resolving datasets.imdbws.com (datasets.imdbws.com)... 52.222.149.104, 52.222.149.89, 52.222.149.11, ...
Connecting to datasets.imdbws.com (datasets.imdbws.com)|52.222.149.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 71386776 (68M) [binary/octet-stream]
Saving to: ‘title.crew.tsv.gz’

     0K .......... .......... .......... .......... ..........  0% 4.29M 16s
    50K .......... .......... .......... .......... ..........  0% 4.46M 16s
   100K .......... .......... .......... .......... ..........  0% 10.3M 13s
   150K .......... .......... .......... .......... ..........  0% 6.01M 12s
   200K .......... .......... .......... .......... ..........  0% 11.9M 11s
   250K .......... .......... .......... .......... ..........  0% 11.2M 10s
   300K .......... .......... .......... .......... ..........  0% 15.7M 9s
   350K .......... .......... .......... .......... ..........  0% 10.7M

In [102]:
# Imports for the project
import pandas as pd
import re

## Data Analysis and Feature Engineering
In this section, we check the integrity of the downloaded files before fusing them and making some adjustments to them. First, we look at IMDB.

In [109]:
df_crew_basics = pd.read_csv("data/imdb/title.crew.tsv", sep='\t')
df_title_basics =  pd.read_csv("data/imdb/title.basics.tsv", sep='\t', low_memory=False)
df_title_basics

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,5,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
...,...,...,...,...,...,...,...,...,...
10911799,tt9916848,tvEpisode,Episode #3.17,Episode #3.17,0,2009,\N,\N,"Action,Drama,Family"
10911800,tt9916850,tvEpisode,Episode #3.19,Episode #3.19,0,2010,\N,\N,"Action,Drama,Family"
10911801,tt9916852,tvEpisode,Episode #3.20,Episode #3.20,0,2010,\N,\N,"Action,Drama,Family"
10911802,tt9916856,short,The Wind,The Wind,0,2015,\N,27,Short


We drop all titles that aren't movies from the titles dataset, merge both datasets on `tconst`, and keep only the `primaryTitle`,`startYear`, and `directors` columns.

In [110]:
df_title_basics = df_title_basics.drop(df_title_basics[df_title_basics['titleType'] != 'movie'].index)
df_imdb = pd.merge(df_title_basics, df_crew_basics, on='tconst', how='left')
df_imdb = df_imdb[['primaryTitle', 'startYear', 'directors']]
df_imdb

Unnamed: 0,primaryTitle,startYear,directors
0,Miss Jerry,1894,nm0085156
1,The Corbett-Fitzsimmons Fight,1897,nm0714557
2,Bohemios,1905,nm0063413
3,The Story of the Kelly Gang,1906,nm0846879
4,The Prodigal Son,1907,nm0141150
...,...,...,...
685588,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,"nm9272490,nm9272491"
685589,De la ilusión al desconcierto: cine colombiano...,2007,nm0652213
685590,Dankyavar Danka,2013,nm7764440
685591,6 Gunn,2017,nm10538612


We then look at the movielens dataset:

In [118]:
m_cols = ['MovieId', 'Title', 'Genres']
df_movielens = pd.read_csv('data/movielens_latest/ml-1m/movies.dat', sep='::', engine='python', encoding='latin-1', names=m_cols)
r_cols = ['UserId', 'MovieId', 'Rating', 'Timestamp']
df_ratings = pd.read_csv('data/movielens_latest/ml-1m/ratings.dat', sep='::', engine='python', encoding='latin-1', names=r_cols)
u_cols = ['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code']
df_users = pd.read_csv('data/movielens_latest/ml-1m/users.dat', sep='::', engine='python', encoding='latin-1', names=u_cols)
df_movielens

Unnamed: 0,MovieId,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama


In order to have the same format as the IMDB files we want to clean up the titles by removing the date in the title and putting them in a separate column, then put the titles in lowercase. We'll put the titles in lowercase on the IMDB database later on as well.

In [112]:
def clean_titles(row):
    match = re.search(r'\((\d{4})\)$', row)
    if match:
        year = match.group(1)
        title = row[:match.start()].strip()
    else:
        year = None
        title = row
    if title.title().split(',')[-1].strip() in ['The', 'A']:
        title = (title.title().split(',')[-1].strip() + " " + " ".join(title.title().split(',')[:-1])).strip()

    title = title.lower()
    return title, year

df_movielens[['Title', 'Year']] = df_movielens['Title'].apply(clean_titles).apply(pd.Series)
df_movielens

Unnamed: 0,MovieId,Title,Genres,Year
0,1,toy story,Animation|Children's|Comedy,1995
1,2,jumanji,Adventure|Children's|Fantasy,1995
2,3,grumpier old men,Comedy|Romance,1995
3,4,waiting to exhale,Comedy|Drama,1995
4,5,father of the bride part ii,Comedy,1995
...,...,...,...,...
3878,3948,meet the parents,Comedy,2000
3879,3949,requiem for a dream,Drama,2000
3880,3950,tigerland,Drama,2000
3881,3951,two family house,Drama,2000


We separate the genres and make a one hot encoding on them as feature extraction:

In [113]:
df_movielens['Genres'] = df_movielens['Genres'].str.split('|')
df_movielens_exploded = df_movielens.explode('Genres')
df_movielens_ohe = pd.get_dummies(df_movielens_exploded['Genres'])
df_movielens_ohe_grouped = df_movielens_ohe.groupby(df_movielens_exploded.index).sum()
df_movielens_final = pd.concat([df_movielens.drop(columns=['Genres']), df_movielens_ohe_grouped], axis=1)
df_movielens_final

Unnamed: 0,MovieId,Title,Year,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,toy story,1995,0,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,jumanji,1995,0,1,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,3,grumpier old men,1995,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,waiting to exhale,1995,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,father of the bride part ii,1995,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3878,3948,meet the parents,2000,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3879,3949,requiem for a dream,2000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3880,3950,tigerland,2000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3881,3951,two family house,2000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We merge the datasets in order to keep only the relevant movies before further working on the dataset. To do so, we decide to do the following:
- first, we merge both datasets on title and year
- since there are duplicates in the resulting dataset, we remove the duplicates
- we drop the `primaryTitle` and `startYear` columns
- we fill all NaN in the directors column with an empty line
- we one hot encode on the director to have a feature for each director

In [120]:
df_imdb['primaryTitle'] = df_imdb['primaryTitle'].str.lower()
df_movies = pd.merge(df_movielens_final, df_imdb, left_on=['Title', 'Year'], right_on=['primaryTitle', 'startYear'], how='left')
df_movies = df_movies.drop_duplicates(subset='MovieId', keep='first')
df_movies = df_movies.drop(columns=['primaryTitle','startYear'])
df_movies['directors'] = df_movies['directors'].fillna('')
df_movies

Unnamed: 0,MovieId,Title,Year,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,...,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,directors
0,1,toy story,1995,0,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,nm0005124
1,2,jumanji,1995,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,nm0002653
2,3,grumpier old men,1995,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,nm0222043
3,4,waiting to exhale,1995,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,nm0001845
4,5,father of the bride part ii,1995,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,nm0796124
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3898,3948,meet the parents,2000,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,nm0005366
3899,3949,requiem for a dream,2000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,nm0004716
3900,3950,tigerland,2000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,nm0001708
3901,3951,two family house,2000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,nm0208343


In [121]:
df_movies['directors'] = df_movies['directors'].str.split(',')
df_movies_exploded = df_movies.explode('directors')
df_movies_ohe = pd.get_dummies(df_movies_exploded['directors'])
df_movies_ohe_grouped = df_movies_ohe.groupby(df_movies_exploded.index).sum()
df_final = pd.concat([df_movies.drop(columns=['directors']), df_movies_ohe_grouped], axis=1)
df_final

Unnamed: 0,MovieId,Title,Year,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,...,nm1212046,nm1301264,nm15572750,nm15643153,nm1635724,nm1871737,nm1968474,nm1969399,nm2259353,nm9054338
0,1,toy story,1995,0,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,jumanji,1995,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,grumpier old men,1995,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,waiting to exhale,1995,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,father of the bride part ii,1995,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3898,3948,meet the parents,2000,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3899,3949,requiem for a dream,2000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3900,3950,tigerland,2000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3901,3951,two family house,2000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now that we have a proper dataset with a reasonable amount of features, time to add the ratings to the mix. We will create a user-item matrix that will be used by our model.

In [122]:
df_matrix = df_ratings.pivot(index="UserId", columns="MovieId", values="Rating")
df_matrix = df_matrix.fillna(0)
df_matrix

MovieId,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
UserId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,0.0,0.0,0.0,2.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6039,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Building the model
