# 🎦**AlloCiné Recommender System**🔎📍

Once we cleaned the data, we can start to build our recommender system. The data that will be used is located in the `../Cleaned Data/` folder.

**Types of Recommender System:**

There are two types of recommender system: **`content-based`** and **`collaborative-filtering`**.

- **`Content-based`:** this recommender system is based only on the characteristics of the products. Here, we will recommend an item to a user by comparing the features between items and recommend the items with the highest similarity.

- **`Collaborative-filtering`:** this recommender system is based on the interactions between users and the items. 


---
# **Import libs**

In [158]:
# import libraries
import pandas as pd
import numpy as np
import re
from sklearn.metrics.pairwise import cosine_similarity
from ast import literal_eval
from warnings import filterwarnings
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop # Used to get the French stop-words


# We ignore dateparse warnings
filterwarnings("ignore",message="The localize method is no longer necessary, as this time zone supports the fold attribute")
# We ignore reindexing warnings
filterwarnings("ignore",message="Boolean Series key will be reindexed")
%matplotlib inline

# **Load the csv files**

In [159]:
def load_csv():
    '''
    Load the csv files and return a dict of dataframes.
    '''
    root_path = f"../Cleaned Data/"
    movies = pd.read_csv(f"{root_path}movies.csv", converters={"genres": literal_eval}) # Load list look-alike string as type list
    series = pd.read_csv(f"{root_path}series.csv", converters={"genres": literal_eval})
    press_movies = pd.read_csv(f"{root_path}press_movies.csv")
    press_series = pd.read_csv(f"{root_path}press_series.csv")
    user_movies = pd.read_csv(f"{root_path}user_movies.csv")
    user_series = pd.read_csv(f"{root_path}user_series.csv")
    #user_series = pd.read_csv(f"../Series/Ratings/Webscraping_Series_Ratings_user_ratings_series_#1-1.csv")
    return {"movies":movies, "series":series, "press_movies":press_movies, "press_series":press_series, "user_movies":user_movies, "user_series":user_series}
data = load_csv()
movies, series, press_movies, press_series, user_movies, user_series = data["movies"], data["series"], data["press_movies"], data["press_series"], data["user_movies"], data["user_series"]

# **Content-based**

In [160]:
def get_stop_words():
    # turn French stop-words into a list
    stop_words = list(fr_stop)
    # Load extra stop-words
    with open("stop_words_french.txt",'r', encoding='UTF-8') as file:
        additional_stopwords = file.readlines()
        additional_stopwords = [line.rstrip() for line in additional_stopwords]
    # Add the additional French stop-words to the list of stop-words if they are not already in it
    for word in additional_stopwords:
        if word not in stop_words:
            stop_words.append(word)
    stop_words.extend([""," ","#"])
    return stop_words

In [161]:
def get_tags(df: pd.DataFrame=None, stop_words: list=get_stop_words(), cols_to_get: str="title"):
    '''
    Get the tags of the movies from the title and summary.
    :param df: Dataframe to transform.
    :param stop_words: List of stop-words to remove from the tags.
    :param col_to_get: Columns to get the tags from.
    '''
    try:
        df["tags"] = df[cols_to_get].apply(lambda x: re.split(" |,|\. |\.\.\.|\"|'", x))
        df["tags"] = df["tags"].apply(lambda x: [i for i in x if i.lower() not in stop_words])
    except Exception as e:
        print(f"Error: {e}")
    return df

In [162]:
def get_dummies_df(df: pd.DataFrame=None, keep_cols: list=None, col_to_dummies: list=None):
    '''
    Get a dataframe with the dummies of the column col_to_dummies.
    :param df: Dataframe to transform.
    :param keep_cols: List of columns to keep.
    :param col_to_dummies: List of columns to transform to dummies.
    '''
    try:
        df_dummies = df[keep_cols]
        # We add the tag column to the dataframe
        if "tags" in col_to_dummies:
            df_dummies = get_tags(df=df_dummies) 

        # We create binary variables for each genre by One-Hot encoding the genres column    
        encoded_genres = pd.get_dummies(df_dummies.genres.apply(pd.Series).stack()).groupby(level=0).sum()
        df_dummies = pd.concat([df_dummies, encoded_genres], axis=1).sort_values(by=["title", "release_date"], ascending=[True,False])
        df_dummies.reset_index(drop=True, inplace=True)       
            
        if "tags" in col_to_dummies:
            encoded_tags = pd.get_dummies(df_dummies.tags.apply(pd.Series).stack()).groupby(level=0).sum()
            df_dummies = pd.concat([df_dummies, encoded_tags], axis=1)            
    except Exception as e:
        print(f"Error: {e}")
        return None
    return df_dummies

## **Model N°1:** Genres with rating ponderation

In [163]:
# We retrieve only the useful columns for the content-based recommender system
cols_to_keep = ["id", "title", "release_date", "user_rating", "genres"]

In [164]:
movies_dummies = get_dummies_df(df=movies, keep_cols=cols_to_keep, col_to_dummies=["genres"])

In [165]:
# We compute the cosine similarity between each movie and multiply it by the user rating.
cos_sim = cosine_similarity(movies_dummies.iloc[:,-(movies_dummies.shape[1]-len(cols_to_keep)):]) * movies_dummies.user_rating.values.tolist()

In [166]:
def CB_recommender(title='En corps', cos_sim=cos_sim, nb_recos=10, dummies_df=movies_dummies):
    CB_recos = [] # initialisation of the list of recommendations
    # If the movie is in the database
    if title in dummies_df.title.values:
        idx = dummies_df.index[dummies_df.title == title][0] # We retrieve the index of the movie (the most recent one if there are homonyms)
        score_series = pd.Series(cos_sim[idx]).drop(idx).sort_values(ascending=False) # We sort the similarity matrix of the movies and we drop the movie itself
        top_nbrecos = list(score_series.iloc[0:nb_recos].index) # We select the top nb_recos movies 
        # We store the titles of the top nb_recos movies in the list CB_recos
        for i in top_nbrecos:             
            CB_recos.append((dummies_df.title[i], cos_sim[idx][i]))            
    else:
        print(f'The movie {title} requested was not found.')
    return CB_recos

In [168]:
CB_recommender(title="Avengers", cos_sim=cos_sim, nb_recos=10)

[('Star Wars : Episode III - La Revanche des Sith', 4.200000000000001),
 ('Rogue One: A Star Wars Story', 4.1000000000000005),
 ('La Planète des singes : les origines', 4.1000000000000005),
 ('Star Trek Into Darkness', 4.000000000000001),
 ('Star Trek', 3.900000000000001),
 ("Captain America, le soldat de l'hiver", 3.900000000000001),
 ('La Planète des Singes - Suprématie', 3.900000000000001),
 ('Planète interdite', 3.8000000000000007),
 ("Avengers : L'ère d'Ultron", 3.8000000000000007),
 ('Stargate, la porte des étoiles', 3.700000000000001)]

The similarity between movies is very high, so a lot of movies with the highest similarity are recommended to the user. This is because we only considered the genres as a comparison criterion. We need to use more features if we want to have a more accurate recommendation. The order may change depending on the original order of the movies. 

By multiplying the genres with the rating, we can have a more accurate recommendation. In that way, movies with the same genres will more likely be similar if they have a high rating.

## **Model N°2:** Model 1 + tag analysis

In [169]:
# We retrieve only the useful columns for the content-based recommender system
cols_to_keep_2 = ["title", "release_date", "user_rating", "genres", "summary"]

In [170]:
movies_dummies_2 = get_dummies_df(df=movies, keep_cols=cols_to_keep_2, col_to_dummies=["genres", "tags"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["tags"] = df[cols_to_get].apply(lambda x: re.split(" |,|\. |\.\.\.|\"|'", x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["tags"] = df["tags"].apply(lambda x: [i for i in x if i.lower() not in stop_words])


In [171]:
movies_dummies_2

Unnamed: 0,title,release_date,user_rating,genres,summary,tags,Action,Animation,Arts Martiaux,Aventure,...,étrange,étreinte,évasion,éventreur,êtes,île,îles,Œil,œuf,–
0,# Pire soirée,2017-08-02,1.8,[Comédie],Cinq amies qui se sont connues à l’université ...,"[Pire, soirée]",0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,#Alive,2021-09-23,3.0,"[Drame, Action, Epouvante-horreur]","Comme un terrifiant virus ravage sa ville, un ...",[#Alive],1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,#Chef,2014-10-29,3.6,"[Comédie, Drame]","Carl Casper, Chef cuisinier, préfère démission...",[#Chef],0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,#Jesuislà,2020-02-05,2.4,"[Comédie, Romance]",Stéphane mène une vie paisible au Pays Basque ...,[#Jesuislà],0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,'71,2014-09-02,3.8,"[Action, Drame, Guerre]","Belfast, 1971.Tandis que le conflit dégénère e...",[71],1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7902,Énorme,2020-09-02,2.2,[Comédie],Ça lui prend d’un coup à 40 ans : Frédéric veu...,[Énorme],0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7903,Éperdument,2016-03-02,3.1,[Drame],"Un homme, une femme. Un directeur de prison, s...",[Éperdument],0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7904,Éternité,2016-01-01,2.1,[Drame],"Quand Valentine se marie à 20 ans avec Jules, ...",[Éternité],0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7905,Étoiles de cristal,2022-04-08,2.6,"[Drame, Thriller]","Sous pression, une ballerine choisie pour un n...","[Étoiles, cristal]",0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [178]:
m_hp = movies[movies.title.apply(lambda x: "Harry Potter" in x)]

In [179]:
summary = m_hp.title.values

In [180]:
summary

array(["Harry Potter à l'école des sorciers",
       'Harry Potter et les reliques de la mort - partie 2',
       'Harry Potter et la chambre des secrets',
       "Harry Potter et l'Ordre du Phénix",
       'Harry Potter et les reliques de la mort - partie 1',
       'Harry Potter et la Coupe de Feu',
       'Harry Potter et le Prince de sang mêlé',
       "Harry Potter et le Prisonnier d'Azkaban"], dtype=object)

In [157]:
movies_dummies_2["tags"] = movies_dummies_2["title"].apply(lambda x: re.split(" |,|\. |\.\.\.|\"|'|#", x))
movies_dummies_2["tags"] = movies_dummies_2["tags"].apply(lambda x: [i for i in x if i.lower() not in stop_words])
movies_dummies_2.head()

Unnamed: 0,title,release_date,user_rating,genres,summary,Action,Animation,Arts Martiaux,Aventure,Biopic,...,Judiciaire,Musical,Policier,Péplum,Romance,Science fiction,Sport event,Thriller,Western,tags
0,# Pire soirée,2017-08-02,1.8,[Comédie],Cinq amies qui se sont connues à l’université ...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"[Pire, soirée]"
1,#Alive,2021-09-23,3.0,"[Drame, Action, Epouvante-horreur]","Comme un terrifiant virus ravage sa ville, un ...",1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,[Alive]
2,#Chef,2014-10-29,3.6,"[Comédie, Drame]","Carl Casper, Chef cuisinier, préfère démission...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,[Chef]
3,#Jesuislà,2020-02-05,2.4,"[Comédie, Romance]",Stéphane mène une vie paisible au Pays Basque ...,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,[Jesuislà]
4,'71,2014-09-02,3.8,"[Action, Drame, Guerre]","Belfast, 1971.Tandis que le conflit dégénère e...",1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,[71]
