# 🎦**AlloCiné Recommender System**🔎📍

Once we cleaned the data, we can start to build our recommender system. The data that will be used is located in the `../Cleaned Data/` folder.

**Types of Recommender System:**

There are two types of recommender system: **`content-based`** and **`collaborative-filtering`**.

- **`Content-based`:** this recommender system is based only on the characteristics of the products. Here, we will recommend an item to a user by comparing the features between items and recommend the items with the highest similarity.

- **`Collaborative-filtering`:** this recommender system is based on the interactions between users and the items. 


---
# **Import libs**

In [1]:
# import libraries
import pandas as pd
import numpy as np
import re
from sklearn.metrics.pairwise import cosine_similarity
from ast import literal_eval
from warnings import filterwarnings
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop # Used to get the French stop-words
from IPython.display import clear_output

# We ignore dateparse warnings
filterwarnings("ignore",message="The localize method is no longer necessary, as this time zone supports the fold attribute")
# We ignore reindexing warnings
filterwarnings("ignore",message="Boolean Series key will be reindexed")
%matplotlib inline

# **Load the csv files**

In [2]:
def load_csv():
    '''
    Load the csv files and return a dict of dataframes.
    '''
    root_path = f"../Cleaned Data/"
    movies = pd.read_csv(f"{root_path}movies.csv", converters={"genres": literal_eval}) # Load list look-alike string as type list
    series = pd.read_csv(f"{root_path}series.csv", converters={"genres": literal_eval})
    press_movies = pd.read_csv(f"{root_path}press_movies.csv")
    press_series = pd.read_csv(f"{root_path}press_series.csv")
    user_movies = pd.read_csv(f"{root_path}user_movies.csv")
    user_series = pd.read_csv(f"{root_path}user_series.csv")
    #user_series = pd.read_csv(f"../Series/Ratings/Webscraping_Series_Ratings_user_ratings_series_#1-1.csv")
    return {"movies":movies, "series":series, "press_movies":press_movies, "press_series":press_series, "user_movies":user_movies, "user_series":user_series}
data = load_csv()
movies, series, press_movies, press_series, user_movies, user_series = data["movies"], data["series"], data["press_movies"], data["press_series"], data["user_movies"], data["user_series"]

---
# **Content-based**

## Functions:

In [3]:
def get_stop_words():
    '''
    Get the French stop-words for tagging.
    :return: list of French stop-words.
    '''
    # turn French stop-words into a list
    stop_words = list(fr_stop)
    # Load extra stop-words in French and in English
    with open("stop_words_french.txt",'r', encoding='UTF-8') as file:
        additional_stopwords = file.readlines()
        additional_stopwords = [line.rstrip() for line in additional_stopwords]
    with open("stop_words_eng.txt",'r', encoding='UTF-8') as file:
        additional_stopwords_eng = file.readlines()
        additional_stopwords_eng = [line.rstrip() for line in additional_stopwords_eng]
    
    # Add the additional French stop-words to the list of stop-words if they are not already in it
    for word, word_eng in zip(additional_stopwords,additional_stopwords_eng):
        if word not in stop_words:
            stop_words.append(word)
        if word_eng not in stop_words:
            stop_words.append(word_eng)
    stop_words.extend([""," ","#","-",":","(",")"])
    return stop_words

In [4]:
def get_tags(df: pd.DataFrame=None, stop_words: list=get_stop_words(), cols_to_get: str="title"):
    '''
    Get the tags of the movies from the title and/or summary.
    :param df: Dataframe to transform.
    :param stop_words: List of stop-words to remove from the tags.
    :param col_to_get: Columns to get the tags from.
    '''
    try:
        # We remove unecessary punctuation and characters from the title 
        # And store the title's keywords in a list
        rmv_char = r"(!|#|:|\$|\%|\^|\&|\*|\(|\)|-|\+|/|\?|\.+|\d)"
        df["tags"] = df[cols_to_get].apply(lambda x: re.split(" |,|\. |\.\.\.|\"|'|-", re.sub(rmv_char, " ", x)))
        df["tags"] = df["tags"].apply(lambda x: [i for i in x if i.lower() not in stop_words])
    except Exception as e:
        print(f"Error: {e}")
    return df

In [5]:
def get_dummies_df(df: pd.DataFrame=None, keep_cols: list=None, col_to_dummies: list=None):
    '''
    Get a dataframe with the dummies of the column col_to_dummies.
    :param df: Dataframe to transform.
    :param keep_cols: List of columns to keep.
    :param col_to_dummies: List of columns to transform to dummies.
    '''
    try:
        df_dummies = df[keep_cols]
        # We drop the movies with no user rating
        df_dummies = df_dummies.dropna(subset=["user_rating"], axis=0)
        df_dummies = df_dummies.reset_index(drop=True)
        
        # We add the tag column to the dataframe
        if "tags" in col_to_dummies:
            df_dummies = get_tags(df=df_dummies) 

        non_dummy_cols = df_dummies.shape[1]

        # We create binary variables for each genre by One-Hot encoding the genres column    
        encoded_genres = pd.get_dummies(df_dummies.genres.apply(pd.Series).stack()).groupby(level=0).sum()
        df_dummies = pd.concat([df_dummies, encoded_genres], axis=1).sort_values(by=["title", "release_date"], ascending=[True,False])
        df_dummies.reset_index(drop=True, inplace=True)       
            
        if "tags" in col_to_dummies:
            encoded_tags = pd.get_dummies(df_dummies.tags.apply(pd.Series).stack()).groupby(level=0).sum()
            df_dummies = pd.concat([df_dummies, encoded_tags], axis=1)
                
        # We replace NaN values in the dummy columns with 0
        df_dummies.iloc[:,non_dummy_cols:] = df_dummies.iloc[:,non_dummy_cols:].fillna(0)
        
    except Exception as e:
        print(f"Error: {e}")
        return None
    return df_dummies

## **Model N°1:** Genres with rating ponderation

In [6]:
# We retrieve only the useful columns for the content-based recommender system
cols_to_keep = ["id", "title", "release_date", "user_rating", "genres"]

In [7]:
movies_dummies = get_dummies_df(df=movies, keep_cols=cols_to_keep, col_to_dummies=["genres"])

In [8]:
# We compute the cosine similarity between each movie and multiply it by the user rating.
cos_sim = cosine_similarity(movies_dummies.iloc[:,-(movies_dummies.shape[1]-len(cols_to_keep)):]) * movies_dummies.user_rating.values

In [9]:
def CB_recommender(title: str="The Batman", cos_sim=cos_sim, nb_recos: int=10, dummies_df: pd.DataFrame=movies_dummies):
    '''
    Get the recommendations for a movie.
    :param title: Title of the movie to get the recommendations for.
    :param cos_sim: Cosine similarity matrix.
    :param nb_recos: Number of recommendations to get.
    :param dummies_df: Dataframe with the dummies of the movies.
    :return: List of nb_recos recommendations.    
    '''
    CB_recos = [] # Initialisation of the list of recommendations
    title_keywords = title.split(" ") # We split the title into keywords
    title_list = []
    # We collect all the movies titles which contain all the keywords of the title
    for full_title in dummies_df.title.values.tolist():
        if all(word.lower() in full_title.lower() for word in title_keywords):
            title_list.append(full_title)
    
    clear_output(wait=True)
    # If we get a direct match, we return the title.    
    if len(title_list) == 1:
        current_title = title_list[0]
    # Else, we get all the movies with similar names and ask the user to choose one.
    elif len(title_list) > 1:
        print("Several movies found with similar title. Please choose one of the following:")
        for i, item in enumerate(title_list,1):
            print(i, ': ' + item, sep='',end='\n')
        choice = -1
        while choice < 1 or choice > len(title_list):
            choice = int(input("Enter the number of the movie: "))
        current_title = title_list[choice-1]
    else:
        return f'Error: The movie {title} requested was not found.'

    clear_output(wait=True)
    print(f"The movie you requested is: \"{current_title}\"\nIf you like this movie, you might also like:")
    # If the movie is in the database
    idx = dummies_df.index[dummies_df.title == current_title][0] # We retrieve the index of the movie (the most recent one if there are homonyms)
    score_series = pd.Series(cos_sim[idx]).drop(idx).sort_values(ascending=False) # We sort the similarity matrix of the movies and we drop the movie itself
    top_nbrecos = list(score_series.iloc[0:nb_recos].index) # We select the top nb_recos movies 
    # We store the titles of the top nb_recos movies in the list CB_recos and return it
    for i in top_nbrecos:             
        CB_recos.append((dummies_df.title[i], cos_sim[idx][i]))
    return CB_recos

In [18]:
CB_recommender(title=input("Enter a movie title or keywords: "), cos_sim=cos_sim, nb_recos=10)

The movie you requested is: "Avengers"
If you like this movie, you might also like:


[('Star Wars : Episode III - La Revanche des Sith', 4.200000000000001),
 ('Rogue One: A Star Wars Story', 4.1000000000000005),
 ('La Planète des singes : les origines', 4.1000000000000005),
 ('Star Trek Into Darkness', 4.000000000000001),
 ('Star Trek', 3.900000000000001),
 ("Captain America, le soldat de l'hiver", 3.900000000000001),
 ('La Planète des Singes - Suprématie', 3.900000000000001),
 ('Planète interdite', 3.8000000000000007),
 ("Avengers : L'ère d'Ultron", 3.8000000000000007),
 ('Stargate, la porte des étoiles', 3.700000000000001)]

The similarity between movies is very high, so a lot of movies with the highest similarity are recommended to the user. But depending on the original order of the movies, the order of the recommendations may change and therefore may not be very relevant. This is because we only considered the genres as a comparison criterion. We need to use more features or add a ponderation if we want to have a more accurate recommendation. 

So far, I obtained better recommendation by multiplying the correlation matrix by the movies' ratings. In that way, movies with the same genres will more likely be similar if they have a high rating.

## **Model N°2:** Model 1 + tag analysis

In [11]:
# We retrieve only the useful columns for the content-based recommender system
cols_to_keep_2 = ["title", "release_date", "user_rating", "genres", "summary"]

In [12]:
movies_dummies_2 = get_dummies_df(df=movies, keep_cols=cols_to_keep_2, col_to_dummies=["genres", "tags"])

In [13]:
# We compute the cosine similarity between each movie and multiply it by the user rating.
cos_sim_2 = cosine_similarity(movies_dummies_2.iloc[:,-(movies_dummies_2.shape[1]-len(cols_to_keep_2) - 1):]) * movies_dummies_2.user_rating.values

In [25]:
CB_recommender(title=input("Enter a movie title or keywords: "), cos_sim=cos_sim_2, nb_recos=10)

The movie you requested is: "Pirates des Caraïbes : la Malédiction du Black Pearl"
If you like this movie, you might also like:


[("Pirates des Caraïbes : Jusqu'au Bout du Monde", 2.114285714285714),
 ('Pirates des Caraïbes : le Secret du Coffre Maudit', 2.08463768691691),
 ('Pirates des Caraïbes : la Vengeance de Salazar', 1.9999999999999996),
 ('Black Panther', 1.7590581895678477),
 ('Spider-Man', 1.7457431218879391),
 ('Predator', 1.658455965793542),
 ('Spider-Man 2', 1.658455965793542),
 ('Black', 1.5275252316519468),
 ('Aquaman', 1.5275252316519468),
 ('Spider-Man 3', 1.5275252316519468)]

In this second model, we used the titles of the movies to help recommend movies that, at first sight, may seem to talk about the same subject. What we notice from this new model is that movies with that share parts of their title are more likely to be recommended to the user, even though they have a lower rating. Moreover, as we decided to remove all punctuation marks and symbols characters, movies like `"X-Men"` will become `"X Men"`, and as both words are considered as stop-words, they will never be used to suggests on purpose other movies from the `"X-Men Saga"`. Movies with titles like `"Le Monde de Narnia : Chapitre 1 - Le lion, la sorcière blanche et l'armoire magique"` won't be affected by this characters removal.