# Content based-filtering

Un système de recommandation par filtrage de contenu (ou content-based filtering) recommande des éléments (comme des films, produits ou articles) à un utilisateur en se basant sur les caractéristiques des éléments qu’il a aimés ou consommés auparavant.

In [1]:
import pandas as pd 
all = pd.read_csv(r"C:\Users\nolle\OneDrive\Documents\Université\M2\Mémoire\Base de données traitées\All.csv")
movies_final = pd.read_csv(r"C:\Users\nolle\OneDrive\Documents\Université\M2\Mémoire\Base de données traitées\movies_final.csv")

In [2]:
movies_final["overview"].isna().sum()
# 8 synopsis sont manquants nous allons donc supprimer ces films de la base dans la cellule suivante.

8

In [3]:
movies_final_overview = movies_final[movies_final["overview"].notna()]
len(movies_final_overview)
# Nous retrouvons bien une différence de 8 films. 

3019

In [4]:
# Problème au niveau des index
movies_final_overview.tail(8)

Unnamed: 0.1,Unnamed: 0,movieId,genres_x,title_bis,title_clean,date,budget,overview,release_date,revenue,tagline
3019,3025,3945,Adventure|Animation|Children's,Digimon: The Movie,Digimon: The Movie (2000),2000,5500000,The first story focused on Tai and Kari Kamiya...,2000-03-17,0.0,
3020,3026,3946,Action|Drama|Thriller,Get Carter,Get Carter (2000),2000,63600000,Remake of the Michael Caine classic. Jack Cart...,2000-10-06,19412993.0,The Truth Hurts
3021,3027,3947,Thriller,Get Carter,Get Carter (1971),1971,1814462,"Michael Caine is Jack Carter, a small-time hoo...",1971-03-03,0.0,What happens when a professional killer violat...
3022,3028,3948,Comedy,Meet the Parents,Meet the Parents (2000),2000,55000000,"Greg Focker is ready to marry his girlfriend, ...",2000-10-06,330444045.0,First comes love. Then comes the interrogation.
3023,3029,3949,Drama,Requiem for a Dream,Requiem for a Dream (2000),2000,4500000,The hopes and dreams of four ambitious people ...,2000-10-27,7390108.0,
3024,3030,3950,Drama,Tigerland,Tigerland (2000),2000,10000000,A group of recruits go through Advanced Infant...,2000-09-22,0.0,The system wanted them to become soldiers. One...
3025,3031,3951,Drama,Two Family House,Two Family House (2000),2000,0,Buddy Visalo (Michael Rispoli) is a factory wo...,2000-01-21,0.0,The only way to find out what you love is to r...
3026,3032,3952,Drama|Thriller,"Contender, The",The Contender (2000),2000,9000000,Political thriller about Laine Hanson's nomina...,2000-10-13,0.0,Sometimes you can assassinate a leader without...


In [5]:
movies_final_overview = movies_final_overview.reset_index(drop = True)
movies_final_overview.tail()

Unnamed: 0.1,Unnamed: 0,movieId,genres_x,title_bis,title_clean,date,budget,overview,release_date,revenue,tagline
3014,3028,3948,Comedy,Meet the Parents,Meet the Parents (2000),2000,55000000,"Greg Focker is ready to marry his girlfriend, ...",2000-10-06,330444045.0,First comes love. Then comes the interrogation.
3015,3029,3949,Drama,Requiem for a Dream,Requiem for a Dream (2000),2000,4500000,The hopes and dreams of four ambitious people ...,2000-10-27,7390108.0,
3016,3030,3950,Drama,Tigerland,Tigerland (2000),2000,10000000,A group of recruits go through Advanced Infant...,2000-09-22,0.0,The system wanted them to become soldiers. One...
3017,3031,3951,Drama,Two Family House,Two Family House (2000),2000,0,Buddy Visalo (Michael Rispoli) is a factory wo...,2000-01-21,0.0,The only way to find out what you love is to r...
3018,3032,3952,Drama|Thriller,"Contender, The",The Contender (2000),2000,9000000,Political thriller about Laine Hanson's nomina...,2000-10-13,0.0,Sometimes you can assassinate a leader without...


# I. Création de la matrice de similiarité avec TF-IDF

### a. TF-IDF : principe et application

**TF-IDF = Term Frequency * Inverse Document Frequency.**  
C’est une technique qui sert à pondérer l’importance d’un mot dans un texte (comme un résumé de film), en tenant compte de l’ensemble des textes. 
 
**TF (Term Frequency) : fréquence d’un mot dans un document**    
→ Plus un mot apparaît souvent dans un texte, plus il est important     
Exemple : "espion" apparaît 5 fois → score TF élevé 

**IDF (Inverse Document Frequency) : rareté du mot dans l’ensemble des documents**      
→ Si un mot est trop courant (ex: "the", "story"), il n’apporte pas d’info utile     
→ Plus un mot est rare, plus il est considéré comme informatif 

Note : un  "document" correspond à une cellule d'overview.

Exemple : le mot "pingouin" apparaît dans un résumé, ce mot sera très rare. De plus il apparaît 3 fois dans mon synopsis. Son TF-IDF sera très fort.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer # type: ignore
import pandas as pd

# One-hot sur les genres
genres_df = movies_final_overview['genres_x'].str.get_dummies(sep='|')

In [7]:
# TF-IDF sur l’overview
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix = vectorizer.fit_transform(movies_final_overview["overview"])
overview_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

In [8]:
# Fusion finale
content_df = pd.concat([movies_final_overview['title_clean'], genres_df, overview_df], axis=1)
len(content_df)

3019

In [25]:
# Sélection de la ligne du film
row = content_df[content_df['title_clean'] == 'Toy Story (1995)']

# Retirer la colonne title_clean pour ne garder que les données numériques
row_no_title = row.drop(columns=['title_clean'])

# On transpose en format "long" (vertical)
row_long = row_no_title.T.reset_index()
row_long.columns = ['feature', 'value']


# On trie par ordre décroissant
row_long_sorted = row_long.sort_values(by='value', ascending=True)

# Résultat
print(row_long_sorted.head(30))

          feature  value
0          Action    0.0
670      personal    0.0
671         peter    0.0
672  photographer    0.0
673       picture    0.0
674         pilot    0.0
676        places    0.0
677          plan    0.0
678         plane    0.0
679        planet    0.0
680       planned    0.0
681         plans    0.0
682          play    0.0
683        played    0.0
684        player    0.0
685       playing    0.0
686         plays    0.0
687          plot    0.0
688        police    0.0
689     political    0.0
690          poor    0.0
691       popular    0.0
692          post    0.0
693         power    0.0
694      powerful    0.0
695        powers    0.0
696      pregnant    0.0
697       present    0.0
669        person    0.0
668       perfect    0.0


### b. Matrice de similarité

Une matrice de similarité est un tableau carré (même nombre de lignes et de colonnes) où chaque case indique à quel point deux éléments se ressemblent.

Dans notre cas :
- Chaque ligne et colonne représente un film 
- Chaque valeur représente le niveau de ressemblance entre deux films (souvent entre 0 et 1) 

In [9]:
from sklearn.metrics.pairwise import cosine_similarity

# On retire la colonne 'title_clean' pour ne garder que les vecteurs numériques

In [10]:
# Calcul de la similarité cosinus entre chaque paire de films
cosine_sim = cosine_similarity(content_df.drop(columns = ['title_clean']))
cosine_sim

array([[1.        , 0.26461398, 0.28867513, ..., 0.        , 0.        ,
        0.        ],
       [0.26461398, 1.        , 0.03338645, ..., 0.        , 0.        ,
        0.        ],
       [0.28867513, 0.03338645, 1.        , ..., 0.        , 0.01381401,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.5       ,
        0.40824829],
       [0.        , 0.        , 0.01381401, ..., 0.5       , 1.        ,
        0.40824829],
       [0.        , 0.        , 0.        , ..., 0.40824829, 0.40824829,
        1.        ]])

In [11]:
cosine_sim_df = pd.DataFrame(cosine_sim, 
             index = content_df['title_clean'],
             columns = content_df['title_clean']
)

In [None]:
cosine_sim_df.loc['Toy Story']

title_clean,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995),Heat (1995),Sabrina (1995),Tom and Huck (1995),Sudden Death (1995),GoldenEye (1995),...,Bamboozled (2000),Bootmen (2000),Digimon: The Movie (2000),Get Carter (2000),Get Carter (1971),Meet the Parents (2000),Requiem for a Dream (2000),Tigerland (2000),Two Family House (2000),The Contender (2000)
title_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story (1995),1.000000,0.264614,0.288675,0.288675,0.353553,0.000000,0.288675,0.288675,0.000000,0.000000,...,0.353553,0.288675,0.500000,0.000000,0.000000,0.353553,0.000000,0.000000,0.000000,0.000000
Jumanji (1995),0.264614,1.000000,0.033386,0.000000,0.000000,0.035990,0.000000,0.577350,0.094606,0.250000,...,0.000000,0.000000,0.518117,0.005993,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
Grumpier Old Men (1995),0.288675,0.033386,1.000000,0.333333,0.434102,0.000000,0.666667,0.005854,0.000000,0.000000,...,0.408248,0.333333,0.000000,0.000000,0.010131,0.431997,0.000000,0.000000,0.013814,0.000000
Waiting to Exhale (1995),0.288675,0.000000,0.333333,1.000000,0.408248,0.009095,0.333333,0.010541,0.000000,0.000000,...,0.408248,0.666667,0.030680,0.288675,0.015718,0.432416,0.408248,0.408248,0.408248,0.333333
Father of the Bride Part II (1995),0.353553,0.000000,0.434102,0.408248,1.000000,0.000000,0.453561,0.000000,0.040416,0.000000,...,0.500000,0.408248,0.000000,0.000000,0.000000,0.542733,0.000000,0.000000,0.028711,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Meet the Parents (2000),0.353553,0.000000,0.431997,0.432416,0.542733,0.000000,0.408248,0.000000,0.008788,0.000000,...,0.500000,0.424900,0.030513,0.040201,0.040888,1.000000,0.000000,0.000000,0.015085,0.000000
Requiem for a Dream (2000),0.000000,0.000000,0.000000,0.408248,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.408248,0.000000,0.353553,0.000000,0.000000,1.000000,0.500000,0.523642,0.408248
Tigerland (2000),0.000000,0.000000,0.000000,0.408248,0.000000,0.000000,0.000000,0.012817,0.000000,0.000000,...,0.000000,0.408248,0.000000,0.353553,0.000000,0.000000,0.500000,1.000000,0.500000,0.408248
Two Family House (2000),0.000000,0.000000,0.013814,0.408248,0.028711,0.000000,0.000000,0.004450,0.000000,0.000000,...,0.020079,0.421192,0.034200,0.363396,0.023814,0.015085,0.523642,0.500000,1.000000,0.408248


## II. Recommandations

### a. Création de la fonction prédictive

Création de la fonction faisant les prédictions pour 1 user et 1 film

In [39]:
def predict_rating_for_user(user_id, 
                            all_df, 
                            similarity_df, 
                            target_title, 
                            min_similarity = 0.1) :
    """
    Prédit la note qu'un utilisateur donnerait à un film non vu.
    
    user_id : int
    all_df : DataFrame avec colonnes ['userId', 'rating', 'title_clean_x']
    similarity_df : matrice de similarité entre films (DataFrame index/colonnes = titres)
    target_title : titre du film à prédire (doit être dans similarity_df)
    min_similarity : seuil minimum pour inclure une similarité dans le calcul (évite les petits scores parasites)
    
    Retourne une note prédite ou None si pas assez d'infos.
    """
    # Films notés par l'utilisateur
    user_data = all_df[all_df['userId'] == user_id]
    
    numerateur = 0
    denominateur = 0
    
    for _, row in user_data.iterrows():
        seen_title = row['title_clean_x']
        rating = row['rating']
        
        if seen_title not in similarity_df.columns or target_title not in similarity_df.index:
            continue
        
        sim = similarity_df.at[target_title, seen_title]
        
        if sim >= min_similarity :
            numerateur += sim * rating
            denominateur += sim
    
    if denominateur == 0:
        return None  # Pas assez d'infos pour prédire
    return numerateur / denominateur

In [40]:
predict_rating_for_user(user_id = 1, 
                        all_df = all, 
                        similarity_df = cosine_sim_df,
                        target_title = "One Flew Over the Cuckoo's Nest (1975)", 
                        )

4.459309761697153

### b. Evaluation du système

1. Création de la matrice d'utilité

In [41]:
# Création de la matrice d'utilité (user x film)
util_matrix = all.pivot_table(
    index='userId',
    columns='title_clean_x',
    values='rating'
)

2. Application de la fonction : pour chaque utilisateur on cache 20% de ses notes et on tente de les prédire grâce aux informations fournies par ses 80% autres notes.  

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tqdm import tqdm

def evaluate_content_based_recommender(all_df, similarity_df, min_similarity=0.1, test_size=0.2):
    y_true = []
    y_pred = []
    
    all_users = all_df['userId'].unique()

    for user_id in tqdm(all_users, desc="Évaluation par utilisateur") :

        # Sous-ensemble des notes de l'utilisateur
        user_ratings = all_df[all_df['userId'] == user_id]

        # On ne peut pas splitter s'il y a moins de 5 films vus : dans notre dataset l'utilisateur ayant vu le moins de films en a vu 11 
        # donc cette condition va toujours passer

        if len(user_ratings) < 5 :
            continue
        
        # Split 80/20 pour chaque utilisateur
        train_user, test_user = train_test_split(user_ratings, test_size=test_size, random_state=42)

        # Base d'entraînement
        train_df = train_user.copy()

        for _, row in test_user.iterrows():
            target_title = row['title_clean_x']
            true_rating = row['rating']
            
            pred = predict_rating_for_user(
                user_id=user_id,
                all_df=train_df,
                similarity_df=similarity_df,
                target_title=target_title,
                min_similarity=min_similarity
            )
            
            y_true.append(true_rating)
            y_pred.append(pred if pred is not None else np.nan)
    
    return y_true, y_pred

In [27]:
# y_true, y_pred = evaluate_content_based_recommender(all, similarity_df=cosine_sim_df)

# Convertir en arrays numpy pour calculer les métriques
# y_true_np = np.array(y_true)
# y_pred_np = np.array(y_pred)

# Filtrer les NaNs pour évaluer uniquement les prédictions faites
# mask = ~np.isnan(y_pred_np) # Mask de boolean qui donne "False" si une valeur est manquante dans les predictions

# rmse = np.sqrt(np.mean((y_true_np[mask] - y_pred_np[mask]) ** 2))
# mae = np.mean(np.abs(y_true_np[mask] - y_pred_np[mask]))

# print(f"RMSE: {rmse:.4f}")
# print(f"MAE: {mae:.4f}")

In [None]:
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Supposons que y_true et y_pred soient déjà définis
y_true_np = np.array(y_true)
y_pred_np = np.array(y_pred)

# Masque pour ne garder que les prédictions valides (non-NaN)
mask = ~np.isnan(y_pred_np)

# Calcul des métriques sur les valeurs valides
rmse = mean_squared_error(y_true_np[mask], y_pred_np[mask], squared=False)
mae = mean_absolute_error(y_true_np[mask], y_pred_np[mask])

# Couverture = nombre de prédictions valides / nombre total de notes testées
coverage = np.sum(mask) / len(y_true_np)

print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"Coverage: {coverage*100:.2f}%")

RMSE: 1.0074
MAE: 0.8035
Coverage: 99.84%


Les résultats obtenus se sont faits sur 181844 prédictions.

In [29]:
len(y_pred_np[mask])

181844

### c. Généralisation du système pour prédire le plus de notes possibles

Objectif : remplir tous les trous de la matrice d'utilité pour obtenir le plus de prédictions possibles.

In [None]:
# Créer une copie pour la remplir
# util_matrix_filled = util_matrix.copy()

# Parcourir chaque cellule vide et prédire la note
# for user_id in util_matrix_filled.index :

    # for movie in util_matrix_filled.columns :

        # if pd.isna(util_matrix_filled.at[user_id, movie]) :

            # predicted_rating = predict_rating_for_user(
                # user_id = user_id,
                # all_df = all,
                # similarity_df = cosine_sim_df,
                # target_title = movie
          # )

            # util_matrix_filled.at[user_id, movie] = predicted_rating

J'ai été obligé d'interrompre la requête et de faire avec ce que j'avais comme prédictions. La requête était trop gourmande en ressources. Nous avons pu malgré tout faire plus de 22000 prédictions comme nous le verrons plus tard.

In [None]:
# util_matrix_filled.to_csv(r"C:\Users\PVUE733\OneDrive - LA POSTE GROUPE\Documents\Mémoire\Util matrix FC.csv")

### d. Meilleurs résultats pour un utilisateur donné

Problème : je dois garder uniquement les notes prédites et pas les notes qui existaient déja.

In [1]:
import pandas as pd
util_matrix_filled = pd.read_csv(r"C:\Users\PVUE733\OneDrive - LA POSTE GROUPE\Documents\Mémoire\Util matrix FC.csv")

NameError: name 'user1' is not defined

In [31]:
# Prédiction pour l'utilisateur 1

user1 = util_matrix_filled.loc[util_matrix_filled["userId"] == 1]

In [32]:
user1

Unnamed: 0,userId,...And Justice for All (1979),1-900 (1994),10 Things I Hate About You (1999),101 Dalmatians (1996),12 Angry Men (1957),2 Days in the Valley (1996),20 Dates (1998),"20,000 Leagues Under the Sea (1954)",200 Cigarettes (1999),...,You've Got Mail (1998),Young Doctors in Love (1982),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),Young Sherlock Holmes (1985),Young and Innocent (1937),Zachariah (1971),Zero Effect (1998),eXistenZ (1999)
0,1,4.405961,3.549038,3.953809,4.26347,4.444594,4.0,4.151375,4.222804,4.349979,...,3.954327,4.164166,4.165382,4.222668,4.222916,4.2089,4.0,,4.133676,4.176754


In [34]:
user1.set_index("userId", inplace = True)

In [35]:
# Les notes qui existaient avant sont supprimées afin de garder uniquement les prédictions

for i in user1.columns :
    if pd.notna(util_matrix.at[1, i]) :
        user1.at[1, i] = None

In [37]:
user1.reset_index(inplace = True)

In [None]:
user1_long = user1.melt(
    id_vars     = 'userId',
    var_name    = 'title_clean_x',
    value_name  = 'pred_rating'
)

In [30]:
util_matrix_filled[util_matrix_filled['userId'] == 1]

Unnamed: 0,userId,...And Justice for All (1979),1-900 (1994),10 Things I Hate About You (1999),101 Dalmatians (1996),12 Angry Men (1957),2 Days in the Valley (1996),20 Dates (1998),"20,000 Leagues Under the Sea (1954)",200 Cigarettes (1999),...,You've Got Mail (1998),Young Doctors in Love (1982),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),Young Sherlock Holmes (1985),Young and Innocent (1937),Zachariah (1971),Zero Effect (1998),eXistenZ (1999)
0,1,4.405961,3.549038,3.953809,4.26347,4.444594,4.0,4.151375,4.222804,4.349979,...,3.954327,4.164166,4.165382,4.222668,4.222916,4.2089,4.0,,4.133676,4.176754


In [27]:
def top_10(df, 
           nombre_films_a_suggerer = 10):
    """
    Affiche les 10 films les mieux notés prédits pour un utilisateur donné.

    Parameters:
    - df (pd.DataFrame): Le DataFrame contenant les colonnes 'userId', 'title_clean_x' et 'pred_rating'.
    """
    user_input = input("Enter your UserID: ")

    try:
        user_id = int(user_input)
    except ValueError:
        print("Veuillez entrer un identifiant numérique valide.")
        return

    # Filtrer les prédictions pour l'utilisateur donné
    user_predictions = df[df['userId'] == user_id]
    user_predictions.set_index("userId", inplace = True)

    # On garde que les prédictions

    for i in user_predictions.columns :
        if pd.notna(util_matrix.at[1, i]) :
            user_predictions.at[1, i] = None
    
    # Pivoter le jeu
    
    user_predictions.reset_index(inplace = True)
    user_long = user1.melt(
    id_vars     = 'userId',
    var_name    = 'title_clean_x',
    value_name  = 'pred_rating'
)
    
    
    if user_predictions.empty :
        print(f"Aucune prédiction trouvée pour l'utilisateur {user_id}.")
        return

    # Trier par note prédite décroissante et prendre les 10 premiers
    top_10_user = user_long.sort_values("pred_rating", ascending=False).head(nombre_films_a_suggerer)

    # Afficher les résultats
    print(f"\nWe suggest to UserID {user_id} :\n")
    for _, row in top_10_user.iterrows():
        print(f"{row['title_clean_x']}, Predicted rating: {row['pred_rating']:.2f}")

In [28]:
top_10(util_matrix_filled, 
       nombre_films_a_suggerer = 10)

NameError: name 'util_matrix_filled' is not defined

### e. Limites  & avantages

#### a. Les films prédits sont ceux se basant sur des informations limitées

L'objectif de cette section est de comprendre comment une note de 5 a été prédite.

In [21]:
top_10_user_1 = user1_long.sort_values("pred_rating", ascending = False).head(10)[["userId", "title_clean_x", "pred_rating"]]

In [22]:
top_10_user_1.set_index('title_clean_x')['pred_rating']

title_clean_x
Stalingrad (1993)                        5.0
Catwalk (1995)                           5.0
Underground (1995)                       5.0
Hell Night (1981)                        5.0
Inferno (1980)                           5.0
Halloween II (1981)                      5.0
Snowriders (1996)                        5.0
The Funhouse (1981)                      5.0
The Masque of the Red Death (1964)       5.0
All Quiet on the Western Front (1930)    5.0
Name: pred_rating, dtype: float64

In [151]:
top_10_user_1_movie = top_10_user_1["title_clean_x"].to_list()

In [None]:
top_10_user_1_movie

['Stalingrad (1993)',
 'Catwalk (1995)',
 'Underground (1995)',
 'Hell Night (1981)',
 'Inferno (1980)',
 'Halloween II (1981)',
 'Snowriders (1996)',
 'The Funhouse (1981)',
 'The Masque of the Red Death (1964)',
 'All Quiet on the Western Front (1930)']

In [153]:
all = all.loc[all["userId"] == 1]

In [154]:
user1_seen = all["title_clean_x"].to_list()
len(user1_seen)

50

In [None]:
for i in top_10_user_1_movie :
    print('\n')
    print(f'Similar movies with {i} :')

    for j in user1_seen :
        if cosine_sim_df.loc[i, j] > 0.1 :
            print(f'{j} : Rating --> {all.loc[(all["userId"] == 1) & (all["title_clean_x"] == j), "rating"].values[0]}')



Similar movies with Stalingrad (1993) :
Schindler's List (1993) : Rating --> 5
Saving Private Ryan (1998) : Rating --> 5


Similar movies with Catwalk (1995) :
The Last Days of Disco (1998) : Rating --> 5


Similar movies with Underground (1995) :
Schindler's List (1993) : Rating --> 5
Saving Private Ryan (1998) : Rating --> 5


Similar movies with Hell Night (1981) :
The Last Days of Disco (1998) : Rating --> 5


Similar movies with Inferno (1980) :
The Last Days of Disco (1998) : Rating --> 5


Similar movies with Halloween II (1981) :
Awakenings (1990) : Rating --> 5


Similar movies with Snowriders (1996) :
The Sound of Music (1965) : Rating --> 5


Similar movies with The Funhouse (1981) :
The Last Days of Disco (1998) : Rating --> 5


Similar movies with The Masque of the Red Death (1964) :
Beauty and the Beast (1991) : Rating --> 5


Similar movies with All Quiet on the Western Front (1930) :
Schindler's List (1993) : Rating --> 5
Saving Private Ryan (1998) : Rating --> 5


#### b. Pas de cold-start : recommandations pour tous les utilisateurs

Rappel : l'utilisateur ayant noté le moins de films en a noté 11

In [8]:
all['userId'].value_counts().sort_values()

userId
3234      11
4943      13
1488      14
3222      14
6038      15
        ... 
1181    1338
1941    1431
4277    1501
1680    1598
4169    1980
Name: count, Length: 6040, dtype: int64

In [64]:
movies_final['overview'].isna()

0       False
1       False
2       False
3       False
4       False
        ...  
3022    False
3023    False
3024    False
3025    False
3026    False
Name: overview, Length: 3027, dtype: bool

In [67]:
users = all['userId'].unique().tolist()
movies = movies_final['title_clean'].unique().tolist()
results = []

for user_id in tqdm(users, desc = "Calcul des films prédictibles par utilisateur") :
    seen_movies = all[all['userId'] == user_id]['title_clean_x'].unique().tolist()
    count_pred = 0

    for movie in movies :
        if movie in cosine_sim_df.index :
            if any(
                seen in cosine_sim_df.columns and cosine_sim_df.at[movie, seen] > 0.1
                for seen in seen_movies
            ):
                count_pred += 1

    results.append({'userId': user_id, 'Films à prédire': count_pred})

similarities = pd.DataFrame(results).set_index('userId')

Calcul des films prédictibles par utilisateur:   0%|          | 0/6040 [00:00<?, ?it/s]

Calcul des films prédictibles par utilisateur: 100%|██████████| 6040/6040 [22:19<00:00,  4.51it/s]


Le nombre minimum de films à prédire est de 1768 donc aucun problème de cold-start. Il s'agit de l'individu 2908 qui a noté seulement 17 films.

In [75]:
min_value = similarities['Films à prédire'].min()
print(min_value)

1768


In [None]:
similarities[similarities['Films à prédire'] == min_value]

Unnamed: 0_level_0,Films à prédire
userId,Unnamed: 1_level_1
2908,1768


In [78]:
len(all[all['userId'] == 2908])

17