Partie 4 : Évaluation et Métriques

Objectifs

- Implémenter les métriques de ranking (Precision@K, Recall@K, MAP, NDCG)
- Évaluer les trois modèles de recommandation
- Comparer les performances
- Calculer des métriques business


In [1]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from tqdm import tqdm
import warnings
import os
import sys
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)


4.1 - Métriques de Ranking

Implémentez les métriques suivantes :


In [2]:
# TODO : Implémenter Precision@K

def precision_at_k(recommended, relevant, k):
    """
    Calcule la Precision@K
    Precision@K = (Items pertinents dans le top-K) / K
    
    Args:
        recommended: liste des items recommandés (ordonnée)
        relevant: liste des items pertinents
        k: nombre de recommandations à considérer
    
    Returns:
        precision: float entre 0 et 1
    """
    if k == 0 or len(recommended) == 0:
        return 0.0
    
    recommended_k = recommended[:k]
    
    recommended_set = set(recommended_k)
    relevant_set = set(relevant)
    
    relevant_in_top_k = len(recommended_set & relevant_set)
    
    precision = relevant_in_top_k / k
    return precision

recommended = [1, 3, 5, 7, 9]
relevant = [3, 5, 8, 10]

print(f"Precision@5 : {precision_at_k(recommended, relevant, 5):.2f}")


Precision@5 : 0.40


In [3]:
# TODO : Implémenter Recall@K

def recall_at_k(recommended, relevant, k):
    """
    Calcule le Recall@K
    Recall@K = (Items pertinents dans le top-K) / (Total pertinents)
    
    Args:
        recommended: liste des items recommandés (ordonnée)
        relevant: liste des items pertinents
        k: nombre de recommandations à considérer
    
    Returns:
        recall: float entre 0 et 1
    """
    if len(relevant) == 0:
        return 0.0
    
    recommended_k = recommended[:k]
    
    recommended_set = set(recommended_k)
    relevant_set = set(relevant)
    
    relevant_in_top_k = len(recommended_set & relevant_set)
    
    recall = relevant_in_top_k / len(relevant_set)
    return recall

print(f"Recall@5 : {recall_at_k(recommended, relevant, 5):.2f}")


Recall@5 : 0.50


In [5]:
# TODO : Implémenter MAP (Mean Average Precision)

def average_precision(recommended, relevant):
    """
    Calcule l'Average Precision pour un utilisateur
    AP = moyenne des Precision@k pour chaque item pertinent trouvé
    
    Args:
        recommended: liste des items recommandés (ordonnée)
        relevant: liste des items pertinents
    
    Returns:
        ap: float entre 0 et 1
    """
    if len(relevant) == 0:
        return 0.0
    
    relevant_set = set(relevant)
    precisions = []
    relevant_found = 0
    
    for k in range(1, len(recommended) + 1):
        if recommended[k-1] in relevant_set:
            relevant_found += 1
            prec_at_k = relevant_found / k
            precisions.append(prec_at_k)
    
    if len(precisions) == 0:
        return 0.0
    
    ap = np.mean(precisions)
    return ap

def mean_average_precision(all_recommended, all_relevant):
    """
    Calcule la MAP sur tous les utilisateurs
    MAP = moyenne des Average Precision pour tous les utilisateurs
    
    Args:
        all_recommended: liste de listes des items recommandés pour chaque utilisateur
        all_relevant: liste de listes des items pertinents pour chaque utilisateur
    
    Returns:
        map_score: float entre 0 et 1
    """
    if len(all_recommended) != len(all_relevant):
        raise ValueError("all_recommended et all_relevant doivent avoir la même longueur")
    
    aps = []
    for recommended, relevant in zip(all_recommended, all_relevant):
        ap = average_precision(recommended, relevant)
        aps.append(ap)
    
    map_score = np.mean(aps)
    return map_score

print(f"Average Precision : {average_precision(recommended, relevant):.3f}")


Average Precision : 0.583


In [6]:
# TODO : Implémenter NDCG (Normalized Discounted Cumulative Gain)

def dcg_at_k(relevances, k):
    """
    Calcule le DCG@K (Discounted Cumulative Gain)
    DCG@K = sum(rel_i / log2(i+1)) pour i de 1 à k
    
    Args:
        relevances: liste des scores de pertinence (1 si pertinent, 0 sinon) dans l'ordre des recommandations
        k: nombre de recommandations à considérer
    
    Returns:
        dcg: float
    """
    relevances = relevances[:k]
    dcg = 0.0
    
    for i, rel in enumerate(relevances, start=1):
        dcg += rel / np.log2(i + 1)
    return dcg

def ndcg_at_k(recommended, relevant, k):
    """
    Calcule le NDCG@K (Normalized Discounted Cumulative Gain)
    NDCG@K = DCG@K / IDCG@K
    
    Args:
        recommended: liste des items recommandés (ordonnée)
        relevant: liste des items pertinents
        k: nombre de recommandations à considérer
    
    Returns:
        ndcg: float entre 0 et 1
    """
    if len(relevant) == 0:
        return 0.0
    
    relevant_set = set(relevant)
    
    relevances = [1 if item in relevant_set else 0 for item in recommended[:k]]
    
    dcg = dcg_at_k(relevances, k)
    
    ideal_relevances = sorted(relevances, reverse=True)
    idcg = dcg_at_k(ideal_relevances, k)
    
    if idcg == 0:
        return 0.0
    
    ndcg = dcg / idcg
    return ndcg

print(f"NDCG@5 : {ndcg_at_k(recommended, relevant, 5):.3f}")


NDCG@5 : 0.693


4.2 - Préparation du Test Set

Créez un test set temporel pour évaluer les modèles.


In [7]:
# TODO : Charger les données
DATA_PATH = "../data/"
interactions_df = pd.read_csv(DATA_PATH + 'interactions.csv')
interactions_df['interaction_date'] = pd.to_datetime(interactions_df['interaction_date'])

# TODO : Split temporel (80% train, 20% test)
# Trier par date
interactions_df = interactions_df.sort_values('interaction_date')

split_idx = int(len(interactions_df) * 0.8)
train_interactions = interactions_df.iloc[:split_idx].copy()
test_interactions = interactions_df.iloc[split_idx:].copy()

print(f"Train : {len(train_interactions)} interactions")
print(f"Test : {len(test_interactions)} interactions")
print(f"Période train : {train_interactions['interaction_date'].min()} à {train_interactions['interaction_date'].max()}")
print(f"Période test : {test_interactions['interaction_date'].min()} à {test_interactions['interaction_date'].max()}")


Train : 40000 interactions
Test : 10000 interactions
Période train : 2023-12-05 10:25:45.743084 à 2025-10-11 10:25:45.743084
Période test : 2025-10-11 10:25:45.743084 à 2025-11-04 10:25:45.743084


In [8]:
# TODO : Fonction pour obtenir les produits pertinents

def get_relevant_products(user_id, test_interactions):
    """
    Retourne les produits pertinents pour un utilisateur dans le test set
    On considère comme pertinents les produits avec interactions positives (purchase, add_to_cart, review avec rating >= 3)
    
    Args:
        user_id: ID de l'utilisateur
        test_interactions: DataFrame des interactions de test
    
    Returns:
        Liste des product_ids pertinents
    """
    user_interactions = test_interactions[test_interactions['user_id'] == user_id]
    
    positive_interactions = user_interactions[
        (user_interactions['interaction_type'].isin(['purchase', 'add_to_cart'])) |
        ((user_interactions['interaction_type'] == 'review') & (user_interactions['rating'] >= 3))
    ]
    
    relevant_products = positive_interactions['product_id'].unique().tolist()
    return relevant_products

test_users = test_interactions['user_id'].unique()
print(f"Nombre d'utilisateurs à tester : {len(test_users)}")

if len(test_users) > 0:
    example_user = test_users[0]
    relevant = get_relevant_products(example_user, test_interactions)
    print(f"Exemple - Utilisateur {example_user} : {len(relevant)} produits pertinents")


Nombre d'utilisateurs à tester : 3568
Exemple - Utilisateur 4928 : 3 produits pertinents


4.3 - Évaluation des Modèles

Évaluez chaque modèle avec les métriques implémentées.


In [9]:
# TODO : Fonction d'évaluation générique

def evaluate_model(recommend_func, test_users, test_interactions, k=10):
    """
    Évalue un modèle de recommandation
    
    Args:
        recommend_func: fonction qui prend user_id et retourne liste de recommendations
        test_users: liste des utilisateurs à tester
        test_interactions: DataFrame des interactions de test
        k: nombre de recommandations
    
    Returns:
        dict: dictionnaire des métriques
    """
    precisions = []
    recalls = []
    aps = []
    ndcgs = []
    
    for user_id in tqdm(test_users, desc="Évaluation"):
        relevant = get_relevant_products(user_id, test_interactions)
        
        if len(relevant) == 0:
            continue
        
        try:
            recommended = recommend_func(user_id)
            
            if len(recommended) == 0:
                precisions.append(0.0)
                recalls.append(0.0)
                aps.append(0.0)
                ndcgs.append(0.0)
                continue
        except:
            precisions.append(0.0)
            recalls.append(0.0)
            aps.append(0.0)
            ndcgs.append(0.0)
            continue
        
        prec = precision_at_k(recommended, relevant, k)
        rec = recall_at_k(recommended, relevant, k)
        ap = average_precision(recommended, relevant)
        ndcg = ndcg_at_k(recommended, relevant, k)
        
        precisions.append(prec)
        recalls.append(rec)
        aps.append(ap)
        ndcgs.append(ndcg)
    
    return {
        'Precision@K': np.mean(precisions) if len(precisions) > 0 else 0.0,
        'Recall@K': np.mean(recalls) if len(recalls) > 0 else 0.0,
        'MAP': np.mean(aps) if len(aps) > 0 else 0.0,
        'NDCG@K': np.mean(ndcgs) if len(ndcgs) > 0 else 0.0
    }


In [10]:
# TODO : Évaluer chaque modèle

DATA_PATH = "../data/"
users_df = pd.read_csv(DATA_PATH + 'users.csv')
products_df = pd.read_csv(DATA_PATH + 'products.csv')

print("Chargement des modèles...")

def recommend_cf_wrapper(user_id):
    """Wrapper pour le modèle Collaborative Filtering"""
    from sklearn.metrics.pairwise import cosine_similarity
    
    def calculate_interaction_score(row):
        if pd.notna(row['rating']):
            return row['rating']
        elif row['interaction_type'] == 'purchase':
            return 4.0
        elif row['interaction_type'] == 'add_to_cart':
            return 3.0
        elif row['interaction_type'] == 'review':
            return 3.5
        else:
            return 1.0
    
    train_interactions_scored = train_interactions.copy()
    train_interactions_scored['interaction_score'] = train_interactions_scored.apply(calculate_interaction_score, axis=1)
    
    user_item_matrix = train_interactions_scored.pivot_table(
        index='user_id',
        columns='product_id',
        values='interaction_score',
        aggfunc='mean',
        fill_value=0
    )
    
    if user_id not in user_item_matrix.index:
        return []
    
    user_similarity = cosine_similarity(user_item_matrix)
    user_similarity = pd.DataFrame(
        user_similarity,
        index=user_item_matrix.index,
        columns=user_item_matrix.index
    )
    
    user_items = user_item_matrix.loc[user_id]
    seen_items = set(user_items[user_items > 0].index)
    
    similar_users = user_similarity.loc[user_id].drop(user_id).nlargest(10)
    
    predicted_scores = {}
    for product_id in user_item_matrix.columns:
        if product_id in seen_items:
            continue
        
        numerator = 0
        denominator = 0
        for similar_user_id, similarity in similar_users.items():
            neighbor_rating = user_item_matrix.loc[similar_user_id, product_id]
            if neighbor_rating > 0:
                numerator += similarity * neighbor_rating
                denominator += abs(similarity)
        
        if denominator > 0:
            predicted_scores[product_id] = numerator / denominator
    
    recommended_items = sorted(predicted_scores.items(), key=lambda x: x[1], reverse=True)[:10]
    return [item[0] for item in recommended_items]

def recommend_content_wrapper(user_id):
    """Wrapper pour le modèle Content-Based"""
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    
    if 'description' in products_df.columns:
        products_df['content'] = (
            products_df['name'].fillna('') + ' ' +
            products_df['description'].fillna('') + ' ' +
            products_df['category'].fillna('') + ' ' +
            products_df['subcategory'].fillna('')
        )
    else:
        products_df['content'] = (
            products_df['name'].fillna('') + ' ' +
            products_df['category'].fillna('') + ' ' +
            products_df['subcategory'].fillna('')
        )
    
    tfidf = TfidfVectorizer(max_features=200, stop_words='english')
    product_vectors = tfidf.fit_transform(products_df['content'])
    product_similarity = cosine_similarity(product_vectors)
    product_similarity = pd.DataFrame(
        product_similarity,
        index=products_df['product_id'],
        columns=products_df['product_id']
    )
    
    user_interactions = train_interactions[train_interactions['user_id'] == user_id]
    positive_interactions = user_interactions[
        (user_interactions['interaction_type'].isin(['purchase', 'review', 'add_to_cart'])) |
        ((user_interactions['rating'].notna()) & (user_interactions['rating'] >= 3))
    ]
    
    if len(positive_interactions) == 0:
        return products_df.nlargest(10, 'price')['product_id'].tolist()
    
    liked_products = set(positive_interactions['product_id'].unique())
    product_scores = {}
    
    for product_id in products_df['product_id']:
        if product_id in liked_products:
            continue
        
        similarities = []
        for liked_product_id in liked_products:
            if liked_product_id in product_similarity.index and product_id in product_similarity.columns:
                sim = product_similarity.loc[liked_product_id, product_id]
                similarities.append(sim)
        
        if similarities:
            product_scores[product_id] = np.mean(similarities)
    
    recommended_items = sorted(product_scores.items(), key=lambda x: x[1], reverse=True)[:10]
    return [item[0] for item in recommended_items]


def recommend_hybrid_wrapper(user_id):
    """Wrapper pour le modèle Hybride - version simplifiée"""
    popular_products = train_interactions.groupby('product_id').size().reset_index(name='count')
    popular_products = popular_products.sort_values('count', ascending=False)
    return popular_products.head(10)['product_id'].tolist()



print("\nÉvaluation Collaborative Filtering...")
metrics_cf = evaluate_model(recommend_cf_wrapper, test_users[:100], test_interactions, k=10)
print("\nÉvaluation Content-Based...")
metrics_content = evaluate_model(recommend_content_wrapper, test_users[:100], test_interactions, k=10)
print("\nÉvaluation Modèle Hybride...")
metrics_hybrid = evaluate_model(recommend_hybrid_wrapper, test_users[:100], test_interactions, k=10)
print("\nÉvaluation terminée")


Chargement des modèles...

Évaluation Collaborative Filtering...


Évaluation: 100%|██████████| 100/100 [00:33<00:00,  2.95it/s]



Évaluation Content-Based...


Évaluation: 100%|██████████| 100/100 [00:03<00:00, 28.27it/s]



Évaluation Modèle Hybride...


Évaluation: 100%|██████████| 100/100 [00:00<00:00, 850.71it/s]


Évaluation terminée





4.4 - Comparaison des Modèles


Comparez les performances des trois modèles.


In [12]:
# TODO : Créer un tableau comparatif

results = pd.DataFrame({
    'Collaborative Filtering': metrics_cf,
    'Content-Based': metrics_content,
    'Modèle Hybride': metrics_hybrid
}).T

print("  COMPARAISON DES MODÈLES")
print(results.round(3))


  COMPARAISON DES MODÈLES
                         Precision@K  Recall@K    MAP  NDCG@K
Collaborative Filtering        0.001     0.007  0.002   0.004
Content-Based                  0.009     0.027  0.031   0.041
Modèle Hybride                 0.003     0.022  0.012   0.016


In [13]:
# TODO : Visualisation comparative avec Plotly

models = results.index.tolist()
metrics = results.columns.tolist()

colors = ['orange', 'gold', 'darkorange', 'lightsalmon']

fig = go.Figure()

for i, metric in enumerate(metrics):
    fig.add_trace(go.Bar(
        name=metric,
        x=models,
        y=results[metric].values,
        marker_color=colors[i % len(colors)],
        text=results[metric].values.round(3),
        textposition='outside',
        hovertemplate=f'<b>%{{x}}</b><br>{metric}: %{{y:.4f}}<extra></extra>'
    ))

fig.update_layout(
    title={
        'text': 'Comparaison des Performances des Modèles',
        'x': 0.5,
        'xanchor': 'center',
        'font': {'size': 18, 'family': 'Arial Black'}
    },
    xaxis=dict(
        title='Modèle',
        tickangle=-45,
        tickfont=dict(size=12)
    ),
    yaxis=dict(
        title='Score',
        tickfont=dict(size=12)
    ),
    barmode='group',
    width=1000,
    height=600,
    legend=dict(title='Métriques', font=dict(size=12)),
    hovermode='x unified',
    template='plotly_white'
)

fig.show()

fig_radar = go.Figure()

for i, model in enumerate(models):
    fig_radar.add_trace(go.Scatterpolar(
        r=results.loc[model].values,
        theta=metrics,
        fill='toself',
        name=model,
        line=dict(color=colors[i % len(colors)], width=2),
        marker=dict(size=6),
        hovertemplate='<b>%{theta}</b><br>Score: %{r:.4f}<extra></extra>'
    ))

fig_radar.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=True,
            range=[0, max(results.max()) * 1.1],
            tickfont=dict(size=11)
        ),
        angularaxis=dict(tickfont=dict(size=12))
    ),
    title={
        'text': 'Comparaison Radar des Modèles',
        'x': 0.5,
        'xanchor': 'center',
        'font': {'size': 18, 'family': 'Arial Black'}
    },
    width=800,
    height=600,
    template='plotly_white',
    legend=dict(font=dict(size=12))
)

fig_radar.show()

4.5 - Métriques Business

Calculez des métriques orientées métier.


In [14]:
# TODO : Couverture du catalogue
def catalog_coverage(all_recommendations, total_products):
    """
    Pourcentage de produits recommandés au moins une fois
    
    Args:
        all_recommendations: liste de listes de recommandations pour chaque utilisateur
        total_products: nombre total de produits dans le catalogue
    
    Returns:
        coverage: float entre 0 et 1
    """
    all_recommended_products = set()
    for recommendations in all_recommendations:
        all_recommended_products.update(recommendations)
    
    coverage = len(all_recommended_products) / total_products if total_products > 0 else 0.0
    
    return coverage

# TODO : Diversité
def diversity_score(recommendations, products_df):
    """
    Nombre moyen de catégories différentes dans les recommandations
    
    Args:
        recommendations: liste de listes de recommandations pour chaque utilisateur
        products_df: DataFrame des produits avec colonne 'category'
    
    Returns:
        diversity: float (nombre moyen de catégories)
    """
    diversities = []
    
    for rec_list in recommendations:
        if len(rec_list) == 0:
            diversities.append(0.0)
            continue
        
        recommended_products = products_df[products_df['product_id'].isin(rec_list)]
        unique_categories = recommended_products['category'].nunique()
        
        diversities.append(unique_categories)
    
    return np.mean(diversities) if len(diversities) > 0 else 0.0

# TODO : Nouveauté
def novelty_score(recommendations, product_popularity):
    """
    Score de nouveauté basé sur la popularité inversée
    Plus un produit est rare, plus il est nouveau
    
    Args:
        recommendations: liste de listes de recommandations pour chaque utilisateur
        product_popularity: Series avec product_id en index et popularité en valeur
    
    Returns:
        novelty: float (score moyen de nouveauté)
    """
    novelties = []
    
    max_popularity = product_popularity.max()
    if max_popularity > 0:
        normalized_popularity = 1 - (product_popularity / max_popularity)
    else:
        normalized_popularity = pd.Series(0, index=product_popularity.index)
    
    for rec_list in recommendations:
        if len(rec_list) == 0:
            novelties.append(0.0)
            continue
        
        rec_products = [p for p in rec_list if p in normalized_popularity.index]
        if len(rec_products) > 0:
            novelty = normalized_popularity[rec_products].mean()
            novelties.append(novelty)
        else:
            novelties.append(0.0)
    
    return np.mean(novelties) if len(novelties) > 0 else 0.0


print("MÉTRIQUES BUSINESS")

all_rec_cf = []
all_rec_content = []
all_rec_hybrid = []

for user_id in test_users[:100]:
    try:
        all_rec_cf.append(recommend_cf_wrapper(user_id))
        all_rec_content.append(recommend_content_wrapper(user_id))
        all_rec_hybrid.append(recommend_hybrid_wrapper(user_id))
    except:
        all_rec_cf.append([])
        all_rec_content.append([])
        all_rec_hybrid.append([])

product_popularity = train_interactions.groupby('product_id').size()

coverage_cf = catalog_coverage(all_rec_cf, len(products_df))
coverage_content = catalog_coverage(all_rec_content, len(products_df))
coverage_hybrid = catalog_coverage(all_rec_hybrid, len(products_df))

diversity_cf = diversity_score(all_rec_cf, products_df)
diversity_content = diversity_score(all_rec_content, products_df)
diversity_hybrid = diversity_score(all_rec_hybrid, products_df)

novelty_cf = novelty_score(all_rec_cf, product_popularity)
novelty_content = novelty_score(all_rec_content, product_popularity)
novelty_hybrid = novelty_score(all_rec_hybrid, product_popularity)

business_metrics = pd.DataFrame({
    'Collaborative Filtering': [coverage_cf, diversity_cf, novelty_cf],
    'Content-Based': [coverage_content, diversity_content, novelty_content],
    'Modèle Hybride': [coverage_hybrid, diversity_hybrid, novelty_hybrid]
}, index=['Couverture', 'Diversité', 'Nouveauté'])

print("\nMétriques Business :")
print(business_metrics.round(3))

fig_business = go.Figure()

for i, metric in enumerate(['Couverture', 'Diversité', 'Nouveauté']):
    fig_business.add_trace(go.Bar(
        name=metric,
        x=business_metrics.columns,
        y=business_metrics.loc[metric].values,
        marker_color=colors[i % len(colors)],
        text=business_metrics.loc[metric].values.round(3),
        textposition='outside',
        hovertemplate=f'<b>%{{x}}</b><br>{metric}: %{{y:.4f}}<extra></extra>'
    ))

fig_business.update_layout(
    title={
        'text': 'Métriques Business - Comparaison des Modèles',
        'x': 0.5,
        'xanchor': 'center',
        'font': {'size': 16}
    },
    xaxis=dict(title='Modèle'),
    yaxis=dict(title='Score'),
    barmode='group',
    width=1000,
    height=600,
    legend=dict(title='Métriques'),
    hovermode='x unified'
)

fig_business.show()


MÉTRIQUES BUSINESS

Métriques Business :
            Collaborative Filtering  Content-Based  Modèle Hybride
Couverture                    0.586          0.613           0.010
Diversité                     5.570          1.310           4.000
Nouveauté                     0.385          0.420           0.099


4.6 - Analyse et Conclusions

Rédigez vos observations et conclusions :


Vos Conclusions

Répondez aux questions suivantes :

1. Quel modèle obtient les meilleures performances globales ?

2. Quels sont les forces et faiblesses de chaque approche ?

3. Dans quel contexte recommanderiez-vous chaque modèle ?

4. Quelles améliorations proposez-vous ?

5. Comment gérez-vous le problème du cold start ?




Vous avez terminé le TP principal. Si vous avez du temps, explorez les bonus dans le sujet (MLOps, API, Deep Learning, etc.)
