# Analyse des co-occurrentes produit et similarité

Nous étudions ici les produits fréquemment achetés ensemble, ainsi que leur similarité.

## Objectifs :
- Créer une matrice client-produit binaire
- Analyser la co-occurrence des produits (association rules)
- Calculer la similarité entre produits (cosine / Jaccard)
- Identifier des produits « similaires » à recommander

C’est une base pour les systèmes de recommandation par contenu.


In [1]:
# Ouverture du fichier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

FILE = "../data/Online_Retail_cleaned.csv"
DATA = pd.read_csv(FILE)
print(DATA.shape)
DATA.head()

(397884, 8)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [2]:
# Création de la matrice client-produit binaire
customer_product_matrix = DATA.pivot_table(index='CustomerID', columns='StockCode', values='Quantity', fill_value=0)
customer_product_matrix = (customer_product_matrix > 0).astype(int)
print(customer_product_matrix.head())

StockCode   10002  10080  10120  10123C  10124A  10124G  10125  10133  10135  \
CustomerID                                                                     
12346.0         0      0      0       0       0       0      0      0      0   
12347.0         0      0      0       0       0       0      0      0      0   
12348.0         0      0      0       0       0       0      0      0      0   
12349.0         0      0      0       0       0       0      0      0      0   
12350.0         0      0      0       0       0       0      0      0      0   

StockCode   11001  ...  90214V  90214W  90214Y  90214Z  BANK CHARGES  C2  DOT  \
CustomerID         ...                                                          
12346.0         0  ...       0       0       0       0             0   0    0   
12347.0         0  ...       0       0       0       0             0   0    0   
12348.0         0  ...       0       0       0       0             0   0    0   
12349.0         0  ...       0    

In [15]:
# Analyse de la co-occurrence des produits
from mlxtend.frequent_patterns import apriori, association_rules
frequent_itemsets = apriori(customer_product_matrix, min_support=0.15, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
print(rules.head())



Empty DataFrame
Columns: [antecedents, consequents, antecedent support, consequent support, support, confidence, lift, representativity, leverage, conviction, zhangs_metric, jaccard, certainty, kulczynski]
Index: []


In [16]:
# Calcul de la similarité entre produits
from sklearn.metrics.pairwise import cosine_similarity
product_similarity = cosine_similarity(customer_product_matrix.T)
product_similarity_df = pd.DataFrame(product_similarity, index=customer_product_matrix.columns, columns=customer_product_matrix.columns)
print(product_similarity_df.head())

StockCode     10002  10080     10120    10123C  10124A    10124G     10125  \
StockCode                                                                    
10002      1.000000    0.0  0.094868  0.091287     0.0  0.000000  0.090351   
10080      0.000000    1.0  0.000000  0.000000     0.0  0.000000  0.032774   
10120      0.094868    0.0  1.000000  0.115470     0.0  0.000000  0.057143   
10123C     0.091287    0.0  0.115470  1.000000     0.0  0.000000  0.164957   
10124A     0.000000    0.0  0.000000  0.000000     1.0  0.447214  0.063888   

StockCode     10133     10135     11001  ...  90214V  90214W  90214Y  90214Z  \
StockCode                                ...                                   
10002      0.062932  0.098907  0.095346  ...     0.0     0.0     0.0     0.0   
10080      0.045655  0.047836  0.000000  ...     0.0     0.0     0.0     0.0   
10120      0.059702  0.041703  0.060302  ...     0.0     0.0     0.0     0.0   
10123C     0.000000  0.000000  0.000000  ...     0.0 

In [17]:
# Identification de produits similaires à recommander
def recommend_similar_products(product_code, product_similarity_df, top_n=5):
    if product_code not in product_similarity_df.columns:
        return f"Produit {product_code} non trouvé."
    similar_products = product_similarity_df[product_code].sort_values(ascending=False).head(top_n + 1)
    similar_products = similar_products[similar_products.index != product_code]
    return similar_products

import random

ALL_DESCRIPTION = DATA[['Description', 'StockCode']].drop_duplicates()
PRODUIT = random.choice(ALL_DESCRIPTION['StockCode'].tolist())
print(f"Produit choisi: {ALL_DESCRIPTION[ALL_DESCRIPTION['StockCode'] == PRODUIT]['Description'].values[0]} ({PRODUIT})")
recommended_products = recommend_similar_products(PRODUIT, product_similarity_df)
print("Produits similaires recommandés:")
for product, score in recommended_products.items():
    description = ALL_DESCRIPTION[ALL_DESCRIPTION['StockCode'] == product]['Description'].values[0]
    print(f"{description} ({product}) - Similarité: {score:.4f}")

Produit choisi: BATHROOM METAL SIGN (82580)
Produits similaires recommandés:
TOILET METAL SIGN (82581) - Similarité: 0.7419
KITCHEN METAL SIGN (82578) - Similarité: 0.7239
BEWARE OF THE CAT METAL SIGN  (21165) - Similarité: 0.3310
HOT BATHS METAL SIGN (82583) - Similarité: 0.3269
LAUNDRY 15C METAL SIGN (82551) - Similarité: 0.3149


In [18]:
product_similarity_df.to_csv("../data/product_similarity.csv", index=True)
product_similarity_df.head()

StockCode,10002,10080,10120,10123C,10124A,10124G,10125,10133,10135,11001,...,90214V,90214W,90214Y,90214Z,BANK CHARGES,C2,DOT,M,PADS,POST
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10002,1.0,0.0,0.094868,0.091287,0.0,0.0,0.090351,0.062932,0.098907,0.095346,...,0.0,0.0,0.0,0.0,0.0,0.029361,0.0,0.067591,0.0,0.078217
10080,0.0,1.0,0.0,0.0,0.0,0.0,0.032774,0.045655,0.047836,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016345,0.0,0.0
10120,0.094868,0.0,1.0,0.11547,0.0,0.0,0.057143,0.059702,0.041703,0.060302,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071247,0.0,0.010993
10123C,0.091287,0.0,0.11547,1.0,0.0,0.0,0.164957,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10124A,0.0,0.0,0.0,0.0,1.0,0.447214,0.063888,0.044499,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
