<a href="https://colab.research.google.com/github/Bennath-coder/Bennath-coder/blob/main/S5_007P_a%CC%80_009P_UseCaseEnonce_DL_NLP_vCorrection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from scipy.sparse import csr_matrix

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, ENGLISH_STOP_WORDS
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

In [2]:
data = pd.read_csv("S5_007_009_ratings.csv")
data.head()

Unnamed: 0,UserId,ProductId,Rating,Description
0,AKJHHD5VEH7VG,762451459,5.0,JSR Paris Beauty Cotton Net Women's Full Cover...
1,A22ZFXQE8AWPEP,1304482596,1.0,"BlueStone The Golden Arm Yellow Gold Diamond, ..."
2,A22ZFXQE8AWPEP,1304482685,1.0,Okane Striped Men's Polo Neck T-Shirt - Buy Go...
3,A22ZFXQE8AWPEP,1304495396,1.0,Her Grace Women's A-line Dress - Buy Blue Her ...
4,A22ZFXQE8AWPEP,1304511111,1.0,Buy Eventz Gifts Allah Showpiece - 20 cm for...


In [3]:
data.shape

(181040, 4)

In [4]:
ratings = data[["UserId", "ProductId", "Rating"]].sample(10000, random_state=42)

In [5]:
ratings.shape

(10000, 3)

Dans ce dataset, on dispose des notes (entre 1 et 5) émises par des clients d'une marketplace sur des produits spécifiques. Les clients et les produits sont identifiables via leurs identifiants.

**Quels sont les produits les plus populaires (c'est-à-dire ceux qui ont obtenu le plus de notes) ?**

In [6]:
data.groupby("ProductId")["Rating"].count().sort_values(ascending=False)

Unnamed: 0_level_0,Rating
ProductId,Unnamed: 1_level_1
B001MA0QY2,7533
B0009V1YR8,2869
B0043OYFKU,2477
B0000YUXI0,2143
B003V265QW,2088
...,...
B001SG7EE0,1
B001SKWB1C,1
B001SP7RV6,1
B001STEX50,1


**Quels sont les utilisateurs les plus actifs (c'est-à-dire ceux qui ont noté le plus de produits) ?**

In [7]:
data.groupby("UserId")["Rating"].count().sort_values(ascending=False)

Unnamed: 0_level_0,Rating
UserId,Unnamed: 1_level_1
A3KEZLJ59C1JVH,389
A281NPSIMI1C2R,336
A3M174IC0VXOS2,326
A2V5R832QCSOMX,278
A3LJLRIZL38GG3,276
...,...
A2BIYSOWGIVNHB,1
A2BIQ0KPVMG8B2,1
A2BIOHEVF0LCHW,1
A2BILUPJR1QK94,1


# Recommandation à partir des descriptions des produits

Supposons désormais que nous n'avons pas accès à l'historique d'achat des clients. Nous allons baser notre recommandation sur les descriptions des produits. Commençons par créer un dataset comportant une liste de 1000 produits et leurs descriptions :

In [8]:
product_descriptions = data[["ProductId", "Description"]].drop_duplicates().sample(1000, random_state=42)

In [78]:
product_descriptions.head()

Unnamed: 0,ProductId,Description
119302,B004NMF63W,April6 Carmelo Printed Men's Boxer - Buy Orang...
130359,B0058P6S2Q,Maxima 13881PPGW FIBER COLLECTION Analog Watch...
154774,B008B4FKLG,Key Features of TIMBERLAKE Slim Fit Fit Women'...
173751,B00CZ4XW8Q,Herberto Girl's A-line Dress - Buy Black And R...
60300,B0016B5LAQ,Jadoo Collections Alloy Necklace - Buy Jadoo C...


Avant de vectoriser nos descriptions, nous allons ajouter à la liste des stop_words anglais quelques termes spécifiques au e-commerce qu'on retrouve dans un grand nombre de descriptions de produits :

In [79]:
ENGLISH_STOP_WORDS

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

In [9]:
stop_words = ENGLISH_STOP_WORDS.union(['shipping', 'online', 'guarantee', 'replacement', 'delivery', 'cash', 'free',
                                      'buy', 'day', 'discounts', 'products', 'flipkart', 'prices', 'warranty',
                                      'specifications', 'details', 'features', 'general', 'pack', 'price',
                                      'package', 'sales', 'box', 'type', 'genuine', 'material', 'cm'])
stop_words = list(stop_words)

**Créer un `TfidfVectorizer` en spécifiant les stop_words ci-dessus et récupérer les vecteurs associés aux descriptions.**

In [80]:
#on vectorize quelques mots spécifiques de la descrption des produits en e-commerce qui ne seront pas utiles à notre analyse
vectorizer = TfidfVectorizer(stop_words=stop_words)
vectors = vectorizer.fit_transform(product_descriptions["Description"]) # j'utilise un fit_transform pour adapter mon vectoriseur à mes données

In [84]:
vectors.toarray().shape

(1000, 5345)

In [83]:
vectors.toarray() # permet la visualisation des vecteurs, ici principalement à 0 (c'est normal on a 1000 vecteurs de 5345 éléments: ces 5345 éléments représentent l'ensemble des mots qui existent dans l'ensemble des descriptions de produit et c'est une minorité de ces mots qui va se retrouvée dans chacune des descriptions)

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [87]:
vectorizer.vocabulary_

{'april6': 868,
 'carmelo': 1263,
 'printed': 3827,
 'men': 3225,
 'boxer': 1137,
 'orange': 3525,
 'rs': 4140,
 '549': 480,
 'india': 2716,
 'shop': 4336,
 'apparels': 853,
 'huge': 2646,
 'collection': 1439,
 'branded': 1149,
 'clothes': 1408,
 'com': 1453,
 'maxima': 3202,
 '13881ppgw': 104,
 'fiber': 2212,
 'analog': 824,
 'watch': 5170,
 '699': 552,
 'tonneau': 4868,
 'dial': 1807,
 'water': 5172,
 'resistant': 4064,
 'stainless': 4540,
 'steel': 4564,
 'case': 1273,
 'buckle': 1192,
 'clasp': 1384,
 'black': 1078,
 'strap': 4603,
 'great': 2475,
 '30': 322,
 'key': 2896,
 'timberlake': 4839,
 'slim': 4421,
 'fit': 2243,
 'women': 5244,
 'red': 3996,
 'jeans': 2814,
 'color': 1442,
 'standard': 4544,
 'brand': 1148,
 'number': 3467,
 'contents': 1533,
 'fabric': 2140,
 '70': 556,
 'cotton30': 1572,
 'lycra': 3124,
 'pattern': 3624,
 'solid': 4462,
 'ideal': 2667,
 'additional': 726,
 'style': 4642,
 'code': 1429,
 'abrolsons': 678,
 '10013': 40,
 'care': 1259,
 'mild': 3257,
 'was

In [93]:
list(vectorizer.vocabulary_.items())

[('april6', 868),
 ('carmelo', 1263),
 ('printed', 3827),
 ('men', 3225),
 ('boxer', 1137),
 ('orange', 3525),
 ('rs', 4140),
 ('549', 480),
 ('india', 2716),
 ('shop', 4336),
 ('apparels', 853),
 ('huge', 2646),
 ('collection', 1439),
 ('branded', 1149),
 ('clothes', 1408),
 ('com', 1453),
 ('maxima', 3202),
 ('13881ppgw', 104),
 ('fiber', 2212),
 ('analog', 824),
 ('watch', 5170),
 ('699', 552),
 ('tonneau', 4868),
 ('dial', 1807),
 ('water', 5172),
 ('resistant', 4064),
 ('stainless', 4540),
 ('steel', 4564),
 ('case', 1273),
 ('buckle', 1192),
 ('clasp', 1384),
 ('black', 1078),
 ('strap', 4603),
 ('great', 2475),
 ('30', 322),
 ('key', 2896),
 ('timberlake', 4839),
 ('slim', 4421),
 ('fit', 2243),
 ('women', 5244),
 ('red', 3996),
 ('jeans', 2814),
 ('color', 1442),
 ('standard', 4544),
 ('brand', 1148),
 ('number', 3467),
 ('contents', 1533),
 ('fabric', 2140),
 ('70', 556),
 ('cotton30', 1572),
 ('lycra', 3124),
 ('pattern', 3624),
 ('solid', 4462),
 ('ideal', 2667),
 ('addition

In [94]:
sorted(list(vectorizer.vocabulary_.items()), key=lambda x: x[1])

[('00', 0),
 ('000', 1),
 ('0008m', 2),
 ('001', 3),
 ('002', 4),
 ('0068', 5),
 ('0098', 6),
 ('010', 7),
 ('0116', 8),
 ('0123', 9),
 ('019', 10),
 ('02', 11),
 ('027_oe', 12),
 ('03', 13),
 ('0303', 14),
 ('038', 15),
 ('04', 16),
 ('045', 17),
 ('049', 18),
 ('04a', 19),
 ('04pack', 20),
 ('05', 21),
 ('050', 22),
 ('051', 23),
 ('05a', 24),
 ('06', 25),
 ('063', 26),
 ('0662', 27),
 ('07', 28),
 ('070', 29),
 ('08', 30),
 ('080', 31),
 ('0820', 32),
 ('08hd', 33),
 ('09', 34),
 ('099', 35),
 ('10', 36),
 ('100', 37),
 ('1000', 38),
 ('1000142', 39),
 ('10013', 40),
 ('1008', 41),
 ('1032', 42),
 ('104', 43),
 ('1045', 44),
 ('1049', 45),
 ('1070', 46),
 ('1070blk', 47),
 ('10796', 48),
 ('108', 49),
 ('1080', 50),
 ('109', 51),
 ('1092', 52),
 ('1095', 53),
 ('1099', 54),
 ('10out', 55),
 ('10pur', 56),
 ('11', 57),
 ('110', 58),
 ('1100', 59),
 ('110002075', 60),
 ('111', 61),
 ('1117', 62),
 ('112', 63),
 ('112015', 64),
 ('1134', 65),
 ('1139', 66),
 ('1150', 67),
 ('11549', 68

In [95]:
vectors_df = pd.DataFrame(vectors.toarray(), columns=sorted(list(vectorizer.vocabulary_.items()), key=lambda x: x[1]))
vectors_df.head()

Unnamed: 0,"(00, 0)","(000, 1)","(0008m, 2)","(001, 3)","(002, 4)","(0068, 5)","(0098, 6)","(010, 7)","(0116, 8)","(0123, 9)",...,"(zippy, 5335)","(zircon, 5336)","(zirconia, 5337)","(zircons, 5338)","(zoire, 5339)","(zone, 5340)","(zoom, 5341)","(zovon, 5342)","(zw, 5343)","(zyxel, 5344)"
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [98]:
(vectors_df!= 0).sum().sum()

23758

**Regrouper les vecteurs en 8 clusters avec la méthode des $k$-means.**

In [100]:
# 1. Appliquer l'algorithme K-means avec 8 clusters
kmeans = KMeans(n_clusters=8, random_state=42)
y_kmeans = kmeans.fit_predict(vectors)

In [102]:
y_kmeans

array([1, 2, 0, 1, 3, 0, 5, 3, 5, 0, 7, 0, 3, 4, 1, 3, 1, 4, 4, 4, 1, 7,
       4, 1, 1, 0, 0, 5, 5, 1, 0, 5, 6, 0, 3, 4, 4, 1, 3, 1, 4, 0, 4, 4,
       3, 4, 1, 4, 3, 7, 7, 4, 5, 7, 4, 4, 4, 4, 3, 0, 0, 4, 1, 4, 4, 6,
       4, 3, 0, 7, 3, 0, 0, 4, 0, 4, 3, 7, 4, 7, 0, 3, 7, 4, 2, 7, 4, 3,
       5, 3, 1, 5, 1, 7, 1, 4, 1, 1, 7, 5, 1, 4, 1, 1, 4, 4, 4, 7, 3, 0,
       1, 5, 4, 5, 0, 1, 0, 0, 4, 3, 4, 2, 4, 3, 1, 4, 3, 1, 5, 4, 4, 4,
       4, 6, 1, 7, 4, 5, 3, 7, 0, 4, 4, 4, 5, 7, 0, 7, 7, 5, 0, 3, 4, 0,
       7, 3, 3, 1, 1, 4, 4, 4, 4, 7, 4, 1, 3, 0, 6, 0, 4, 4, 5, 3, 4, 4,
       4, 7, 7, 7, 0, 7, 4, 4, 3, 4, 4, 4, 2, 7, 1, 6, 5, 1, 4, 1, 4, 1,
       3, 5, 5, 1, 3, 4, 3, 4, 0, 3, 7, 1, 4, 4, 0, 4, 7, 3, 1, 7, 5, 7,
       4, 4, 4, 1, 1, 4, 4, 3, 1, 5, 5, 0, 4, 0, 1, 5, 2, 7, 7, 3, 5, 1,
       4, 1, 4, 1, 4, 5, 5, 0, 1, 0, 7, 5, 3, 1, 0, 4, 4, 0, 1, 5, 3, 3,
       1, 1, 4, 4, 1, 4, 1, 0, 5, 0, 0, 3, 4, 1, 1, 0, 3, 4, 1, 1, 1, 0,
       5, 5, 5, 5, 5, 0, 4, 1, 6, 1, 7, 4, 3, 1, 1,

Nous pouvons maintenant identifier les termes les plus présents dans chaque cluster :

In [101]:
# 2. Récupérer les labels des clusters (les indices des clusters)
kmeans.cluster_centers_.argsort()[:, ::-1]

array([[2140, 1571, 5244, ..., 3254, 3251,    0],
       [4327, 5244, 4336, ..., 3521, 3522,    0],
       [5170,  824, 1807, ..., 3538, 3539,    0],
       ...,
       [1453,  322, 1809, ..., 3070, 3069, 2672],
       [1588, 4115, 3891, ..., 3558, 3559,    0],
       [4274, 3190, 1249, ..., 3444, 3445,    0]])

In [115]:
vectorizer.get_feature_names_out()
# vectorizer.get_feature_names_out()[2140] : si j'indique l'indice 2140, j'obtiens le résultat 'fabric'

array(['00', '000', '0008m', ..., 'zovon', 'zw', 'zyxel'], dtype=object)

In [116]:
# boucle pour récupérer les coordonnées de mes centres et les dimensions les plus significatives pour chacun de mes centres
print("Top terms per cluster:")
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names_out() # je récupère les termes de mon dico
for i in range(8): # et ensuite pour chaque cluster je vais aller regarder pour les 10 termes [i, :10]: les plus significatifs à quel mot il se rapporte
    print(f"\n Cluster {i}:"),
    for ind in order_centroids[i, :10]:
        print(terms[ind]),

Top terms per cluster:

 Cluster 0:
fabric
cotton
women
sleeve
neck
casual
printed
shirt
pattern
solid

 Cluster 1:
shirt
women
shop
branded
huge
apparels
clothes
india
collection
men

 Cluster 2:
watch
analog
dial
strap
great
men
women
sonata
india
boys

 Cluster 3:
necklace
alloy
metal
30
com
rs
acrylic
fashion
plated
voylla

 Cluster 4:
cover
color
rs
model
ipad
best
product
brand
design
30

 Cluster 5:
com
30
diamond
gold
ring
rs
18
art
pencil
na

 Cluster 6:
covers
robust
puppy
ways
sturdiness
lost
matte
class
impress
elegance

 Cluster 7:
set
mat
car
30
allure
auto
best
container
plant
bangle


C'est une première segmentation des produits

**Quels types de produits allons-nous retrouver dans chaque cluster ?**

Imaginons le terme de recherche : *car accessory* (accessoire voiture). Tentons de trouver des produits pertinents autour de cette recherche.

**Vectoriser le terme de recherche puis trouver le cluster associé.**

Nous allons voir ci-dessous comment fournir une recommandation au client :

In [117]:
search = vectorizer.transform(["car accessory"])

In [119]:
search

<1x5345 sparse matrix of type '<class 'numpy.float64'>'
	with 2 stored elements in Compressed Sparse Row format>

In [118]:
prediction = kmeans.predict(search)
print(f"Cluster : {prediction[0]}")

Cluster : 7


**Proposer un produit issu de ce cluster.**

In [123]:
product_descriptions["Cluster"] = y_kmeans
product_descriptions_cluster = product_descriptions[product_descriptions["Cluster"] == prediction[0]]
product_descriptions_cluster.sample(1, random_state=42).iloc[0]["Description"]

'Flipkart.com: Buy Yardley English Rose Festive Collection Pack Combo Set online only for Rs. 418 from Flipkart.com. Only Genuine Products. 30 Day Replacement Guarantee. Free Shipping. Cash On Delivery!'