# Capstone project 1: Pet Product Auto-Subcategorization by Review Analysis (2-5 reviews, modeling)

**Goal: Creating a system that automatically classifies products in Pet Supplies category into subcategories by analyzing the reviews.**  
In this project, the data collected during 2014 - 2015 in the US is used.

This jupyter notebook is about modeling. If you want to see the preprocessing, see a jupyter notebool about preprocessing.

## Recapitulation

Number of reviews per product, products, reviews, and tokens in the data:
  
| Reviews/product | Total reviews | Total products | Total tokens (unique) |  
|:---------------:|:-------------:|:--------------:|:---------------------:|  
|2 to 5|245,565|61,796|9,140|  
  
  
Summary statistics of the number of tokens per product:
  
|  Min  |  25%  |  50%  |  75%  |  90%  |  Max|Mean  |  SD  |  
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|  
|5|14|23|35|48|119|26.3|15.9|

## Import libraries and load data

In [1]:
%matplotlib notebook

import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from collections import Counter
import nltk
import csv
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans

sns.set(context='notebook', style='ticks', palette='hls')

In [2]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle the code on/off."></form>''')

In [2]:
# Load the token list
token_list = []
with open("token_list_2_5.csv", "r", encoding="UTF-8") as f:
    reader = csv.reader(f) 
    for r in reader: 
        token_list.append(r)
        
token_list[:2]

[['materi', 'larg', 'lenght', 'larg', 'fit', 'dog', 'button', 'coat', 'dog'],
 ['beasti',
  'band',
  'collar',
  'easili',
  'adjust',
  'afford',
  'band',
  'cat',
  'chew',
  'cat',
  'floor',
  'chew',
  'con',
  'stretch',
  'rip',
  'sever',
  'collar',
  'outdoor',
  'cat',
  'cat',
  'velcro',
  'ensur',
  'head',
  'collar',
  'curl',
  'pick',
  'girl',
  'import',
  'cat',
  'collar',
  'cat']]

In [3]:
len(token_list)

61796

In [4]:
# Load the product table
product_name = pd.read_csv("product_list_2_5.csv")
product_name

Unnamed: 0,product_id,product_title
0,3270,PETSOO Puppy Dog Pets Cute Winter Clothing Coa...
1,17464,"Beastie Bands ZEBRA Cat Collar, StripesBeastie..."
2,19343,PetSafe PIF00-12917 Stay & Play Wireless Fence...
3,23478,YML Double Door Dog Kennel Cage with Plastic T...
4,52493,Hartz Groomer's Best Pedicure Kit for Dogs and...
...,...,...
61791,999879135,EzyDog Micro Doggy Flotation Device (DFD)EzyDo...
61792,999917918,Farnam Horse Health Electrolytes SupplementFar...
61793,999944581,"KONG ZoomGroom, Dog Grooming Brush, SmallKONG ..."
61794,999961811,Tradewinds 8785 Canine Tapeworm Tablets - 5 x ...


## Define the Silhouette analysis

In [6]:
# Define Silhouette analysis
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm

def silhouette(n_clusters, cluster_labels, features):
    '''
    Draw Shilhouette plot to various k
    '''

    fig, ax1 = plt.subplots(1, 1)
    fig.set_size_inches(7, 5)

    # the silhouette plot
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(features) + (n_clusters + 1) * 10])

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(features, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to cluster i, and sort them
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
        ith_cluster_silhouette_values.sort()
        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values,
                                    facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples
    
    # Average silhouette score
    silhouette_avg = np.mean(sample_silhouette_values)
    print("For n_clusters =", n_clusters, ", The average silhouette_score is :", silhouette_avg)
    
    ax1.set_title("The silhouette plot for the various clusters")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")
    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
    
    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    plt.show()

In [7]:
# Define Silhouette analysis to k-means to decide k
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm

def kmean_silhouette(min_k, max_k, interval_k, features):
    '''
    Do K-means and draw Shilhouette plot to various k
    '''
    range_n_clusters = [i for i in range(min_k, max_k+1, interval_k)]

    for n_clusters in range_n_clusters:
        # Initialize the clusterer with n_clusters value and a random generator seed of 10 for reproducibility.
        clusterer = KMeans(n_clusters=n_clusters, random_state=10)
        cluster_labels = clusterer.fit_predict(features)
        
        silhouette(n_clusters, cluster_labels, features)
        
    plt.show()

## 4. Building the model

#### Approach:

4.1. Vectorization with Doc2Vec  
4.2. Dimensional reduction  
4.3. Applying k-means model  

### 4.1. Vectorization with Doc2Vec

In [8]:
# How many unique words?
len(set(token for review in token_list for token in review))

9140

In [9]:
from gensim.models import Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

300

In [71]:
corpus = [TaggedDocument(words=tokens, tags=[i]) for i, tokens in enumerate(token_list)]

doc2vecs = Doc2Vec(documents=corpus, dm=1, epoch=600, alpha=0.025, min_alpha=0.0001, vector_size=300, 
                     min_count=5, window=5, sample=1e-6, workers=4)

In [73]:
# Get the vectors as a list
vectors_list = [doc2vecs.docvecs[n] for n in range(len(doc2vecs.docvecs))]
len(vectors_list)

61796

**Without dimensional reduction**

In [17]:
# To decide the K value
from sklearn.cluster import KMeans
# Select k
ks1 = [10, 15, 20, 25, 30]
inertias1_dire = []

for k in ks1:
    # Create a KMeans instance with k clusters: model
    model1 = KMeans(n_clusters=k)
    
    # Fit model to samples
    model1.fit(vectors_list)
    
    # Append the inertia to the list of inertias
    inertias1_dire.append(model1.inertia_)

In [18]:
# Plot ks vs inertias
plt.figure()
plt.plot(ks1, inertias1_dire, '-o')
plt.xlabel('Number of clusters, k')
plt.ylabel('Inertia')
plt.xticks(ks1)
plt.title('SS vs. K')
plt.show()

<IPython.core.display.Javascript object>

The Elbow point is around 15 or 20. Let's try k-means when k = 16.

In [19]:
# Try K-Means
k = 16
kmeans1_dire = KMeans(n_clusters=k)

label_kmean1_dire = kmeans1_dire.fit_predict(vectors_list)

df_label_kmeans1_dire = product_name[['product_id', 'product_title']]
df_label_kmeans1_dire['label'] = label_kmean1_dire
df_label_kmeans1_dire.head(10)

Unnamed: 0,product_id,product_title,label
0,3270,PETSOO Puppy Dog Pets Cute Winter Clothing Coa...,5
1,17464,"Beastie Bands ZEBRA Cat Collar, StripesBeastie...",2
2,19343,PetSafe PIF00-12917 Stay & Play Wireless Fence...,2
3,23478,YML Double Door Dog Kennel Cage with Plastic T...,12
4,52493,Hartz Groomer's Best Pedicure Kit for Dogs and...,0
5,69972,Espree Skin & Coat Care for PetsEspree Skin & ...,12
6,70064,Perfect Pet Soft Flap Cat Door with Telescopin...,9
7,82760,Landen 60P 25.4 Gallon Rimless Low Iron Aquari...,11
8,100975,Generic Dog Puppy Pet Rainbow Colorful Rubber ...,8
9,102579,"Weco Wonder Shell Natural Minerals (3 Pack), S...",2


In [20]:
# Silhouette plot
silhouette(len(set(label_kmean1_dire)), label_kmean1_dire, vectors_list)

<IPython.core.display.Javascript object>

For n_clusters = 16 , The average silhouette_score is : 0.009322195


50

In [72]:
doc2vecs_50 = Doc2Vec(documents=corpus, dm=1, epoch=600, alpha=0.025, min_alpha=0.0001, vector_size=50, 
                     min_count=5, window=5, sample=1e-6, workers=4)

In [38]:
# Get the vectors as a list
vectors_list_50 = [doc2vecs_50.docvecs[n] for n in range(len(doc2vecs_50.docvecs))]
len(vectors_list_50)

61796

In [62]:
# To decide the K value
from sklearn.cluster import KMeans
# Select k
ks1 = [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25]
inertias1_dire_50 = []

for k in ks1:
    # Create a KMeans instance with k clusters: model
    model1 = KMeans(n_clusters=k)
    
    # Fit model to samples
    model1.fit(vectors_list_50)
    
    # Append the inertia to the list of inertias
    inertias1_dire_50.append(model1.inertia_)

In [63]:
# Plot ks vs inertias
plt.figure()
plt.plot(ks1, inertias1_dire_50, '-o')
plt.xlabel('Number of clusters, k')
plt.ylabel('Inertia')
plt.xticks(ks1)
plt.title('SS vs. K')
plt.show()

<IPython.core.display.Javascript object>

In [64]:
# Try K-Means
k = 9
kmeans1_dire_2 = KMeans(n_clusters=k)

label_kmean1_dire_2 = kmeans1_dire_2.fit_predict(vectors_list_50)

df_label_kmeans1_dire_2 = product_name[['product_id', 'product_title']]
df_label_kmeans1_dire_2['label'] = label_kmean1_dire_2
df_label_kmeans1_dire_2.head(10)

Unnamed: 0,product_id,product_title,label
0,3270,PETSOO Puppy Dog Pets Cute Winter Clothing Coa...,7
1,17464,"Beastie Bands ZEBRA Cat Collar, StripesBeastie...",3
2,19343,PetSafe PIF00-12917 Stay & Play Wireless Fence...,3
3,23478,YML Double Door Dog Kennel Cage with Plastic T...,0
4,52493,Hartz Groomer's Best Pedicure Kit for Dogs and...,3
5,69972,Espree Skin & Coat Care for PetsEspree Skin & ...,3
6,70064,Perfect Pet Soft Flap Cat Door with Telescopin...,7
7,82760,Landen 60P 25.4 Gallon Rimless Low Iron Aquari...,0
8,100975,Generic Dog Puppy Pet Rainbow Colorful Rubber ...,6
9,102579,"Weco Wonder Shell Natural Minerals (3 Pack), S...",3


In [65]:
# Silhouette plot
silhouette(len(set(label_kmean1_dire_2)), label_kmean1_dire_2, vectors_list_50)

<IPython.core.display.Javascript object>

For n_clusters = 9 , The average silhouette_score is : 0.020778246


20

In [55]:
doc2vecs_20 = Doc2Vec(documents=corpus, dm=1, epoch=600, alpha=0.025, min_alpha=0.0001, vector_size=20, 
                     min_count=5, window=5, sample=1e-6, workers=4)

In [56]:
# Get the vectors as a list
vectors_list_20 = [doc2vecs_20.docvecs[n] for n in range(len(doc2vecs_20.docvecs))]
len(vectors_list_20)

61796

In [58]:
# To decide the K value
from sklearn.cluster import KMeans
# Select k
ks1 = [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25]
inertias1_dire_20 = []

for k in ks1:
    # Create a KMeans instance with k clusters: model
    model1 = KMeans(n_clusters=k)
    
    # Fit model to samples
    model1.fit(vectors_list_20)
    
    # Append the inertia to the list of inertias
    inertias1_dire_20.append(model1.inertia_)

In [59]:
# Plot ks vs inertias
plt.figure()
plt.plot(ks1, inertias1_dire_20, '-o')
plt.xlabel('Number of clusters, k')
plt.ylabel('Inertia')
plt.xticks(ks1)
plt.title('SS vs. K')
plt.show()

<IPython.core.display.Javascript object>

In [60]:
# Try K-Means
k = 10
kmeans1_dire_3 = KMeans(n_clusters=k)

label_kmean1_dire_3 = kmeans1_dire_2.fit_predict(vectors_list_20)

df_label_kmeans1_dire_3 = product_name[['product_id', 'product_title']]
df_label_kmeans1_dire_3['label'] = label_kmean1_dire_3
df_label_kmeans1_dire_3.head(10)

Unnamed: 0,product_id,product_title,label
0,3270,PETSOO Puppy Dog Pets Cute Winter Clothing Coa...,1
1,17464,"Beastie Bands ZEBRA Cat Collar, StripesBeastie...",1
2,19343,PetSafe PIF00-12917 Stay & Play Wireless Fence...,0
3,23478,YML Double Door Dog Kennel Cage with Plastic T...,1
4,52493,Hartz Groomer's Best Pedicure Kit for Dogs and...,6
5,69972,Espree Skin & Coat Care for PetsEspree Skin & ...,0
6,70064,Perfect Pet Soft Flap Cat Door with Telescopin...,6
7,82760,Landen 60P 25.4 Gallon Rimless Low Iron Aquari...,6
8,100975,Generic Dog Puppy Pet Rainbow Colorful Rubber ...,8
9,102579,"Weco Wonder Shell Natural Minerals (3 Pack), S...",5


In [61]:
# Silhouette plot
silhouette(len(set(label_kmean1_dire_3)), label_kmean1_dire_3, vectors_list_20)

<IPython.core.display.Javascript object>

For n_clusters = 9 , The average silhouette_score is : 0.03425362


Silhouette score is very low. It would be better to reduce the dimensions.

### 4.2. Dimensional reduction

original dimension 300

In [12]:
# Select the number of components for TruncatedSVD
from sklearn.decomposition import TruncatedSVD

# Create a TruncatedSVD instance
svd1 = TruncatedSVD(n_components=299)

# Fit the model to 'vectors_list'
svd1.fit(vectors_list)

TruncatedSVD(algorithm='randomized', n_components=299, n_iter=5,
             random_state=None, tol=0.0)

In [21]:
svd1.explained_variance_[:10]

array([2.0563498e-04, 1.0552926e-06, 1.0537007e-06, 1.0510552e-06,
       1.0489714e-06, 1.0474677e-06, 1.0456702e-06, 1.0446420e-06,
       1.0438063e-06, 1.0423666e-06], dtype=float32)

In [16]:
# Plot the explained variances
plt.figure(figsize=(10, 5))
features = range(100) 
plt.bar(features, svd1.explained_variance_[:100]) 
plt.xlabel('SVD feature')
plt.ylabel('Variance')
plt.xticks(range(0, 101, 10))
plt.title('The explained variances')
plt.show()

<IPython.core.display.Javascript object>

Let's try 2 dimensions.

2 dimensions

In [22]:
# Create a TruncatedSVD instance using the number of components
n = 2
svd1_n = TruncatedSVD(n_components=n)
svd1_n_features = svd1_n.fit_transform(vectors_list)
svd1_n_features.shape

(61796, 2)

In [51]:
plt.figure(figsize=(10, 10))
plt.scatter(x=svd1_n_features[:,0], y=svd1_n_features[:,1], s=5)
plt.xlabel('x')
plt.ylabel('y')
plt.ylim([-0.005, 0.005])
plt.title('The visualization of the clustered data')
plt.show()

<IPython.core.display.Javascript object>

Try k-means with 2 dimensions.

In [31]:
# To decide the K value
from sklearn.cluster import KMeans
# Select k
ks1 = [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25]
inertias1 = []

for k in ks1:
    # Create a KMeans instance with k clusters: model
    model1 = KMeans(n_clusters=k)
    
    # Fit model to samples
    model1.fit(svd1_n_features)
    
    # Append the inertia to the list of inertias
    inertias1.append(model1.inertia_)

In [32]:
# Plot ks vs inertias
plt.figure()
plt.plot(ks1, inertias1, '-o')
plt.xlabel('Number of clusters, k')
plt.ylabel('Inertia')
plt.xticks(ks1)
plt.title('SS vs. K')
plt.show()

<IPython.core.display.Javascript object>

According to the plot, k is around 10 or 15. Let's check Silhouette score changing the k value. 

In [33]:
# Silhouette plots when k = 7, 8, 9, 10, 11, 12 (dimension = 2)
kmean_silhouette(7, 12, 1, svd1_n_features)

<IPython.core.display.Javascript object>

For n_clusters = 7 , The average silhouette_score is : 0.49545968


<IPython.core.display.Javascript object>

For n_clusters = 8 , The average silhouette_score is : 0.48189723


<IPython.core.display.Javascript object>

For n_clusters = 9 , The average silhouette_score is : 0.46437985


<IPython.core.display.Javascript object>

For n_clusters = 10 , The average silhouette_score is : 0.45400843


<IPython.core.display.Javascript object>

For n_clusters = 11 , The average silhouette_score is : 0.43510893


<IPython.core.display.Javascript object>

For n_clusters = 12 , The average silhouette_score is : 0.4312632


k = 7

In [53]:
# Try K-Means
k = 7
kmeans1 = KMeans(n_clusters=k)

label_kmean1 = kmeans1.fit_predict(vectors_list)

df_label_kmeans1 = product_name[['product_id', 'product_title']]
df_label_kmeans1['label'] = label_kmean1
df_label_kmeans1.head(10)

Unnamed: 0,product_id,product_title,label
0,3270,PETSOO Puppy Dog Pets Cute Winter Clothing Coa...,6
1,17464,"Beastie Bands ZEBRA Cat Collar, StripesBeastie...",2
2,19343,PetSafe PIF00-12917 Stay & Play Wireless Fence...,2
3,23478,YML Double Door Dog Kennel Cage with Plastic T...,2
4,52493,Hartz Groomer's Best Pedicure Kit for Dogs and...,6
5,69972,Espree Skin & Coat Care for PetsEspree Skin & ...,1
6,70064,Perfect Pet Soft Flap Cat Door with Telescopin...,1
7,82760,Landen 60P 25.4 Gallon Rimless Low Iron Aquari...,6
8,100975,Generic Dog Puppy Pet Rainbow Colorful Rubber ...,1
9,102579,"Weco Wonder Shell Natural Minerals (3 Pack), S...",2


In [54]:
plt.figure(figsize=(10, 10))
plt.scatter(x=svd1_n_features[:,0], y=svd1_n_features[:,1], c=label_kmean1, s=5)
plt.xlabel('x')
plt.ylabel('y')
plt.ylim([-0.005, 0.005])
plt.title('The visualization of the clustered data')
plt.show()

<IPython.core.display.Javascript object>

original dimension 50

In [43]:
# Select the number of components for TruncatedSVD
from sklearn.decomposition import TruncatedSVD

# Create a TruncatedSVD instance
svd1_50 = TruncatedSVD(n_components=49)

# Fit the model to 'vectors_list'
svd1_50.fit(vectors_list_50)

TruncatedSVD(algorithm='randomized', n_components=49, n_iter=5,
             random_state=None, tol=0.0)

In [50]:
svd1_50.explained_variance_[:10]

array([6.6111318e-04, 3.5019166e-05, 3.4895795e-05, 3.4767487e-05,
       3.4702778e-05, 3.4646455e-05, 3.4579734e-05, 3.4457036e-05,
       3.4438952e-05, 3.4347766e-05], dtype=float32)

In [49]:
# Plot the explained variances
plt.figure(figsize=(10, 5))
features = range(49) 
plt.bar(features, svd1_50.explained_variance_) 
plt.xlabel('SVD feature')
plt.ylabel('Variance')
plt.xticks(range(0, 51, 5))
plt.title('The explained variances')
plt.show()

<IPython.core.display.Javascript object>

original dimension 20

In [68]:
# Select the number of components for TruncatedSVD
from sklearn.decomposition import TruncatedSVD

# Create a TruncatedSVD instance
svd1_20 = TruncatedSVD(n_components=19)

# Fit the model to 'vectors_list'
svd1_20.fit(vectors_list_20)

TruncatedSVD(algorithm='randomized', n_components=19, n_iter=5,
             random_state=None, tol=0.0)

In [69]:
svd1_20.explained_variance_[:10]

array([0.00124072, 0.00021547, 0.00021421, 0.00021273, 0.00021164,
       0.00021126, 0.00021047, 0.00021011, 0.00020951, 0.00020884],
      dtype=float32)

In [70]:
# Plot the explained variances
plt.figure(figsize=(10, 5))
features = range(19) 
plt.bar(features, svd1_20.explained_variance_) 
plt.xlabel('SVD feature')
plt.ylabel('Variance')
plt.xticks(range(0, 21, 5))
plt.title('The explained variances')
plt.show()

<IPython.core.display.Javascript object>

**The other models**

1. DBSCAN

With dimensional reduction (7 dimensions)

In [None]:
# DBSCAN
from sklearn.cluster import DBSCAN

dbscan1 = DBSCAN(eps=0.5)
label_db1 = dbscan1_dire.fit_predict(svd1_n_features_7)

df_label_db1 = product_name[['product_id', 'product_title']]
df_label_db1['label'] = label_db1
df_label_db1.head()

In [None]:
set(label_db1)

In [None]:
# Silhouette plot
silhouette(len(set(label_db1)), label_db1, svd1_n_features_7)

2. Affinity propagation

With dimensional reduction (7 dimensions)

In [23]:
# Affinity propagation
from sklearn.cluster import AffinityPropagation

Aff1 = AffinityPropagation()
label_aff1 = Aff1.fit_predict(svd1_n_features_7)

df_label_aff1 = product_name[['product_id', 'product_title']]
df_label_aff1['label'] = label_aff1
df_label_aff1.head()

MemoryError: Unable to allocate 28.5 GiB for an array with shape (61796, 61796) and data type float64

In [None]:
n_clusters_aff1 = len(set(label_aff1)
n_clusters_aff1

In [None]:
# Silhouette plot
silhouette(n_clusters_aff1, label_aff1, svd1_n_features_7)

3. Mean-shift

With dimensional reductions (7 dimensions)

In [71]:
from sklearn.cluster import MeanShift

ms1 = MeanShift(seeds=svd1_n_features_7)
label_ms1 = ms1.fit_predict(svd1_n_features_7)

df_label_ms1 = product_name[['product_id', 'product_title']]
df_label_ms1['label'] = label_ms1
df_label_ms1.head()

[[ 0.12015731 -0.03038722  0.00152214 -0.03848991  0.01080679  0.00220317
   0.007025    0.00992317 -0.00333705 -0.00784524 -0.00409096 -0.00209521
  -0.00270592]
 [ 0.41487225 -0.29141214 -0.04202477 -0.18524623  0.10881872  0.03947082
  -0.16837394 -0.19282269 -0.22346347 -0.17418922 -0.06681692 -0.0347344
  -0.16807398]]


Unnamed: 0,product_id,product_title,label
0,3270,PETSOO Puppy Dog Pets Cute Winter Clothing Coa...,0
1,17464,"Beastie Bands ZEBRA Cat Collar, StripesBeastie...",0
2,19343,PetSafe PIF00-12917 Stay & Play Wireless Fence...,0
3,23478,YML Double Door Dog Kennel Cage with Plastic T...,0
4,52493,Hartz Groomer's Best Pedicure Kit for Dogs and...,0


In [None]:
n_clusters_ms1 = len(set(label_ms1)
n_clusters_ms1

In [72]:
# Silhouette plot
silhouette(n_clusters_ms1, label_ms1, svd1_n_features_7)



<IPython.core.display.Javascript object>

The average silhouette_score is : 0.35628715558558716


4. VBGMM(Variational Bayesian Gaussian Mixture):  not good, very low Silhouette score

In [76]:
from sklearn.mixture import BayesianGaussianMixture

vbgm1 = BayesianGaussianMixture(n_components=16)
label_vbgm1 = vbgm1.fit_predict(svd1_n_features_7)

df_label_vbgm1 = product_name[['product_id', 'product_title']]
df_label_vbgm1['label'] = label_vbgm1
df_label_vbgm1.head()



Unnamed: 0,product_id,product_title,label
0,3270,PETSOO Puppy Dog Pets Cute Winter Clothing Coa...,1
1,17464,"Beastie Bands ZEBRA Cat Collar, StripesBeastie...",13
2,19343,PetSafe PIF00-12917 Stay & Play Wireless Fence...,13
3,23478,YML Double Door Dog Kennel Cage with Plastic T...,11
4,52493,Hartz Groomer's Best Pedicure Kit for Dogs and...,1


In [80]:
#print(vbgm.weights_)
plt.figure()
x_tick =np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])
plt.bar(x_tick, vbgm1.weights_, width=0.7, tick_label=x_tick)
plt.show

  


<IPython.core.display.Javascript object>

<function matplotlib.pyplot.show(*args, **kw)>

In [None]:
n_clusters_vbgm1 = len(set(label_vbgm1)
n_clusters_vbgm1

In [81]:
# Silhouette plot
silhouette(n_clusters_vbgm1, label_vbgm1, svd1_n_features_7)



<IPython.core.display.Javascript object>

The average silhouette_score is : -0.0072682444199583165


5. GMM: not good, very low Silhouette score

In [None]:
from sklearn.mixture import GaussianMixture

gmm1 = GaussianMixture(n_components=16)
label_gmm1 = gmm1.fit_predict(svd1_n_features_7)

df_label_gmm1 = product_name[['product_id', 'product_title']]
df_label_gmm1['label'] = label_gmm1
df_label_gmm1.head()

In [None]:
#print(vbgm.weights_)
plt.figure()
x_tick =np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])
plt.bar(x_tick, gmm1.weights_, width=0.7, tick_label=x_tick)
plt.show

In [None]:
# Silhouette plot
silhouette(len(set(label_gmm1)), label_gmm1, svd1_n_features_7)

6. Special Clustering: Memory error

In [82]:
from sklearn.cluster import SpectralClustering

spec1 = SpectralClustering(n_components=16, random_state=6)
label_spec1 = spec1.fit_predict(svd1_n_features_7)

df_label_spec1 = product_name[['product_id', 'product_title']]
df_label_spec1['label'] = label_spec1
df_label_spec1.head()

MemoryError: Unable to allocate 28.5 GiB for an array with shape (61796, 61796) and data type float64

In [None]:
# Silhouette plot
silhouette(len(set(label_spec1), label_spec1, svd1_n_features_7)