# Capstone project 1: Pet Product Auto-Subcategorization by Review Analysis (modeling)

**Problem and Goal:**  
Now, an e-commerce company has a Pet Supplies category on the website. The category has many products for dogs, cats, birds and other animals. They would like to **subcategorize them into some small subcategories** by the reviews so that they can improve analysis of trends and customer needs to a specific field and their customers can easily find out a product they want.

This jupyter notebook is about clustering. If you want to see the preprocessing, see the preprocessing jupyter notenbook.

## Recapitulation

In the preprocessing steps, the original data (reviews) were cleaned, tokenized, and divided into three animal categories; cat, dog, and other.  Now, each dataset has the following numbers of products and unique tokens:
  
| Category  | Total products | Total tokens (unique) |  
|:--------:|:--------------:|:---------------------:|  
|dog|11,916|6,682|
|cat|4,099|6,001|
|other|4,388|5,914|
  
  
Summary statistics of the number of tokens per product:
  
| Category |  Min  |  25%  |  50%  |  75%  |  90%  |  Max|Mean  |  SD  |  
|:--------:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|  
|dog|7|36|45|55|66|126|46.0|14.8|
|cat|9|38|47|58|69|120|48.7|15.1|
|other|5|32|42|52|63|98|43.1|14.7|  
  
We use these tokes to categorize the products into subcategories. 

## Table of Contents

0. Import packages and define functions  
  
4. Clustering
5. Summary

## 0. Import packages and define functions

In [139]:
# Import the packages
%matplotlib notebook

import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from collections import Counter
import nltk
import csv
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import normalize
from sklearn.manifold import TSNE
from ipywidgets import interact

sns.set(context='notebook', style='ticks', palette='hls')
pd.set_option("display.max_colwidth", 100)

# colors
cmap1 = plt.get_cmap('tab20b') 
cmap2 = plt.get_cmap('tab20c') 

cmap1_vals = [cmap1(i) for i in range(cmap1.N)]
cmap2_vals = [cmap2(i) for i in range(cmap2.N)]

cmaps = cmap1_vals + cmap2_vals + cmap1_vals + cmap2_vals + cmap1_vals 

In [140]:
# You can see this notebook without codes
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle the code on/off."></form>''')

In [141]:
# Define a function ploting a dendrogram
from scipy.cluster.hierarchy import dendrogram

def plot_dendrogram(model, **kwargs):
    '''
    Create linkage matrix and then plot the dendrogram
    '''

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack([model.children_, model.distances_,
                                      counts]).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

In [142]:
# Define a function drewing a Silhouette plot
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm

def silhouette_plot(n_clusters, cluster_labels, features):
    '''
    Draw Shilhouette plot 
    '''

    fig, ax1 = plt.subplots(1, 1)
    fig.set_size_inches(7, 7)

    # the silhouette plot
    ax1.set_xlim([-0.1, 1])
    
    # The (n_clusters+1)*10 is for inserting blank space between silhouette plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, features.shape[0] + (n_clusters + 1) * 10])

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(features, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to cluster i, and sort them
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
        ith_cluster_silhouette_values.sort()
        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values,
                                    facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples
    
    # Average silhouette score
    silhouette_avg = np.mean(sample_silhouette_values)
    print("For n_clusters =", n_clusters, ", The average silhouette_score is :", silhouette_avg)
    
    ax1.set_title("The silhouette plot for the various clusters")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")
    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
    
    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    plt.show()

In [5]:

# Define Silhouette analysis (only score)
from sklearn.metrics import silhouette_score
import matplotlib.cm as cm

def aggclus_silhouette_score(min_nclus, max_nclus, interval_nclus, features):
    '''
    Do agglomerative clustering and get the Shilhouette score to various number of clusters
    '''
    n_clusters = [i for i in range(min_nclus, max_nclus+1, interval_nclus)]
    scores = []

    for k in n_clusters:
        model = AgglomerativeClustering(n_clusters=k, affinity='cosine', linkage='average')
        model = model.fit(features)
        
        score = silhouette_score(features, model.labels_)
        scores.append(round(score, 4))
    
    return n_clusters, scores

In [130]:
# Define a function doing agglomerative clustering, 
# get the Shilhouette score and number of products in each cluster 
from sklearn.cluster import AgglomerativeClustering

def aggclus_silscore_nsamples(min_nclus, max_nclus, interval_nclus, features):
    '''
    Do agglomerative clustering and get the Shilhouette score to various number of clusters
    '''
    n_clusters = [i for i in range(min_nclus, max_nclus+1, interval_nclus)]
    scores = []
    n_samples = []

    for k in n_clusters:
        model = AgglomerativeClustering(n_clusters=k, affinity='cosine', linkage='average')
        model = model.fit(features)
        
        score = silhouette_score(features, model.labels_)
        scores.append(round(score, 4))
        
        each_datapoints = {}
        
        for i in range(k):
            each_datapoints[i] = len(np.where(model.labels_ == i)[0])
        n_samples.append(each_datapoints)
    
    return n_clusters, scores, n_samples

In [143]:
# Define a function finding the most frequent 10 words in each cluster
def frequent_words(df, label_column, token_list):
    '''
    Find the most frequent 10 words in each cluster
    '''
    n_clusters = df[label_column].nunique()
    df_cluster = pd.DataFrame(columns=['cluster number','frequent words', 'number of products'])
    for i in range(n_clusters):
        words = []
        indexes = df[df[label_column] == i].index
        for index in indexes:
            for word in set(token_list[index]):
                words.append(word)
        c = Counter(words)
        values, _ = zip(*c.most_common(10))

        df_temp = pd.DataFrame([[i, values, len(indexes)]], 
                                columns=['cluster number','frequent words', 'number of products'])
        df_cluster = df_cluster.append(df_temp)

    return df_cluster.set_index('cluster number')

## 1. Clustering

First, let's imagine how the result of this project would be used:  
1. A customer visits the website of the e-commerce company.  
2. Select 'Animal products' category  
3. Select 'Cat' category  
4. Select 'Grooming' category  
5. Select 'Brush, Clipper' category  
6. Browse products, find something interesting and buy it  

As you see, there are animal categories (we already have them!; 'Dog', 'Cat', 'Other'), and under the animal categories I would like to have some big categories (e.g. 'Grooming', 'Food', 'Toy'), and some small categories (e.g. 'Brush', 'cat tree', 'Collar') under the big categories. Here, I have chosen hierarchical agglomerative clustering as an algorithm to achieve this.  
  
Hierarchical agglomerative clustering is a method of cluster analysis, which is one of unsupervised learning. Cluster analysis is generally used to segment data points into some groups without any pre-labels. The feature of hierarchical agglomerative clustering is to build nested clusters by merging the clusters (at the start point, they are individual samples) until becoming one cluster successively. The hierarchy is represented as a dendrogram.  

It would be suitable for the purpose of this project! The one weak point of this clustering is the time complexity; that is, it's slow. The three datasets of this project (dog, cat, other) have 4,000 to 12,000 data points each. It would take time, but it would be acceptable. Let's start!!

#### Approach:

4.1. Dog category  
4.1.1. Load data  
4.1.2. Vectorization    
4.1.3. Hierarchical clustering  
  
4.2. Cat category  
4.2.1. Load data  
4.2.2. Vectorization    
4.2.3. Hierarchical clustering 

4.3. Other category  
4.3.1. Load data  
4.3.2. Vectorization    
4.3.3. Hierarchical clustering 

### 4.1. Dog categpry

#### 4.1.1. Load data

Load the dog token list and the id-title table we made at the preprocessing.

Token list:

In [294]:
# Load the token list
dog_token_list = []
with open("dog_token_list.csv", "r", encoding="UTF-8") as f:
    reader = csv.reader(f) 
    for r in reader: 
        dog_token_list.append(r)
        
len(dog_token_list), dog_token_list[:2]
print('dog_token_list (first 5 tokens of the second product):', dog_token_list[1][:5])
print('Number of products:', len(dog_token_list))
print('Number of unique tokens:', len(set(token for review in dog_token_list for token in review)))

dog_token_list (first 5 tokens of the first product): ['chewi', 'contamin', 'food', 'dog', 'stella']
Number of products: 11916
Number of unique tokens: 6682


Dog related tokens such as 'dog', 'puppi', or 'doggi' are everywhere. They can be a distraction. Let's remove them. 

In [295]:
# Remove dog tokens
dog_tokens = ['dog', 'puppi', 'doggi']
new_dog_token_list = []

for product in dog_token_list:
    nested_list = []
    for token in product:
        if token not in dog_tokens:
            nested_list.append(token)
    new_dog_token_list.append(nested_list)
    
print('new_dog_token_list (first 5 tokens of the second product):', new_dog_token_list[1][:5])
print('Number of products:', len(new_dog_token_list))
print('Number of unique tokens:', len(set(token for review in new_dog_token_list for token in review)))

new_dog_token_list (first 5 tokens of the second product): ['chewi', 'contamin', 'food', 'stella', 'chewi']
Number of products: 11916
Number of unique tokens: 6679


ID-title table:

In [155]:
# Load the product table
df_dog_id_name = pd.read_csv("df_dog_id_name.csv")

print('df_dog_id_name (first 5 products):')
df_dog_id_name.head()

df_dog_id_name (first 5 products):


Unnamed: 0,product_id,product_title
0,119780,"ARK Naturals PRODUCTS for PETS 326066 4-Ounce Breath-Less Chewable Brushless Toothpaste, MiniARK..."
1,202371,"Stella & Chewy's Freeze Dried Dog Food for Adult Dogs, Chicken Patties, 15 Ounce Bag - 2 PackSte..."
2,291967,Premium Deshedding Brush for Dogs and Cats with Medium to Long Hair | Veterinary Approved | Rugg...
3,490904,"Remington Coastal Pet R0206 GRN06 Rope Leash, 72-Inch, GreenRemington Coastal Pet R0206 GRN06 Ro..."
4,798322,Pet Dog Puppy Nonslip Canvas Sport Shoes Sneaker Boots Rubber Sole Size 5 Blue by MallofusaPet D...


#### 4.1.2. Vectorization

Count vectorizer is used to vectorize the tokens because the frequent words look useful to describe each product (see the preprocessing notebook).  new_dog_token_list has 6,679 kinds of tokens. Additionally, I would like to use bigrams so that some frequent compound words can be recognized as one word. For example, 'dried food', 'training pad', and 'litter box'. If all of the single tokens and bigrams are used, the tokens are going to be more than 10,000. Here, I have decided to use the top 5000 tokens for clustering to control the number of tokens. Otherwise, too many dimensions cause very long compute time and complications from the curse of dimensionality.

In [157]:
# Create CountVectorizer object 
def dummy_tokened(text):
    return text

# Remove less frequent words
cvectorizer_dog = CountVectorizer(tokenizer=dummy_tokened,lowercase=False, max_features=5000, ngram_range=(1, 2))

bow_dog = cvectorizer_dog.fit_transform(new_dog_token_list)

# Get the feature names
feature_names_dog = cvectorizer_dog.get_feature_names()

# Show the shape of bow_dog
print('BOW matrix: bow_dog')
print('Matrix shape:', bow_dog.shape)

BOW matrix: bow_dog
Matrix shape: (11916, 5000)


As you saw above, some products have a few tokens and some have many (min.7, max.126). To adjust the difference, let's normarize the bow matrix. Also, transfer the matrix into numpy array to use for sk-learn agglomerative clustering function.

In [163]:
# Normarize the bow matrix
normalized_bow_dog = normalize(bow_dog)

# Change the normalizer bow matrix into np.array to use for AgglomerativeClustering()
np_normalized_bow_dog = normalized_bow_dog.toarray()

print('Normarized BOW matrix: np_normalized_bow_dog')
print('Matrix shape:', np_normalized_bow_dog.shape)

Normarized BOW matrix: np_normalized_bow_dog
Matrix shape: (11916, 5000)


Here, let's take a look at the t-SNE image to roughly confirm how many clusters there are.

In [670]:
# Create a t-SNE instance
tsne_dog = TSNE(perplexity=50, learning_rate=800, random_state=10)

# Apply fit_transform to samples
tsne_features_dog = tsne_dog.fit_transform(np_normalized_bow_dog)

# Scatter plot of t-NSE
plt.figure(figsize=(9,7))
tsne_features = tsne_features_dog

plt.scatter(tsne_features[:,0], tsne_features[:,1], s=4, alpha=0.8, c='royalblue')  
plt.title('t-SNE plot of dog products')
        
plt.subplots_adjust(left=0.035, right=0.999, bottom=0.04, top=0.85)
plt.show()

  


<IPython.core.display.Javascript object>

There seems to be less than 10 big clusters, and many small clusters. Now, let's start hierarchical clustering!

#### 4.1.3. Hierarchical clustering

First, I would like to divide the data into less than 10 categories (call them big_categories). Second, each of the big_categories will be devided into subcategories (call them small_categories).

**4.1.3.1. Big categories**

Let's plot the dendrogram and see the outline.

In [165]:
# Agglomeral clustering setting distance_threshold=0 to compute the full tree
model_dog = AgglomerativeClustering(distance_threshold=0, n_clusters=None, affinity='cosine', linkage='average')
model_dog = model_dog.fit(np_normalized_bow_dog)

# Draw the dendrogram
plt.figure(figsize=(15,5))
plt.title('Hierarchical Clustering Dendrogram')
# plot the top three levels of the dendrogram
plot_dendrogram(model_dog, truncate_mode='level', p=5)

plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.subplots_adjust(left=0.04, right=0.999, bottom=0.15, top=0.95)
plt.ylabel("Distance threshold")
plt.yticks([i/10 for i in range(0,11,1)])
plt.show()

<IPython.core.display.Javascript object>

According to the dendrogram, some mini clusters join up with the big clusters in the upper course of the dendrogram. That means the mini clusters are not very similar to the main clusters. We can call them "other".
I would like to divide the data into less than 10 big_categories. So, let's check the silhouette scores from k = 5 to 20.

In [166]:
# Agglomeral clustering from k = 5 to 20 
n_k_dog, scores_dog, n_samples_dog  = aggclus_silscore_nsamples(5, 20, 1, np_normalized_bow_dog)

In [167]:
# Plot Number of clusters vs. Silhouette score
plt.figure(figsize=(9,4))

n_clusters = n_k_dog
sil_scores = scores_dog

plt.plot(n_clusters, sil_scores, '.-')
plt.xticks(n_clusters)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.title('Number of clusters vs. Silhouette score')

plt.subplots_adjust(left=0.085, right=0.999, bottom=0.15, top=0.9)
plt.show()

<IPython.core.display.Javascript object>

When k = 9, 15, 19, the scores are relatively high. Let's check the number of big clusters in each case.

In [168]:
# Count the number of products in each cluster
for k in [9, 15, 19]:
    n_cluster = k - 5
    count = 0
    for i in range(len(n_samples_dog[n_cluster])):
        if n_samples_dog[n_cluster][i] > 100:
            count += 1
    print('k =', k)
    print(n_samples_dog[n_cluster])
    print('Number of clusters more than 100 samples:', count)
    print('')

k = 9
{0: 2211, 1: 2485, 2: 5223, 3: 10, 4: 6, 5: 6, 6: 1852, 7: 7, 8: 116}
Number of clusters more than 100 samples: 5

k = 15
{0: 88, 1: 2485, 2: 2092, 3: 10, 4: 22, 5: 3896, 6: 1852, 7: 7, 8: 116, 9: 6, 10: 76, 11: 6, 12: 9, 13: 9, 14: 1242}
Number of clusters more than 100 samples: 6

k = 19
{0: 3896, 1: 22, 2: 1285, 3: 1200, 4: 76, 5: 2065, 6: 1293, 7: 10, 8: 27, 9: 6, 10: 70, 11: 6, 12: 9, 13: 9, 14: 1242, 15: 7, 16: 18, 17: 116, 18: 559}
Number of clusters more than 100 samples: 8



Although k = 9 have one huge cluster (cluster_2), K = 15 and 19 have some big clusters and many mini clusters. Let's try k = 19. We can also see k = 15 through k = 19 because the merging is continuous. 

In [170]:
# Agglomeral clustering; k = 19
model_dog_19 = AgglomerativeClustering(n_clusters=19, affinity='cosine', linkage='average')
model_dog_19 = model_dog_19.fit(np_normalized_bow_dog)

In [171]:
# Silhouette plot
silhouette(model_dog_19.n_clusters_, model_dog_19.labels_, np_normalized_bow_dog)

<IPython.core.display.Javascript object>

For n_clusters = 19 , The average silhouette_score is : 0.03920719437585718


In [348]:
# Scatter plot, colored by model_cat_test
plt.figure(figsize=(9,7))

model = model_dog_19
tsne_features = tsne_features_dog

# for i in [2]:
for i in range(model.n_clusters_):
    plt.scatter(tsne_features[model.labels_ == i][:,0], 
                tsne_features[model.labels_ == i][:,1], 
                s=4, alpha=0.8, c=[cmap1(i/model.n_clusters_)])   
    plt.text(tsne_features[model.labels_ == i][:,0][0],
             tsne_features[model.labels_ == i][:,1][0],
             str(i), color="black", size=16
             )

plt.title('t-SNE plot of dog products colored by the labels')
plt.subplots_adjust(left=0.04, right=0.999, bottom=0.04, top=0.9)
plt.show()

<IPython.core.display.Javascript object>

Additionaly, let's check the frequent words in each cluster.

In [174]:
# Apply the labels to the product table
df_label_dog = df_dog_id_name[['product_id', 'product_title']]
df_label_dog['label_19'] = model_dog_19.labels_

# Count the number of products in each clusters
frequent_words(df_label_dog, 'label_19', new_dog_token_list)

Unnamed: 0_level_0,frequent words,number of products
cluster number,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"(food, treat, eat, bag, mix, chew, vet, sinc, lab, pup)",3896
1,"(area, air, devic, track, home, cat, batteri, tagg, call, let)",22
2,"(bed, pad, fit, crate, wash, cat, door, cover, room, comfort)",1285
3,"(fit, mix, materi, strap, car, comfort, seat, stay, chihuahua, ador)",1200
4,"(storm, mix, thunder, calm, firework, anxieti, effect, vet, thunderstorm, rescu)",76
5,"(collar, leash, walk, fit, train, color, pull, comfort, har, neck)",2065
6,"(smell, coat, skin, bath, shampoo, hair, dri, flea, vet, spray)",1293
7,"(camera, video, gift, watch, app, monitor, night, fun, view, home)",10
8,"(beauti, pass, gift, photo, belov, person, urn, collar, ash, display)",27
9,"(kit, aid, powder, bleed, emerg, home, hand, satisfi, prepar, vet)",6


From the sample sizes, cluster_2 and 3 is one cluster, and cluster_6 and 18 is one cluster when k = 15. I guess cluster_6 and 18 can be one category such as 'body care, cleaning' but cluster_2 and 3 have each feature. Let's look each cluster closely. 
Take a look at the products in each cluster, and think whether the separation is appropriate.

In [306]:
# Check the products in each cluster
df_label_dog[df_label_dog['label_19'] == 18].sample(10)

Unnamed: 0,product_id,product_title,label_19
4360,361508452,Bio-Groom Ear Fresh Ear PowderBio-Groom Ear Fresh Ear PowderBio-Groom Ear Fresh Ear PowderBio-Gr...,18
5142,429139548,"Oster CryogenX Professional Animal Clipper Blade, Size # 4 Skip ToothOster CryogenX Professional...",18
328,24568185,Wahl Professional Animal 5in1 Stainless Steel Comb Set #3379Wahl Professional Animal 5in1 Stainl...,18
10267,862636232,"Alfie Pet by Petoga Couture - Pet Home Grooming Kit - Curved, Straight, Thinning Shears, Round-T...",18
1922,154548223,"Evercare Giant Lint Roller Refill, 60 Sheets RollEvercare Giant Lint Roller Refill, 60 Sheets Ro...",18
8395,704098156,"Pet Grooming Clippers for Dogs, Cats and Small Animals. Professional Nail Clipper with Safety Gu...",18
3487,288436438,Master Equipment Adjustable Height Grooming Table for PetsMaster Equipment Adjustable Height Gro...,18
3723,307190712,"Formal Pets Bowtie, Dog Cat Pets Adjustable Bow Tie and Collar DCL01Formal Pets Bowtie, Dog Cat ...",18
6276,525333310,Oster Cryogen-X Pet Clipper BladesOster Cryogen-X Pet Clipper BladesOster Cryogen-X Pet Clipper ...,18
3068,251815707,"Groomer's Stone Pets Grooming Tool, GreyGroomer's Stone Pets Grooming Tool, GreyGroomer's Stone ...",18


According to the Silhouette plot, the t-SNE image with the labels, the frequent words, and browsing the product titles in each clusters, cluster_0, 2, 3, 5, 6, and 14 have individual features, and I evaluate they are appropriate as big_categories respectively. Cluster_18 is similar to cluster_6. So, let's get them together. The other clusters have fewer products. Let's think about them as 'other' group. However, some of them also have a unique character. I'll also label them. 

In [659]:
# Label each category title (from the above two tables)
big_category_labels_dog = {0:"food, treat, treatment", 1:"other", 2:"bed, crate, gate", 3:"clothes", 4:"other", 
                           5:"collar, leash", 6:"body care, cleaning", 7:"other", 8:"other", 9:"other", 10:"other", 
                           11:"other", 12:"other", 13:"other", 14:"toy", 15:"other", 16:"other", 17:"other", 
                           18:"body care, cleaning"}

# Label each category title of the 'other' group (from the above table)
other_category_labels_dog = {1:"other", 4:"calming", 7:"monitoring", 8:"memorial", 9:"other", 10:"tie out", 
                             11:"other", 12:"other", 13:"other", 15:"aquarium", 16:"other", 17:"eye care"}

I noticed several things.  Let's take care of these later.
- Cluster_15 is not for dogs, but fish.  
- Cluster_10 can be in cluster_2 ("bed, crate, gate").  
- Cluster_17 can be in cluster_6 ("body care, cleaning"). 

Now, let's extract the big 7 clusters (cluster_0, 2, 3, 5, 6, 14, 18) for farther clustering.

In [183]:
# Separate the df_label_dog into two tables; the big clusters, the 'other' group
big_clusters_dog = [0, 2, 3, 5, 6, 14, 18]
df_label_dog_big = pd.DataFrame()
df_label_dog_other = pd.DataFrame()

for i in range(len(df_label_dog)):
    if i in big_clusters_dog:
        df_label_dog_big = df_label_dog_big.append(df_label_dog[df_label_dog['label_19'] == i])
    else:
        df_label_dog_other = df_label_dog_other.append(df_label_dog[df_label_dog['label_19'] == i])

# Sort by the indexes
df_label_dog_big = df_label_dog_big.sort_index()
df_label_dog_other = df_label_dog_other.sort_index()

id-title table for the big clusters:

In [182]:
print(len(df_label_dog_big), 'products')
print('df_label_dog_big (first 5 products):')
df_label_dog_big.head()

11540 products
df_label_dog_big (first 5 products):


Unnamed: 0,product_id,product_title,label_19
0,119780,"ARK Naturals PRODUCTS for PETS 326066 4-Ounce Breath-Less Chewable Brushless Toothpaste, MiniARK...",0
1,202371,"Stella & Chewy's Freeze Dried Dog Food for Adult Dogs, Chicken Patties, 15 Ounce Bag - 2 PackSte...",0
2,291967,Premium Deshedding Brush for Dogs and Cats with Medium to Long Hair | Veterinary Approved | Rugg...,18
3,490904,"Remington Coastal Pet R0206 GRN06 Rope Leash, 72-Inch, GreenRemington Coastal Pet R0206 GRN06 Ro...",5
4,798322,Pet Dog Puppy Nonslip Canvas Sport Shoes Sneaker Boots Rubber Sole Size 5 Blue by MallofusaPet D...,3


id-title table for the other clusters:

In [184]:
print(len(df_label_dog_other), 'products')
print('df_label_dog_other (first 5 products):')
df_label_dog_other.head()

376 products
df_label_dog_big (first 5 products):


Unnamed: 0,product_id,product_title,label_19
24,2443024,"Prestige Super-Beast Dog Tie-Out, 15-FeetPrestige Super-Beast Dog Tie-Out, 15-FeetPrestige Super...",10
53,4480424,Angels Eyes Product ScoopAngels Eyes Product ScoopAngels Eyes Product ScoopAngels Eyes Product S...,17
70,5872617,Ocu-GLO Vision Supplement for Dogs by Animal Necessity - Antioxidant Vision - Protect Against Di...,17
89,6913915,"PAWCY 6100 Doggy Lazy Raft, SmallPAWCY 6100 Doggy Lazy Raft, SmallPAWCY 6100 Doggy Lazy Raft, Sm...",16
132,10339196,Beef Trachea Made in USABeef Trachea Made in USABeef Trachea Made in USABeef Trachea Made in USA...,13


Also, let's separate the bow matrix and t-SNE features of the big clusters. 

In [185]:
# Get the normalized bow matrix of the big clusters
np_normalized_bow_dog_big = np.array([np_normalized_bow_dog[i] 
                                          for i in range(len(np_normalized_bow_dog)) if i in df_label_dog_big.index])

print('Normarized BOW matrix of big_categories: np_normalized_bow_dog_big')
print('Matrix shape:', np_normalized_bow_dog_big.shape)

Normarized BOW matrix for big_categories: np_normalized_bow_dog_big
Matrix shape: (11540, 5000)


In [308]:
# Get the t-SNE features of the big clusters
tsne_features_dog_big = np.array([tsne_features_dog[i] 
                                      for i in range(len(tsne_features_dog))if i in df_label_dog_big.index])

print('T-SNE features of big_categories: tsne_features_dog_big')
print('Feature shape:', tsne_features_dog_big.shape)

T-SNE features of big_categories: tsne_features_dog_big
Feature shape: (11540, 2)


Let's move on to farther clustering for small_categories!

**4.1.3.2. Small categories**

What is the appropriate number of small_categories? If the number is large, we can find more specific features (products). However, it could be too specific. Also, I'll decide the cluster name by hand checking the product titles. Let's say less than 50. Check the silhouette scores from k = 20 to 50 roughly to get some ideas to decide the number of the small_categories.

In [186]:
# Agglomeral clustering; k = 20 to 50 
rough_n_k_dog, rough_scores_dog, _  = aggclus_silscore_nsamples(20, 50, 5, np_normalized_bow_dog_big)

In [187]:
# Plot Number of clusters vs. Silhouette score
plt.figure(figsize=(9,4))

n_clusters = rough_n_k_dog
sil_scores = rough_scores_dog

plt.plot(n_clusters, sil_scores, '.-')
plt.xticks(n_clusters)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.title('Number of clusters vs. Silhouette score')

plt.subplots_adjust(left=0.085, right=0.999, bottom=0.15, top=0.9)
plt.show()

<IPython.core.display.Javascript object>

The peak is around k = 35. The number of small_categories would be around there. Let' s check from 30 to 40 finely.

In [188]:
# Agglomeral clustering; k = 30 to 40 
small_n_k_dog, small_scores_dog, small_n_samples_dog  = aggclus_silscore_nsamples(30, 40, 1, np_normalized_bow_dog_big)

In [219]:
# Plot Number of clusters vs. Silhouette score
plt.figure(figsize=(9,4))

n_clusters = small_n_k_dog
sil_scores = small_scores_dog

plt.plot(n_clusters, sil_scores, '.-')
plt.xticks(n_clusters)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.title('Number of clusters vs. Silhouette score (K = 30 to 40)')

plt.subplots_adjust(left=0.085, right=0.999, bottom=0.15, top=0.9)
plt.show()

<IPython.core.display.Javascript object>

When k = 34, the Sihouette score is highest. Let's decide k = 34 for the small_categories. Do clustering on k = 34, and get the labels.

In [220]:
# Agglomerative clustering (n_clusters=34, small_categories)
n_clusters=34

model_dog_34 = AgglomerativeClustering(n_clusters=n_clusters, affinity='cosine', linkage='average')
model_dog_34 = model_dog_34.fit(np_normalized_bow_dog_big)

In [221]:
# Plot silhouette scores
silhouette(model_dog_34.n_clusters_, model_dog_34.labels_, np_normalized_bow_dog_big)

<IPython.core.display.Javascript object>

For n_clusters = 34 , The average silhouette_score is : 0.07248798338303429


In [352]:
# Scatter plot, colored by the labels 
plt.figure(figsize=(9,7))

model = model_dog_34
tsne_features = tsne_features_dog_big

for i in range(model.n_clusters_):
# for i in [0,1,31,3,4,28,23,20]:  # Cluster_0
# for i in [25,22,15,14,16,5]:  # Cluster_2
# for i in [33,32,17,21,27,6]:  # Cluster_3
# for i in [2,30,19,8,12,26]:  # Cluster_5
# for i in [29,7,9,18,11,24,13]:  # Cluster_6
# for i in [10]:  # Cluster_14
    plt.scatter(tsne_features[model.labels_ == i][:,0], 
                tsne_features[model.labels_ == i][:,1], 
                s=4, alpha=0.8, c=[cmaps[i]])   
    plt.text(tsne_features[model.labels_ == i][:,0][0],
             tsne_features[model.labels_ == i][:,1][0],
             str(i), color="black", size=16)

plt.title('t-SNE plot of dog products colored by the labels')
plt.subplots_adjust(left=0.04, right=0.999, bottom=0.04, top=0.90)
plt.xlim(-70, 75)
plt.ylim(-85, 70)
plt.show()

<IPython.core.display.Javascript object>

In [None]:
# Add the labels on the product information
df_label_dog_big['label_34'] = model_dog_34.labels_

In [228]:
# Get the frequent words and count the number of products in each cluster
n_clusters = df_label_dog_big['label_34'].nunique()
df_34cluster_dog = pd.DataFrame(columns=['big_category_number', 'small_category_number','frequent words', 'number of products'])

for i in range(n_clusters):
    words = []
    indexes = df_label_dog_big[df_label_dog_big['label_34'] == i].index
    big_category_number = df_label_dog_big[df_label_dog_big['label_34'] == i].iat[0, 2]
    for index in indexes:
        for word in set(new_dog_token_list[index]):
            words.append(word)
    c = Counter(words)
    values, _ = zip(*c.most_common(10))

    df_temp = pd.DataFrame([[big_category_number, i, values, len(indexes)]], 
                                columns=df_34cluster_dog.columns)
    df_34cluster_dog = df_34cluster_dog.append(df_temp) 

df_34cluster_dog.set_index(['big_category_number']).sort_index()

Unnamed: 0_level_0,small_category_number,frequent words,number of products
big_category_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,"(vet, pill, eat, lab, supplement, food, hip, treat, sinc, pain)",440
0,1,"(treat, eat, chew, bone, bag, food, smell, pup, mix, flavor)",1494
0,31,"(rat, eat, hamster, rodent, sinc, hand, mice, smell, box, gerbil)",10
0,3,"(food, eat, bowl, water, mix, feed, dri, cat, bag, sinc)",1431
0,4,"(bag, poop, carri, fit, pick, hand, plastic, comfort, carrier, walk)",436
0,28,"(breath, vet, smell, treat, water, brush, cat, improv, effect, tartar)",78
0,23,"(date, expir, smell, tast, chew, sign, vet, pay, msm, market)",6
0,20,"(max, tast, reduc, continu, compress, convinc, joint, bay, yrs, dosag)",1
2,25,"(cool, heat, ice, water, summer, pup, walk, stay, pack, fit)",26
2,22,"(gate, area, sturdi, room, door, wall, wood, jump, panel, home)",76


Let's take a look at the product titles in each cluster, and check this separation is appropriate.

In [292]:
# Check the products in each cluster
df_label_dog_big[df_label_dog_big['label_34'] == 13].sample(10)

Unnamed: 0,product_id,product_title,label_19,label_34
7457,625911903,For Your Dog 078279-102 2-in-1 Combo BrushFor Your Dog 078279-102 2-in-1 Combo BrushFor Your Dog...,18,13
3186,261892382,Chris Christensen A5VIII Mark VIII Round SlickerChris Christensen A5VIII Mark VIII Round Slicker...,18,13
8448,708674931,FURminator Nail Grinder (Nail Grinder w/ 6 Extra FURminator Replacement Bands)FURminator Nail Gr...,18,13
357,27433943,Furbuster 3-in-1 Grooming GloveFurbuster 3-in-1 Grooming GloveFurbuster 3-in-1 Grooming GloveFur...,18,13
36,3263923,Master Equipment Steel Versa Competition Pet Grooming TableMaster Equipment Steel Versa Competit...,18,13
2166,175051802,PET CRAZY - Pet Deshedder - Dematting Comb - 2 in 1 use as comb or brush - SALE PRICE - BEST Dog...,18,13
11387,956569487,Alfie Pet by Petoga Couture - Pet Grooming Trimmer Razor Comb - Color: BlackAlfie Pet by Petoga ...,18,13
2508,202758888,Andis Folding Blade CaseAndis Folding Blade CaseAndis Folding Blade CaseAndis Folding Blade Case...,18,13
11190,939813492,Quick Finder Nail ClipperQuick Finder Nail ClipperQuick Finder Nail ClipperQuick Finder Nail Cli...,18,13
5486,456662669,Wahl 9961-1291 Super Pocket Pro Trimmer by Wahl Professional AnimalWahl 9961-1291 Super Pocket P...,18,13


The clusters mostly have each feature. The clustering has worked well. Let's label each cluster name. 

In [333]:
# Label each category title (from the above two tables)
small_category_labels_dog = {0:"treatment, supplement", 1:"treat", 31:"other", 3:"food, bowl", 4:"waste bag, carrier",
                             28:"oral care", 23:"treatment, supplement", 20:"treatment, supplement", 
                             25:"other", 22:"door, gate, crate", 15:"bed", 14:"step", 16:"training pad", 
                             5:"door, gate, crate",
                             33:"shoes", 32:"other", 17:"car seat, cover", 21:"stroller", 27:"door, gate, crate", 
                             6:"costume", 
                             2:"ID tag", 30:"leash, harness", 19:"collar", 8:"leash, harness", 12:"collar", 26:"training",
                             29:"flea", 7:"odor, stain, shampoo", 9:"dryer, towel", 18:"ear cleaner",
                             11:"other", 24:"other", 
                             10:"toy", 
                             13:"brush, clipper"}

I noticed several things. Let's take care of these later.  
- Small_cluster_4 is in big_category_0, but should have been big_category_1 ("other").  
- Small_cluster_16 is in big_category_2, but should have been big_category_1 ("other").  
- Small_cluster_17 and 21 are in big_category_3, but should have been big_category_1 ("other").  
- Small_cluster_27 is in big_category_3, but should have been big_category_2.  
- Small_cluster_26 is in big_category_5, but should have been big_category_1 ("other").  
- Small_cluster_11 and 24 are in big_category_6, but should have been big_category_1 ("other").  

**4.1.3.3. Make the summary tables**

Now, each of the dog products have a big_category label and a small_category label. In this section, let's make the summary table that has rows representing each product with the id, title, and each category label.  

In [660]:
# Make a new datafeame from df_label_dog_big for summary
df_label_dog_summary = df_label_dog_big[['product_id', 'product_title', 'label_19', 'label_34']]

# Add 'animal' column
df_label_dog_summary['animal'] = 'dog'

print('df_label_dog_summary (first 5 rows):')
df_label_dog_summary.head()

df_label_dog_summary (first 5 rows):


Unnamed: 0,product_id,product_title,label_19,label_34,animal
0,119780,"ARK Naturals PRODUCTS for PETS 326066 4-Ounce Breath-Less Chewable Brushless Toothpaste, MiniARK...",0,1,dog
1,202371,"Stella & Chewy's Freeze Dried Dog Food for Adult Dogs, Chicken Patties, 15 Ounce Bag - 2 PackSte...",0,3,dog
2,291967,Premium Deshedding Brush for Dogs and Cats with Medium to Long Hair | Veterinary Approved | Rugg...,18,13,dog
3,490904,"Remington Coastal Pet R0206 GRN06 Rope Leash, 72-Inch, GreenRemington Coastal Pet R0206 GRN06 Ro...",5,8,dog
4,798322,Pet Dog Puppy Nonslip Canvas Sport Shoes Sneaker Boots Rubber Sole Size 5 Blue by MallofusaPet D...,3,33,dog


Here, take care of the mis-categorized products. Then, replace the category numbers into the names.  

In [661]:
# Take care of mis-categorized products
for i in range(len(df_label_dog_summary)):
    # Column_3 is 'label_34'
    if df_label_dog_summary.iloc[i, 3] in [4, 16, 17, 21, 26, 11, 24]:
        # Column_2 is 'label_19'
        df_label_dog_summary.iloc[i, 2] = 1
        
    elif df_label_dog_summary.iloc[i, 3] == 27:
        df_label_dog_summary.iloc[i, 2] = 2
        
# Replace the category numbers into the names
df_label_dog_summary = df_label_dog_summary.replace({'label_19': big_category_labels_dog,
                                             'label_34': small_category_labels_dog})
# Rename the columns
df_label_dog_summary = df_label_dog_summary.rename(columns={'label_19': 'big_category', 'label_34': 'small_category'})

print('df_label_dog_summary (first 5 rows):')
df_label_dog_summary.head()

df_label_dog_summary (first 5 rows):


Unnamed: 0,product_id,product_title,big_category,small_category,animal
0,119780,"ARK Naturals PRODUCTS for PETS 326066 4-Ounce Breath-Less Chewable Brushless Toothpaste, MiniARK...","food, treat, treatment",treat,dog
1,202371,"Stella & Chewy's Freeze Dried Dog Food for Adult Dogs, Chicken Patties, 15 Ounce Bag - 2 PackSte...","food, treat, treatment","food, bowl",dog
2,291967,Premium Deshedding Brush for Dogs and Cats with Medium to Long Hair | Veterinary Approved | Rugg...,"body care, cleaning","brush, clipper",dog
3,490904,"Remington Coastal Pet R0206 GRN06 Rope Leash, 72-Inch, GreenRemington Coastal Pet R0206 GRN06 Ro...","collar, leash","leash, harness",dog
4,798322,Pet Dog Puppy Nonslip Canvas Sport Shoes Sneaker Boots Rubber Sole Size 5 Blue by MallofusaPet D...,clothes,shoes,dog


I would like to merge the 'other' group on df_label_dog_summary. Frist, let's make df_label_dog_other have the same columns as df_label_dog_summary, and take care of the mis-categorized products. Remember them:

- Cluster_15 is not for dogs, but fish.  
- Cluster_10 can be in cluster_2 ("bed, crate, gate").  
- Cluster_17 can be in cluster_6 ("body care, cleaning").  

In [662]:
# Make a new datafeame from df_label_dog_other for summary
df_other_dog = df_label_dog_other[['product_id', 'product_title', 'label_19']]

# Add 'big_category' column
df_other_dog['big_category'] = 'other'

# Add 'animal' column
df_other_dog['animal'] = 'dog'

# Take care of mis-categorized products 
for i in range(len(df_other_dog)):
    # Column_2 is 'label_19'
    if df_other_dog.iloc[i, 2] == 15:
        # Column_3 is 'big_categories'
        df_other_dog.iloc[i, 3] = 'fish, reptile'
        # Column_4 is 'animal'
        df_other_dog.iloc[i, 4] = 'other'
    
    elif df_other_dog.iloc[i, 2] == 10:
        df_other_dog.iloc[i, 3] = 'bed, crate, gate'
    
    elif df_other_dog.iloc[i, 2] == 17:
        df_other_dog.iloc[i, 3] = 'body care, cleaning'

# Replace the category numbers into the names
df_other_dog = df_other_dog.replace({'label_19': other_category_labels_dog})

# Replace the category numbers into the names
df_other_dog = df_other_dog.rename(columns={'label_19': 'small_category'})

print('df_other_dog (first 5 rows):')
df_other_dog.head()

df_other_dog (first 5 rows):


Unnamed: 0,product_id,product_title,small_category,big_category,animal
24,2443024,"Prestige Super-Beast Dog Tie-Out, 15-FeetPrestige Super-Beast Dog Tie-Out, 15-FeetPrestige Super...",tie out,"bed, crate, gate",dog
53,4480424,Angels Eyes Product ScoopAngels Eyes Product ScoopAngels Eyes Product ScoopAngels Eyes Product S...,eye care,"body care, cleaning",dog
70,5872617,Ocu-GLO Vision Supplement for Dogs by Animal Necessity - Antioxidant Vision - Protect Against Di...,eye care,"body care, cleaning",dog
89,6913915,"PAWCY 6100 Doggy Lazy Raft, SmallPAWCY 6100 Doggy Lazy Raft, SmallPAWCY 6100 Doggy Lazy Raft, Sm...",other,other,dog
132,10339196,Beef Trachea Made in USABeef Trachea Made in USABeef Trachea Made in USABeef Trachea Made in USA...,other,other,dog


Merge the two tables in one.

In [663]:
# Merge df_label_dog_summary and df_other_dog
df_label_dog_summary = df_label_dog_summary.append(df_other_dog)

print('df_label_dog_summary (first 5 rows):')
df_label_dog_summary.head()

df_label_dog_summary (first 5 rows):


Unnamed: 0,product_id,product_title,big_category,small_category,animal
0,119780,"ARK Naturals PRODUCTS for PETS 326066 4-Ounce Breath-Less Chewable Brushless Toothpaste, MiniARK...","food, treat, treatment",treat,dog
1,202371,"Stella & Chewy's Freeze Dried Dog Food for Adult Dogs, Chicken Patties, 15 Ounce Bag - 2 PackSte...","food, treat, treatment","food, bowl",dog
2,291967,Premium Deshedding Brush for Dogs and Cats with Medium to Long Hair | Veterinary Approved | Rugg...,"body care, cleaning","brush, clipper",dog
3,490904,"Remington Coastal Pet R0206 GRN06 Rope Leash, 72-Inch, GreenRemington Coastal Pet R0206 GRN06 Ro...","collar, leash","leash, harness",dog
4,798322,Pet Dog Puppy Nonslip Canvas Sport Shoes Sneaker Boots Rubber Sole Size 5 Blue by MallofusaPet D...,clothes,shoes,dog


This df_label_dog_summary shows the category label of each product for dogs. That's all for dog products!

### 4.2. Cat category

Let's do clustering of cat category in the same way as dog category.

#### 4.2.1. Load data

Token list:

In [339]:
# Load the token list
cat_token_list = []
with open("cat_token_list.csv", "r", encoding="UTF-8") as f:
    reader = csv.reader(f) 
    for r in reader: 
        cat_token_list.append(r)

print('cat_token_list (first 5 tokens of the first product):', cat_token_list[0][:5])
print('Number of products:', len(cat_token_list))
print('Number of unique tokens:', len(set(token for review in cat_token_list for token in review)))

cat_token_list (first 5 tokens of the first product): ['templat', 'wont', 'door', 'pick', 'litter']
Number of products: 4099
Number of unique tokens: 6001


Remove cat related tokens.

In [340]:
# Remove cat tokens 
cat_tokens = ['cat', 'kitti', 'kitten']
new_cat_token_list = []

for product in cat_token_list:
    nested_list = []
    for token in product:
        if token not in cat_tokens:
            nested_list.append(token)
    new_cat_token_list.append(nested_list)

print('new_cat_token_list (first 5 tokens of the first product):', new_cat_token_list[0][:5])
print('Number of products:', len(new_cat_token_list))
print('Number of unique tokens:', len(set(token for review in new_cat_token_list for token in review)))

new_cat_token_list (first 5 tokens of the first product): ['templat', 'wont', 'door', 'pick', 'litter']
Number of products: 4099
Number of unique tokens: 5998


ID-title table:

In [341]:
# Load the product table
df_cat_id_name = pd.read_csv("df_cat_id_name.csv")

print('df_cat_id_name (first 5 products):')
df_cat_id_name.head()

df_cat_id_name (first 5 products):


Unnamed: 0,product_id,product_title
0,70064,Perfect Pet Soft Flap Cat Door with Telescoping FramePerfect Pet Soft Flap Cat Door with Telesco...
1,593896,Pet Food Can Covers Lids Set of 3Pet Food Can Covers Lids Set of 3Pet Food Can Covers Lids Set o...
2,919291,Basis Pet Made in the USA Low Profile Stainless Steel Cat DishBasis Pet Made in the USA Low Prof...
3,944764,Alfie Pet by Petoga Couture - Vea 2.0 Slow-Eating Anti-Gulping Pet Food Bowl (for Dogs & Cats)Al...
4,1124833,Petmate Hooded Cat Litter PanPetmate Hooded Cat Litter PanPetmate Hooded Cat Litter PanPetmate H...


#### 4.2.2. Vectorization

new_cat_token_list has 5,998 kinds of tokens. Additionaly, I use bigrams as the dog dataset. To control the number of tokens, tokens appearing less than 6 times in the total documents are cut.

In [343]:
# Generate matrix of word vectors

# Create CountVectorizer object 
cvectorizer_cat = CountVectorizer(tokenizer=dummy_tokened,lowercase=False, min_df=6, ngram_range=(1, 2))

bow_cat = cvectorizer_cat.fit_transform(new_cat_token_list)

# Get the feature names
feature_names_cat = cvectorizer_cat.get_feature_names()

# Show the shape of bow_matrix
print('BOW matrix: bow_cat')
print('Matrix shape:', bow_cat.shape)

BOW matrix: bow_cat
Matrix shape: (4099, 4665)


In [344]:
# Normarize the bow matrix
normalized_bow_cat = normalize(bow_cat)

# Change the normalizer bow matrix into np.array to use for AgglomerativeClustering()
np_normalized_bow_cat = normalized_bow_cat.toarray()

print('Normarized BOW matrix: np_normalized_bow_cat')
print('Matrix shape:', np_normalized_bow_cat.shape)

Normarized BOW matrix: np_normalized_bow_cat
Matrix shape: (4099, 4665)


In [353]:
# Create a t-SNE instance
tsne_cat = TSNE(perplexity=50, learning_rate=800, random_state=10)

# Apply fit_transform to samples
tsne_features_cat = tsne_cat.fit_transform(normalized_bow_cat)

# Scatter plot of t-NSE
plt.figure(figsize=(9,7))
tsne_features = tsne_features_cat

plt.scatter(tsne_features[:,0], tsne_features[:,1], s=4, alpha=0.8, c='royalblue')   

plt.title('t-SNE plot of cat products')
plt.subplots_adjust(left=0.04, right=0.999, bottom=0.04, top=0.9)
plt.show()

<IPython.core.display.Javascript object>

There seems to be less than 8 big clusters, and some small clusters. Now, let's start hierarchical clustering!

#### 4.2.3. Hierarchical clustering

First, I would like to divide the data into less than 8 categories. Second, each of the big_categories will be devided into subcategories.

**4.1.3.1. Big categories**

In [354]:
# Agglomeral clustering setting distance_threshold=0 to compute the full tree
model_cat = AgglomerativeClustering(distance_threshold=0, n_clusters=None, affinity='cosine', linkage='average')
model_cat = model_cat.fit(np_normalized_bow_cat)

In [355]:
# Draw the dendrogram
plt.figure(figsize=(15,5))
plt.title('Hierarchical Clustering Dendrogram')
# plot the top three levels of the dendrogram
plot_dendrogram(model_cat, truncate_mode='level', p=7)

plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.subplots_adjust(left=0.035, right=0.999, bottom=0.15, top=0.95)
plt.ylabel("Distance threshold")
plt.yticks([i/10 for i in range(0,11,1)])
plt.show()

<IPython.core.display.Javascript object>

According to the dendrogram above, some mini clusters join up with the big clusters in the upper course of the dendrogram. Let's collect them as the "other" group.  
I would like to divide the data into less than 8 big_categories. So, let's check the silhouette scores from k = 5 to 15.

In [356]:
# Agglomeral clustering from k = 5 to 15
n_k_cat, scores_cat, n_samples_cat  = aggclus_silscore_nsamples(5, 15, 1, np_normalized_bow_cat)

In [357]:
# Plot Number of clusters vs. Silhouette score
plt.figure(figsize=(9,4))

n_clusters = n_k_cat
sil_scores = scores_cat

plt.plot(n_clusters, sil_scores, '.-')
plt.xticks(n_clusters)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.title('Number of clusters vs. Silhouette score')

plt.subplots_adjust(left=0.085, right=0.999, bottom=0.15, top=0.9)
plt.show()

<IPython.core.display.Javascript object>

When k = 10 and 12, the scores are relatively high. Let's check the number of big clusters in each case.

In [358]:
# Count the number of samples in each cluster
for k in [10, 12]:
    n_cluster = k - 5
    count = 0
    for i in range(len(n_samples_cat[n_cluster])):
        if n_samples_cat[n_cluster][i] > 100:
            count += 1
    print('k =', k)
    print(n_samples_cat[n_cluster])
    print('Number of clusters more than 100 samples:', count)
    print('')

k = 10
{0: 961, 1: 881, 2: 185, 3: 1304, 4: 744, 5: 1, 6: 4, 7: 1, 8: 17, 9: 1}
Number of clusters more than 100 samples: 5

k = 12
{0: 518, 1: 956, 2: 185, 3: 1304, 4: 744, 5: 363, 6: 4, 7: 1, 8: 17, 9: 1, 10: 5, 11: 1}
Number of clusters more than 100 samples: 6



When k = 10, cluster_0 and 5 of k = 12 are one cluster; cluster_1. Let's try k = 12, and see whether the two clusters should be separated.

In [359]:
# Agglomeral clustering; k = 12 
model_cat_12 = AgglomerativeClustering(n_clusters=12, affinity='cosine', linkage='average')
model_cat_12 = model_cat_12.fit(np_normalized_bow_cat)

In [360]:
# Silhouette plot
silhouette(model_cat_12.n_clusters_, model_cat_12.labels_, np_normalized_bow_cat)

<IPython.core.display.Javascript object>

For n_clusters = 12 , The average silhouette_score is : 0.06875282567482995


In [362]:
# Scatter plot, colored by model_cat_test
plt.figure(figsize=(9,7))

model = model_cat_12
tsne_features = tsne_features_cat

for i in range(model.n_clusters_):
    plt.scatter(tsne_features[model.labels_ == i][:,0], 
                tsne_features[model.labels_ == i][:,1], 
                s=4, alpha=0.8, c=[cmap1(i/model.n_clusters_)])   
    plt.text(tsne_features[model.labels_ == i][:,0][0],
             tsne_features[model.labels_ == i][:,1][0],
             str(i), color="black", size=16
             )
        
plt.title('t-SNE plot of cat products colored by the labels')
plt.subplots_adjust(left=0.04, right=0.999, bottom=0.04, top=0.9)
plt.show()

<IPython.core.display.Javascript object>

In [364]:
# Apply the labels to the product table
df_label_cat = df_cat_id_name[['product_id', 'product_title']]
df_label_cat['label_12'] = model_cat_12.labels_

# Count the number of products in each clusters
frequent_words(df_label_cat, 'label_12', new_cat_token_list)

Unnamed: 0_level_0,frequent words,number of products
cluster number,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"(dog, fit, door, side, room, comfort, insid, bed, carrier, vet)",518
1,"(play, toy, scratch, box, sturdi, post, fun, catnip, room, jump)",956
2,"(collar, color, fit, bell, neck, dog, design, stay, comfort, tag)",185
3,"(food, eat, dri, dog, feed, treat, water, flavor, vet, bowl)",1304
4,"(litter, box, smell, odor, scoop, clump, bag, floor, spray, room)",744
5,"(dog, hair, flea, brush, fur, groom, remov, comb, effect, coat)",363
6,"(chair, sound, tool, dog, train, furnitur, spray, color, result, custom)",4
7,"(jus, walmart, ask, sell, bring, pleas, wife, manufactur, tonight, shelf)",1
8,"(tank, aquarium, gallon, water, color, plant, quit, glass, home, con)",17
9,"(sugar, energi, filler, drop, health, past, betterthi, exot, pain, ingredi)",1


According to the tokens of cluster_2 and 5, they have a different feature each other. Also, the other 4 big clusters have unique features respectively. I think k = 12 is appropriate. Let's look each cluster closely, and make sure that.

In [370]:
# Check the products in each cluster
df_label_cat[df_label_cat['label_12'] == 8]#.sample(10)

Unnamed: 0,product_id,product_title,label_12
62,12304637,"Wall Mounted Fish Bowl Bubble for Goldfish & Beta or Hanging Terrarium with Exclusive ""Lets Get ...",8
73,14334855,"Cholla Wood, 3 Nice Pieces of Aquarium Driftwood Decoration by Aquatic ArtsCholla Wood, 3 Nice P...",8
161,35498715,Lightahead Artificial Mini Aquarium Fish Tank Multi Color LED Swimming Fish Tank with BubblesLig...,8
624,157859212,Tantora Nano Catappa Leaves - 15 LeavesTantora Nano Catappa Leaves - 15 LeavesTantora Nano Catap...,8
1420,342161363,Aqueon Beauty Max Modular LED Aquarium LampAqueon Beauty Max Modular LED Aquarium LampAqueon Bea...,8
2137,520946336,"Carib Sea ACS05832 Super Natural Peace River Sand for Aquarium, 5-PoundCarib Sea ACS05832 Super ...",8
2172,529164600,Marineland Black Diamond Premium Activated CarbonMarineland Black Diamond Premium Activated Carb...,8
2226,541618331,"Tetra 29234 Half Moon Aquarium Kit, 10-GallonTetra 29234 Half Moon Aquarium Kit, 10-GallonTetra ...",8
2517,624164642,Zilla Fresh Air Screen Cover with HingeZilla Fresh Air Screen Cover with HingeZilla Fresh Air Sc...,8
2529,625982738,Marina 2-in-1 Fish HatcheryMarina 2-in-1 Fish HatcheryMarina 2-in-1 Fish HatcheryMarina 2-in-1 F...,8


According to the Silhouette plot, the t-SNE image with the labels, the frequent words, and browsing the product titles in each clusters, cluster_0 to 5 have individual features, and I evaluate they are appropriate as big_categories respectively. The other clusters have a few products. Let's think about them as 'other' group. However, some of them also have a unique character. I'll also label them.

In [653]:
# Label each category title (from the above two tables)
big_category_labels_cat = {0:"door, cage, carrier, bed", 1:"toy, scratcher, cat tree", 2:"collar, leash",
                           3:"food, treatment", 4:"litter, odor, stain", 5:"grooming", 6:"other", 7:"other", 8:"other",
                           9:"other", 10:"other", 11:"other"}

# Label each category title of the 'other' group (from the above table)
other_category_labels_cat = {6:'other', 7:'other', 8:'aquarium', 9:'other', 10:'toy', 11:'other'}

I noticed several things.  Let's take care of these later.  
- Cluster_8 is not for cats, but fish.  
- Cluster_10 is a kind of toy (catnip bubbles).

Now, let's extract the big 6 clusters (cluster_0 to 5) for farther clustering.

In [379]:
# Collect the product information of 'other' group
big_clusters_cat = [0, 1, 2, 3, 4, 5]
df_label_cat_big = pd.DataFrame()
df_label_cat_other = pd.DataFrame()

for i in range(len(df_label_cat)):
    if i in big_clusters_cat:
        df_label_cat_big = df_label_cat_big.append(df_label_cat[df_label_cat['label_12'] == i])
    else:
        df_label_cat_other = df_label_cat_other.append(df_label_cat[df_label_cat['label_12'] == i])

# Sort by the indexes
df_label_cat_big = df_label_cat_big.sort_index()
df_label_cat_other = df_label_cat_other.sort_index()

id-title table for the big clusters:

In [380]:
print(len(df_label_cat_big), 'products')
print('df_label_cat_big (first 5 products):')
df_label_cat_big.head()

4070 products
df_label_cat_big (first 5 products):


Unnamed: 0,product_id,product_title,label_12
0,70064,Perfect Pet Soft Flap Cat Door with Telescoping FramePerfect Pet Soft Flap Cat Door with Telesco...,0
1,593896,Pet Food Can Covers Lids Set of 3Pet Food Can Covers Lids Set of 3Pet Food Can Covers Lids Set o...,3
2,919291,Basis Pet Made in the USA Low Profile Stainless Steel Cat DishBasis Pet Made in the USA Low Prof...,3
3,944764,Alfie Pet by Petoga Couture - Vea 2.0 Slow-Eating Anti-Gulping Pet Food Bowl (for Dogs & Cats)Al...,3
4,1124833,Petmate Hooded Cat Litter PanPetmate Hooded Cat Litter PanPetmate Hooded Cat Litter PanPetmate H...,4


id-title table for the other clusters:

In [381]:
print(len(df_label_cat_other), 'products')
print('df_label_cat_other (first 5 products):')
df_label_cat_other.head()

29 products
df_label_cat_other (first 5 products):


Unnamed: 0,product_id,product_title,label_12
62,12304637,"Wall Mounted Fish Bowl Bubble for Goldfish & Beta or Hanging Terrarium with Exclusive ""Lets Get ...",8
73,14334855,"Cholla Wood, 3 Nice Pieces of Aquarium Driftwood Decoration by Aquatic ArtsCholla Wood, 3 Nice P...",8
161,35498715,Lightahead Artificial Mini Aquarium Fish Tank Multi Color LED Swimming Fish Tank with BubblesLig...,8
224,53604498,"OurPets Bouncy North-American Catnip Bubbles, 8-OunceOurPets Bouncy North-American Catnip Bubble...",10
468,121923547,"Kookamunga Krazee Kitty Catnip Bubbles, 5 ozKookamunga Krazee Kitty Catnip Bubbles, 5 ozKookamun...",10


Also, let's separate the bow matrix and t-SNE features of the big clusters.

In [382]:
# Get the normalized bow matrix of the big clusters
np_normalized_bow_cat_big = np.array([np_normalized_bow_cat[i] 
                                          for i in range(len(np_normalized_bow_cat)) if i in df_label_cat_big.index])

print('Normarized BOW matrix of big_categories: np_normalized_bow_cat_big')
print('Matrix shape:', np_normalized_bow_cat_big.shape)

Normarized BOW matrix of big_categories: np_normalized_bow_cat_big
Matrix shape: (4070, 4665)


In [383]:
# Get the t-SNE features of the big clusters
tsne_features_cat_big = np.array([tsne_features_cat[i] 
                                      for i in range(len(tsne_features_cat))if i in df_label_cat_big.index])

print('T-SNE features of big_categories: tsne_features_cat_big')
print('Feature shape:', tsne_features_cat_big.shape)

T-SNE features of big_categories: tsne_features_cat_big
Feature shape: (4070, 2)


**4.1.3.2. Small categories**

Check the silhouette scores from k = 15 to 50 roughly to get some ideas to decide the number of the small_categories.

In [384]:
# Agglomeral clustering; k = 15 to 50 
rough_n_k_cat, rough_scores_cat, _ = aggclus_silhouette_score_nsamples(15, 50, 5, np_normalized_bow_cat_big)

In [385]:
# Plot Number of clusters vs. Silhouette score
plt.figure(figsize=(9,4))

n_clusters = rough_n_k_cat
sil_scores = rough_scores_cat

plt.plot(n_clusters, sil_scores, '.-')
plt.xticks(n_clusters)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.title('Number of clusters vs. Silhouette score')

plt.subplots_adjust(left=0.085, right=0.999, bottom=0.15, top=0.9)
plt.show()

<IPython.core.display.Javascript object>

One peak is around k = 35.  Let's check the scores finely from k = 30 to 40 to choose k for the small_categories. 

In [386]:
# Decide the number of small_categories
small_n_k_cat, small_scores_cat, small_n_samples_cat  = aggclus_silscore_nsamples(30, 40, 1, np_normalized_bow_cat_big)

In [387]:
# Plot Number of clusters vs. Silhouette score
plt.figure(figsize=(9,4))

n_clusters = small_n_k_cat
sil_scores = small_scores_cat

plt.plot(n_clusters, sil_scores, '.-')
plt.xticks(n_clusters)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.title('Number of clusters vs. Silhouette score (K = 30 to 40)')

plt.subplots_adjust(left=0.085, right=0.999, bottom=0.15, top=0.9)
plt.show()

<IPython.core.display.Javascript object>

When k = 35, the Sihouette score is highest. Let's decide k = 35 for the small_categories. Do clustering on k = 35, and get the labels.

In [388]:
# Agglomerative clustering (n_clusters=35, small_categories)
n_clusters=35

model_cat_35 = AgglomerativeClustering(n_clusters=n_clusters, affinity='cosine', linkage='average')
model_cat_35 = model_cat_35.fit(np_normalized_bow_cat_big)

In [390]:
# Plot silhouette scores
silhouette(model_cat_35.n_clusters_, model_cat_35.labels_, np_normalized_bow_cat_big)

<IPython.core.display.Javascript object>

For n_clusters = 35 , The average silhouette_score is : 0.0929838921073329


In [452]:
# Scatter plot, colored by the labels
plt.figure(figsize=(9,7))

model = model_cat_35
tsne_features = tsne_features_cat_big

# for i in range(model.n_clusters_):
# for i in [32,4,5,27,7,13,12]:  # Cluster_0
# for i in [0,30,28,24,18,34,3,9,6]:  # Cluster_1
# for i in [14,2,19,20,26]:  # Cluster_2
# for i in [10,1,23,22,21,11,17]:  # Cluster_3
# for i in [8,29,31,15]:  # Cluster_4
for i in [25,33,16]:  # Cluster_5
    plt.scatter(tsne_features[model.labels_ == i][:,0], 
                tsne_features[model.labels_ == i][:,1], 
                s=4, alpha=0.8, c=[cmaps[i]])   
    plt.text(tsne_features[model.labels_ == i][:,0][1],
             tsne_features[model.labels_ == i][:,1][1],
             str(i), color="black", size=16)
        
plt.subplots_adjust(left=0.04, right=0.999, bottom=0.04, top=0.9)
plt.title('t-SNE plot of cat products colored by the labels')
plt.xlim(-45, 55)
plt.ylim(-55, 55)
plt.show()

<IPython.core.display.Javascript object>

In [395]:
# Add the labels on the product information
df_label_cat_big['label_35'] = model_cat_35.labels_

In [398]:
# Get the frequent words and count the number of products in each cluster
n_clusters = df_label_cat_big['label_35'].nunique()
df_35cluster_cat = pd.DataFrame(columns=['big_category_number', 'small_category_number','frequent words', 'number of products'])

for i in range(n_clusters):
    words = []
    indexes = df_label_cat_big[df_label_cat_big['label_35'] == i].index
    big_category_number = df_label_cat_big[df_label_cat_big['label_35'] == i].iat[0, 2]
    for index in indexes:
        for word in set(new_cat_token_list[index]):
            words.append(word)
    c = Counter(words)
    values, _ = zip(*c.most_common(10))

    df_temp = pd.DataFrame([[big_category_number, i, values, len(indexes)]], 
                                columns=df_35cluster_cat.columns)
    df_35cluster_cat = df_35cluster_cat.append(df_temp) 

df_35cluster_cat.set_index(['big_category_number']).sort_index()

Unnamed: 0_level_0,small_category_number,frequent words,number of products
big_category_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,32,"(sturdi, design, surpris, lock, system, farm, anim, twist, line, charm)",3
0,4,"(bed, dog, pad, heat, winter, wash, cover, insid, fit, comfort)",152
0,5,"(door, dog, lock, box, room, flap, fit, cage, side, instruct)",101
0,27,"(cup, mug, microwav, tea, handl, husband, gift, cereal, coffe, soup)",2
0,7,"(dog, vet, fit, issu, scratch, mix, bottl, sever, skin, food)",147
0,13,"(carrier, dog, comfort, room, side, fit, carri, trip, travel, vet)",107
0,12,"(instruct, stray, unit, door, custom, batteri, outdoor, weather, cold, button)",6
1,0,"(scratch, post, sturdi, carpet, tree, play, box, jump, assembl, sit)",307
1,30,"(tent, dog, jump, toy, break, play, fun, tunnel, attach, insid)",11
1,28,"(step, stair, sturdi, dog, bed, jump, age, weight, play, room)",10


In [464]:
# Check the products in each cluster
df_label_cat_big[df_label_cat_big['label_35'] == 10].sample(10)

Unnamed: 0,product_id,product_title,label_12,label_35
1940,469839783,"Virbac Anxitane Tablets, Small Dog/Cat, 50mg, 30 CountVirbac Anxitane Tablets, Small Dog/Cat, 50...",3,10
3159,776225787,"TradeWinds Tape Worm Tabs Cat Tablets, 3 TabletsTradeWinds Tape Worm Tabs Cat Tablets, 3 Tablets...",3,10
3365,823784472,Sentry Sure Shot Wormer 100mlSentry Sure Shot Wormer 100mlSentry Sure Shot Wormer 100mlSentry Su...,3,10
3090,762510316,6 Tapeworm De-wormer Capsules Cats & Dogs 10 lbs and under - Praziquantel 23mg6 Tapeworm De-worm...,3,10
9,1479305,Spring Pet Pill Yums ~ Tasty Pocket Treat to Hide Your Pet's Medication ~ Simply Insert Pill or ...,3,10
341,86120068,MonoJect Syringe 1cc RL Box/100MonoJect Syringe 1cc RL Box/100MonoJect Syringe 1cc RL Box/100Mon...,3,10
1387,335136317,"Prolabs Feline Tapeworm Tabs , 3-23mg TabsProlabs Feline Tapeworm Tabs , 3-23mg TabsProlabs Feli...",3,10
80,16898273,Little City Dogs CHICKEN FLAVORED Praziquantel Tapeworm Wormer Capsules for Cats (6 Capsules)Lit...,3,10
2141,522119868,Sergeant's Vetscription Sure Shot Liquid Wormer Cat 100mlSergeant's Vetscription Sure Shot Liqui...,3,10
588,147805567,Chondro Flex II (250 tablets) CHEWABLESChondro Flex II (250 tablets) CHEWABLESChondro Flex II (2...,3,10


In [476]:
# Label each category title (from the above two tables)
small_category_labels_cat = {32:'other', 4:"bed", 5:"door, cage", 27:'other', 7:"pill, treatment", 13:"carrier, stroller", 
                             12:"other", 
                             0:"scratcher, cat tree", 30:"tent", 28:"step", 24:"perch, shelf", 18:'toy', 
                             34:"nail cap, furniture protector", 3:'other', 9:'toy', 6:"bed", 
                             14:"bed", 2:"collar", 19:"memorial", 20:"harness, leash", 26:"memorial", 
                             10:"pill, treatment", 1:"food, treat, water", 23:"cat grass", 22:"pill, treatment", 
                             21:"flea", 11:"pill, treatment", 17:"scale",
                             8:"litter, litter box", 29:"odor, stain, shampoo", 31:"odor, stain, shampoo", 15:"calming", 
                             25:"flea", 33:"clipper", 16:"brush. comb"}

I noticed several things. Let's take care of these later.  
  
 
- Small_cluster_7 should have been big_category_3. 
- Small_cluster_34 and 21 should have been big_category_5.  
- Small_cluster_30, 28, 24, 6 and 14 should have been big_category_0.  
- Small_cluster_27, 3, 19, 26 and 15 should have been big_category_6.

**4.2.3.3. Make the summary tables**

Now, each of the cat products have a big_category label and a small_category label. In this section, let's make the summary table that has rows representing each product with the id, title, and each category label.  

In [654]:
# Make a new datafeame from df_label_dog_big for summary
df_label_cat_summary = df_label_cat_big[['product_id', 'product_title', 'label_12', 'label_35']]

# Add 'animal' column
df_label_cat_summary['animal'] = 'cat'

print('df_label_cat_summary (first 5 rows):')
df_label_cat_summary.head()

df_label_cat_summary (first 5 rows):


Unnamed: 0,product_id,product_title,label_12,label_35,animal
0,70064,Perfect Pet Soft Flap Cat Door with Telescoping FramePerfect Pet Soft Flap Cat Door with Telesco...,0,5,cat
1,593896,Pet Food Can Covers Lids Set of 3Pet Food Can Covers Lids Set of 3Pet Food Can Covers Lids Set o...,3,1,cat
2,919291,Basis Pet Made in the USA Low Profile Stainless Steel Cat DishBasis Pet Made in the USA Low Prof...,3,1,cat
3,944764,Alfie Pet by Petoga Couture - Vea 2.0 Slow-Eating Anti-Gulping Pet Food Bowl (for Dogs & Cats)Al...,3,1,cat
4,1124833,Petmate Hooded Cat Litter PanPetmate Hooded Cat Litter PanPetmate Hooded Cat Litter PanPetmate H...,4,8,cat


In [655]:
# Take care of mis-categorized products
for i in range(len(df_label_cat_summary)):
    if df_label_cat_summary.iloc[i, 3] == 7:
        df_label_cat_summary.iloc[i, 2] = 3

    elif df_label_cat_summary.iloc[i, 3] in [34, 21]:
        df_label_cat_summary.iloc[i, 2] = 5

    elif df_label_cat_summary.iloc[i, 3] in [30, 28, 24, 6, 14]:
        df_label_cat_summary.iloc[i, 2] = 0
        
    elif df_label_cat_summary.iloc[i, 3] in [27, 3, 19, 26, 15]:
        df_label_cat_summary.iloc[i, 2] = 6

# Replace the category numbers into the names
df_label_cat_summary = df_label_cat_summary.replace({'label_12': big_category_labels_cat,
                                             'label_35': small_category_labels_cat})
# Rename the columns
df_label_cat_summary = df_label_cat_summary.rename(columns={'label_12': 'big_category', 'label_35': 'small_category'})

print('df_label_dog_summary (first 5 rows):')
df_label_cat_summary.head()

df_label_dog_summary (first 5 rows):


Unnamed: 0,product_id,product_title,big_category,small_category,animal
0,70064,Perfect Pet Soft Flap Cat Door with Telescoping FramePerfect Pet Soft Flap Cat Door with Telesco...,"door, cage, carrier, bed","door, cage",cat
1,593896,Pet Food Can Covers Lids Set of 3Pet Food Can Covers Lids Set of 3Pet Food Can Covers Lids Set o...,"food, treatment","food, treat, water",cat
2,919291,Basis Pet Made in the USA Low Profile Stainless Steel Cat DishBasis Pet Made in the USA Low Prof...,"food, treatment","food, treat, water",cat
3,944764,Alfie Pet by Petoga Couture - Vea 2.0 Slow-Eating Anti-Gulping Pet Food Bowl (for Dogs & Cats)Al...,"food, treatment","food, treat, water",cat
4,1124833,Petmate Hooded Cat Litter PanPetmate Hooded Cat Litter PanPetmate Hooded Cat Litter PanPetmate H...,"litter, odor, stain","litter, litter box",cat


Let's make df_label_cat_other have the same columns as df_label_cat_summary, and take care of the mis-categorized products. Remember them:

- Cluster_8 is not for cats, but fish.  
- Cluster_10 is a kind of toy.

In [656]:
# Make a new datafeame from df_label_dog_other for summary
df_other_cat = df_label_cat_other[['product_id', 'product_title', 'label_12']]

# Add 'big_category' column
df_other_cat['big_category'] = 'other'

# Add 'animal' column
df_other_cat['animal'] = 'cat'

# Take care of mis-categorized products 
for i in range(len(df_other_cat)):
    # Column_2 is 'label_12'
    if df_other_cat.iloc[i, 2] == 8:
        # Column_3 is 'big_categories'
        df_other_cat.iloc[i, 3] = 'fish, reptile'
        # Column_4 is 'animal'
        df_other_cat.iloc[i, 4] = 'other'
    
    elif df_other_cat.iloc[i, 2] == 10:
        df_other_cat.iloc[i, 3] = 'toy, scratcher, cat tree' 

# Replace the category numbers into the names
df_other_cat = df_other_cat.replace({'label_12': other_category_labels_cat})

# Replace the category numbers into the names
df_other_cat = df_other_cat.rename(columns={'label_12': 'small_category'})

print('df_other_cat (first 5 rows):')
df_other_cat.head()

df_other_cat (first 5 rows):


Unnamed: 0,product_id,product_title,small_category,big_category,animal
62,12304637,"Wall Mounted Fish Bowl Bubble for Goldfish & Beta or Hanging Terrarium with Exclusive ""Lets Get ...",aquarium,"fish, reptile",other
73,14334855,"Cholla Wood, 3 Nice Pieces of Aquarium Driftwood Decoration by Aquatic ArtsCholla Wood, 3 Nice P...",aquarium,"fish, reptile",other
161,35498715,Lightahead Artificial Mini Aquarium Fish Tank Multi Color LED Swimming Fish Tank with BubblesLig...,aquarium,"fish, reptile",other
224,53604498,"OurPets Bouncy North-American Catnip Bubbles, 8-OunceOurPets Bouncy North-American Catnip Bubble...",toy,"toy, scratcher, cat tree",cat
468,121923547,"Kookamunga Krazee Kitty Catnip Bubbles, 5 ozKookamunga Krazee Kitty Catnip Bubbles, 5 ozKookamun...",toy,"toy, scratcher, cat tree",cat


In [657]:
# Merge df_label_cat_summary and df_other_cat
df_label_cat_summary = df_label_cat_summary.append(df_other_cat)

print('df_label_cat_summary (first 5 rows):')
df_label_cat_summary.head()

df_label_cat_summary (first 5 rows):


Unnamed: 0,product_id,product_title,big_category,small_category,animal
0,70064,Perfect Pet Soft Flap Cat Door with Telescoping FramePerfect Pet Soft Flap Cat Door with Telesco...,"door, cage, carrier, bed","door, cage",cat
1,593896,Pet Food Can Covers Lids Set of 3Pet Food Can Covers Lids Set of 3Pet Food Can Covers Lids Set o...,"food, treatment","food, treat, water",cat
2,919291,Basis Pet Made in the USA Low Profile Stainless Steel Cat DishBasis Pet Made in the USA Low Prof...,"food, treatment","food, treat, water",cat
3,944764,Alfie Pet by Petoga Couture - Vea 2.0 Slow-Eating Anti-Gulping Pet Food Bowl (for Dogs & Cats)Al...,"food, treatment","food, treat, water",cat
4,1124833,Petmate Hooded Cat Litter PanPetmate Hooded Cat Litter PanPetmate Hooded Cat Litter PanPetmate H...,"litter, odor, stain","litter, litter box",cat


This df_label_cat_summary shows the category label of each product for cats.  That's all for cat products!

### 4.3. Other categpry

Let's do clustering of other category in the same way as dog and cat categories.

#### 4.3.1. Load data

Token list:

In [493]:
# Load the token list
pet_token_list = []
with open("other_token_list.csv", "r", encoding="UTF-8") as f:
    reader = csv.reader(f) 
    for r in reader: 
        pet_token_list.append(r)
        
print('pet_token_list (first 5 tokens of the first product):', pet_token_list[0][:5])
print('Number of products:', len(pet_token_list))
print('Number of unique tokens:', len(set(token for review in pet_token_list for token in review)))

pet_token_list (first 5 tokens of the first product): ['lint', 'roller', 'hair', 'cloth', 'pick']
Number of products: 4388
Number of unique tokens: 5914


ID-title table:

In [494]:
# Load the product table
df_pet_id_name = pd.read_csv("df_other_id_name.csv")

print('df_pet_id_name (first 5 products):')
df_pet_id_name.head()

df_pet_id_name (first 5 products):


Unnamed: 0,product_id,product_title
0,674575,Scotch Pet Hair Roller 839RScotch Pet Hair Roller 839RScotch Pet Hair Roller 839RScotch Pet Hair...
1,690871,Petco Brooklyn 55 Gallon Metal Tank StandPetco Brooklyn 55 Gallon Metal Tank StandPetco Brooklyn...
2,1299419,Kaytee Forti Diet Pro Health Guinea Pig FoodKaytee Forti Diet Pro Health Guinea Pig FoodKaytee F...
3,1304120,Amzdeal® 12 Inch Blue led light Underwater LED Aquarium Light Strip & Airstone for Aquarium Fish...
4,1469725,"Fetch for Pets Bulk Nail Files, 6-PackFetch for Pets Bulk Nail Files, 6-PackFetch for Pets Bulk ..."


#### 4.3.2. Vectorization

pet_token_list has 5,914 kinds of tokens. Additionaly, I use bigrams as the dog dataset. To control the number of tokens, tokens appearing less than 6 times in the total documents are cut.

In [495]:
# Generate matrix of word vectors

# Create CountVectorizer object 
cvectorizer_pet = CountVectorizer(tokenizer=dummy_tokened,lowercase=False, min_df=6, ngram_range=(1, 2))

bow_pet = cvectorizer_pet.fit_transform(pet_token_list)

# Get the feature names
feature_names_pet = cvectorizer_pet.get_feature_names()

# Show the shape of bow_matrix
print('BOW matrix: bow_pet')
print('Matrix shape:', bow_pet.shape)

BOW matrix: bow_pet
Matrix shape: (4388, 5156)


In [496]:
# Normarize the bow matrix
normalized_bow_pet = normalize(bow_pet)

# Change the normalizer bow matrix into np.array to use for AgglomerativeClustering()
np_normalized_bow_pet = normalized_bow_pet.toarray()

print('Normarized BOW matrix: np_normalized_bow_pet')
print('Matrix shape:', np_normalized_bow_pet.shape)

Normarized BOW matrix: np_normalized_bow_pet
Matrix shape: (4388, 5156)


In [497]:
# Create a t-SNE instance
tsne_pet = TSNE(perplexity=50, learning_rate=800, random_state=10)

# Apply fit_transform to samples
tsne_features_pet = tsne_pet.fit_transform(normalized_bow_pet)

# Scatter plot of t-NSE
plt.figure(figsize=(9,7))
tsne_features = tsne_features_pet

plt.scatter(tsne_features[:,0], tsne_features[:,1], s=4, alpha=0.8, c='royalblue')   

plt.title('t-SNE plot of pet products')
plt.subplots_adjust(left=0.04, right=0.999, bottom=0.04, top=0.9)
plt.show()

<IPython.core.display.Javascript object>

There seems to be less than 6 big clusters, and many small and mini clusters. Now, let's start hierarchical clustering!

#### 4.3.3. Hierarchical clustering

First, I would like to divide the data into less than 6 categories. Second, each of the big_categories will be devided into subcategories.

**4.3.3.1. Big categories**

In [498]:
# Agglomeral clustering setting distance_threshold=0 to compute the full tree
model_pet = AgglomerativeClustering(distance_threshold=0, n_clusters=None, affinity='cosine', linkage='average')
model_pet = model_pet.fit(np_normalized_bow_pet)

In [499]:
# Draw the dendrogram
plt.figure(figsize=(15,5))
plt.title('Hierarchical Clustering Dendrogram')
# plot the top three levels of the dendrogram
plot_dendrogram(model_pet, truncate_mode='level', p=5)

plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.subplots_adjust(left=0.035, right=0.999, bottom=0.15, top=0.95)
plt.ylabel("Distance threshold")
plt.yticks([i/10 for i in range(0,11,1)])
plt.show()

  


<IPython.core.display.Javascript object>

According to the dendrogram above, some mini clusters join up with the big clusters in the upper course of the dendrogram. Let's collect them as the "other" group.
I would like to divide the data into less than 6 big_categories. So, let's check the silhouette scores from k = 5 to 15.

In [500]:
# Agglomeral clustering from k = 5 to 15 
n_k_pet, scores_pet, n_samples_pet  = aggclus_silscore_nsamples(5, 15, 1, np_normalized_bow_pet)

In [501]:
# Plot Number of clusters vs. Silhouette score
plt.figure(figsize=(9,4))

n_clusters = n_k_pet
sil_scores = scores_pet

plt.plot(n_clusters, sil_scores, '.-')
plt.xticks(n_clusters)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.title('Number of clusters vs. Silhouette score')

plt.subplots_adjust(left=0.085, right=0.999, bottom=0.15, top=0.9)
plt.show()

  


<IPython.core.display.Javascript object>

When k = 12, the score is the maximum. Let's check the number of big clusters.

In [502]:
# Count the number of samples in the cluster
for k in [12]:
    n_cluster = k - 5
    count = 0
    for i in range(len(n_samples_pet[n_cluster])):
        if n_samples_pet[n_cluster][i] > 100:
            count += 1
    print('k =', k)
    print(n_samples_pet[n_cluster])
    print('Number of clusters more than 100 samples:', count)
    print('')

k = 12
{0: 2689, 1: 118, 2: 116, 3: 81, 4: 39, 5: 1289, 6: 33, 7: 1, 8: 6, 9: 12, 10: 2, 11: 2}
Number of clusters more than 100 samples: 4



There are two big clusters, three small clusters, and several mini clusters. Because this dataset is about 'other', the diversity of product kinds could highly diverse. Let's look closely k = 12, and see what is in each cluster.

In [503]:
# Agglomeral clustering; k = 12
model_pet_12 = AgglomerativeClustering(n_clusters=12, affinity='cosine', linkage='average')
model_pet_12 = model_pet_12.fit(np_normalized_bow_pet)

In [504]:
# Silhouette plot
silhouette(model_pet_12.n_clusters_, model_pet_12.labels_, np_normalized_bow_pet)

  # Remove the CWD from sys.path while we load stuff.


<IPython.core.display.Javascript object>

For n_clusters = 12 , The average silhouette_score is : 0.03564812239805721


In [505]:
# Scatter plot, colored by model_cat_test
plt.figure(figsize=(9,7))

model = model_pet_12
tsne_features = tsne_features_pet

for i in range(model.n_clusters_):
    plt.scatter(tsne_features[model.labels_ == i][:,0], 
                tsne_features[model.labels_ == i][:,1], 
                s=4, alpha=0.8, c=[cmap1(i/model.n_clusters_)])   
    plt.text(tsne_features[model.labels_ == i][:,0][0],
             tsne_features[model.labels_ == i][:,1][0],
             str(i), color="black", size=16
             )
        
plt.title('t-SNE plot of pet products colored by the labels')
plt.subplots_adjust(left=0.04, right=0.999, bottom=0.04, top=0.9)
plt.show()

  


<IPython.core.display.Javascript object>

In [506]:
# Apply the labels to the product table
df_label_pet = df_pet_id_name[['product_id', 'product_title']]
df_label_pet['label_12'] = model_pet_12.labels_

# Count the number of products in each clusters
frequent_words(df_label_pet, 'label_12', pet_token_list)

Unnamed: 0_level_0,frequent words,number of products
cluster number,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"(tank, water, gallon, aquarium, fish, filter, plant, color, food, pump)",2689
1,"(smell, spray, hand, area, effect, bottl, water, result, sinc, vet)",118
2,"(fit, comfort, car, color, side, door, head, strap, materi, clip)",116
3,"(hair, groom, cut, blade, brush, clipper, fit, comb, oster, tool)",81
4,"(beauti, gift, son, chain, hang, seller, tag, make, color, person)",39
5,"(bird, cage, food, eat, parrot, toy, seed, rabbit, treat, guinea)",1289
6,"(batteri, collar, fit, leash, box, shock, charg, train, hand, origin)",33
7,"(bio, colon, alreadi, benefit, meter, provid, biolog, matrix, reus, effect)",1
8,"(goat, milk, right, watch, administ, box, duti, side, dwarf, edg)",6
9,"(pad, plastic, filter, origin, name, pee, absorb, leak, sound, spin)",12


According to the frequent tokens, the big two clusters are about fish (cluster_0) and birds & rabbits (cluster_5). I can imagine there are several common products among birds and rabbits such as cages, cleaning items, and food. So, I think k = 12 is appropriate. Let's look each cluster closely, and make sure that.

In [633]:
# Check the products in each cluster
df_label_pet[df_label_pet['label_12'] == 9]#.sample(10)

Unnamed: 0,product_id,product_title,label_12
265,62115607,Innovative Marine 939026 Auqa Gadget Spin Stream Universal NozzleInnovative Marine 939026 Auqa G...,9
697,158213061,Marineland PRBW150B 150b Bio Wheel Assembly Penguin Filter Parts for AquariumMarineland PRBW15...,9
902,196924825,Generic Fine Filter Media Pads Suitable for Eheim Classic 2213 / 250 2616135 (Pack of 12)Generic...,9
1195,266369719,4 Pack - 30ppi Foam Filter Pads for Rena Filstar xP by Zanyzap4 Pack - 30ppi Foam Filter Pads fo...,9
1440,322511764,"WEE WEE Puppy Training PEE Pads 30x30 DOG Chux 110 / CS, IrregularWEE WEE Puppy Training PEE Pad...",9
2050,466348598,"Petcessory PHB-001-RED-S Travel Harness with Leash, Small, RedPetcessory PHB-001-RED-S Travel Ha...",9
2291,519895349,K&amp;H Manufacturing Small Animal Heated Pad Deluxe CoverK&amp;H Manufacturing Small Animal Hea...,9
2339,530354268,Sergeant's Yippee Skippy Doggie Training PadsSergeant's Yippee Skippy Doggie Training PadsSergea...,9
2968,675945515,All-Absorb Training Pads 22-inch By 23-inch.All-Absorb Training Pads 22-inch By 23-inch.All-Abso...,9
3076,699994138,Tick Remover - World's Simplest Tick Remover by Ticked OffTick Remover - World's Simplest Tick R...,9


According to the Silhouette plot, the t-SNE image with the labels, the frequent words, and browsing the product titles in each clusters, the big two clusters have individual features, and I evaluate they are appropriate as big_categories respectively. Let's think about the other clusters as 'other' group. However, some of them also have a unique character. I'll also label them.

In [634]:
# Label each category title (from the above two tables)
big_category_labels_pet = {0:"fish, reptile", 1:"other", 2:"other", 3:"other", 4:"other", 
                           5:"bird, rabbit, hamster", 6:"other", 7:"fish, reptile", 8:"other", 
                           9:"fish, reptile", 10:"other", 11:"other"}

# Label each category title (from the above table)
other_category_labels_pet = {1:"treatment", 2:"other", 3:"brush, comb, clipper", 4:"memorial, tag", 6:"training", 
                             7:"aquarium", 8:"other", 9:"aquarium", 10:"other", 11:"other"}

I noticed several things. Let's take care of these later.  

- Cluster_6 is for dogs, especially for training. 
- Cluster_7 and 9 is for fish.


Now, let's extract the big 2 clusters (cluster_0 and 5) for farther clustering.

In [530]:
# Collect the product information of 'other' group
big_clusters_pet = [0, 5]
df_label_pet_big = pd.DataFrame()
df_label_pet_other = pd.DataFrame()

for i in range(len(df_label_pet)):
    if i in big_clusters_pet:
        df_label_pet_big = df_label_pet_big.append(df_label_pet[df_label_pet['label_12'] == i])
    else:
        df_label_pet_other = df_label_pet_other.append(df_label_pet[df_label_pet['label_12'] == i])

# Sort by the indexes
df_label_pet_big = df_label_pet_big.sort_index()
df_label_pet_other = df_label_pet_other.sort_index()

id-title table for the big clusters:

In [531]:
print(len(df_label_pet_big), 'products')
print('df_label_pet_big (first 5 products):')
df_label_pet_big.head()

3978 products
df_label_pet_big (first 5 products):


Unnamed: 0,product_id,product_title,label_12
1,690871,Petco Brooklyn 55 Gallon Metal Tank StandPetco Brooklyn 55 Gallon Metal Tank StandPetco Brooklyn...,0
2,1299419,Kaytee Forti Diet Pro Health Guinea Pig FoodKaytee Forti Diet Pro Health Guinea Pig FoodKaytee F...,5
3,1304120,Amzdeal® 12 Inch Blue led light Underwater LED Aquarium Light Strip & Airstone for Aquarium Fish...,0
5,1926454,API 1500 Watt 3 In 1 De-Icer DT15API 1500 Watt 3 In 1 De-Icer DT15API 1500 Watt 3 In 1 De-Icer...,0
6,2365091,AQUATOP AC/DC Single Battery Operated Air PumpAQUATOP AC/DC Single Battery Operated Air PumpAQUA...,0


id-title table for the other clusters:

In [532]:
print(len(df_label_pet_other), 'products')
print('df_label_pet_other (first 5 products):')
df_label_pet_other.head()

410 products
df_label_pet_other (first 5 products):


Unnamed: 0,product_id,product_title,label_12
0,674575,Scotch Pet Hair Roller 839RScotch Pet Hair Roller 839RScotch Pet Hair Roller 839RScotch Pet Hair...,3
4,1469725,"Fetch for Pets Bulk Nail Files, 6-PackFetch for Pets Bulk Nail Files, 6-PackFetch for Pets Bulk ...",2
7,2425297,"Parisian Pet I Love Daddy Dog T-Shirt, SmallParisian Pet I Love Daddy Dog T-Shirt, SmallParisian...",2
18,4493925,RECARO Performance BOOSTER Highback Booster Car Seat - RoseRECARO Performance BOOSTER Highback B...,2
46,10799804,Single Feather SmudgerSingle Feather SmudgerSingle Feather SmudgerSingle Feather SmudgerSingle F...,11


Also, let's separate the bow matrix and t-SNE features of the big clusters.

In [533]:
# Get the normalized bow matrix of the big clusters
np_normalized_bow_pet_big = np.array([np_normalized_bow_pet[i] 
                                          for i in range(len(np_normalized_bow_pet)) if i in df_label_pet_big.index])

print('Normarized BOW matrix of big_categories: np_normalized_bow_pet_big')
print('Matrix shape:', np_normalized_bow_pet_big.shape)

Normarized BOW matrix of big_categories: np_normalized_bow_pet_big
Matrix shape: (3978, 5156)


In [534]:
# Get the t-SNE features of the big clusters
tsne_features_pet_big = np.array([tsne_features_pet[i] 
                                      for i in range(len(tsne_features_pet))if i in df_label_pet_big.index])

print('T-SNE features of big_categories: tsne_features_pet_big')
print('Feature shape:', tsne_features_pet_big.shape)

T-SNE features of big_categories: tsne_features_pet_big
Feature shape: (3978, 2)


**4.3.3.2. Small categories**

Check the silhouette scores from k = 15 to 50 roughly to get some ideas to decide the number of the small_categories.

In [536]:
# Agglomeral clustering; k = 15 to 50 
rough_n_k_pet, rough_scores_pet, _ = aggclus_silhouette_score_nsamples(5, 50, 5, np_normalized_bow_pet_big)

In [537]:
# Plot Number of clusters vs. Silhouette score
plt.figure(figsize=(9,4))

n_clusters = rough_n_k_pet
sil_scores = rough_scores_pet

plt.plot(n_clusters, sil_scores, '.-')
plt.xticks(n_clusters)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.title('Number of clusters vs. Silhouette score')

plt.subplots_adjust(left=0.085, right=0.999, bottom=0.15, top=0.9)
plt.show()

  


<IPython.core.display.Javascript object>

One peak is around k = 35. Let's check the scores finely from k = 30 to 40 to choose k for the small_categories.

In [538]:
# Agglomeral clustering; k = 15 to 50 
small_n_k_pet, small_scores_pet, small_n_samples_pet = aggclus_silhouette_score_nsamples(30, 40, 1, np_normalized_bow_pet_big)

In [539]:
# Plot Number of clusters vs. Silhouette score
plt.figure(figsize=(9,4))

n_clusters = small_n_k_pet
sil_scores = small_scores_pet

plt.plot(n_clusters, sil_scores, '.-')
plt.xticks(n_clusters)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.title('Number of clusters vs. Silhouette score (K = 30 to 40)')

plt.subplots_adjust(left=0.085, right=0.999, bottom=0.15, top=0.9)
plt.show()

  


<IPython.core.display.Javascript object>

When k = 37, the Sihouette score is highest. Let's decide k = 37 for the small_categories. Do clustering on k = 37, and get the labels.

In [540]:
# Agglomerative clustering (n_clusters=37, small_categories)
n_clusters=37

model_pet_37 = AgglomerativeClustering(n_clusters=n_clusters, affinity='cosine', linkage='average')
model_pet_37 = model_pet_37.fit(np_normalized_bow_pet_big)

In [541]:
# Plot silhouette scores
silhouette(model_pet_37.n_clusters_, model_pet_37.labels_, np_normalized_bow_pet_big)

  # Remove the CWD from sys.path while we load stuff.


<IPython.core.display.Javascript object>

For n_clusters = 37 , The average silhouette_score is : 0.03657825698406181


In [618]:
# Scatter plot, colored by the labels (non_interactive plot)
plt.figure(figsize=(9,7))

model = model_pet_37
tsne_features = tsne_features_pet_big

# for i in range(model.n_clusters_):
for i in [0,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,20,22,23,26,27,28,29,30,31,33,35,36]:  # Cluster_0
# for i in [1,2,6,19,21,24,25,32,34]:  # Cluster_5
# for i in [1,36, 6]:
    plt.scatter(tsne_features[model.labels_ == i][:,0], 
                tsne_features[model.labels_ == i][:,1], 
                s=4, alpha=0.8, c=[cmaps[i+1]], label=i)   
    plt.text(tsne_features[model.labels_ == i][:,0][0],
             tsne_features[model.labels_ == i][:,1][0],
             str(i), color="black", size=16)
        
plt.legend(loc='lower center', bbox_to_anchor=(.5, 1), ncol=10)
plt.subplots_adjust(left=0.035, right=0.999, bottom=0.04, top=0.85)
plt.xlim(-65, 60)
plt.ylim(-55, 70)
plt.show()

  


<IPython.core.display.Javascript object>

In [544]:
# Add the labels on the product information
df_label_pet_big['label_37'] = model_pet_37.labels_

In [545]:
# Get the frequent words and count the number of products in each cluster
n_clusters = df_label_pet_big['label_37'].nunique()
df_37cluster_pet = pd.DataFrame(columns=['big_category_number', 'small_category_number','frequent words', 'number of products'])

for i in range(n_clusters):
    words = []
    indexes = df_label_pet_big[df_label_pet_big['label_37'] == i].index
    big_category_number = df_label_pet_big[df_label_pet_big['label_37'] == i].iat[0, 2]
    for index in indexes:
        for word in set(pet_token_list[index]):
            words.append(word)
    c = Counter(words)
    values, _ = zip(*c.most_common(10))

    df_temp = pd.DataFrame([[big_category_number, i, values, len(indexes)]], 
                                columns=df_37cluster_pet.columns)
    df_37cluster_pet = df_37cluster_pet.append(df_temp) 

df_37cluster_pet.set_index(['big_category_number']).sort_index()

Unnamed: 0_level_0,small_category_number,frequent words,number of products
big_category_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,"(terrarium, beauti, digit, humid, temp, instruct, moss, plant, tank, terra)",11
0,33,"(treatment, result, warn, spot, aliv, tissu, skin, recomend, didnt, remov)",3
0,31,"(serv, purpos, learn, deliveri, project, daddi, soak, guard, thx, busi)",2
0,30,"(eye, cavali, stain, wipe, face, hair, solut, wash, food, charl)",5
0,29,"(scale, water, eye, tank, didnt, degre, dark, glass, freez, tape)",3
0,28,"(seal, bucket, food, water, content, storag, lid, gamma, fit, cover)",7
0,27,"(side, tank, support, junk, descript, scratch, care, gap, stay, make)",7
0,26,"(crab, hermit, shell, food, tank, eat, hermi, water, bag, shape)",28
0,23,"(tank, readi, display, parasit, sinc, pack, pod, bag, color, sign)",3
0,22,"(seller, head, turn, send, suppos, tank, gallon, pressur, bulk, system)",2


In [624]:
# Check the products in each cluster
df_label_pet_big[df_label_pet_big['label_37'] == 0].sample(10)

Unnamed: 0,product_id,product_title,label_12,label_37
3949,894343448,H Potter Mini (Decorative) Box English Greenhouse Terrarium with Green Glass.H Potter Mini (Deco...,0,0
2090,473957770,Exo Terra Compact Incandescent FixtureExo Terra Compact Incandescent FixtureExo Terra Compact In...,0,0
3976,903038769,"Avianweb Digital Thermo Hygrometer, Mini, BlackAvianweb Digital Thermo Hygrometer, Mini, BlackAv...",0,0
2129,481835720,Zoo Med Naturalistic Terrarium HoodZoo Med Naturalistic Terrarium HoodZoo Med Naturalistic Terra...,0,0
491,111385414,Terrarium/Fairy Garden Kit - Create Your Own Living TerrariumTerrarium/Fairy Garden Kit - Create...,0,0
2916,662867885,Exo Terra Digital Combination Thermometer/HygrometerExo Terra Digital Combination Thermometer/Hy...,0,0
4127,937730508,Exo Terra Terrarium PlantExo Terra Terrarium PlantExo Terra Terrarium PlantExo Terra Terrarium P...,0,0
750,168473621,9GreenBox - Terrarium/Fairy Garden Kit - Create Your Own Living Terrarium or Fairy Garden9GreenB...,0,0
3154,720430757,Reptology Reptile Hygrometer Humidity and Temperature Sensor GaugesReptology Reptile Hygrometer ...,0,0
2614,594657150,"Exo Terra Digital Thermometer with Probe, Celsius and FahrenheitExo Terra Digital Thermometer wi...",0,0


In [625]:
# Label each category title (from the above two tables)
small_category_labels_pet = {0:"terrarium", 33:"aquarium", 31:"other", 
                             30:"terrarium", 29:"aquarium", 28:"aquarium", 27:"other", 26:"crustacean",
                             23:"aquarium", 22:"aquarium", 20:"aquarium", 35:"crustacean",  17:"other", 16:"aquarium", 
                             15:"other", 18:"aquarium", 13:"terrarium", 3:"terrarium", 
                             4:"aquarium", 5:"food (reptile)", 14:"waste bag", 8:"bowl", 7:"aquarium", 10:"other", 
                             11:"other", 12:"other", 9:"food (fish, turtle)", 36:"rabbit, hamster",
                             21:"chicken", 24:"chicken", 25:"chinchila", 6:"rabbit, hamster", 
                             19:"other", 2:"chicken", 32:"litter, bedding", 1:"bird", 34:"other"}

I noticed several things. Let's take care of these later.  

- Small_cluster_17, 15, 14, 8, 12 are in big_category_0, but should have been big_category_1.
- Small_cluster_34 is in big_category_0, but should have been big_category_1.
- Small_cluster_36 is in big_category_0, but should have been big_category_5.

**4.3.3.3. Make the summary tables**

Now, each of the other products have a big_category label and a small_category label. In this section, let's make the summary table that has rows representing each product with the id, title, and each category label.  

In [648]:
# Make a new datafeame from df_label_dog_big for summary
df_label_pet_summary = df_label_pet_big[['product_id', 'product_title', 'label_12', 'label_37']]

# Add 'animal' column
df_label_pet_summary['animal'] = 'other'

print('df_label_pet_summary (first 5 rows):')
df_label_pet_summary.head()

df_label_pet_summary (first 5 rows):


Unnamed: 0,product_id,product_title,label_12,label_37,animal
1,690871,Petco Brooklyn 55 Gallon Metal Tank StandPetco Brooklyn 55 Gallon Metal Tank StandPetco Brooklyn...,0,27,other
2,1299419,Kaytee Forti Diet Pro Health Guinea Pig FoodKaytee Forti Diet Pro Health Guinea Pig FoodKaytee F...,5,6,other
3,1304120,Amzdeal® 12 Inch Blue led light Underwater LED Aquarium Light Strip & Airstone for Aquarium Fish...,0,4,other
5,1926454,API 1500 Watt 3 In 1 De-Icer DT15API 1500 Watt 3 In 1 De-Icer DT15API 1500 Watt 3 In 1 De-Icer...,0,4,other
6,2365091,AQUATOP AC/DC Single Battery Operated Air PumpAQUATOP AC/DC Single Battery Operated Air PumpAQUA...,0,4,other


In [649]:
# Take care of mis-categorized products
for i in range(len(df_label_pet_summary)):
    if df_label_pet_summary.iloc[i, 3] in [17, 15, 14, 8, 12, 34]:
        df_label_pet_summary.iloc[i, 2] = 1
    elif df_label_pet_summary.iloc[i, 3] == 36:
        df_label_pet_summary.iloc[i, 2] = 5
        
# Replace the category numbers into the names
df_label_pet_summary = df_label_pet_summary.replace({'label_12': big_category_labels_pet,
                                             'label_37': small_category_labels_pet})
# Rename the columns
df_label_pet_summary = df_label_pet_summary.rename(columns={'label_12': 'big_category', 'label_37': 'small_category'})

print('df_label_pet_summary (first 5 rows):')
df_label_pet_summary.head()

df_label_pet_summary (first 5 rows):


Unnamed: 0,product_id,product_title,big_category,small_category,animal
1,690871,Petco Brooklyn 55 Gallon Metal Tank StandPetco Brooklyn 55 Gallon Metal Tank StandPetco Brooklyn...,"fish, reptile",other,other
2,1299419,Kaytee Forti Diet Pro Health Guinea Pig FoodKaytee Forti Diet Pro Health Guinea Pig FoodKaytee F...,"bird, rabbit, hamster","rabbit, hamster",other
3,1304120,Amzdeal® 12 Inch Blue led light Underwater LED Aquarium Light Strip & Airstone for Aquarium Fish...,"fish, reptile",aquarium,other
5,1926454,API 1500 Watt 3 In 1 De-Icer DT15API 1500 Watt 3 In 1 De-Icer DT15API 1500 Watt 3 In 1 De-Icer...,"fish, reptile",aquarium,other
6,2365091,AQUATOP AC/DC Single Battery Operated Air PumpAQUATOP AC/DC Single Battery Operated Air PumpAQUA...,"fish, reptile",aquarium,other


Let's make df_label_pet_other have the same columns as df_label_pet_summary, and take care of the mis-categorized products. Remember them:

- Cluster_6 is for dogs, especially for training. 
- Cluster_7 and 9 is for fish.

In [650]:
# Make a new datafeame from df_label_dog_other for summary
df_other_pet = df_label_pet_other[['product_id', 'product_title', 'label_12']]

# Add 'big_category' column
df_other_pet['big_category'] = 'other'

# Add 'animal' column
df_other_pet['animal'] = 'other'

# Take care of mis-categorized products 
for i in range(len(df_other_pet)):
    # Column_2 is 'label_12'
    if df_other_pet.iloc[i, 2] == 6:
        # Column_3 is 'big_categories'
        df_other_pet.iloc[i, 3] = 'other'
        # Column_4 is 'animal'
        df_other_pet.iloc[i, 4] = 'dog'
    
    elif df_other_pet.iloc[i, 2] in [7, 9]:
        df_other_pet.iloc[i, 3] = 'fish, reptile'
        # Column_4 is 'animal'
        df_other_pet.iloc[i, 4] = 'other'
    
# Replace the category numbers into the names
df_other_pet = df_other_pet.replace({'label_12': other_category_labels_pet})

# Replace the category numbers into the names
df_other_pet = df_other_pet.rename(columns={'label_12': 'small_category'})

print('df_other_pet (first 5 rows):')
df_other_pet.head()

df_other_pet (first 5 rows):


Unnamed: 0,product_id,product_title,small_category,big_category,animal
0,674575,Scotch Pet Hair Roller 839RScotch Pet Hair Roller 839RScotch Pet Hair Roller 839RScotch Pet Hair...,"brush, comb, clipper",other,other
4,1469725,"Fetch for Pets Bulk Nail Files, 6-PackFetch for Pets Bulk Nail Files, 6-PackFetch for Pets Bulk ...",other,other,other
7,2425297,"Parisian Pet I Love Daddy Dog T-Shirt, SmallParisian Pet I Love Daddy Dog T-Shirt, SmallParisian...",other,other,other
18,4493925,RECARO Performance BOOSTER Highback Booster Car Seat - RoseRECARO Performance BOOSTER Highback B...,other,other,other
46,10799804,Single Feather SmudgerSingle Feather SmudgerSingle Feather SmudgerSingle Feather SmudgerSingle F...,other,other,other


In [651]:
# Merge df_label_pet_summary and df_other_pet
df_label_pet_summary = df_label_pet_summary.append(df_other_pet)

print('df_label_pet_summary (first 5 rows):')
df_label_pet_summary.head()

df_label_pet_summary (first 5 rows):


Unnamed: 0,product_id,product_title,big_category,small_category,animal
1,690871,Petco Brooklyn 55 Gallon Metal Tank StandPetco Brooklyn 55 Gallon Metal Tank StandPetco Brooklyn...,"fish, reptile",other,other
2,1299419,Kaytee Forti Diet Pro Health Guinea Pig FoodKaytee Forti Diet Pro Health Guinea Pig FoodKaytee F...,"bird, rabbit, hamster","rabbit, hamster",other
3,1304120,Amzdeal® 12 Inch Blue led light Underwater LED Aquarium Light Strip & Airstone for Aquarium Fish...,"fish, reptile",aquarium,other
5,1926454,API 1500 Watt 3 In 1 De-Icer DT15API 1500 Watt 3 In 1 De-Icer DT15API 1500 Watt 3 In 1 De-Icer...,"fish, reptile",aquarium,other
6,2365091,AQUATOP AC/DC Single Battery Operated Air PumpAQUATOP AC/DC Single Battery Operated Air PumpAQUA...,"fish, reptile",aquarium,other


This df_label_pet_summary shows the category label of each product. That's all for 'other' products!

## 2. Summary

Now, we have subcategorized all of the products in the three datasets; dog, cat, and other! Let's merge the three tables in one.

In [665]:
# Merge the three df_label_summary dataframes 
df_label_summary = df_label_dog_summary.append(df_label_cat_summary).append(df_label_pet_summary)

# Change the order of the columns
df_label_summary = df_label_summary[['product_id', 'product_title', 'animal', 'big_category', 'small_category']]

# Save the dataframe as a csv file if needed
# df_label_summary.to_csv('df_label_summary.csv')

print('df_label_summary (first 5 rows):')
df_label_summary.head()

df_label_summary (first 5 rows):


Unnamed: 0,product_id,product_title,animal,big_category,small_category
0,119780,"ARK Naturals PRODUCTS for PETS 326066 4-Ounce Breath-Less Chewable Brushless Toothpaste, MiniARK...",dog,"food, treat, treatment",treat
1,202371,"Stella & Chewy's Freeze Dried Dog Food for Adult Dogs, Chicken Patties, 15 Ounce Bag - 2 PackSte...",dog,"food, treat, treatment","food, bowl"
2,291967,Premium Deshedding Brush for Dogs and Cats with Medium to Long Hair | Veterinary Approved | Rugg...,dog,"body care, cleaning","brush, clipper"
3,490904,"Remington Coastal Pet R0206 GRN06 Rope Leash, 72-Inch, GreenRemington Coastal Pet R0206 GRN06 Ro...",dog,"collar, leash","leash, harness"
4,798322,Pet Dog Puppy Nonslip Canvas Sport Shoes Sneaker Boots Rubber Sole Size 5 Blue by MallofusaPet D...,dog,clothes,shoes


Great! This table is what we wanted! Each product have three labels; animal_category, big_category, and small_category. Animal_category has three classes; dog, cat, and other. Each animal class has several big_categories and small categories under the big categories. 

Lastly, let's make a table showing all of the categories and the number of products in each small category. 

In [669]:
# Extract the relevant columns from df_label_summary, and add a new columns to count the number
df_label_categories = df_label_summary[['animal', 'big_category', 'small_category']].assign(count=0)

# Groupby the category columns, and count the numbers
df_label_categories = df_label_categories.groupby(['animal', 'big_category', 'small_category']).count()

print('df_label_categories:')
pd.set_option('display.max_rows', 75)
df_label_categories

df_label_categories:


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count
animal,big_category,small_category,Unnamed: 3_level_1
cat,"collar, leash",collar,114
cat,"collar, leash","harness, leash",28
cat,"door, cage, carrier, bed",bed,193
cat,"door, cage, carrier, bed","carrier, stroller",107
cat,"door, cage, carrier, bed","door, cage",101
cat,"door, cage, carrier, bed",other,9
cat,"door, cage, carrier, bed","perch, shelf",26
cat,"door, cage, carrier, bed",step,10
cat,"door, cage, carrier, bed",tent,11
cat,"food, treatment",cat grass,20


This table shows what the popular categories are. For example, 'treat' category in 'food, treat, treatment' big_category of dog products has almost same number of products as 'food, bowl' category. Also, if you are interested in one of the specific category in this table, you can extract the reviews of the products, and analyze them closely. 

**Summary:**

Totally 20,403 pet products are subcategorized. Each product has three labels; animal_category, big_category, and small_category. Animal_category has three classes; dog, cat, and other. Each animal class has several big_categories and small categories under the big categories. as the result of the subcategorization, the products are classified into 74 kinds of groups. This information can be used for the product classification on website of an e-commerce company, and extracting a specific group of products to analyze them closely; for example, the variety, popular products in the group, consumer needs, etc, 