### Summary of Findings Across Different Cluster Sizes    
We will be inspecting the clusters content of N= 100, 150, 200, to find an optimal number of clusters that captures both variety and meaningful customization attributes.

In [203]:
import pandas as pd

In [204]:
headers = ['Word 1', 'Word 2', 'Word 3', 'Word 4', 'Word 5', 'Word 6', 'Word 7', 'Word 8', 'Word 9', 'Word 10']

#### Number of Clusters = 100
Observations:  
- only a few types of glasses, does not sufficiently capture customisation.

Conclusion:
- too few clusters, leading to overgeneralisation.

In [205]:
cluster_100_df = pd.read_csv("../dataset/others/top_words_for_100.csv")
cluster_100_df = cluster_100_df.iloc[:, 1:]
cluster_100_df.columns = headers

# there is less variety of glasses, compared to when there are more clusters e.g. 150, 200
filtered_df = cluster_100_df[cluster_100_df.apply(lambda row: (row == "glass").any(), axis=1)]
filtered_df


Unnamed: 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10
30,glass,reading,reader,eyeglass,computer,blocking,light,frame,hinge,blue
42,lens,sunglass,frame,polarized,glass,eyeglass,oakley,replacement,black,mirror
55,sunglass,frame,lens,polarized,fashion,woman,retro,oversized,glass,eye
63,sunglass,polarized,men,driving,frame,sport,glass,lens,for,protection


#### Number of Clusters = 150
Observations:  
- compared to cluster_size = 100, there are more variety of glasses and the clusters have distinctive customization features. 
- a variety of women’s dresses were identified with different customization features (e.g., summer, party, formal, loose).
- within floral summer dresses for woman, meaningful distinctions between sleeve, sleeveless or styles with pockets were observed. 

Conclusion:
- a good balance of between cluster size and customization level.

In [206]:
cluster_150_df = pd.read_csv("../dataset/others/top_words_for_150.csv")
cluster_150_df = cluster_150_df.iloc[:, 1:]
cluster_150_df.columns = headers

# there are more variety of glasses, e.g. black frame, sports, night vision, vintage/ retro, reading, sunglass
# filtered_df = cluster_150_df[cluster_150_df.apply(lambda row: (row == "glass").any(), axis=1)].reset_index(drop=True)
# filtered_df

# there is a variety of woman dress and their customised features e.g. tunic, summer, vintage, party
# filtered_df = cluster_100_df[cluster_100_df.apply(lambda row: (row == "woman").any() & (row == "dress").any(), axis=1)].reset_index(drop=True)
# filtered_df

# within items such as floral summer dress for woman, there is appropriate level of customisation such as between sleeve and sleeveless, and dress with pocket
filtered_df = cluster_150_df[
    cluster_150_df.apply(lambda row: (row == "woman").any() & (row == "dress").any() & (row == "floral").any() & (row == "summer").any() , axis=1)
]
filtered_df


Unnamed: 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10
37,dress,floral,summer,woman,casual,sleeve,neck,mini,short,long
125,swing,dress,casual,woman,sleeve,loose,pocket,summer,neck,floral
132,beach,dress,summer,maxi,floral,woman,casual,sundress,sleeveless,boho


#### Number of Clusters = 200
Observations:  
- more detailed breakdown of woman dresses, but new clusters did not provide any new insightful customisation features.
- mixed features within a cluster due to small cluster size. e.g. sleeve and sleeveless both in same cluster
- some clusters only showed generic terms without much differentiation

Conclusion:
- too many clusters, leading to over-fragmentation where some clusters lack distinctive features.

In [207]:
cluster_200_df = pd.read_csv("../dataset/others/top_words_for_200.csv")
cluster_200_df = cluster_200_df.iloc[:, 1:]
cluster_200_df.columns = headers

# the extra cluster does not provide any insightful distinction, clusters overlapping
# mixed features within a cluster due to small cluster size. e.g. sleeve and sleeveless both in same cluster
filtered_df = cluster_200_df[
    cluster_200_df.apply(lambda row: (row == "woman").any() & (row == "dress").any() & (row == "floral").any() & (row == "summer").any() & (row == "sleeve").any(), axis=1)
]
filtered_df 


Unnamed: 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10
31,dress,summer,floral,casual,woman,mini,maxi,sleeve,neck,sleeveless
139,swing,dress,summer,floral,casual,woman,beach,sundress,sleeveless,sleeve
145,floral,dress,woman,print,sleeve,maxi,blue,casual,summer,short
146,midi,dress,woman,bodycon,sleeve,casual,neck,floral,summer,sleeveless


### Conclusion
We will choose 150 clusters for MiniBatchKMeans clustering, as it is observed to achieve balance between number of clusters and appropriate level of customisation.