In [None]:
import json
import os
import pandas as pd
import gzip
import itertools
try:
  from bertopic import BERTopic
except:
  !pip install bertopic[all]
  os.kill(os.getpid(), 9)

#The Data
Prof. Julian McAuley at UC-San Diego has graciously let me use his “Amazon Product Data” database. It contains tons of data about Amazon products. Specifically, we will leverage two datasets: (1) meta-data about products and (2) product reviews. The aforementioned database has reviews on all types of Amazon products, but these datasets are huge (~80gb).

I’ve picked two smaller datasets that only contain (1) meta-data about products that are categorized as “Clothing, Shoes & Jewelry” and (2) reviews about products that are in the “Clothing, Shoes & Jewelry” category.

The product data:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#Extracting the Data
First, you’ll need to identify the ASINs associated with products from your brand, Nike. To do that you’ll leverage dataset #1. Next, once you have a list of ASINs, you’ll use dataset #2 to extract reviews that match those ASINs. 



In [None]:
asins = []

# To run this code, you will need to download the metadata file from the course
# assets and upload it to your Google Drive. See the notes about that file
# regarding how it was processed from the original file into json-l format.

with gzip.open("drive/MyDrive/meta_Clothing_Shoes_and_Jewelry.jsonl.gz") as products:
    for product in products:
        data = json.loads(product)
        categories = [c.lower() for c in
                      list(itertools.chain(*data.get("categories", [])))]
        if "nike" in categories:
            asins.append(data["asin"])

In [None]:
asinList = []
review_corpus = []
with gzip.open("drive/MyDrive/reviews_Clothing_Shoes_and_Jewelry.json.gz") as reviews:
    for review in reviews:
        data = json.loads(review)
        if data["asin"] in asins:
            text = data["reviewText"]
            review_corpus.append(text)
            asinList.append(data["asin"])

In [None]:

print("There are this amount of ASINs for Nike:", len(asins))
asins[0:10]

There are this amount of ASINs for Nike: 8327


['B0000V9K32',
 'B0000V9K3W',
 'B0000V9K46',
 'B0000V9KNM',
 'B0000V9KRI',
 'B0000V9KRS',
 'B00012O2LA',
 'B00012O2MO',
 'B00012O2RY',
 'B00012O2R4']

In [None]:
print("There are this amount of reviews for those ASINs: ", len(review_corpus))
review_corpus[0:10]


There are this amount of reviews for those ASINs:  21570


['the colour i received is not blue as shown but yellow.Couldnt change it because it was a birthday present for my daughter and havent got time.She really didn,t like it',
 'Very cute and is really practical. Fits better on smaller wrists which is my case. I wear them everywhere. I really love this watch!',
 'The watch was exactly what i ordered and I got it very fast. Unfortunately it was a bit too big for my wrist.  I returned it for a refund without any problems.',
 'This product came promptly and as described, pleasure doing business with them!-d',
 "Why isn't Nike making these anymore?  I love this watch, and I get a lot of compliments, questions from people who would like to have one as well.",
 'good price, very good material and excellent design, very useful for traveling, totally recomendation this use this product, to buy this',
 "I mean, Roxy rocks, but I'm kinda dissapointed with the material. The purse lokks a lil' bit cheap.",
 'I love this watch, i use every day, every w

There are 8327 unique ASINs related to Nike sold on Amazon in this dataset.  There are over 21k reviews for these products.  

#Performing Topic Modeling
Identify Nike’s product ASINS and extract the relevant reviews, you’ll need to perform topic modeling on the text of the review data. Using one of the popular clustering methods demonstrated in class (e.g., k-means or LDA), perform topic modeling on the data to reveal the most popular topics.

Visualize those topics. From a grade standpoint, we’ll be looking particularly at how logical the topic models are. Can we read them and generally understand what the topic model represents? There is no minimum or maximum of topics, instead I want you to tweak this parameter until the topics make the most sense to you.

Here’s a list of Nike topics that aren’t terrible:


In [None]:
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(review_corpus)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/675 [00:00<?, ?it/s]

2023-03-06 16:47:22,220 - BERTopic - Transformed documents to Embeddings
2023-03-06 16:48:01,023 - BERTopic - Reduced dimensionality
2023-03-06 16:48:30,150 - BERTopic - Clustered reduced embeddings


In [None]:
freq = topic_model.get_topic_info(); freq.head(50)

Unnamed: 0,Topic,Count,Name
0,-1,7517,-1_shoes_shoe_and_them
1,0,1144,0_watch_it_band_wrist
2,1,989,1_nike_nikes_of_as
3,2,641,2_socks_sock_are_they
4,3,521,3_running_run_shoes_miles
5,4,505,4_bag_gym_clothes_carry
6,5,415,5_product_item_it_was
7,6,375,6_sneakers_these_are_fit
8,7,365,7_he_son_loves_old
9,8,353,8_fit_perfectly_shoes_expected


When we preview the top 50 topic models, we can see they are mostly grouped by product type.  We do see some topics that are interesting conceptually

*   **Topic 10** reminds us Amazon and Nike have a global footprint.  These reiviews are most likely in Spanish and/or Portuguese since there representative words are muy, de, el, and la.
*  **Topic 26** we would want to investigate for brand protection.  Representaive words are fake, authentic, real, fakes.



In [None]:
topic_model.visualize_hierarchy(top_n_topics=50)

In [None]:
print(list(zip(*topic_model.get_topic(7)))[0])
print(list(zip(*topic_model.get_topic(24)))[0])
print(list(zip(*topic_model.get_topic(15)))[0])
print(list(zip(*topic_model.get_topic(37)))[0])


('he', 'son', 'loves', 'old', 'his', 'year', 'shoes', 'him', 'my', 'grandson')
('he', 'wears', 'pair', 'them', 'loves', 'son', 'she', 'bought', 'every', 'loved')
('them', 'gift', 'he', 'loves', 'son', 'she', 'loved', 'got', 'christmas', 'happy')
('fit', 'he', 'gift', 'they', 'them', 'perfect', 'loves', 'she', 'son', 'christmas')


In [None]:
df = pd.DataFrame({'topic': topics, 'review': review_corpus, 'asin': asinList})
df = df[(df.topic == 7) | ( df.topic == 24) |  ( df.topic == 15) |  ( df.topic == 37) ]
df

Unnamed: 0,topic,review,asin
101,37,I purchased these as a gift for my son. He lo...,B0006NGUE6
205,24,My dad loved his father day gift. Perfect size...,B0007RADZ8
248,37,Did not fit so they have been returned as of ...,B0007RADZ8
332,7,I purchased this as a gift for Christmas so I ...,B0007RADZ8
337,24,It was for my hubby.He love them.His favorite ...,B0007RADZ8
...,...,...,...
21491,7,Great shoes and our son believes that they are...,B00JJ8NXXK
21493,7,My Grand-son loved these shoes.,B00JLR0FYY
21531,37,They fit like a glove and need no peds with th...,B00JZSNWEO
21549,7,"These shoes fit well, my son likes the color.",B00K8CLZTU


In [None]:
suggestions = df.asin.value_counts()
suggestions[0:10]

B000V4YZ1K    18
B007FXKMLW    13
B0098G7Q1S    12
B001LDDR0K     9
B001V6PNRW     8
B0013UXIAK     7
B004IM1GHW     6
B001YYKGK0     6
B0007RADZ8     5
B00C8P9K2E     5
Name: asin, dtype: int64

#Conclusions

*   These are the top reviewed products that are related to gift giving

*   Many of the comments refer to holidays like Christmas or Father’s Day.  Children were also mentioned many times
*   We would recommend increasing the marketing spend for these 10 products before Father’s Day, Christmas, Mother’s Day and the start of a school year targeting individuals looking for gift ideas

*  List could be further refined with sentiment analyis looking for words like 'return' or 'hated' to make sure these gifts were enjoyed


