<a href="https://colab.research.google.com/github/TOM-BOHN/MsDS-product-review-topic-modeling/blob/main/product_review_topic_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Product Review Topic Modeling
**Thomas Bohn**   --   **2023-09-18**

A report focused on modeling topics on text review data from Amazon for clothing and shoes manufactured by Nike. Use unsupervised clustering methods to create popular topics in the review data.

--  [Main Report](https://github.com/TOM-BOHN/MsDS-product-review-topic-modeling/blob/main/product_review_topic_modeling.ipynb)  --  [Github Repo](https://github.com/TOM-BOHN/MsDS-product-review-topic-modeling)  --  [Presentation Slides](tbd)  --  [Presentation Video](tbd) --

# 1.&nbsp;Introduction

## Python Libraries

The following python libraries are used in this notebook.

In [1]:
# File Connection and File Manipulation
import os
import time
from time import sleep
import pickle
import json
import glob
# Basic Data Science Toolkits
import pandas as pd
import numpy as np
import math
import time
# Basic Data Vizualization
import seaborn as sns
import matplotlib.pyplot as plt

## Verify GPU Runtime

In [2]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

/bin/bash: line 1: nvidia-smi: command not found


In [3]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 13.6 gigabytes of available RAM

Not using a high-RAM runtime


## Mount Google Drive

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Setup Directories

In [32]:
ROOT_DIR = "/content/drive/MyDrive/MSDS_marketing_text_analytics/master_files/2_topic_modeling"
DATA_DIR = "%s/data" % ROOT_DIR
EVAL_DIR = "%s/evaluation" % ROOT_DIR
MODEL_DIR = "%s/models" % ROOT_DIR

#Create missing directories, if they don't exist
if not os.path.exists(DATA_DIR):
  # Create a new directory because it does not exist
  os.makedirs(DATA_DIR)
  print("The data directory is created!")
if not os.path.exists(EVAL_DIR):
  # Create a new directory because it does not exist
  os.makedirs(EVAL_DIR)
  print("The evaluation directory is created!")
if not os.path.exists(MODEL_DIR):
  # Create a new directory because it does not exist
  os.makedirs(MODEL_DIR)
  print("The model directory is created!")

In [18]:
META_FILE_PATH = '%s/meta_Clothing_Shoes_and_Jewelry.json.gz' % DATA_DIR
REVIEW_FILE_PATH = '%s/reviews_Clothing_Shoes_and_Jewelry.json.gz' % DATA_DIR

# 2.&nbsp;Data Source

## Copy Data From Source

In [43]:
#!wget <URL> -P <COLAB PATH>
#source_url = 'http://128.138.93.164/meta_Clothing_Shoes_and_Jewelry.json.gz' # true source, need better link
source_url = 'https://docs.google.com/uc?export=download&id=12cPbdNpQ6Dmqg25Fb0kAxFSEug-8t3gc&confirm=t' # local source, working for testing
dest_path = '%s/meta_Clothing_Shoes_and_Jewelry.jsonl.gz' % DATA_DIR
!wget "$source_url" -O "$dest_path"


--2023-09-18 22:13:46--  https://docs.google.com/uc?export=download&id=12cPbdNpQ6Dmqg25Fb0kAxFSEug-8t3gc&confirm=t
Resolving docs.google.com (docs.google.com)... 108.177.126.138, 108.177.126.139, 108.177.126.113, ...
Connecting to docs.google.com (docs.google.com)|108.177.126.138|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0o-58-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/7qeuhp3jkprbc4o70snmnq820i7po11u/1695075225000/15741694635513001712/*/12cPbdNpQ6Dmqg25Fb0kAxFSEug-8t3gc?e=download&uuid=df5822a8-2268-40fd-a6f4-bb63239e6a05 [following]
--2023-09-18 22:13:47--  https://doc-0o-58-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/7qeuhp3jkprbc4o70snmnq820i7po11u/1695075225000/15741694635513001712/*/12cPbdNpQ6Dmqg25Fb0kAxFSEug-8t3gc?e=download&uuid=df5822a8-2268-40fd-a6f4-bb63239e6a05
Resolving doc-0o-58-docs.googleusercontent.com (doc-0o-58-docs.googleusercontent.com)... 142.251.18.1

In [44]:
#!wget <URL> -P <COLAB PATH>
#source_url = 'http://128.138.93.164/reviews_Clothing_Shoes_and_Jewelry.json.gz' # true source, need better link
source_url = "https://docs.google.com/uc?export=download&id=12detwlesuD7S-O8i9w4LOii1DWML0i7Q&confirm=t" # local source, working for testing
dest_path = '%s/reviews_Clothing_Shoes_and_Jewelry.json.gz' % DATA_DIR
file_name = 'reviews_Clothing_Shoes_and_Jewelry.json.gz'
print(dest_path)
!wget "$source_url" -O "$dest_path"

/content/drive/MyDrive/MSDS_marketing_text_analytics/master_files/2_topic_modeling/data/reviews_Clothing_Shoes_and_Jewelry.json.gz
--2023-09-18 22:14:37--  https://docs.google.com/uc?export=download&id=12detwlesuD7S-O8i9w4LOii1DWML0i7Q&confirm=t
Resolving docs.google.com (docs.google.com)... 108.177.126.138, 108.177.126.102, 108.177.126.100, ...
Connecting to docs.google.com (docs.google.com)|108.177.126.138|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0o-58-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/bhu56oc1bj6feidn253emcrplo7110r7/1695075225000/15741694635513001712/*/12detwlesuD7S-O8i9w4LOii1DWML0i7Q?e=download&uuid=ba539796-3687-4d5a-b0d9-600b1016498d [following]
--2023-09-18 22:14:38--  https://doc-0o-58-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/bhu56oc1bj6feidn253emcrplo7110r7/1695075225000/15741694635513001712/*/12detwlesuD7S-O8i9w4LOii1DWML0i7Q?e=download&uuid=ba53979

In [45]:
meta_file_path = '%s/meta_Clothing_Shoes_and_Jewelry.jsonl.gz' % DATA_DIR
review_file_path = '%s/reviews_Clothing_Shoes_and_Jewelry.json.gz' % DATA_DIR

!gzip -d "$meta_file_path"
!gzip -d "$review_file_path"

## Load the Data

In [47]:
##this assigns the filename we're trying to load in to a string variable
meta_file_path = '%s/meta_Clothing_Shoes_and_Jewelry.jsonl' % DATA_DIR
loadedjson = open(meta_file_path, 'r')

In [48]:
#The data used in this script comes from: http://jmcauley.ucsd.edu/data/amazon/links.html
#The data here is the 'per category' data for Clothing, Shoes and Jewelry
#use the above url to better understand the data, where it came from, and some
#tips on how to use it!

#getting reviews is going to be a two step process:
#1) go through the amazon product catalog for "Clothing, Shoes and Jewelery
#and extract out matching products by their ASIN
#2) go through the review data and parse out the matching reviews by ASIN

#1) - Extracting ASINs by brand
#First, let's iterate through the data and store it as a python dictionary

#let's set a counter to see how many products we have in the json
count = 0
start_time = time.time()
#loading the json file
#we've always got to initiate dictionaries before we can use them
allproducts = {}

#each line of data here is a product and its metadata
print('loading product data to dictionary:')
for aline in loadedjson:
    #creating a counter to know our progress in processing the entire catalog
    count += 1
    if count % 100000 == 0:
        #we're only going to print our count every 100k, this way we don't spam
        #our output console
        current_runtime = round(time.time() - start_time,3)
        print('[-] current progress:', count, 'and a runtime of', current_runtime, 'seconds.')
    #interestingly enough, this data isn't true JSON, instead it's python
    #dictionaries that have essentially been printed as text. It's odd, but if
    #we read the documentaion, all we need to do to load a dictionary is use
    #the eval function. https://www.programiz.com/python-programming/methods/built-in/eval
    #eval takes whatever string is passed to it, and interprets it as python code
    #and runs it. So here, it's exactly what we need to interpret a printed
    #python dictionary

    aproduct = eval(aline)

    #making a dictionary entry with the ASIN of the product as the key
    #and it's metadata as nested dictionaries
    allproducts[aproduct['asin']] = aproduct

loading product data to dictionary:
[-] current progress: 100000 and a runtime of 14.454 seconds.
[-] current progress: 200000 and a runtime of 29.4 seconds.
[-] current progress: 300000 and a runtime of 45.329 seconds.
[-] current progress: 400000 and a runtime of 63.058 seconds.
[-] current progress: 500000 and a runtime of 79.572 seconds.
[-] current progress: 600000 and a runtime of 95.584 seconds.
[-] current progress: 700000 and a runtime of 113.943 seconds.
[-] current progress: 800000 and a runtime of 134.295 seconds.
[-] current progress: 900000 and a runtime of 152.333 seconds.
[-] current progress: 1000000 and a runtime of 171.838 seconds.
[-] current progress: 1100000 and a runtime of 192.491 seconds.
[-] current progress: 1200000 and a runtime of 211.064 seconds.
[-] current progress: 1300000 and a runtime of 235.86 seconds.
[-] current progress: 1400000 and a runtime of 255.625 seconds.
[-] current progress: 1500000 and a runtime of 276.365 seconds.


In [49]:
#print a summary of the records processed
allproducts_length = len(allproducts)
current_runtime = round(time.time() - start_time,3)
print('Process completed for', count, 'of', allproducts_length, 'records with a final runtime of', current_runtime, 'seconds.')

Process completed for 1503384 of 1503384 records with a final runtime of 277.317 seconds.


## Summarize the Categories

In [50]:
#Next we need to explore the product data to see what categories are common in the
#data. As you'll learn, product categories are wishywashy in that they can be
#product categories (e.g., baby, house and home), or they can be brands!
#We're already dealing with a subset of the product categories, Clothing, Shoes
#and Jewlery. We still need to find a list of product ids for our specific
#brand. To do this,We're going to use the 'categories' metadata field to find
#your brand

##Let's create a dictionary of all the product subcategories
#and by doing so, also come up with a list of brands and the number of products
#they have listed in the amazon product catalog

allcategories = {}
count = 0
start_time = time.time()

#each line of data here is a product and its metadata
print('loading categories data to dictionary:')
for aproduct in allproducts:
    #creating a counter to know our progress in processing the entire catalog
    count += 1
    if count % 100000 == 0:
        #we now know there are 1.5 million products, so we can build a counter
        #that tells how our processing is going. When the counter reaches one
        #we're done!
        current_progress = int(round(count/allproducts_length,2)*100)
        current_runtime = round(time.time() - start_time,3)
        print('[-] current progress:', current_progress, '%', 'and a runtime of', current_runtime, 'seconds.')

    #setting a dict up with just one product, so we can inspect and ref it
    aproduct = allproducts[aproduct]
    #creating a dictionary entry for each product category
    #also counting the occurances of each category
    if 'categories' in aproduct:
        for categories in aproduct['categories']:
            for acategory in categories:
                if acategory in allcategories:
                    allcategories[acategory] += 1
                if acategory not in allcategories:
                    allcategories[acategory] = 1

loading categories data to dictionary:
[-] current progress: 7 % and a runtime of 0.801 seconds.
[-] current progress: 13 % and a runtime of 1.545 seconds.
[-] current progress: 20 % and a runtime of 2.029 seconds.
[-] current progress: 27 % and a runtime of 2.492 seconds.
[-] current progress: 33 % and a runtime of 2.947 seconds.
[-] current progress: 40 % and a runtime of 3.423 seconds.
[-] current progress: 47 % and a runtime of 3.853 seconds.
[-] current progress: 53 % and a runtime of 4.283 seconds.
[-] current progress: 60 % and a runtime of 4.722 seconds.
[-] current progress: 67 % and a runtime of 5.135 seconds.
[-] current progress: 73 % and a runtime of 5.583 seconds.
[-] current progress: 80 % and a runtime of 5.983 seconds.
[-] current progress: 86 % and a runtime of 6.392 seconds.
[-] current progress: 93 % and a runtime of 6.81 seconds.
[-] current progress: 100 % and a runtime of 7.197 seconds.


In [51]:
#print a summary of the categories processed
allcategories_length = len(allcategories)
current_runtime = round(time.time() - start_time,3)
print('Process completed for', allcategories_length, 'categories with a final runtime of', current_runtime, 'seconds.')

Process completed for 2773 catregories with a final runtime of 7.224 seconds.


In [58]:
#create a sorted list of categories
sortedlist = []
#covert the dictionary to a list of tuples
for acategory in allcategories:
  sortedlist.append((allcategories[acategory],acategory))
#sort the list
sortedlist = sorted(sortedlist, reverse=True)
#print the top x records in the list
top_n = 20
for item in range(0,top_n):
  print('[',str(item).zfill(2),']', sortedlist[item])

[ 00 ] (3429257, 'Clothing, Shoes & Jewelry')
[ 01 ] (1086181, 'Women')
[ 02 ] (617092, 'Clothing')
[ 03 ] (541681, 'Men')
[ 04 ] (537761, 'Novelty, Costumes & More')
[ 05 ] (432653, 'Shoes')
[ 06 ] (339900, 'Novelty')
[ 07 ] (268065, 'Shoes & Accessories: International Shipping Available')
[ 08 ] (255454, 'Jewelry')
[ 09 ] (174962, 'Accessories')
[ 10 ] (97095, 'Girls')
[ 11 ] (93596, 'Tops & Tees')
[ 12 ] (87688, 'Dresses')
[ 13 ] (84549, 'T-Shirts')
[ 14 ] (82063, 'Boots')
[ 15 ] (80302, 'Shirts')
[ 16 ] (79897, 'Sandals')
[ 17 ] (79545, 'Watches')
[ 18 ] (77684, 'Boys')
[ 19 ] (73507, 'Jewelry: International Shipping Available')


In [53]:
nike_categories = allcategories['Nike']
print(nike_categories, 'product records for Nike.')

8327 product records for Nike.


## Extract a List of Product Ids

In [54]:
#Now, go ahead and use the Variable Expolorer in Spyder to locate a brand
#that has a lot of product entries! Alternatively, type allcategories['Brand name']
#to get a count for a specific brand. For instance:
#>>allcategories['Nike']
#>> 8327
#>>allcategories['adidas']
#>> 8645

#I'd reccommend at least 1.5k products, but you're welcome to try smaller counts
#all I care about is whether you have at least 2k reviews when it's all said and done


##Now we need to go through our newly first dictionary and extract out the
##matching ASINs for Nike

##First, create a set where we will store our ASINs
##We choose a set here because we don't want duplicates
allnikeasins = set()
count = 0
start_time = time.time()

for areview in allproducts:
    theproduct = allproducts[areview]
    count += 1
    if count % 100000 == 0:
        current_progress = int(round(count/allproducts_length,2)*100)
        current_runtime = round(time.time() - start_time,3)
        print('[-] current progress:', current_progress, '%', 'and a runtime of', current_runtime, 'seconds.')

    #let's iterate fore each category for a product, again, any given product
    #can be assigned multiple product categories,
    for categories in theproduct['categories']:
        #each category is actually encoded as a list (even though they should
        #just be strings, so we need to iterate one more time)
        for acategory in categories:
            #checking to see if the product category matches Nike
            #lowercasing the category string incase capitalization might get
            #in the way of a match
            if 'nike' in acategory.lower():
                #let's go ahead and store it to our set of Nike ASINs
                allnikeasins.add(theproduct['asin'])

[-] current progress: 7 % and a runtime of 0.311 seconds.
[-] current progress: 13 % and a runtime of 0.61 seconds.
[-] current progress: 20 % and a runtime of 0.932 seconds.
[-] current progress: 27 % and a runtime of 1.244 seconds.
[-] current progress: 33 % and a runtime of 1.558 seconds.
[-] current progress: 40 % and a runtime of 1.858 seconds.
[-] current progress: 47 % and a runtime of 2.165 seconds.
[-] current progress: 53 % and a runtime of 2.471 seconds.
[-] current progress: 60 % and a runtime of 2.753 seconds.
[-] current progress: 67 % and a runtime of 3.037 seconds.
[-] current progress: 73 % and a runtime of 3.338 seconds.
[-] current progress: 80 % and a runtime of 3.63 seconds.
[-] current progress: 86 % and a runtime of 3.911 seconds.
[-] current progress: 93 % and a runtime of 4.201 seconds.
[-] current progress: 100 % and a runtime of 4.649 seconds.


In [55]:
#print a summary of the categories processed
allnikeasins_length = len(allnikeasins)
current_runtime = round(time.time() - start_time,3)
print('Process completed for', allnikeasins_length, 'records with a final runtime of', current_runtime, 'seconds.')

Process completed for 8327 records with a final runtime of 4.684 seconds.


In [56]:
# write the ASINs out to a file as a checkpoint
outputfile = open('%s/allasins.txt' % DATA_DIR, 'w')

outputfile.write(','.join(allnikeasins))
outputfile.close()

## Filter Data to Scope of Analysis

# 3.&nbsp;Preprocessing the Data

# 4.&nbsp;Data Cleansing

# 5.&nbsp;Model: Parameter Tuning

# 6.&nbsp;Model: Final

# 7.&nbsp; Model Evaluation

# 8.&nbsp; Results

# 9.&nbsp; References