# Product Review Topic Modelling

A report focused on Topic Modeling using [tmtoolkit](https://tmtoolkit.readthedocs.io/) to process text review data from Amazon for clothing and shoes manufactured by Nike. Use unsupervised LDA clustering methods to create popular topics in the Amazon review data.

**Context**
Online marketplaces and applications can product data at volumes too big for a human to read and analyze. An example of this is Amazon, where millions of customers buy and review products. When a company wants to extract insights from their reviews, they need a way to process the text of the reviews and identifty patterns. This exploratory analysis of big data is a common use case for Topic Modeling using unsupervised learning.

**Background**
The Nike brand is interested in extracting key indsights from customer review data on the Amazon website. Specifically, they would like to understand when customers are not satisfied and what major problems they face with their products. To accomplish this task, the Data Science team at Nike is tasked with analyzing the Amazon review data. First, the team must identify Nike's product ASINs on Amazon and extract the relevant reviews. Next, the team will complete topic modeling on the text of the reviews using the LDA clustering method to model the data into popular topics. Finally, the team will vizualize the topics and derive initial insights to drive the strategy of the team.

** LDA (Latent Dirichleyt Allocation)**
This project will focus on the application of [lda.LDA](https://github.com/lda-project/lda), which implements latent Dirichlet allocation (LDA).

Specifically, in natural language processing (NLP), Latent Dirichlet Allocation (LDA) is a Bayesian network used for topic modeling. It aims to explain a set of observations through unobserved groups. Each group explains why some parts of the data are similar. For most purposes, observations are words and are collected into documents. Each word's presence is attributed to one of the document's topics. Each document will contain a small number of topics. [1]

**Data Source**
Prof. Julian McAuley at UC-San Diego created an [Amazon Product Data](http://jmcauley.ucsd.edu/data/amazon/links.html) database, which contains Amazon products details. This project will leverage 2 datasets from the database:
- (1) meta-data about products - details related to each Amazon product
- (2) product reviews - reviews on all types of Amazon products

Since the overall size of these datasets are huge (~80gb), a subset of data will be utilized to focus on the product category of **Clothing, Shoes & Jewelry**.

The raw data sources for the project can be accessed with the following links:
- [Product Data](http://128.138.93.164/meta_Clothing_Shoes_and_Jewelry.json.gz)
- [Review Data](http://128.138.93.164/reviews_Clothing_Shoes_and_Jewelry.json.gz)

**Overview of Observations**

The **Product Details** dataset has 1,503,384 records in a json format and includes the following fields:
- asin: a unique product id assigned to the Amazon product
- title: the name of the product displayed on the amazon page
- imUrl: a URL to access the product page
- related: related products to the this product
- salesRank: the main sales category and the product rank within the sales category
- categories: categories or tags grouping the products

The **Review Details** dataset has 5,748,920 records in a json format and includes the following fields:
- reviewerID': a unique id assigned to the review
- asin': a unique id of the product reviewed
- reviewerName': the text name of the user who created the review
- helpful': indicator if the review was helpful or not helpful
- reviewText': the text written for the review
- overall': the rating of the product (in stars) on a scale of 1-5
- summary': the summary or title of the review
- unixReviewTime: the numeric unix representation of the time the review was created
- reviewTime: the time the review was created as a formatted data string

**Objective**
The objective is to build a unsupervize topic model for nike product reviews. The topic model will group reviews into topics and create label to identify patterns and trends in user reviews. Final evaluation will use qualitative and look at value of the topics modeled and the usefullness of insights extracted from the topics.

**Report Overview**
The project will cover 5 key phases:
1. Data Source: Extracting, filtering, and focusing the data on the Nike brand
2. Preprocessing: Extracting Word Features with Natural Language Processing (NLP) tools
3. Parameter Tuning: Tuning the topic model parameters to improve the purity and uniqueness of topics
4. Final Model: Building the final topic model
5. Classify and Enright Topic Data: tuning the topic labels, tagging all records, and adding in supporting data to the final dataset
6. Model Evaluation: Review the model and extract insights for the Nike Brand
7. Results: Review the findings of the Topic Evaluation.

## Setting Up
### Importing Libraries

In [29]:
# Special Install of Packages
print('[-] Importing packages...')
#special_install_tmtoolkit
import os
try:
    import tmtoolkit
except:
  print('starting patch of tmtoolkit.')
  !pip install --quiet -U "tmtoolkit[recommended,lda,sklearn,wordclouds,topic_modeling_eval_extra]"
  print('finished patch of tmtoolkit.')
  os.kill(os.getpid(), 9)

#special_install_lda
import os
try:
  from tmtoolkit.topicmod.tm_lda import compute_models_parallel
except:
  !pip install --quiet tmtoolkit['lda']
  from tmtoolkit.topicmod.tm_lda import compute_models_parallel

try:
  from lda import LDA
except:
  !pip install --quiet lda
  from lda import LDA

#special_install_pyLDAvis
try:
  import pyLDAvis
except:
  !pip install --quiet pyLDAvis==2.1.2
  import pyLDAvis


[-] Importing packages...


In [30]:
print('[-] Importing packages...')
# File Connection and File Manipulation
import os
import pickle
import json
import glob
# Import Usability Functions
import logging
import warnings
# Basic Data Science Toolkits
import pandas as pd
import numpy as np
import math
import random
import time
from time import sleep
# Basic Data Vizualization
import seaborn as sns
import matplotlib.pyplot as plt
# Text Preprocessing (tmtoolkit)
import tmtoolkit
from tmtoolkit.corpus import Corpus, lemmatize, to_lowercase, remove_chars, filter_clean_tokens
from tmtoolkit.corpus import filter_for_pos, remove_common_tokens, remove_uncommon_tokens
from tmtoolkit.corpus import corpus_num_tokens, corpus_tokens_flattened
from tmtoolkit.corpus import doc_tokens, tokens_table, doc_labels, dtm
from tmtoolkit.corpus import vocabulary, vocabulary_size, vocabulary_counts
from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words
from tmtoolkit.topicmod.tm_lda import compute_models_parallel
from tmtoolkit.corpus.visualize import plot_doc_lengths_hist, plot_doc_frequencies_hist, plot_ranked_vocab_counts
#https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html
# Text Preprocessing(other)
from string import punctuation
import nltk
import scipy.sparse
# Topic Modeling
from lda import LDA
import pyLDAvis
from tmtoolkit.topicmod import tm_lda
from tmtoolkit.topicmod.tm_lda import compute_models_parallel
from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words
from tmtoolkit.topicmod.model_io import save_ldamodel_to_pickle
from tmtoolkit.topicmod.model_io import load_ldamodel_from_pickle
from tmtoolkit.topicmod.model_io import ldamodel_top_doc_topics
from tmtoolkit.topicmod.evaluate import results_by_parameter
from tmtoolkit.topicmod.visualize import plot_eval_results
from tmtoolkit.topicmod.visualize import parameters_for_ldavis
from tmtoolkit.topicmod.visualize import generate_wordclouds_for_topic_words
from tmtoolkit.topicmod.model_stats import generate_topic_labels_from_top_words
from tmtoolkit.bow.bow_stats import doc_lengths
# Sentiment Modeling
from textblob import TextBlob
# normalize
from sklearn.preprocessing import MinMaxScaler

[-] Importing packages...


### Set Global Variables

In [31]:
random.seed(20191120)   # to make the sampling reproducible
np.set_printoptions(precision=5)

### Verify GPU Runtime

In [32]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Mon Mar 25 18:17:59 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.52                 Driver Version: 551.52         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3070 ...  WDDM  |   00000000:01:00.0  On |                  N/A |
| N/A   47C    P0             32W /  115W |     529MiB /   8192MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

### Setup Directories

In [33]:
ROOT_DIR = "./"
DATA_DIR = "%s/data" % ROOT_DIR
EVAL_DIR = "%s/evaluation" % ROOT_DIR
MODEL_DIR = "%s/models" % ROOT_DIR

#Create missing directories, if they don't exist
if not os.path.exists(DATA_DIR):
  # Create a new directory because it does not exist
  os.makedirs(DATA_DIR)
  print("The data directory is created!")
if not os.path.exists(EVAL_DIR):
  # Create a new directory because it does not exist
  os.makedirs(EVAL_DIR)
  print("The evaluation directory is created!")
if not os.path.exists(MODEL_DIR):
  # Create a new directory because it does not exist
  os.makedirs(MODEL_DIR)
  print("The model directory is created!")


### Downloading Data

In [10]:
#link would be expired by the time of the grading. It's just here to show how it could be done.
import urllib.request
url = 'https://d3c33hcgiwev3.cloudfront.net/F4BxPB4wTWSAcTweMH1ktA_cf7170556c97498f80af8f7b869f35f1_meta_Clothing_Shoes_and_Jewelry.jsonl.gz?Expires=1711497600&Signature=ABbwkDeNeTL-s2uHXlyu4QweI8JnGGaCopj50~hba0MaEMhk70IImmREl5hzHC6O0mxPkFACpq5PP~JX8XnIEVe1cph7G2neQ0zgW1Q9npt47H-RMjk0p6emo95c8oZ~aWty9kEQBZM3D7PqnwRe7dHRWbZ-eS84gYkvdAq0w48_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A'
filename = './data/meta_Clothing_Shoes_and_Jewelry.jsonl.gz'
urllib.request.urlretrieve(url, filename)

('./data/meta_Clothing_Shoes_and_Jewelry.jsonl.gz',
 <http.client.HTTPMessage at 0x1e28ab25790>)

In [12]:
url = 'https://d3c33hcgiwev3.cloudfront.net/ed_p_0DhRh-f6f9A4dYfJA_d374df8f88084b8c9384c9b910b50cf1_reviews_Clothing_Shoes_and_Jewelry.json.gz?Expires=1711497600&Signature=Nllx8wiA27Ey2uOdCpLCRGy-y31pBWRnlXG5AtnpXxVoxPLABKJBTsk4eAcUIGTPzw23f6zckeNTtAbIfOy9w8mCGa4kjFDZAKFAaLXzcmUEs4ia09dzaSYjuduIHhE4ECVvr~VyutJkqfS4MwXxWpgqvwbB44F9RqE5g3P3h6A_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A'
filename = './data/reviews_Clothing_Shoes_and_Jewelry.json.gz'
urllib.request.urlretrieve(url, filename)

('./data/reviews_Clothing_Shoes_and_Jewelry.json.gz',
 <http.client.HTTPMessage at 0x1e28b6f9ed0>)

In [13]:
meta_file_path = '%s/meta_Clothing_Shoes_and_Jewelry.jsonl.gz' % DATA_DIR
review_file_path = '%s/reviews_Clothing_Shoes_and_Jewelry.json.gz' % DATA_DIR

!gzip -d "$meta_file_path"
!gzip -d "$review_file_path"

## Examining Product Data

In [34]:
##this assigns the filename we're trying to load in to a string variable
meta_file_path = '%s/meta_Clothing_Shoes_and_Jewelry.jsonl' % DATA_DIR
loadedjson = open(meta_file_path, 'r')

In [35]:
#The data used in this script comes from: http://jmcauley.ucsd.edu/data/amazon/links.html
#The data here is the 'per category' data for Clothing, Shoes and Jewelry
#use the above url to better understand the data, where it came from, and some
#tips on how to use it!

#getting reviews is going to be a two step process:
#1) go through the amazon product catalog for "Clothing, Shoes and Jewelery
#and extract out matching products by their ASIN
#2) go through the review data and parse out the matching reviews by ASIN

#1) - Extracting ASINs by brand
#First, let's iterate through the data and store it as a python dictionary

#let's set a counter to see how many products we have in the json
count = 0
start_time = time.time()
#loading the json file
#we've always got to initiate dictionaries before we can use them
allproducts = {}

#each line of data here is a product and its metadata
print('loading product data to dictionary:')
for aline in loadedjson:
    #creating a counter to know our progress in processing the entire catalog
    count += 1
    if count % 100000 == 0:
        #we're only going to print our count every 100k, this way we don't spam
        #our output console
        current_runtime = round(time.time() - start_time,3)
        print('[-] current progress:', count, 'and a runtime of', current_runtime, 'seconds.')
    #interestingly enough, this data isn't true JSON, instead it's python
    #dictionaries that have essentially been printed as text. It's odd, but if
    #we read the documentaion, all we need to do to load a dictionary is use
    #the eval function. https://www.programiz.com/python-programming/methods/built-in/eval
    #eval takes whatever string is passed to it, and interprets it as python code
    #and runs it. So here, it's exactly what we need to interpret a printed
    #python dictionary

    aproduct = eval(aline)

    #making a dictionary entry with the ASIN of the product as the key
    #and it's metadata as nested dictionaries
    allproducts[aproduct['asin']] = aproduct

loading product data to dictionary:
[-] current progress: 100000 and a runtime of 24.194 seconds.
[-] current progress: 200000 and a runtime of 36.596 seconds.
[-] current progress: 300000 and a runtime of 48.865 seconds.
[-] current progress: 400000 and a runtime of 61.053 seconds.
[-] current progress: 500000 and a runtime of 72.612 seconds.
[-] current progress: 600000 and a runtime of 85.12 seconds.
[-] current progress: 700000 and a runtime of 98.269 seconds.
[-] current progress: 800000 and a runtime of 111.138 seconds.
[-] current progress: 900000 and a runtime of 188.012 seconds.
[-] current progress: 1000000 and a runtime of 201.046 seconds.
[-] current progress: 1100000 and a runtime of 216.22 seconds.
[-] current progress: 1200000 and a runtime of 230.819 seconds.
[-] current progress: 1300000 and a runtime of 246.186 seconds.
[-] current progress: 1400000 and a runtime of 262.559 seconds.
[-] current progress: 1500000 and a runtime of 343.536 seconds.


In [36]:
#print a summary of the records processed
allproducts_length = len(allproducts)
current_runtime = round(time.time() - start_time,3)
print('Process completed for', count, 'of', allproducts_length, 'records with a final runtime of', current_runtime, 'seconds.')

Process completed for 1503384 of 1503384 records with a final runtime of 344.088 seconds.


In [37]:
#preview the product record
allproducts['B00KUSKHDC']

{'asin': 'B00KUSKHDC',
 'title': "Family Guy - Men's T-shirt Evil Monkey",
 'imUrl': 'http://ecx.images-amazon.com/images/I/41eUK6CAY4L._SX342_.jpg',
 'related': {'also_viewed': ['B004P0JEK8',
   'B00EC4UZ3M',
   'B000VYZEY2',
   'B00HZSI7QE']},
 'salesRank': {'Clothing': 288020},
 'categories': [['Clothing, Shoes & Jewelry', 'Men'],
  ['Clothing, Shoes & Jewelry',
   'Novelty, Costumes & More',
   'Novelty',
   'Clothing',
   'Men',
   'Shirts',
   'T-Shirts']]}

In [38]:
#save the files to disk
allproducts_file_path = '%s/allproducts.p' % DATA_DIR
pickle.dump(allproducts, open(allproducts_file_path, 'wb'))

### Summarize the Product Categories

In [39]:
#Next we need to explore the product data to see what categories are common in the
#data. As you'll learn, product categories are wishywashy in that they can be
#product categories (e.g., baby, house and home), or they can be brands!
#We're already dealing with a subset of the product categories, Clothing, Shoes
#and Jewlery. We still need to find a list of product ids for our specific
#brand. To do this,We're going to use the 'categories' metadata field to find
#your brand

##Let's create a dictionary of all the product subcategories
#and by doing so, also come up with a list of brands and the number of products
#they have listed in the amazon product catalog

allcategories = {}
count = 0
start_time = time.time()

#each line of data here is a product and its metadata
print('loading categories data to dictionary:')
for aproduct in allproducts:
    #creating a counter to know our progress in processing the entire catalog
    count += 1
    if count % 100000 == 0:
        #we now know there are 1.5 million products, so we can build a counter
        #that tells how our processing is going. When the counter reaches one
        #we're done!
        current_progress = int(round(count/allproducts_length,2)*100)
        current_runtime = round(time.time() - start_time,3)
        print('[-] current progress:', current_progress, '%', 'and a runtime of', current_runtime, 'seconds.')

    #setting a dict up with just one product, so we can inspect and ref it
    aproduct = allproducts[aproduct]
    #creating a dictionary entry for each product category
    #also counting the occurances of each category
    if 'categories' in aproduct:
        for categories in aproduct['categories']:
            for acategory in categories:
                if acategory in allcategories:
                    allcategories[acategory] += 1
                if acategory not in allcategories:
                    allcategories[acategory] = 1

loading categories data to dictionary:
[-] current progress: 7 % and a runtime of 0.35 seconds.
[-] current progress: 13 % and a runtime of 0.699 seconds.
[-] current progress: 20 % and a runtime of 1.029 seconds.
[-] current progress: 27 % and a runtime of 1.371 seconds.
[-] current progress: 33 % and a runtime of 1.679 seconds.
[-] current progress: 40 % and a runtime of 1.985 seconds.
[-] current progress: 47 % and a runtime of 2.316 seconds.
[-] current progress: 53 % and a runtime of 2.61 seconds.
[-] current progress: 60 % and a runtime of 2.891 seconds.
[-] current progress: 67 % and a runtime of 3.171 seconds.
[-] current progress: 73 % and a runtime of 3.46 seconds.
[-] current progress: 80 % and a runtime of 3.736 seconds.
[-] current progress: 86 % and a runtime of 4.01 seconds.
[-] current progress: 93 % and a runtime of 4.28 seconds.
[-] current progress: 100 % and a runtime of 4.546 seconds.


In [40]:
#print a summary of the categories processed
allcategories_length = len(allcategories)
current_runtime = round(time.time() - start_time,3)
print('Process completed for', allcategories_length, 'categories with a final runtime of', current_runtime, 'seconds.')

Process completed for 2773 categories with a final runtime of 4.56 seconds.


In [41]:
#create a sorted list of categories
sortedlist = []
#covert the dictionary to a list of tuples
for acategory in allcategories:
  sortedlist.append((allcategories[acategory],acategory))
#sort the list
sortedlist = sorted(sortedlist, reverse=True)
#print the top x records in the list
top_n = 20
for item in range(0,top_n):
  print('[',str(item).zfill(2),']', sortedlist[item])

[ 00 ] (3429257, 'Clothing, Shoes & Jewelry')
[ 01 ] (1086181, 'Women')
[ 02 ] (617092, 'Clothing')
[ 03 ] (541681, 'Men')
[ 04 ] (537761, 'Novelty, Costumes & More')
[ 05 ] (432653, 'Shoes')
[ 06 ] (339900, 'Novelty')
[ 07 ] (268065, 'Shoes & Accessories: International Shipping Available')
[ 08 ] (255454, 'Jewelry')
[ 09 ] (174962, 'Accessories')
[ 10 ] (97095, 'Girls')
[ 11 ] (93596, 'Tops & Tees')
[ 12 ] (87688, 'Dresses')
[ 13 ] (84549, 'T-Shirts')
[ 14 ] (82063, 'Boots')
[ 15 ] (80302, 'Shirts')
[ 16 ] (79897, 'Sandals')
[ 17 ] (79545, 'Watches')
[ 18 ] (77684, 'Boys')
[ 19 ] (73507, 'Jewelry: International Shipping Available')


In [42]:
nike_categories = allcategories['Nike']
print(nike_categories, 'product records for Nike.')

8327 product records for Nike.


### Extract a List of Product Ids

In [43]:
#Now, go ahead and use the Variable Expolorer in Spyder to locate a brand
#that has a lot of product entries! Alternatively, type allcategories['Brand name']
#to get a count for a specific brand. For instance:
#>>allcategories['Nike']
#>> 8327
#>>allcategories['adidas']
#>> 8645

#I'd reccommend at least 1.5k products, but you're welcome to try smaller counts
#all I care about is whether you have at least 2k reviews when it's all said and done


##Now we need to go through our newly first dictionary and extract out the
##matching ASINs for Nike

##First, create a set where we will store our ASINs
##We choose a set here because we don't want duplicates
allnikeasins = set()
count = 0
start_time = time.time()

for areview in allproducts:
    theproduct = allproducts[areview]
    count += 1
    if count % 100000 == 0:
        current_progress = int(round(count/allproducts_length,2)*100)
        current_runtime = round(time.time() - start_time,3)
        print('[-] current progress:', current_progress, '%', 'and a runtime of', current_runtime, 'seconds.')

    #let's iterate fore each category for a product, again, any given product
    #can be assigned multiple product categories,
    for categories in theproduct['categories']:
        #each category is actually encoded as a list (even though they should
        #just be strings, so we need to iterate one more time)
        for acategory in categories:
            #checking to see if the product category matches Nike
            #lowercasing the category string incase capitalization might get
            #in the way of a match
            if 'nike' in acategory.lower():
                #let's go ahead and store it to our set of Nike ASINs
                allnikeasins.add(theproduct['asin'])

[-] current progress: 7 % and a runtime of 0.21 seconds.
[-] current progress: 13 % and a runtime of 0.443 seconds.
[-] current progress: 20 % and a runtime of 0.657 seconds.
[-] current progress: 27 % and a runtime of 0.874 seconds.
[-] current progress: 33 % and a runtime of 1.083 seconds.
[-] current progress: 40 % and a runtime of 1.291 seconds.
[-] current progress: 47 % and a runtime of 1.498 seconds.
[-] current progress: 53 % and a runtime of 1.7 seconds.
[-] current progress: 60 % and a runtime of 1.904 seconds.
[-] current progress: 67 % and a runtime of 2.104 seconds.
[-] current progress: 73 % and a runtime of 2.307 seconds.
[-] current progress: 80 % and a runtime of 2.504 seconds.
[-] current progress: 86 % and a runtime of 2.697 seconds.
[-] current progress: 93 % and a runtime of 2.889 seconds.
[-] current progress: 100 % and a runtime of 3.069 seconds.


In [44]:
#print a summary of the categories processed
allnikeasins_length = len(allnikeasins)
current_runtime = round(time.time() - start_time,3)
print('Process completed for', allnikeasins_length, 'records with a final runtime of', current_runtime, 'seconds.')

Process completed for 8327 records with a final runtime of 3.079 seconds.


In [45]:
# write the ASINs out to a file as a checkpoint
outputfile = open('%s/allasins.txt' % DATA_DIR, 'w')

outputfile.write(','.join(allnikeasins))
outputfile.close()

## Examining the Review Data

In [46]:
#this assigns the filename we're trying to load in to a string variable
review_file_path = '%s/reviews_Clothing_Shoes_and_Jewelry.json' % DATA_DIR
loadedjson = open(review_file_path, 'r')

In [47]:
#2) - Parsing the review data
#First, let's iterate through the data and store it as a python dictionary

#let's set a counter to see how many products we have in the json
count = 0
start_time = time.time()
#loading the json file
#we've always got to initiate dictionaries before we can use them
allreviews = {}

#each line of data here is a product and its metadata
print('loading review data to dictionary:')
for aline in loadedjson:
    #creating a counter to know our progress in processing the entire catalog
    count += 1
    if count % 500000 == 0:
        #we're only going to print our count every 100k, this way we don't spam
        #our output console
        current_runtime = round(time.time() - start_time,3)
        print('[-] current progress:', count, 'and a runtime of', current_runtime, 'seconds.')
    #interestingly enough, this data isn't true JSON, instead it's python
    #dictionaries that have essentially been printed as text. It's odd, but if
    #we read the documentaion, all we need to do to load a dictionary is use
    #the eval function. https://www.programiz.com/python-programming/methods/built-in/eval
    #eval takes whatever string is passed to it, and interprets it as python code
    #and runs it. So here, it's exactly what we need to interpret a printed
    #python dictionary

    areview = eval(aline)

    #making a dictionary entry with the iteration count as the review key
    #and it's metadata as nested dictionaries
    allreviews[count] = areview
print('completed load of review data to dictionary.')

loading review data to dictionary:
[-] current progress: 500000 and a runtime of 33.901 seconds.
[-] current progress: 1000000 and a runtime of 53.423 seconds.
[-] current progress: 1500000 and a runtime of 73.157 seconds.
[-] current progress: 2000000 and a runtime of 92.679 seconds.
[-] current progress: 2500000 and a runtime of 120.684 seconds.
[-] current progress: 3000000 and a runtime of 140.692 seconds.
[-] current progress: 3500000 and a runtime of 160.421 seconds.
[-] current progress: 4000000 and a runtime of 180.362 seconds.
[-] current progress: 4500000 and a runtime of 230.885 seconds.
[-] current progress: 5000000 and a runtime of 250.479 seconds.
[-] current progress: 5500000 and a runtime of 270.205 seconds.
completed load of review data to dictionary.


In [48]:
#print a summary of the records processed
allreviews_length = len(allreviews)
current_runtime = round(time.time() - start_time,3)
print('Process completed for', count, 'of', allreviews_length, 'records with a final runtime of', current_runtime, 'seconds.')

Process completed for 5748920 of 5748920 records with a final runtime of 281.212 seconds.


### Extract a List of Reviews Related to the Product Ids

In [49]:
#Load the list of Nike Asins

allnikeasins = []
allasins_file_path = '%s/allasins.txt' % DATA_DIR

#open the file and load to a list
for data in open(allasins_file_path, 'r'):
  asins = data.split(',')
  for anasin in asins:
    allnikeasins.append(anasin)

In [50]:
#print a summary of the records processed
allnikeasins_length = len(allnikeasins)
print('Process completed for', allnikeasins_length, 'records.')
print('First 5 Asins in list:', allnikeasins[0:5])

Process completed for 8327 records.
First 5 Asins in list: ['B007I7C3IK', 'B0098YXOES', 'B007MV6BNU', 'B0053HC6T8', 'B002Z34I94']


In [51]:
#Now, we need to go through all the reviews and pick out the reviews that
#correspond to the matching ASINs, that is reviews that are tied to Nike ASINs

#let's set a counter to see how many products we have in the json
count = 0
start_time = time.time()
#loading the json file
#we've always got to initiate dictionaries before we can use them
nikereviews = {}

#each line of data here is a product and its metadata
print('loading review data to dictionary:')
for areview in allreviews:
  count += 1
  if count % 500000 == 0:
      current_progress = int(round(count/allreviews_length,2)*100)
      current_runtime = round(time.time() - start_time,3)
      print('[-] current progress:', current_progress, '%', 'and a runtime of', current_runtime, 'seconds.')
  #setting current review as a dictionary, so we can easily reference its
  #entries
  thereview = allreviews[areview]
  theasin = thereview['asin']
  reviewerid = thereview['reviewerID']
  if theasin in allnikeasins:
      #im setting the key here as something unique. if we just did by asin
      #we'd only have one review for each asin, with the last review the only
      #one being stored
      thekey = '%s.%s' % (theasin, reviewerid)
      nikereviews[thekey] = thereview
print('completed load of review data to dictionary.')

loading review data to dictionary:
[-] current progress: 9 % and a runtime of 52.024 seconds.
[-] current progress: 17 % and a runtime of 103.339 seconds.
[-] current progress: 26 % and a runtime of 154.976 seconds.
[-] current progress: 35 % and a runtime of 206.807 seconds.
[-] current progress: 43 % and a runtime of 258.728 seconds.
[-] current progress: 52 % and a runtime of 309.869 seconds.
[-] current progress: 61 % and a runtime of 360.782 seconds.
[-] current progress: 70 % and a runtime of 411.814 seconds.
[-] current progress: 78 % and a runtime of 463.55 seconds.
[-] current progress: 87 % and a runtime of 514.539 seconds.
[-] current progress: 96 % and a runtime of 564.564 seconds.
completed load of review data to dictionary.


In [52]:
#print a summary of the records processed
nikereviews_length = len(nikereviews)
current_runtime = round(time.time() - start_time,3)
print('Process completed for', count, 'of', nikereviews_length, 'records with a final runtime of', current_runtime, 'seconds.')

Process completed for 5748920 of 21570 records with a final runtime of 589.318 seconds.


In [53]:
#save our data to a JSON dictionary
allnikereviews_file_path = '%s/allnikereviews.json' % DATA_DIR
json.dump(nikereviews, open(allnikereviews_file_path, 'w'))

### Preview a Record from the File

In [54]:
#this assigns the filename we're trying to load
allnikereviews_file_path = '%s/allnikereviews.json' % DATA_DIR
json_file = json.load(open(allnikereviews_file_path, 'r'))

In [55]:
#select a random review
count = 0
for a_review in json_file:
  count += 1
  if count % 1000 == 0:
    the_review = json_file[a_review]
    print(the_review)
    #sleep(10)
  if count >= 10000:
    break

{'reviewerID': 'A7H6Q3JAE3UTR', 'asin': 'B000G42Z2Q', 'reviewerName': 'april fritsche', 'helpful': [0, 0], 'reviewText': 'I bought these for my son and he loves them.  He said that that r very comfortable.  He said he wears them all the time.  He said that they look great.', 'overall': 5.0, 'summary': 'love them', 'unixReviewTime': 1388620800, 'reviewTime': '01 2, 2014'}
{'reviewerID': 'A278R123DX0CLR', 'asin': 'B0013UXIAK', 'reviewerName': 'AQB', 'helpful': [0, 2], 'reviewText': 'Too much money for the pair. I saw a better pair for less price. I will shop around Zappos or Nike websites next time.', 'overall': 3.0, 'summary': 'its ok', 'unixReviewTime': 1377561600, 'reviewTime': '08 27, 2013'}
{'reviewerID': 'A1CZIWT35UVJS6', 'asin': 'B001PA87L8', 'reviewerName': 'Courtney', 'helpful': [1, 1], 'reviewText': "I love these shoes, they are perfect. Very comfy especially since I'm on my feet all day at work. It's wierd that they don't have a tongue but it only makes them fit better.", 'ove

In [56]:
#print the review to the screen
the_review

{'reviewerID': 'AGP3D5ZB7N3NA',
 'asin': 'B00594DYW8',
 'reviewerName': 'S. M. Hughes "mom of 2"',
 'helpful': [0, 0],
 'reviewText': 'I absolutely love this pair of shoes. I am more prone to barefoot running, and the flexibility of these shoes are not at all restrictive on my feet and they feel great! I was so very happy to find them for a good price! Seller shipped very quickly and overall, the experience with this purchase was A+!',
 'overall': 5.0,
 'summary': 'Great Item, great seller!',
 'unixReviewTime': 1322524800,
 'reviewTime': '11 29, 2011'}

### Extract a List of Products Related to Product Ids

In [57]:
#Load the list of Nike Asins

allnikeasins = []
allasins_file_path = '%s/allasins.txt' % DATA_DIR

#open the file and load to a list
for data in open(allasins_file_path, 'r'):
  asins = data.split(',')
  for anasin in asins:
    allnikeasins.append(anasin)

In [58]:
#print a summary of the records processed
allnikeasins_length = len(allnikeasins)
print('Process completed for', allnikeasins_length, 'records.')
print('First 5 Asins in list:', allnikeasins[0:5])

Process completed for 8327 records.
First 5 Asins in list: ['B007I7C3IK', 'B0098YXOES', 'B007MV6BNU', 'B0053HC6T8', 'B002Z34I94']


In [59]:
#the path for the all product dict
allproducts_file_path = '%s/allproducts.p' % DATA_DIR
#load the dict
allproducts =  pickle.load(open(allproducts_file_path, 'rb'))

In [60]:
print('size of the full product catelog:', len(allproducts))
keys = set(allnikeasins).intersection(allproducts)
allnikeproducts = {key:allproducts[key] for key in keys}
print('size of the nike product catelog:', len(allnikeproducts))

size of the full product catelog: 1503384
size of the nike product catelog: 8327


In [61]:
#save the files to disk
allnikeproducts_file_path = '%s/allnikeproducts.p' % DATA_DIR
pickle.dump(allnikeproducts, open(allnikeproducts_file_path, 'wb'))

## Preprocessing the Data
### Load the Nike Review Data

In [62]:
#this assigns the filename we're trying to load
allnikereviews_file_path = '%s/allnikereviews.json' % DATA_DIR
json_file = json.load(open(allnikereviews_file_path, 'r'))

In [63]:
#extract review text from all review details
reviews = []
for a_review in json_file:
    the_review = json_file[a_review]
    text = the_review["reviewText"]
    reviews.append(text)