<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Topic Modeling
## *Data Science Unit 4 Sprint 1 Assignment 4*

Analyze a corpus of Amazon reviews from Unit 4 Sprint 1 Module 1's lecture using topic modeling: 

- Fit a Gensim LDA topic model on Amazon Reviews
- Select appropriate number of topics
- Create some dope visualization of the topics
- Write a few bullets on your findings in markdown at the end
- **Note**: You don't *have* to use generators for this assignment

In [1]:
import numpy as np
import gensim
import os
import re

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora

from gensim.models.ldamulticore import LdaMulticore

import pandas as pd



In [2]:
# Tokenizer function

def tokenize(text):
    tokens = []
    for token in simple_preprocess(text):
        if token not in STOPWORDS and len(token) > 3:
            tokens.append(token)
    return tokens

In [3]:
df = pd.read_csv('data/Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv')

In [5]:
df.head(1)

Unnamed: 0,id,dateAdded,dateUpdated,name,asins,brand,categories,primaryCategories,imageURLs,keys,...,reviews.didPurchase,reviews.doRecommend,reviews.id,reviews.numHelpful,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.username,sourceURLs
0,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,3,https://www.amazon.com/product-reviews/B00QWO9...,I order 3 of them and one of the item is bad q...,... 3 of them and one of the item is bad quali...,Byger yang,"https://www.barcodable.com/upc/841710106442,ht..."


In [8]:
df['tokens'] = df['reviews.text'].apply(tokenize)

In [9]:
df['tokens'].head()

0    [order, item, quality, missing, backup, spring...
1                    [bulk, expensive, products, like]
2                             [duracell, price, happy]
3              [work, brand, batteries, better, price]
4             [batteries, long, lasting, price, great]
Name: tokens, dtype: object

In [11]:
df = df[['name', 'tokens']]

In [12]:
df.head()

Unnamed: 0,name,tokens
0,AmazonBasics AAA Performance Alkaline Batterie...,"[order, item, quality, missing, backup, spring..."
1,AmazonBasics AAA Performance Alkaline Batterie...,"[bulk, expensive, products, like]"
2,AmazonBasics AAA Performance Alkaline Batterie...,"[duracell, price, happy]"
3,AmazonBasics AAA Performance Alkaline Batterie...,"[work, brand, batteries, better, price]"
4,AmazonBasics AAA Performance Alkaline Batterie...,"[batteries, long, lasting, price, great]"


In [None]:
df.name.value_counts()

# k. not sure the name is meaningful... might just chop it.

In [24]:
# Here I just called the df instead of that crazy path method from lecture dealing with 
# a bunch of text files (looks more useful that way, but this is easier...)

id2word = corpora.Dictionary(df['tokens'])

In [22]:
# Checking to see it it works. It do. 

id2word.token2id['battery']

2

In [25]:
# How many words are we talkin'?
len(id2word.keys())

8980

In [26]:
# This is a way of tuning/trimming the words we use

id2word.filter_extremes(no_below=5, no_above=0.9)

In [27]:
# Look at the number of words left after trimming
len(id2word.keys())

3337

In [28]:
# Need a "Bag of Words"

corpus = [id2word.doc2bow(text) for text in df['tokens']]

In [30]:
corpus[6][:10]

[(15, 1), (36, 1), (37, 1), (38, 1)]

In [31]:
lda = LdaMulticore(corpus=corpus,
                   id2word=id2word,
                   random_state=5,
                   num_topics = 15,
                   passes=10,
                   workers=4
                  )

In [32]:
# Oooooo "topics"
lda.print_topics()

[(0,
  '0.077*"great" + 0.053*"deal" + 0.040*"batteries" + 0.034*"best" + 0.028*"item" + 0.018*"love" + 0.018*"easy" + 0.015*"come" + 0.014*"convenient" + 0.014*"price"'),
 (1,
  '0.149*"batteries" + 0.056*"long" + 0.038*"work" + 0.036*"good" + 0.034*"brand" + 0.031*"price" + 0.025*"brands" + 0.022*"great" + 0.020*"amazon" + 0.018*"time"'),
 (2,
  '0.073*"kids" + 0.042*"tablet" + 0.036*"love" + 0.033*"year" + 0.029*"apps" + 0.023*"games" + 0.021*"time" + 0.020*"great" + 0.019*"easy" + 0.018*"child"'),
 (3,
  '0.057*"batteries" + 0.024*"battery" + 0.019*"amazon" + 0.010*"time" + 0.010*"charge" + 0.009*"bought" + 0.009*"remote" + 0.009*"long" + 0.009*"months" + 0.009*"work"'),
 (4,
  '0.109*"kindle" + 0.024*"light" + 0.019*"reading" + 0.018*"bought" + 0.016*"love" + 0.016*"reader" + 0.015*"read" + 0.013*"screen" + 0.012*"ipad" + 0.012*"easy"'),
 (5,
  '0.050*"happy" + 0.035*"purchase" + 0.022*"product" + 0.016*"thank" + 0.016*"satisfied" + 0.015*"service" + 0.014*"couldn" + 0.014*"purcha

In [33]:
# RegEx away all the extra silliness from the above output

words = [re.findall(r'"([^"]*)"',t[1]) for t in lda.print_topics()]

In [34]:
topics = [' '.join(t[0:5]) for t in words]

In [35]:
# Pretty print-job of topics without all the silly extra stuff

for id, t in enumerate(topics): 
    print(f"------ Topic {id} ------")
    print(t, end="\n\n")

------ Topic 0 ------
great deal batteries best item

------ Topic 1 ------
batteries long work good brand

------ Topic 2 ------
kids tablet love year apps

------ Topic 3 ------
batteries battery amazon time charge

------ Topic 4 ------
kindle light reading bought love

------ Topic 5 ------
happy purchase product thank satisfied

------ Topic 6 ------
easy tablet size perfect expected

------ Topic 7 ------
tablet loves great bought year

------ Topic 8 ------
amazon good quality amazing pretty

------ Topic 9 ------
books games read bought reading

------ Topic 10 ------
great good price product works

------ Topic 11 ------
tablet amazon great nice prime

------ Topic 12 ------
tablet apps amazon store google

------ Topic 13 ------
screen user clear friendly better

------ Topic 14 ------
love great tablet size price



In [36]:
import pyLDAvis.gensim

pyLDAvis.enable_notebook()

  from collections import Iterable


In [37]:
# This is going to become our visualizer 

pyLDAvis.gensim.prepare(lda, corpus, id2word)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


### Quick Recap / reflections

* The topics here seem to be ligit: not overlapping and approximately the same size. 
* All of the summarized topics are overwhelmingly positive
* Were these real people or Amazon bots? It seems almost too perfect... When I go to buy a product, the distribution of reviews makes a big difference. Amazon products get such great reviews. I guess that it most likely has much more to do with their smart business practices -- they don't need bots, they predict which products will be liked based upon this sort of anaylsis. 