# The First Solution
    This notebook analyses various outcomes on analyzing Amazon Reviews through Topic Modelling. Also, it proides insights on what value can be mined from this information.

### Reading the Data
     This piece of code unzips the package and parses the json document as a pandas dataframe.

In [1]:
data_dir = 'C:/Users/maruv/Desktop/DSB/LDA/'

In [2]:
import json
import os
import glob
import numpy as np
import random

In [3]:
import pandas as pd
import gzip

def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

In [4]:
df_reviews = getDF(data_dir+'reviews_Cell_Phones_and_Accessories_5.json.gz')
df_metadata = getDF(data_dir+'meta_Cell_Phones_and_Accessories.json.gz')

## Selecting the Item
    We need to be able to select the item to analyse. Through this flow the item and further details about it can be viewed.

In [5]:
df_reviews.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4.0,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5.0,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5.0,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4.0,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5.0,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


In [6]:
df_metadata.head()

Unnamed: 0,asin,related,title,price,salesRank,imUrl,brand,categories,description
0,0110400550,"{'also_bought': ['B00C56IXFG', 'B008ZUQWOK', '...",Pink &amp; White 3d Melt Ice-cream Skin Hard C...,3.33,{'Cell Phones & Accessories': 83460},http://ecx.images-amazon.com/images/I/31zn6SOL...,,"[[Cell Phones & Accessories, Cases, Basic Cases]]",Pink & White 3D Melt Ice-Cream Skin Hard Case ...
1,011040047X,"{'buy_after_viewing': ['B008RU7UL2', 'B00698LY...",Purple Hard Case Cover for Iphone 4 4s 4g with...,1.94,{'Cell Phones & Accessories': 495795},http://ecx.images-amazon.com/images/I/41WCZc2d...,,"[[Cell Phones & Accessories, Cases, Basic Cases]]",Purple Hard Case Cover for iPhone 4 4S 4G With...
2,0195866479,"{'buy_after_viewing': ['B00530RXP2', 'B004SH9B...",Hello Kitty Light-weighted Chrome Case Black C...,2.94,{'Cell Phones & Accessories': 371302},http://ecx.images-amazon.com/images/I/41fy1%2B...,,"[[Cell Phones & Accessories, Cases, Basic Cases]]","Thin and light weighted,\nCase's unique design..."
3,0214514706,"{'buy_after_viewing': ['B0042FV2SI', 'B00869D2...",Cool Summer Breeze in the Ocean Beach Collecti...,0.94,{'Cell Phones & Accessories': 778100},http://ecx.images-amazon.com/images/I/415cmp6Q...,,"[[Cell Phones & Accessories, Cases, Basic Cases]]",Product Name: Cool Summer Breeze In The Ocean...
4,0214714705,"{'buy_after_viewing': ['B008EU7HRM', 'B00869D2...",Cool Summer Breeze In The Ocean Beach Collecti...,5.79,{'Cell Phones & Accessories': 654894},http://ecx.images-amazon.com/images/I/41XDwPt2...,,"[[Cell Phones & Accessories, Cases, Basic Cases]]",Product Name: Cool Summer Breeze In The Ocean...


We can be able to view the names of all items present in the file, hence a new data frame is created.

In [7]:
df_item_names = df_metadata.loc[:,['asin','title']]
df_item_names.head()

Unnamed: 0,asin,title
0,0110400550,Pink &amp; White 3d Melt Ice-cream Skin Hard C...
1,011040047X,Purple Hard Case Cover for Iphone 4 4s 4g with...
2,0195866479,Hello Kitty Light-weighted Chrome Case Black C...
3,0214514706,Cool Summer Breeze in the Ocean Beach Collecti...
4,0214714705,Cool Summer Breeze In The Ocean Beach Collecti...


In [8]:
df_item_names.loc[df_item_names.asin.isin(['B005SUHPO6']),['title']]

Unnamed: 0,title
104895,Otterbox Defender Series Hybrid Case &amp; Hol...


In [9]:
def df_item(asin):
    '''This function intakes an item ID or asin and returns the data frame of all the reviews of that item.'''
    reviewed_item = df_reviews.loc[df_reviews.asin.isin([asin]),['reviewText']]
    return reviewed_item

### Functions to Extract Nouns
    By using the pos tagger in the NLTK we can extract the nouns from a given sentence/ review.

In [10]:
import nltk as nt
noun_tags = ["NN","NNP","NNS","POS","WP"] #"PRP$" , "PRP", removing proper nouns

In [11]:
def tags(sentence):
    '''Takes a sentence and returns an array of tags'''
    array_words = nt.word_tokenize(sentence)
    tags = nt.pos_tag(array_words)
    return tags

In [12]:
def noun_words(review):
    '''Takes an array of sentence and returns a numpy array of nouns for each sentence'''
    nouns =[];
    all_tuples = tags(review)
    for one in all_tuples:
        for each in noun_tags:
            if(each == one[1]):
                nouns.append(one[0])
    return nouns

In [13]:
def nouns_only(corpus):
    '''This takes an array of sentences and returns a numpy array of nouns for each sentence'''
    total_nouns = []
    for each in corpus:
        total_nouns.append(noun_words(each))
    return total_nouns

## Creating Lda Model using GENSIM

This piece of code creates a document term matrix that can further be used to build LDA models.

Testing for complete reviews on the same item.

In [14]:
import gensim
from gensim import corpora

def lda_model(noun_array,n,k):
    '''This takes in a numpy array of nounsnumber of topics and number of words in each topic and returns the ldamodel. 
        Number of passes is hardcoded to 50'''
    # Creating the term dictionary of our corpus, where every unique term is assigned an index. 
    dictionary = corpora.Dictionary(noun_array)
    # Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
    doc_term_matrix = [dictionary.doc2bow(each) for each in noun_array]
    # Creating the object for LDA model using gensim library
    Lda = gensim.models.ldamodel.LdaModel
    # Running and Trainign LDA model on the document term matrix.
    ldamodel = Lda(doc_term_matrix, num_topics=n, id2word = dictionary, passes=50)
    
    return ldamodel

Using TensorFlow backend.


Forming a new data frame for the reviews of a selected items

In [15]:
'''Creating a dataframe of items with required item reviews'''
reviewed_item = df_item('B005SUHPO6')
final_corp = reviewed_item['reviewText']

In [16]:
final_corp.head()

59707    excellent product at 1/2 the price as sale at ...
59708    Sometimes the flap over the charging place is ...
59709    Great case.  Fits like every other Otterbox De...
59710    Use these for our technicians and anyone that ...
59711    It's very strong and protects my 4S phone! I t...
Name: reviewText, dtype: object

## Extracting only Nouns and various forms of nouns
Nouns were extracted, and proper nouns are not considered (I,we , them, they, etc)

In [17]:
total_nouns = nouns_only(final_corp)
len(total_nouns) #This gives the frequency of the reviews

837

In [18]:
first_model = lda_model(total_nouns,5,3)

In [19]:
a =first_model.print_topics(num_topics=5, num_words=3)

In [20]:
a

[(0, '0.118*"case" + 0.052*"phone" + 0.021*"screen"'),
 (1, '0.087*"phone" + 0.054*"case" + 0.031*"iPhone"'),
 (2, '0.035*"case" + 0.033*"Otterbox" + 0.027*"phone"'),
 (3, '0.058*"case" + 0.023*"phone" + 0.015*"iPhone"'),
 (4, '0.040*"i" + 0.031*"product" + 0.016*"box"')]

### Documenting the results for knowing the optimum number of reviews:
    Now we would run the topic modelling on different number of reviews for the same item and define the optimum percentage of reviews that needs to be sent to obtain the best accurate results.

In [21]:
def results_array(asin,k,l):
    '''Takes arguments: ASIN, number of topics, number of words in each topic.
    returns an array of number of reviews ans corresponding topics for each iteration.'''
    results = [[],[]]
    reviewed_item = df_item(asin)
    final_corp = reviewed_item['reviewText']
    print(final_corp.head())
    total_nouns = nouns_only(final_corp)
    print('\ntotal reviews for this item : ',len(total_nouns))
    
    percent = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
    val = [round(i * len(total_nouns)) for i in percent]

    for each in val:
        a = None
        model = lda_model(total_nouns[:each],k,l)
        a = model.print_topics(num_topics=k, num_words=l)
        results[0].append(each),results[1].append(a)
    
    return results

In [22]:
help(results_array)

Help on function results_array in module __main__:

results_array(asin, k, l)
    Takes arguments: ASIN, number of topics, number of words in each topic.
    returns an array of number of reviews ans corresponding topics for each iteration.



In [23]:
results_array('B005SUHPO6',4,2) #Most generalized

59707    excellent product at 1/2 the price as sale at ...
59708    Sometimes the flap over the charging place is ...
59709    Great case.  Fits like every other Otterbox De...
59710    Use these for our technicians and anyone that ...
59711    It's very strong and protects my 4S phone! I t...
Name: reviewText, dtype: object

total reviews for this item :  837


[[84, 167, 251, 335, 418, 502, 586, 670, 753, 837],
 [[(0, '0.062*"case" + 0.059*"phone"'),
   (1, '0.108*"case" + 0.041*"phone"'),
   (2, '0.033*"phone" + 0.023*"Iphone"'),
   (3, '0.027*"phone" + 0.024*"case"')],
  [(0, '0.104*"case" + 0.052*"phone"'),
   (1, '0.052*"phone" + 0.031*"case"'),
   (2, '0.031*"case" + 0.015*"Otterbox"'),
   (3, '0.057*"case" + 0.049*"phone"')],
  [(0, '0.067*"phone" + 0.053*"case"'),
   (1, '0.044*"phone" + 0.029*"case"'),
   (2, '0.098*"case" + 0.037*"phone"'),
   (3, '0.055*"case" + 0.041*"phone"')],
  [(0, '0.014*"daughter" + 0.013*"Amazon"'),
   (1, '0.023*"case" + 0.020*"product"'),
   (2, '0.010*"plastic" + 0.010*"thing"'),
   (3, '0.097*"case" + 0.066*"phone"')],
  [(0, '0.058*"phone" + 0.053*"case"'),
   (1, '0.086*"case" + 0.052*"phone"'),
   (2, '0.072*"case" + 0.032*"phone"'),
   (3, '0.047*"case" + 0.043*"phone"')],
  [(0, '0.100*"case" + 0.055*"phone"'),
   (1, '0.051*"case" + 0.033*"phone"'),
   (2, '0.071*"case" + 0.052*"phone"'),
   (3, '

In [24]:
results_array('B005SUHPO6',3,3)  #moderate results

59707    excellent product at 1/2 the price as sale at ...
59708    Sometimes the flap over the charging place is ...
59709    Great case.  Fits like every other Otterbox De...
59710    Use these for our technicians and anyone that ...
59711    It's very strong and protects my 4S phone! I t...
Name: reviewText, dtype: object

total reviews for this item :  837


[[84, 167, 251, 335, 418, 502, 586, 670, 753, 837],
 [[(0, '0.039*"case" + 0.038*"phone" + 0.022*"product"'),
   (1, '0.090*"case" + 0.054*"phone" + 0.021*"screen"'),
   (2, '0.038*"phone" + 0.029*"case" + 0.021*"Iphone"')],
  [(0, '0.026*"case" + 0.022*"phone" + 0.015*"box"'),
   (1, '0.090*"case" + 0.038*"phone" + 0.021*"iPhone"'),
   (2, '0.070*"case" + 0.067*"phone" + 0.017*"Otterbox"')],
  [(0, '0.019*"phone" + 0.013*"Iphone" + 0.011*"what"'),
   (1, '0.083*"phone" + 0.077*"case" + 0.016*"screen"'),
   (2, '0.082*"case" + 0.022*"iPhone" + 0.020*"Otterbox"')],
  [(0, '0.103*"case" + 0.053*"phone" + 0.018*"iPhone"'),
   (1, '0.035*"phone" + 0.035*"case" + 0.022*"Otterbox"'),
   (2, '0.058*"case" + 0.048*"phone" + 0.013*"protection"')],
  [(0, '0.099*"case" + 0.048*"phone" + 0.018*"screen"'),
   (1, '0.059*"case" + 0.058*"phone" + 0.016*"iPhone"'),
   (2, '0.039*"case" + 0.026*"iPhone" + 0.022*"phone"')],
  [(0, '0.022*"Otterbox" + 0.018*"product" + 0.014*"color"'),
   (1, '0.098*"ca

In [25]:
results_array('B005SUHPO6',4,3) #70%

59707    excellent product at 1/2 the price as sale at ...
59708    Sometimes the flap over the charging place is ...
59709    Great case.  Fits like every other Otterbox De...
59710    Use these for our technicians and anyone that ...
59711    It's very strong and protects my 4S phone! I t...
Name: reviewText, dtype: object

total reviews for this item :  837


[[84, 167, 251, 335, 418, 502, 586, 670, 753, 837],
 [[(0, '0.067*"phone" + 0.049*"case" + 0.017*"protection"'),
   (1, '0.020*"thing" + 0.019*"Otterbox" + 0.015*"product"'),
   (2, '0.054*"case" + 0.054*"phone" + 0.016*"product"'),
   (3, '0.100*"case" + 0.042*"phone" + 0.021*"screen"')],
  [(0, '0.061*"case" + 0.040*"phone" + 0.023*"iPhone"'),
   (1, '0.075*"case" + 0.040*"phone" + 0.017*"protection"'),
   (2, '0.100*"case" + 0.044*"phone" + 0.017*"screen"'),
   (3, '0.068*"phone" + 0.017*"case" + 0.013*"box"')],
  [(0, '0.100*"case" + 0.047*"phone" + 0.022*"screen"'),
   (1, '0.075*"phone" + 0.069*"case" + 0.017*"iPhone"'),
   (2, '0.031*"phone" + 0.022*"case" + 0.020*"iPhone"'),
   (3, '0.024*"case" + 0.019*"product" + 0.012*"time"')],
  [(0, '0.111*"case" + 0.063*"phone" + 0.020*"iPhone"'),
   (1, '0.028*"phone" + 0.024*"case" + 0.022*"Otterbox"'),
   (2, '0.024*"Iphone" + 0.009*"rubber" + 0.009*"product"'),
   (3, '0.040*"phone" + 0.021*"case" + 0.021*"product"')],
  [(0, '0.014*

In [26]:
results_array('B005SUHPO6',4,4) #70%

59707    excellent product at 1/2 the price as sale at ...
59708    Sometimes the flap over the charging place is ...
59709    Great case.  Fits like every other Otterbox De...
59710    Use these for our technicians and anyone that ...
59711    It's very strong and protects my 4S phone! I t...
Name: reviewText, dtype: object

total reviews for this item :  837


[[84, 167, 251, 335, 418, 502, 586, 670, 753, 837],
 [[(0, '0.067*"case" + 0.040*"phone" + 0.020*"Otterbox" + 0.017*"protection"'),
   (1, '0.083*"case" + 0.058*"phone" + 0.024*"Defender" + 0.021*"Otterbox"'),
   (2, '0.039*"phone" + 0.036*"case" + 0.017*"protection" + 0.017*"iphone"'),
   (3, '0.035*"phone" + 0.030*"case" + 0.010*"Box" + 0.010*"Iphone"')],
  [(0, '0.010*"box" + 0.010*"flap" + 0.010*"anything" + 0.007*"thing"'),
   (1, '0.088*"case" + 0.047*"phone" + 0.024*"screen" + 0.017*"iPhone"'),
   (2, '0.082*"case" + 0.061*"phone" + 0.021*"protection" + 0.019*"iPhone"'),
   (3, '0.042*"case" + 0.035*"phone" + 0.016*"product" + 0.013*"order"')],
  [(0, '0.097*"case" + 0.050*"phone" + 0.023*"screen" + 0.016*"iPhone"'),
   (1, '0.054*"case" + 0.036*"phone" + 0.020*"iPhone" + 0.019*"Otterbox"'),
   (2, '0.058*"phone" + 0.046*"case" + 0.024*"protection" + 0.020*"product"'),
   (3, '0.038*"case" + 0.035*"phone" + 0.009*"color" + 0.007*"iPhone"')],
  [(0, '0.064*"case" + 0.031*"phone" 

In [27]:
results_array('B005SUHPO6',4,5) #Not useful at all

59707    excellent product at 1/2 the price as sale at ...
59708    Sometimes the flap over the charging place is ...
59709    Great case.  Fits like every other Otterbox De...
59710    Use these for our technicians and anyone that ...
59711    It's very strong and protects my 4S phone! I t...
Name: reviewText, dtype: object

total reviews for this item :  837


[[84, 167, 251, 335, 418, 502, 586, 670, 753, 837],
 [[(0,
    '0.060*"phone" + 0.022*"case" + 0.014*"box" + 0.014*"otter" + 0.014*"product"'),
   (1,
    '0.045*"phone" + 0.037*"case" + 0.019*"product" + 0.017*"Otterbox" + 0.014*"screen"'),
   (2,
    '0.100*"case" + 0.052*"phone" + 0.021*"screen" + 0.019*"iPhone" + 0.019*"Otterbox"'),
   (3,
    '0.016*"protection" + 0.015*"quality" + 0.012*"case" + 0.010*"flap" + 0.010*"durability"')],
  [(0,
    '0.098*"case" + 0.065*"phone" + 0.019*"protection" + 0.017*"Otterbox" + 0.016*"iPhone"'),
   (1,
    '0.081*"case" + 0.041*"phone" + 0.020*"screen" + 0.018*"iPhone" + 0.013*"plastic"'),
   (2,
    '0.019*"Otterbox" + 0.012*"one" + 0.011*"way" + 0.011*"otter" + 0.011*"colors"'),
   (3,
    '0.036*"case" + 0.031*"phone" + 0.018*"clip" + 0.014*"Defender" + 0.014*"belt"')],
  [(0,
    '0.062*"phone" + 0.061*"case" + 0.021*"screen" + 0.016*"iPhone" + 0.015*"product"'),
   (1,
    '0.036*"phone" + 0.027*"case" + 0.017*"product" + 0.011*"protectio

In [28]:
results_array('B005SUHPO6',5,3) #At 80% of total reviews we can see req topics with least probabilities.

59707    excellent product at 1/2 the price as sale at ...
59708    Sometimes the flap over the charging place is ...
59709    Great case.  Fits like every other Otterbox De...
59710    Use these for our technicians and anyone that ...
59711    It's very strong and protects my 4S phone! I t...
Name: reviewText, dtype: object

total reviews for this item :  837


[[84, 167, 251, 335, 418, 502, 586, 670, 753, 837],
 [[(0, '0.073*"phone" + 0.070*"case" + 0.021*"product"'),
   (1, '0.061*"case" + 0.043*"phone" + 0.029*"protection"'),
   (2, '0.033*"phone" + 0.018*"cases" + 0.012*"one"'),
   (3, '0.101*"case" + 0.033*"phone" + 0.025*"Defender"'),
   (4, '0.038*"phone" + 0.020*"Iphone" + 0.020*"box"')],
  [(0, '0.082*"case" + 0.051*"phone" + 0.021*"screen"'),
   (1, '0.051*"phone" + 0.031*"case" + 0.020*"product"'),
   (2, '0.081*"case" + 0.056*"phone" + 0.026*"Otterbox"'),
   (3, '0.086*"case" + 0.041*"phone" + 0.027*"iPhone"'),
   (4, '0.055*"case" + 0.021*"protection" + 0.018*"phone"')],
  [(0, '0.090*"case" + 0.024*"phone" + 0.015*"screen"'),
   (1, '0.094*"case" + 0.064*"phone" + 0.027*"iPhone"'),
   (2, '0.039*"product" + 0.037*"case" + 0.028*"phone"'),
   (3, '0.066*"phone" + 0.051*"case" + 0.024*"Otterbox"'),
   (4, '0.035*"case" + 0.017*"silicone" + 0.016*"phone"')],
  [(0, '0.014*"holster" + 0.013*"Otterbox" + 0.012*"case"'),
   (1, '0.089

# Observations:
    It is observed that at 70% of total reviews, with 4 topics and 4 words each, we have accurate results.

### Working with a different item

In [None]:
#Should run the code with different item and different data. at 70% reviews, 4 and 4 topics and words..