## Problem Statement
In this project we have to build a system which summarizes the customer reviews of a particular product into a bunch of keywords, so that when a customer goes to a product page, he/she doesn’t have to read long reviews. Instead he/she can make up his/her mind based on the product average rating and summarized keywords of the review. 
For this problem you may use any tools and techniques you like. The data consists of reviews and ratings information of the products which are being sold by the client via online website. 

The data description is as follows: 
DATA DESCRIPTION: 
You are given a file named “Cell_Phones_and_Accessories.json”. This file contains review information under following columns:  IC – Item Code of the product, e.g. B016MF3P3K   

Reviewer_Name - Name of the reviewer  

Useful- Number of useful votes (upvotes) of the review  Prod_meta- a dictionary of the product metadata. It contains only additional information about the product, if any available. 

Review- text of the review  

Rating- rating given to the product by the reviewer.  

Rev_summ- summary of the review  Review_timestamp- time when the review has been posted (unix time format)  

Review_Date- Date when the review has been posted  Prod_img- images that users post after they have received the product  

Rev_verify- Flag to represent whether the review has been verified or not. (True/False) 

Now, since you have understood the features present in the dataset, you have to do a proper data cleaning for the same. You may remove all the rows where no review is present. You may choose any column(s) to perform this task. You may perform EDA, feature engineering if you are able to find any important new feature. 
Once you have done data pre-processing for all the products, you have to predict the important words which summarize the reviews for each product and thus return those words. Number of words extracted for each topic depends on your understanding, you need to give a suitable reason for the number you choose. The summary keywords should not contain more than 30 words. 

In [143]:
import pandas as pd
import numpy as np
import warnings 
warnings.filterwarnings('ignore')
import nltk
import ast
import re
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import wordnet
nltk.download('sentiwordnet')
from nltk.corpus import sentiwordnet
import json

[nltk_data] Downloading package sentiwordnet to C:\Users\Silent
[nltk_data]     night\AppData\Roaming\nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


In [144]:
df=pd.read_json('Cell_Phones_and_Accessories.json')

In [145]:
df

Unnamed: 0,Rating,Rev_verify,Review_Date,IC,Prod_meta,Reviewer_Name,Review,Rev_summ,Review_timestamp,Useful,Prod_img
0,5,True,"09 1, 2015",B009XD5TPQ,,Sunny Zoeller,Bought it for my husband. He's very happy with it,He's very happy with,1441065600,,
1,5,True,"01 9, 2016",B016MF3P3K,,Denise Lesley,Great screen protector. Doesn't even seem as ...,Five Stars,1452297600,,
2,5,True,"04 21, 2013",B008DC8N5G,,Emir,Saved me lots of money! it's not gorilla glass...,As long as you know how to put it on!,1366502400,,
3,3,True,"02 27, 2013",B0089CH3TM,{'Color:': ' Green'},Alyse,"The material and fit is very nice, but the col...",Good case overall,1361923200,3,
4,4,True,"12 19, 2013",B00AKZWGAC,,TechGuy,This last me about 3 days till i have to charg...,Awesome Battery,1387411200,,
...,...,...,...,...,...,...,...,...,...,...,...
760445,4,False,"07 12, 2014",B00C3V9M8A,,momahjoub,Very good,Four Stars,1405123200,,
760446,5,False,"07 13, 2016",B0178BYS24,,Cindy,My name is Cynthia Beard and I believe that th...,... believe that the Samsung Galaxy car mount ...,1468368000,,
760447,4,True,"07 23, 2015",B009KY47CE,,zzrnam11,This iphone case is very durable and long last...,I LOVE THIS,1437609600,,
760448,5,True,"12 14, 2015",B00X60AYDY,{'Style:': ' 6-in-1 Silver'},ACER,great,Five Stars,1450051200,,


In [146]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 760450 entries, 0 to 760449
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Rating            760450 non-null  int64 
 1   Rev_verify        760450 non-null  bool  
 2   Review_Date       760450 non-null  object
 3   IC                760450 non-null  object
 4   Prod_meta         407826 non-null  object
 5   Reviewer_Name     760359 non-null  object
 6   Review            759920 non-null  object
 7   Rev_summ          760095 non-null  object
 8   Review_timestamp  760450 non-null  int64 
 9   Useful            62200 non-null   object
 10  Prod_img          18194 non-null   object
dtypes: bool(1), int64(2), object(8)
memory usage: 64.5+ MB


In [147]:
df.isnull().sum()

Rating                   0
Rev_verify               0
Review_Date              0
IC                       0
Prod_meta           352624
Reviewer_Name           91
Review                 530
Rev_summ               355
Review_timestamp         0
Useful              698250
Prod_img            742256
dtype: int64

From the above we cab see that prod_img, useful, and prod_meta has highest number of missing vlaues, so we can simply drop that column

In [148]:
df.drop(['Prod_meta','Useful','Prod_img'],axis=1,inplace=True)

From the porblem statement we can see that review_date,reviwer_name,review_timestamp is not needed, so we can simply drop those column.

In [149]:
df.drop(['Review_Date','Reviewer_Name','Review_timestamp'],axis=1,inplace=True)

In [150]:
df.isnull().sum()

Rating          0
Rev_verify      0
IC              0
Review        530
Rev_summ      355
dtype: int64

Now we can see that the their are missing values on review, rev_summ, so we have to replace missing values with no reviews.

In [152]:
df.fillna('No Reviews',inplace=True)

In [153]:
## Merging the reviews and reviews summary
df['Full_review'] = df[['Review', 'Rev_summ']].apply(lambda x: ' '.join(x), axis = 1)

In [154]:
df.isnull().sum()

Rating         0
Rev_verify     0
IC             0
Review         0
Rev_summ       0
Full_review    0
dtype: int64

In [155]:
df.IC.nunique()

48134

so from the above we can see that our final DataFrame should have 48134 columns

In [156]:
df

Unnamed: 0,Rating,Rev_verify,IC,Review,Rev_summ,Full_review
0,5,True,B009XD5TPQ,Bought it for my husband. He's very happy with it,He's very happy with,Bought it for my husband. He's very happy with...
1,5,True,B016MF3P3K,Great screen protector. Doesn't even seem as ...,Five Stars,Great screen protector. Doesn't even seem as ...
2,5,True,B008DC8N5G,Saved me lots of money! it's not gorilla glass...,As long as you know how to put it on!,Saved me lots of money! it's not gorilla glass...
3,3,True,B0089CH3TM,"The material and fit is very nice, but the col...",Good case overall,"The material and fit is very nice, but the col..."
4,4,True,B00AKZWGAC,This last me about 3 days till i have to charg...,Awesome Battery,This last me about 3 days till i have to charg...
...,...,...,...,...,...,...
760445,4,False,B00C3V9M8A,Very good,Four Stars,Very good Four Stars
760446,5,False,B0178BYS24,My name is Cynthia Beard and I believe that th...,... believe that the Samsung Galaxy car mount ...,My name is Cynthia Beard and I believe that th...
760447,4,True,B009KY47CE,This iphone case is very durable and long last...,I LOVE THIS,This iphone case is very durable and long last...
760448,5,True,B00X60AYDY,great,Five Stars,great Five Stars


In [157]:
df['Average_Rating']= df.groupby(['IC'])['Rating'].transform(lambda x: (x.mean()))
df['Minimum_Rating']= df.groupby(['IC'])['Rating'].transform(lambda x: x.min())
df['Max_Rating']= df.groupby(['IC'])['Rating'].transform(lambda x: x.max())
df['Average_Rating']=df['Average_Rating'].round(decimals=1)

In [158]:
df

Unnamed: 0,Rating,Rev_verify,IC,Review,Rev_summ,Full_review,Average_Rating,Minimum_Rating,Max_Rating
0,5,True,B009XD5TPQ,Bought it for my husband. He's very happy with it,He's very happy with,Bought it for my husband. He's very happy with...,4.5,2,5
1,5,True,B016MF3P3K,Great screen protector. Doesn't even seem as ...,Five Stars,Great screen protector. Doesn't even seem as ...,3.6,1,5
2,5,True,B008DC8N5G,Saved me lots of money! it's not gorilla glass...,As long as you know how to put it on!,Saved me lots of money! it's not gorilla glass...,4.1,1,5
3,3,True,B0089CH3TM,"The material and fit is very nice, but the col...",Good case overall,"The material and fit is very nice, but the col...",4.5,1,5
4,4,True,B00AKZWGAC,This last me about 3 days till i have to charg...,Awesome Battery,This last me about 3 days till i have to charg...,4.4,1,5
...,...,...,...,...,...,...,...,...,...
760445,4,False,B00C3V9M8A,Very good,Four Stars,Very good Four Stars,3.0,1,5
760446,5,False,B0178BYS24,My name is Cynthia Beard and I believe that th...,... believe that the Samsung Galaxy car mount ...,My name is Cynthia Beard and I believe that th...,4.6,1,5
760447,4,True,B009KY47CE,This iphone case is very durable and long last...,I LOVE THIS,This iphone case is very durable and long last...,4.3,1,5
760448,5,True,B00X60AYDY,great,Five Stars,great Five Stars,4.6,1,5


In [159]:
df['Full_Review']= df.groupby(['IC'])['Full_review'].transform(lambda x: ','.join(x))

In [160]:
df

Unnamed: 0,Rating,Rev_verify,IC,Review,Rev_summ,Full_review,Average_Rating,Minimum_Rating,Max_Rating,Full_Review
0,5,True,B009XD5TPQ,Bought it for my husband. He's very happy with it,He's very happy with,Bought it for my husband. He's very happy with...,4.5,2,5,Bought it for my husband. He's very happy with...
1,5,True,B016MF3P3K,Great screen protector. Doesn't even seem as ...,Five Stars,Great screen protector. Doesn't even seem as ...,3.6,1,5,Great screen protector. Doesn't even seem as ...
2,5,True,B008DC8N5G,Saved me lots of money! it's not gorilla glass...,As long as you know how to put it on!,Saved me lots of money! it's not gorilla glass...,4.1,1,5,Saved me lots of money! it's not gorilla glass...
3,3,True,B0089CH3TM,"The material and fit is very nice, but the col...",Good case overall,"The material and fit is very nice, but the col...",4.5,1,5,"The material and fit is very nice, but the col..."
4,4,True,B00AKZWGAC,This last me about 3 days till i have to charg...,Awesome Battery,This last me about 3 days till i have to charg...,4.4,1,5,This last me about 3 days till i have to charg...
...,...,...,...,...,...,...,...,...,...,...
760445,4,False,B00C3V9M8A,Very good,Four Stars,Very good Four Stars,3.0,1,5,"poor fitting, poor quality, bad choice. I gue..."
760446,5,False,B0178BYS24,My name is Cynthia Beard and I believe that th...,... believe that the Samsung Galaxy car mount ...,My name is Cynthia Beard and I believe that th...,4.6,1,5,A great vent car mount. The magnet is strong a...
760447,4,True,B009KY47CE,This iphone case is very durable and long last...,I LOVE THIS,This iphone case is very durable and long last...,4.3,1,5,"the colors blue are like, but I like the gray ..."
760448,5,True,B00X60AYDY,great,Five Stars,great Five Stars,4.6,1,5,Works Great! I highly Recommend this! Five Sta...


In [167]:
df['Full_Review'].iloc[0]

'Bought it for my husband. He\'s very happy with it He\'s very happy with,Good product.  Does exactly what the description says it will do. I recommend buying this product. Leather case holder,This is a very nice case for medium to larger smartphones.  My Droid Razor Maxx fits very nicely, the belt clip is strong spring steel, and the magnetic clasp holds very well for a magnetic design. Nice leather smartphone case,Appears to be excellent product. Fits the phone very nicely even with the hard shell back I have on my phone. I like the fact it has a belt loop and well as the clip that slides over the belt. The loop ensures the case will not come off the belt yet you have the option of just using the clip when you have to quickly remove the phone.  Time will tell concerning the durability.  The price is hard to beat. Exceptional Value,Great fit for iPhone 5 with protective case on - yes you read correctly.  Needed a belt clip to hold iPhone 5 that has a thin protective case already on. T

In [170]:
data=df.drop_duplicates("IC")

In [171]:
data

Unnamed: 0,Rating,Rev_verify,IC,Review,Rev_summ,Full_review,Average_Rating,Minimum_Rating,Max_Rating,Full_Review
0,5,True,B009XD5TPQ,Bought it for my husband. He's very happy with it,He's very happy with,Bought it for my husband. He's very happy with...,4.5,2,5,Bought it for my husband. He's very happy with...
1,5,True,B016MF3P3K,Great screen protector. Doesn't even seem as ...,Five Stars,Great screen protector. Doesn't even seem as ...,3.6,1,5,Great screen protector. Doesn't even seem as ...
2,5,True,B008DC8N5G,Saved me lots of money! it's not gorilla glass...,As long as you know how to put it on!,Saved me lots of money! it's not gorilla glass...,4.1,1,5,Saved me lots of money! it's not gorilla glass...
3,3,True,B0089CH3TM,"The material and fit is very nice, but the col...",Good case overall,"The material and fit is very nice, but the col...",4.5,1,5,"The material and fit is very nice, but the col..."
4,4,True,B00AKZWGAC,This last me about 3 days till i have to charg...,Awesome Battery,This last me about 3 days till i have to charg...,4.4,1,5,This last me about 3 days till i have to charg...
...,...,...,...,...,...,...,...,...,...,...
758007,3,True,B019NW8OV2,"Cheap plastic, not a big fan. I'm not sure i'...","Cheap case, nothing spectacular here","Cheap plastic, not a big fan. I'm not sure i'...",3.0,3,3,"Cheap plastic, not a big fan. I'm not sure i'..."
758085,5,True,B00EXQ6JMA,My daughter loves it ! Some past reveiws have...,My daughter loves it! Some past reveiws have c...,My daughter loves it ! Some past reveiws have...,5.0,5,5,My daughter loves it ! Some past reveiws have...
758141,5,False,B01739B1XA,I was happy when I found this product because ...,Solid and holds my iPhone5 perfectly. So simpl...,I was happy when I found this product because ...,5.0,5,5,I was happy when I found this product because ...
759794,3,True,B00GI8RRZE,Bulky and inconsistent,Three Stars,Bulky and inconsistent Three Stars,3.0,3,3,Bulky and inconsistent Three Stars


In [172]:
## WE create a small dataset to test all the features enginnering and model to run which takes a lot less time
test = df.head(20)

In [173]:
test

Unnamed: 0,Rating,Rev_verify,IC,Review,Rev_summ,Full_review,Average_Rating,Minimum_Rating,Max_Rating,Full_Review
0,5,True,B009XD5TPQ,Bought it for my husband. He's very happy with it,He's very happy with,Bought it for my husband. He's very happy with...,4.5,2,5,Bought it for my husband. He's very happy with...
1,5,True,B016MF3P3K,Great screen protector. Doesn't even seem as ...,Five Stars,Great screen protector. Doesn't even seem as ...,3.6,1,5,Great screen protector. Doesn't even seem as ...
2,5,True,B008DC8N5G,Saved me lots of money! it's not gorilla glass...,As long as you know how to put it on!,Saved me lots of money! it's not gorilla glass...,4.1,1,5,Saved me lots of money! it's not gorilla glass...
3,3,True,B0089CH3TM,"The material and fit is very nice, but the col...",Good case overall,"The material and fit is very nice, but the col...",4.5,1,5,"The material and fit is very nice, but the col..."
4,4,True,B00AKZWGAC,This last me about 3 days till i have to charg...,Awesome Battery,This last me about 3 days till i have to charg...,4.4,1,5,This last me about 3 days till i have to charg...
5,5,True,B00MAWPGMI,"Love this case, very sturdy!",Five Stars,"Love this case, very sturdy! Five Stars",4.4,1,5,"Love this case, very sturdy! Five Stars,Great ..."
6,5,False,B00NB7B4GI,Simple and good quality iPhone 6 case. Fits on...,Simple and good quality iPhone 6 case,Simple and good quality iPhone 6 case. Fits on...,4.9,4,5,Simple and good quality iPhone 6 case. Fits on...
7,5,True,B00NMR6N7W,Great screen protector for the money! Paid $1....,Perfect!,Great screen protector for the money! Paid $1....,4.8,4,5,Great screen protector for the money! Paid $1....
8,5,True,B018V60504,"Nice charger. One problem, one if the two USB ...",Make sure your Items work before you miss the ...,"Nice charger. One problem, one if the two USB ...",4.2,1,5,"Nice charger. One problem, one if the two USB ..."
9,5,False,B00PG8TID6,Most battery packs for iPhones come as a total...,This clever design combines a battery pack int...,Most battery packs for iPhones come as a total...,4.4,1,5,Most battery packs for iPhones come as a total...


In [178]:
from nltk.corpus import wordnet
import string
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.tokenize import WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer

In [179]:
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer() 
def clean_text(text):
    text=str(text)
    text = text.lower()
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', text)
    rem_num = re.sub('[0-9]+', '', text)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)  
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    stem_words=[stemmer.stem(w) for w in filtered_words]
    lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
    return " ".join(filtered_words)
test["Full_review"] = test["Full_review"].apply(lambda x: clean_text(x))

In [180]:
test

Unnamed: 0,Rating,Rev_verify,IC,Review,Rev_summ,Full_review,Average_Rating,Minimum_Rating,Max_Rating,Full_Review
0,5,True,B009XD5TPQ,Bought it for my husband. He's very happy with it,He's very happy with,bought husband happy happy,4.5,2,5,"bought husband happy happy,Good product. Does..."
1,5,True,B016MF3P3K,Great screen protector. Doesn't even seem as ...,Five Stars,great screen protector even seem though five s...,3.6,1,5,great screen protector even seem though five s...
2,5,True,B008DC8N5G,Saved me lots of money! it's not gorilla glass...,As long as you know how to put it on!,saved lots money gorilla glass careful subject...,4.1,1,5,saved lots money gorilla glass careful subject...
3,3,True,B0089CH3TM,"The material and fit is very nice, but the col...",Good case overall,material fit nice color neon green expected wo...,4.5,1,5,material fit nice color neon green expected wo...
4,4,True,B00AKZWGAC,This last me about 3 days till i have to charg...,Awesome Battery,last days till charge take forever charge make...,4.4,1,5,last days till charge take forever charge make...
5,5,True,B00MAWPGMI,"Love this case, very sturdy!",Five Stars,love case sturdy five stars,4.4,1,5,"love case sturdy five stars,Great looking case..."
6,5,False,B00NB7B4GI,Simple and good quality iPhone 6 case. Fits on...,Simple and good quality iPhone 6 case,simple good quality iphone case fits perfectly...,4.9,4,5,simple good quality iphone case fits perfectly...
7,5,True,B00NMR6N7W,Great screen protector for the money! Paid $1....,Perfect!,great screen protector money paid free shippin...,4.8,4,5,great screen protector money paid free shippin...
8,5,True,B018V60504,"Nice charger. One problem, one if the two USB ...",Make sure your Items work before you miss the ...,nice charger one problem one two usb slots mis...,4.2,1,5,nice charger one problem one two usb slots mis...
9,5,False,B00PG8TID6,Most battery packs for iPhones come as a total...,This clever design combines a battery pack int...,battery packs iphones come totally separate de...,4.4,1,5,battery packs iphones come totally separate de...


In [181]:
def pos_tag(text):
    wordsList = nltk.word_tokenize(text) 
    tagged = nltk.pos_tag(wordsList)   
    return tagged
test['tagged']=test['Full_Review'].map(lambda s:pos_tag(s))

In [182]:
test

Unnamed: 0,Rating,Rev_verify,IC,Review,Rev_summ,Full_review,Average_Rating,Minimum_Rating,Max_Rating,Full_Review,tagged
0,5,True,B009XD5TPQ,Bought it for my husband. He's very happy with it,He's very happy with,bought husband happy happy,4.5,2,5,"bought husband happy happy,Good product. Does...","[(bought, JJ), (husband, NN), (happy, JJ), (ha..."
1,5,True,B016MF3P3K,Great screen protector. Doesn't even seem as ...,Five Stars,great screen protector even seem though five s...,3.6,1,5,great screen protector even seem though five s...,"[(great, JJ), (screen, NN), (protector, NN), (..."
2,5,True,B008DC8N5G,Saved me lots of money! it's not gorilla glass...,As long as you know how to put it on!,saved lots money gorilla glass careful subject...,4.1,1,5,saved lots money gorilla glass careful subject...,"[(saved, VBN), (lots, NNS), (money, NN), (gori..."
3,3,True,B0089CH3TM,"The material and fit is very nice, but the col...",Good case overall,material fit nice color neon green expected wo...,4.5,1,5,material fit nice color neon green expected wo...,"[(material, JJ), (fit, NN), (nice, JJ), (color..."
4,4,True,B00AKZWGAC,This last me about 3 days till i have to charg...,Awesome Battery,last days till charge take forever charge make...,4.4,1,5,last days till charge take forever charge make...,"[(last, JJ), (days, NNS), (till, VB), (charge,..."
5,5,True,B00MAWPGMI,"Love this case, very sturdy!",Five Stars,love case sturdy five stars,4.4,1,5,"love case sturdy five stars,Great looking case...","[(love, VB), (case, NN), (sturdy, JJ), (five, ..."
6,5,False,B00NB7B4GI,Simple and good quality iPhone 6 case. Fits on...,Simple and good quality iPhone 6 case,simple good quality iphone case fits perfectly...,4.9,4,5,simple good quality iphone case fits perfectly...,"[(simple, NN), (good, JJ), (quality, NN), (iph..."
7,5,True,B00NMR6N7W,Great screen protector for the money! Paid $1....,Perfect!,great screen protector money paid free shippin...,4.8,4,5,great screen protector money paid free shippin...,"[(great, JJ), (screen, JJ), (protector, NN), (..."
8,5,True,B018V60504,"Nice charger. One problem, one if the two USB ...",Make sure your Items work before you miss the ...,nice charger one problem one two usb slots mis...,4.2,1,5,nice charger one problem one two usb slots mis...,"[(nice, JJ), (charger, NN), (one, CD), (proble..."
9,5,False,B00PG8TID6,Most battery packs for iPhones come as a total...,This clever design combines a battery pack int...,battery packs iphones come totally separate de...,4.4,1,5,battery packs iphones come totally separate de...,"[(battery, NN), (packs, VBZ), (iphones, NNS), ..."


In [185]:
def Aspect(text):
    prevWord=''
    prevTags=''
    currWord=''
    aspectList=[]
    outputDict={}
    it = iter(text)
    text_dict=dict(zip(it, it))
    for key,value in text_dict.items():
        for word,tag in value:
            if(tag=='NN' or tag=='NNP'):
                if(prevTag=='NN' or prevTag=='NNP'):
                    currWord= prevWord + ' ' + word
                else:
                    aspectList.append(prevWord.upper())
                    currWord= word
            prevWord=currWord
            prevTag=tag
    return aspectList
test['Aspect']=test['tagged'].map(lambda s:Aspect(s))

ValueError: too many values to unpack (expected 2)

In [98]:
## Aspect Extraction
prevWord=''
prevTags=''
currWord=''
aspectList=[]
outputDict={}
#Extracting Aspects
for key,value in tagdict.iteritems():
    for ID,word,tag in value:
        if(tag=='NN' or tag=='NNP'):
            if(prevTag=='NN' or prevTag=='NNP'):
                currWord= prevWord + ' ' + word
            else:
                aspectList.append(prevWord.upper())
                currWord= word
        prevWord=currWord
        prevTag=tag
#Eliminating aspect which has 1 or less count
for aspect in aspectList:
        if(aspectList.count(aspect)>1):
                if(outputDict.keys()!=aspect):
                        outputDict[aspect]=aspectList.count(aspect)
#outputAspect=sorted(outputDict.items(), key=lambda x: x[1],reverse = True)

AttributeError: 'dict' object has no attribute 'iteritems'