In [1]:
!pip install wordcloud



## &#10148;Problem Statement </br> 
### <div class="alert alert-info">Thomas, a global market analyst, wishes to develop an automated system to analyze and monitor an enormous number of reviews. By monitoring the entire review history of products, he wishes to analyze tone, language, keywords, and trends over time to provide valuable insights that increase the success rate of existing and new products and marketing campaigns.</div>

## Introduction

Everyday we come across various products in our lives, on the digital medium we swipe across hundreds of product choices under one category. It will be tedious for the customer to make selection. Here comes 'reviews' where customers who have already got that product leave a rating after using them and brief their experience by giving reviews. As we know ratings can be easily sorted and judged whether a product is good or bad. But when it comes to sentence reviews we need to read through every line to make sure the review conveys a positive or negative sense. In the era of artificial intelligence, things like that have got easy with the Natural Langauge Processing(NLP) technology.

## &#10148; Requried Libraries</br>

- Importing the required libraries for the project

In [2]:
import pandas as pd
import json
import csv
import gzip
import itertools
import string
import re
#import wordcloud
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import pylab as pl
from textblob import TextBlob
import spacy                                                  # Import spaCy library
from spacy.lang.en import English                             # Import specific model
nlp = spacy.load("en_core_web_sm")
import warnings                                                  #To avoid the shown warnings
warnings.filterwarnings("ignore")

- The gzip module provides the GzipFile class, as well as the open() , compress() and decompress() convenience functions.

- The Yield keyword in Python is similar to a return statement used for returning values or objects in Python.

## &#10148; Importing the Data</br>

In [3]:
reviews = pd.read_json('Digital_Music.json.gz', lines=True)
metadata = pd.read_json('meta_Digital_Music.json.gz', lines=True)

In [4]:
reviews.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5,True,"12 22, 2013",A1ZCPG3D3HGRSS,1388703,{'Format:': ' Audio CD'},mark l. massey,This is a great cd full of worship favorites!!...,Great worship cd,1387670400,,
1,5,True,"09 11, 2013",AC2PL52NKPL29,1388703,{'Format:': ' Audio CD'},Norma Mushen,"So creative! Love his music - the words, the ...",Gotta listen to this!,1378857600,,
2,5,True,"03 2, 2013",A1SUZXBDZSDQ3A,1388703,{'Format:': ' Audio CD'},Herbert W. Shurley,"Keith Green, gone far to early in his carreer,...",Great approach still gets the message out,1362182400,,
3,5,True,"12 2, 2012",A3A0W7FZXM0IZW,1388703,{'Format:': ' Audio CD'},Mary M Raybell,Keith Green had his special comedy style of Ch...,Great A must have,1354406400,,
4,5,False,"01 7, 2012",A12R54MKO17TW0,1388703,{'Format:': ' Audio CD'},J. Bynum,Keith Green / So you wanna go back to Egypt......,A great one from Keith with a guest appearance...,1325894400,6.0,


In [5]:
metadata.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
0,[],,[],,Master Collection Volume One,"[B000002UEN, B000008LD5, B01J804JKE, 747403435...",,John Michael Talbot,[],"58,291 in CDs & Vinyl (","[B000002UEN, B000008LD5, 7474034352, B000008LD...","<img src=""https://images-na.ssl-images-amazon....",,,$18.99,1377647,[],[],
1,[],,[],,Hymns Collection: Hymns 1 &amp; 2,"[5558154950, B00014K5V4]",,Second Chapter of Acts,[],"93,164 in CDs & Vinyl (","[B000008KJ3, B000008KJ0, 5558154950, B000UN8KZ...","<img src=""https://images-na.ssl-images-amazon....",,,,1529145,[],[],
2,[],,[],,Early Works - Don Francisco,"[B00004RC05, B003H8F4NA, B003ZFVHPO, B003JMP1Z...",,Don Francisco,[],"875,825 in CDs & Vinyl (","[B003H8F4NA, B003ZFVHPO, B003JMP1ZK, B00004RC0...","<img src=""https://images-na.ssl-images-amazon....",,,,1527134,[],[],
3,[],,[],,So You Wanna Go Back to Egypt,"[B0000275QQ, 0001393774, 0001388312, B0016CP2G...",,Keith Green,[],"203,263 in CDs & Vinyl (","[B00000I7JO, B0016CP2GS, 0001393774, B0000275Q...","<img src=""https://images-na.ssl-images-amazon....",,,$13.01,1388703,[],[],
4,[],,[1. Losing Game 2. I Can't Wait 3. Didn't He S...,,Early Works - Dallas Holm,"[B0002N4JP2, 0760131694, B00002EQ79, B00150K8J...",,Dallas Holm,[],"399,269 in CDs & Vinyl (","[B0002N4JP2, 0760131694, B00150K8JC, B003MTXNV...","<img src=""https://images-na.ssl-images-amazon....",,,,1526146,[],[],


- Shape of review data
- Columns of review data
- Statistical evualtion of the review data
- Data type of review data

In [6]:
print('Shape of review data:\n',reviews.shape)
print('\n')
print('#'*100,'\n')
print('Columns of review data:\n ',reviews.columns)
print('\n')
print('#'*100,'\n')
print('Statistical evualtion of the review data:\n',reviews.describe())
print('\n')
print('#'*100,'\n')
print('Data type of review data:\n',reviews.dtypes)

Shape of review data:
 (1584082, 12)


#################################################################################################### 

Columns of review data:
  Index(['overall', 'verified', 'reviewTime', 'reviewerID', 'asin', 'style',
       'reviewerName', 'reviewText', 'summary', 'unixReviewTime', 'vote',
       'image'],
      dtype='object')


#################################################################################################### 

Statistical evualtion of the review data:
             overall  unixReviewTime
count  1.584082e+06    1.584082e+06
mean   4.660555e+00    1.408211e+09
std    8.440314e-01    7.857646e+07
min    1.000000e+00    8.773056e+08
25%    5.000000e+00    1.374883e+09
50%    5.000000e+00    1.420070e+09
75%    5.000000e+00    1.457222e+09
max    5.000000e+00    1.538438e+09


#################################################################################################### 

Data type of review data:
 overall            int64
verified       

In [7]:
metadata.shape

(74347, 19)

- Shape of meta data
- Columns of meta data
- Statistical evualtion of the meta data
- Data type of meta data

In [8]:
print('Shape of meta data:\n',metadata.shape)
print('\n')
print('#'*100,'\n')
print('Columns of meta data:\n ',metadata.columns)
print('\n')
print('#'*100,'\n')
print('Data type of meta data:\n',metadata.dtypes)

Shape of meta data:
 (74347, 19)


#################################################################################################### 

Columns of meta data:
  Index(['category', 'tech1', 'description', 'fit', 'title', 'also_buy', 'tech2',
       'brand', 'feature', 'rank', 'also_view', 'main_cat', 'similar_item',
       'date', 'price', 'asin', 'imageURL', 'imageURLHighRes', 'details'],
      dtype='object')


#################################################################################################### 

Data type of meta data:
 category           object
tech1              object
description        object
fit                object
title              object
also_buy           object
tech2              object
brand              object
feature            object
rank               object
also_view          object
main_cat           object
similar_item       object
date               object
price              object
asin               object
imageURL           object
imageURLHighR

## Merging both the datasets of review and metadata for  analysis

In [9]:
digi_muse = pd.merge(reviews,metadata,on='asin')

In [10]:
digi_muse.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,...,feature,rank,also_view,main_cat,similar_item,date,price,imageURL,imageURLHighRes,details
0,5,True,"12 22, 2013",A1ZCPG3D3HGRSS,1388703,{'Format:': ' Audio CD'},mark l. massey,This is a great cd full of worship favorites!!...,Great worship cd,1387670400,...,[],"203,263 in CDs & Vinyl (","[B00000I7JO, B0016CP2GS, 0001393774, B0000275Q...","<img src=""https://images-na.ssl-images-amazon....",,,$13.01,[],[],
1,5,True,"09 11, 2013",AC2PL52NKPL29,1388703,{'Format:': ' Audio CD'},Norma Mushen,"So creative! Love his music - the words, the ...",Gotta listen to this!,1378857600,...,[],"203,263 in CDs & Vinyl (","[B00000I7JO, B0016CP2GS, 0001393774, B0000275Q...","<img src=""https://images-na.ssl-images-amazon....",,,$13.01,[],[],
2,5,True,"03 2, 2013",A1SUZXBDZSDQ3A,1388703,{'Format:': ' Audio CD'},Herbert W. Shurley,"Keith Green, gone far to early in his carreer,...",Great approach still gets the message out,1362182400,...,[],"203,263 in CDs & Vinyl (","[B00000I7JO, B0016CP2GS, 0001393774, B0000275Q...","<img src=""https://images-na.ssl-images-amazon....",,,$13.01,[],[],
3,5,True,"12 2, 2012",A3A0W7FZXM0IZW,1388703,{'Format:': ' Audio CD'},Mary M Raybell,Keith Green had his special comedy style of Ch...,Great A must have,1354406400,...,[],"203,263 in CDs & Vinyl (","[B00000I7JO, B0016CP2GS, 0001393774, B0000275Q...","<img src=""https://images-na.ssl-images-amazon....",,,$13.01,[],[],
4,5,False,"01 7, 2012",A12R54MKO17TW0,1388703,{'Format:': ' Audio CD'},J. Bynum,Keith Green / So you wanna go back to Egypt......,A great one from Keith with a guest appearance...,1325894400,...,[],"203,263 in CDs & Vinyl (","[B00000I7JO, B0016CP2GS, 0001393774, B0000275Q...","<img src=""https://images-na.ssl-images-amazon....",,,$13.01,[],[],


- Shape of merge data(digi_muse)
- Columns of merge data(digi_muse)
- Data type of merge data(digi_muse)

In [11]:
print('Shape of merge data:\n',digi_muse.shape)
print('\n')
print('#'*100,'\n')
print('Columns of merge data:\n ',digi_muse.columns)
print('\n')
print('#'*100,'\n')
print('Data type of merge data:\n',digi_muse.info())

Shape of merge data:
 (182826, 30)


#################################################################################################### 

Columns of merge data:
  Index(['overall', 'verified', 'reviewTime', 'reviewerID', 'asin', 'style',
       'reviewerName', 'reviewText', 'summary', 'unixReviewTime', 'vote',
       'image', 'category', 'tech1', 'description', 'fit', 'title', 'also_buy',
       'tech2', 'brand', 'feature', 'rank', 'also_view', 'main_cat',
       'similar_item', 'date', 'price', 'imageURL', 'imageURLHighRes',
       'details'],
      dtype='object')


#################################################################################################### 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 182826 entries, 0 to 182825
Data columns (total 30 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   overall          182826 non-null  int64 
 1   verified         182826 non-null  bool  
 2   reviewTime       182

In [12]:
digi_muse_c=digi_muse.copy()

### Checking for missing values in our merged data (digi_muse).

In [13]:
digi_muse_c.isnull().sum()

overall                 0
verified                0
reviewTime              0
reviewerID              0
asin                    0
style               39392
reviewerName           23
reviewText             90
summary                56
unixReviewTime          0
vote               145817
image              179140
category                0
tech1                   0
description             0
fit                     0
title                   0
also_buy                0
tech2                   0
brand                   0
feature                 0
rank                    0
also_view               0
main_cat                0
similar_item            0
date                    0
price                   0
imageURL                0
imageURLHighRes         0
details              3161
dtype: int64

## Percentage of null values.

In [14]:
digi_muse_c.isnull().sum()/len(digi_muse)*100

overall             0.000000
verified            0.000000
reviewTime          0.000000
reviewerID          0.000000
asin                0.000000
style              21.546170
reviewerName        0.012580
reviewText          0.049227
summary             0.030630
unixReviewTime      0.000000
vote               79.757256
image              97.983875
category            0.000000
tech1               0.000000
description         0.000000
fit                 0.000000
title               0.000000
also_buy            0.000000
tech2               0.000000
brand               0.000000
feature             0.000000
rank                0.000000
also_view           0.000000
main_cat            0.000000
similar_item        0.000000
date                0.000000
price               0.000000
imageURL            0.000000
imageURLHighRes     0.000000
details             1.728966
dtype: float64

### Looking at the percentage of null values in the columns:- vote, image has high % of null value so i removed this columns from our merge data(digi_muse)

In [15]:
digi_muse_c.drop(['image','vote'],axis=1 ,inplace=True)
#digi_muse_c = digi_muse_c.drop(columns='vote')

In [16]:
digi_muse_c.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,...,feature,rank,also_view,main_cat,similar_item,date,price,imageURL,imageURLHighRes,details
0,5,True,"12 22, 2013",A1ZCPG3D3HGRSS,1388703,{'Format:': ' Audio CD'},mark l. massey,This is a great cd full of worship favorites!!...,Great worship cd,1387670400,...,[],"203,263 in CDs & Vinyl (","[B00000I7JO, B0016CP2GS, 0001393774, B0000275Q...","<img src=""https://images-na.ssl-images-amazon....",,,$13.01,[],[],
1,5,True,"09 11, 2013",AC2PL52NKPL29,1388703,{'Format:': ' Audio CD'},Norma Mushen,"So creative! Love his music - the words, the ...",Gotta listen to this!,1378857600,...,[],"203,263 in CDs & Vinyl (","[B00000I7JO, B0016CP2GS, 0001393774, B0000275Q...","<img src=""https://images-na.ssl-images-amazon....",,,$13.01,[],[],
2,5,True,"03 2, 2013",A1SUZXBDZSDQ3A,1388703,{'Format:': ' Audio CD'},Herbert W. Shurley,"Keith Green, gone far to early in his carreer,...",Great approach still gets the message out,1362182400,...,[],"203,263 in CDs & Vinyl (","[B00000I7JO, B0016CP2GS, 0001393774, B0000275Q...","<img src=""https://images-na.ssl-images-amazon....",,,$13.01,[],[],
3,5,True,"12 2, 2012",A3A0W7FZXM0IZW,1388703,{'Format:': ' Audio CD'},Mary M Raybell,Keith Green had his special comedy style of Ch...,Great A must have,1354406400,...,[],"203,263 in CDs & Vinyl (","[B00000I7JO, B0016CP2GS, 0001393774, B0000275Q...","<img src=""https://images-na.ssl-images-amazon....",,,$13.01,[],[],
4,5,False,"01 7, 2012",A12R54MKO17TW0,1388703,{'Format:': ' Audio CD'},J. Bynum,Keith Green / So you wanna go back to Egypt......,A great one from Keith with a guest appearance...,1325894400,...,[],"203,263 in CDs & Vinyl (","[B00000I7JO, B0016CP2GS, 0001393774, B0000275Q...","<img src=""https://images-na.ssl-images-amazon....",,,$13.01,[],[],


In [17]:
digi_muse_c.shape

(182826, 28)

### Removing all the null values from the containig columns of Null Values.

In [18]:
for col in digi_muse_c.columns:
    digi_muse_c = digi_muse_c[~pd.isnull(digi_muse_c['style'])]
    
for col in digi_muse_c.columns:
    digi_muse_c = digi_muse_c[~pd.isnull(digi_muse_c['reviewerName'])]
    
for col in digi_muse_c.columns:
    digi_muse_c = digi_muse_c[~pd.isnull(digi_muse_c['reviewText'])]
    
for col in digi_muse_c.columns:
    digi_muse_c = digi_muse_c[~pd.isnull(digi_muse_c['summary'])]
    
for col in digi_muse_c.columns:
    digi_muse_c = digi_muse_c[~pd.isnull(digi_muse_c['details'])]

In [19]:
digi_muse_c.shape

(140833, 28)

In [20]:
print('Old data shape =>\n',digi_muse.shape)
print('#'*100)
print('New data shape =>\n',digi_muse_c.shape)

Old data shape =>
 (182826, 30)
####################################################################################################
New data shape =>
 (140833, 28)


### Cheack Null value

In [21]:
digi_muse_c.isnull().sum()

overall            0
verified           0
reviewTime         0
reviewerID         0
asin               0
style              0
reviewerName       0
reviewText         0
summary            0
unixReviewTime     0
category           0
tech1              0
description        0
fit                0
title              0
also_buy           0
tech2              0
brand              0
feature            0
rank               0
also_view          0
main_cat           0
similar_item       0
date               0
price              0
imageURL           0
imageURLHighRes    0
details            0
dtype: int64

### There are no missing value left

# Drop irrelevant columns

### What Is  Unix Time?

###### Unix time is a count of total seconds since a fixed time and date. It’s a date/time (or timestamp) format that looks different from the human-readable dates and times we’re used to. This is purely for efficiency reasons. It takes a lot less space to store a single number representing seconds than it does to store separate values for the year, month, hour, etc.

In [22]:
cols = ['fit', 'tech2','tech1', 'feature', 'details', 'date','imageURL', 'imageURLHighRes','unixReviewTime']

for i in cols:
    digi_muse_c.drop([i],axis=1,inplace=True)

In [23]:
print('New data shape after drop some irrelevant:\n',digi_muse_c.shape)

New data shape after drop some irrelevant:
 (140833, 19)


In [24]:
print('Columns of New data:\n ',digi_muse_c.columns)

Columns of New data:
  Index(['overall', 'verified', 'reviewTime', 'reviewerID', 'asin', 'style',
       'reviewerName', 'reviewText', 'summary', 'category', 'description',
       'title', 'also_buy', 'brand', 'rank', 'also_view', 'main_cat',
       'similar_item', 'price'],
      dtype='object')


In [25]:
digi_muse_c.reset_index(drop=True,inplace=True)
digi_muse_c.head(5)

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,category,description,title,also_buy,brand,rank,also_view,main_cat,similar_item,price
0,5,True,"12 12, 2014",A2KLYLMS27MMSX,7019098606,{'Format:': ' Audio CD'},C. oliver,Brooklyn Tab can not be beat in gospel music. ...,Uplifting,[],[],Live...Again,"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...",Brooklyn Tabernacle Choir,"725,231 in CDs & Vinyl (","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,$13.71
1,5,True,"12 9, 2014",A1IOEZDAZD6X51,7019098606,{'Format:': ' Audio CD'},Raf,Great!,Five Stars,[],[],Live...Again,"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...",Brooklyn Tabernacle Choir,"725,231 in CDs & Vinyl (","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,$13.71
2,5,True,"12 5, 2014",A1LUTAW6O69PQP,7019098606,{'Format:': ' Audio CD'},Paula Milo-Moultrie,Their music is always inspiring! I enjoy liste...,I enjoy listening to it as well as and singing...,[],[],Live...Again,"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...",Brooklyn Tabernacle Choir,"725,231 in CDs & Vinyl (","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,$13.71
3,5,True,"09 29, 2014",A1UUCVS78RELYW,7019098606,{'Format:': ' Audio CD'},jadamtam,Beautiful songs,Five Stars,[],[],Live...Again,"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...",Brooklyn Tabernacle Choir,"725,231 in CDs & Vinyl (","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,$13.71
4,5,True,"09 6, 2014",A1E2OY42QMGRXR,7019098606,{'Format:': ' Audio Cassette'},Chester Thrash,It's one of my favorites! I really love it!,I really love it!,[],[],Live...Again,"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...",Brooklyn Tabernacle Choir,"725,231 in CDs & Vinyl (","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,$13.71


### Reformat date time from raw form

In [26]:
digi_muse_c["reviewTime"] = pd.to_datetime(digi_muse_c["reviewTime"])

In [27]:
digi_muse_c.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,category,description,title,also_buy,brand,rank,also_view,main_cat,similar_item,price
0,5,True,2014-12-12,A2KLYLMS27MMSX,7019098606,{'Format:': ' Audio CD'},C. oliver,Brooklyn Tab can not be beat in gospel music. ...,Uplifting,[],[],Live...Again,"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...",Brooklyn Tabernacle Choir,"725,231 in CDs & Vinyl (","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,$13.71
1,5,True,2014-12-09,A1IOEZDAZD6X51,7019098606,{'Format:': ' Audio CD'},Raf,Great!,Five Stars,[],[],Live...Again,"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...",Brooklyn Tabernacle Choir,"725,231 in CDs & Vinyl (","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,$13.71
2,5,True,2014-12-05,A1LUTAW6O69PQP,7019098606,{'Format:': ' Audio CD'},Paula Milo-Moultrie,Their music is always inspiring! I enjoy liste...,I enjoy listening to it as well as and singing...,[],[],Live...Again,"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...",Brooklyn Tabernacle Choir,"725,231 in CDs & Vinyl (","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,$13.71
3,5,True,2014-09-29,A1UUCVS78RELYW,7019098606,{'Format:': ' Audio CD'},jadamtam,Beautiful songs,Five Stars,[],[],Live...Again,"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...",Brooklyn Tabernacle Choir,"725,231 in CDs & Vinyl (","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,$13.71
4,5,True,2014-09-06,A1E2OY42QMGRXR,7019098606,{'Format:': ' Audio Cassette'},Chester Thrash,It's one of my favorites! I really love it!,I really love it!,[],[],Live...Again,"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...",Brooklyn Tabernacle Choir,"725,231 in CDs & Vinyl (","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,$13.71


In [28]:
digi_muse_c = digi_muse_c[['asin', 'summary', 'reviewText','description', 'title','overall','brand','rank','verified', 'reviewerID', 'reviewerName', 'reviewTime',
                    'price','style','category','also_buy','also_view','main_cat','similar_item']]

In [29]:
digi_muse_c.head()

Unnamed: 0,asin,summary,reviewText,description,title,overall,brand,rank,verified,reviewerID,reviewerName,reviewTime,price,style,category,also_buy,also_view,main_cat,similar_item
0,7019098606,Uplifting,Brooklyn Tab can not be beat in gospel music. ...,[],Live...Again,5,Brooklyn Tabernacle Choir,"725,231 in CDs & Vinyl (",True,A2KLYLMS27MMSX,C. oliver,2014-12-12,$13.71,{'Format:': ' Audio CD'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
1,7019098606,Five Stars,Great!,[],Live...Again,5,Brooklyn Tabernacle Choir,"725,231 in CDs & Vinyl (",True,A1IOEZDAZD6X51,Raf,2014-12-09,$13.71,{'Format:': ' Audio CD'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
2,7019098606,I enjoy listening to it as well as and singing...,Their music is always inspiring! I enjoy liste...,[],Live...Again,5,Brooklyn Tabernacle Choir,"725,231 in CDs & Vinyl (",True,A1LUTAW6O69PQP,Paula Milo-Moultrie,2014-12-05,$13.71,{'Format:': ' Audio CD'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
3,7019098606,Five Stars,Beautiful songs,[],Live...Again,5,Brooklyn Tabernacle Choir,"725,231 in CDs & Vinyl (",True,A1UUCVS78RELYW,jadamtam,2014-09-29,$13.71,{'Format:': ' Audio CD'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
4,7019098606,I really love it!,It's one of my favorites! I really love it!,[],Live...Again,5,Brooklyn Tabernacle Choir,"725,231 in CDs & Vinyl (",True,A1E2OY42QMGRXR,Chester Thrash,2014-09-06,$13.71,{'Format:': ' Audio Cassette'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",


In [30]:
digi_muse_c.shape

(140833, 19)

### Check Duplicate value

In [31]:
#find duplicate rows across  columns
digi_muse_c.duplicated(['asin','reviewText','reviewerName','reviewerID']).sum()

17188

In [32]:
digi_muse_c.drop_duplicates(subset=['asin','reviewText','reviewerName','reviewerID'], keep='first',inplace=True)

In [33]:
#find duplicate rows across  columns
digi_muse_c.duplicated(['asin','reviewText','reviewerName','reviewerID']).sum()

0

In [34]:
print('New data shape after drop duplicate value:\n',digi_muse_c.shape)

New data shape after drop duplicate value:
 (123645, 19)


In [35]:
digi_muse_c[['rank']]

Unnamed: 0,rank
0,"725,231 in CDs & Vinyl ("
1,"725,231 in CDs & Vinyl ("
2,"725,231 in CDs & Vinyl ("
3,"725,231 in CDs & Vinyl ("
4,"725,231 in CDs & Vinyl ("
...,...
140828,"130,165 in CDs & Vinyl ("
140829,"130,165 in CDs & Vinyl ("
140830,"130,165 in CDs & Vinyl ("
140831,"130,165 in CDs & Vinyl ("


In [36]:
#for column 'rank' 
#remove uneccessary symbol like comma ',' and words
#convert the string into integer 
digi_muse_c['rank'].replace(',','', regex=True, inplace=True)
digi_muse_c['rank'] = digi_muse_c['rank'].str.extract('(\d+)')

In [37]:
digi_muse_c[['rank']]

Unnamed: 0,rank
0,725231
1,725231
2,725231
3,725231
4,725231
...,...
140828,130165
140829,130165
140830,130165
140831,130165


In [38]:
digi_muse_c[['price']]

Unnamed: 0,price
0,$13.71
1,$13.71
2,$13.71
3,$13.71
4,$13.71
...,...
140828,
140829,
140830,
140831,


In [39]:
digi_muse_c['price'].unique

<bound method Series.unique of 0         $13.71
1         $13.71
2         $13.71
3         $13.71
4         $13.71
           ...  
140828          
140829          
140830          
140831          
140832          
Name: price, Length: 123645, dtype: object>

In [40]:
#for column 'price'
#if 'price' don't start with $ or has '-' means null value, fillna with 'unknown'
#remove uneccessary symbol like comma ',' and '$'
#product_table.loc[product_table["price"].str[0] != "$", "price"] = "unknown"
#product_table.loc[product_table["price"].str.contains('-'), "price"] = "unknown"
digi_muse_c['price'].replace("\$","", regex=True, inplace=True)
digi_muse_c['price'].replace(',','', regex=True, inplace=True)

In [41]:
print("Number of reviews having empty string :",digi_muse_c[digi_muse_c['price']==''].shape[0])

Number of reviews having empty string : 41141


In [42]:
digi_muse_c.isnull().sum()

asin               0
summary            0
reviewText         0
description        0
title              0
overall            0
brand              0
rank            4456
verified           0
reviewerID         0
reviewerName       0
reviewTime         0
price              0
style              0
category           0
also_buy           0
also_view          0
main_cat           0
similar_item       0
dtype: int64

In [43]:
digi_muse_c.dropna(inplace=True)

In [44]:
digi_muse_c.isnull().sum()

asin            0
summary         0
reviewText      0
description     0
title           0
overall         0
brand           0
rank            0
verified        0
reviewerID      0
reviewerName    0
reviewTime      0
price           0
style           0
category        0
also_buy        0
also_view       0
main_cat        0
similar_item    0
dtype: int64

In [45]:
digi_muse_c = digi_muse_c[digi_muse_c['price']!=''] 

In [46]:
print("Number of reviews having empty string :",digi_muse_c[digi_muse_c['price']==''].shape[0])

Number of reviews having empty string : 0


In [47]:
digi_muse_c.head()

Unnamed: 0,asin,summary,reviewText,description,title,overall,brand,rank,verified,reviewerID,reviewerName,reviewTime,price,style,category,also_buy,also_view,main_cat,similar_item
0,7019098606,Uplifting,Brooklyn Tab can not be beat in gospel music. ...,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A2KLYLMS27MMSX,C. oliver,2014-12-12,13.71,{'Format:': ' Audio CD'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
1,7019098606,Five Stars,Great!,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1IOEZDAZD6X51,Raf,2014-12-09,13.71,{'Format:': ' Audio CD'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
2,7019098606,I enjoy listening to it as well as and singing...,Their music is always inspiring! I enjoy liste...,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1LUTAW6O69PQP,Paula Milo-Moultrie,2014-12-05,13.71,{'Format:': ' Audio CD'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
3,7019098606,Five Stars,Beautiful songs,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1UUCVS78RELYW,jadamtam,2014-09-29,13.71,{'Format:': ' Audio CD'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
4,7019098606,I really love it!,It's one of my favorites! I really love it!,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1E2OY42QMGRXR,Chester Thrash,2014-09-06,13.71,{'Format:': ' Audio Cassette'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",


In [48]:
digi_muse_c.shape

(80891, 19)

In [49]:
##changing data type
digi_muse_c['rank']= digi_muse_c['rank'].astype(int)
digi_muse_c['price'] = digi_muse_c['price'].apply(pd.to_numeric,errors='coerce')

In [50]:
digi_muse_c.head(10)

Unnamed: 0,asin,summary,reviewText,description,title,overall,brand,rank,verified,reviewerID,reviewerName,reviewTime,price,style,category,also_buy,also_view,main_cat,similar_item
0,7019098606,Uplifting,Brooklyn Tab can not be beat in gospel music. ...,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A2KLYLMS27MMSX,C. oliver,2014-12-12,13.71,{'Format:': ' Audio CD'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
1,7019098606,Five Stars,Great!,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1IOEZDAZD6X51,Raf,2014-12-09,13.71,{'Format:': ' Audio CD'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
2,7019098606,I enjoy listening to it as well as and singing...,Their music is always inspiring! I enjoy liste...,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1LUTAW6O69PQP,Paula Milo-Moultrie,2014-12-05,13.71,{'Format:': ' Audio CD'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
3,7019098606,Five Stars,Beautiful songs,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1UUCVS78RELYW,jadamtam,2014-09-29,13.71,{'Format:': ' Audio CD'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
4,7019098606,I really love it!,It's one of my favorites! I really love it!,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1E2OY42QMGRXR,Chester Thrash,2014-09-06,13.71,{'Format:': ' Audio Cassette'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
5,7019098606,Live Again,This is an excellent CD. The CD was in good co...,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,False,A3W527BN42AXN1,Resa,2012-12-29,13.71,{'Format:': ' Audio CD'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
6,7019098606,Uplifting!,The Brooklyn Tabernacle Choir has been a minis...,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1QS1CO7GHQ6EJ,Georgia Peach,2012-06-13,13.71,{'Format:': ' Audio CD'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
7,7019098606,thank you jesus,this is a must have cd to start the day. you ...,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A349TJMZWT77ZZ,mark dale,2012-06-04,13.71,{'Format:': ' Audio CD'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
8,7019098606,great cd at a great price,I have been looking for the cd for so long and...,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A87FAOGM1F70X,rox,2011-12-23,13.71,{'Format:': ' Audio CD'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
9,7019098606,best of the Brooklyn Tabernacle,This is my favorite CD from the Brooklyn Taber...,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,False,A2K5I4RAQ30IKI,Kevin Quinn,2004-06-09,13.71,{'Format:': ' Audio CD'},[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",


In [51]:
digi_muse_c.shape

(80891, 19)

In [52]:
print('Data type of merge data:\n',digi_muse_c.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 80891 entries, 0 to 140821
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   asin          80891 non-null  object        
 1   summary       80891 non-null  object        
 2   reviewText    80891 non-null  object        
 3   description   80891 non-null  object        
 4   title         80891 non-null  object        
 5   overall       80891 non-null  int64         
 6   brand         80891 non-null  object        
 7   rank          80891 non-null  int32         
 8   verified      80891 non-null  bool          
 9   reviewerID    80891 non-null  object        
 10  reviewerName  80891 non-null  object        
 11  reviewTime    80891 non-null  datetime64[ns]
 12  price         79501 non-null  float64       
 13  style         80891 non-null  object        
 14  category      80891 non-null  object        
 15  also_buy      80891 non-null  objec

### Cleaning the style columns

In [53]:
digi_muse_c[['style']]

Unnamed: 0,style
0,{'Format:': ' Audio CD'}
1,{'Format:': ' Audio CD'}
2,{'Format:': ' Audio CD'}
3,{'Format:': ' Audio CD'}
4,{'Format:': ' Audio Cassette'}
...,...
140814,{'Format:': ' Audio CD'}
140815,{'Format:': ' Audio CD'}
140819,{'Format:': ' Audio CD'}
140820,{'Format:': ' Audio CD'}


In [54]:
style= list(digi_muse_c['style'].items())
style

[(0, {'Format:': ' Audio CD'}),
 (1, {'Format:': ' Audio CD'}),
 (2, {'Format:': ' Audio CD'}),
 (3, {'Format:': ' Audio CD'}),
 (4, {'Format:': ' Audio Cassette'}),
 (5, {'Format:': ' Audio CD'}),
 (6, {'Format:': ' Audio CD'}),
 (7, {'Format:': ' Audio CD'}),
 (8, {'Format:': ' Audio CD'}),
 (9, {'Format:': ' Audio CD'}),
 (10, {'Format:': ' Audio CD'}),
 (11, {'Format:': ' Audio CD'}),
 (12, {'Format:': ' Audio CD'}),
 (13, {'Format:': ' Audio CD'}),
 (14, {'Format:': ' Audio CD'}),
 (15, {'Format:': ' Audio CD'}),
 (16, {'Format:': ' Audio CD'}),
 (17, {'Format:': ' Audio CD'}),
 (18, {'Format:': ' Audio CD'}),
 (19, {'Format:': ' Audio Cassette'}),
 (74, {'Format:': ' Audio CD'}),
 (75, {'Format:': ' Audio CD'}),
 (76, {'Format:': ' Audio CD'}),
 (77, {'Format:': ' MP3 Music'}),
 (78, {'Format:': ' Audio CD'}),
 (79, {'Format:': ' Audio CD'}),
 (80, {'Format:': ' Audio CD'}),
 (81, {'Format:': ' Audio CD'}),
 (82, {'Format:': ' Audio CD'}),
 (83, {'Format:': ' Audio CD'}),
 (84, {

In [55]:
#digi_muse_c.loc[digi_muse_c["style"].str[0] != "$", "price"] = "unknown"
#digi_muse_c.loc[digi_muse_c["style"].str.contains('-'), "price"] = "unknown"
digi_muse_c['style'].replace("size",'', regex=True, inplace=True)
digi_muse_c['style'].replace([":","''"],'', regex=True, inplace=True)
#digi_muse_c.fillna("unknown",inplace = True)

In [56]:
# Define function to clean dictionary string
def clean_dict_string(s):
    # Remove leading/trailing whitespace and curly braces
    s = s.str.strip('{}').strip()
    return s

In [57]:
digi_muse_c['style'] = digi_muse_c['style'].apply(lambda x: json.dumps(x)[1:-1])

In [58]:
digi_muse_c['style'].replace(["Size","Color"],'', regex=True, inplace=True)
digi_muse_c['style'].replace([":","''"],'', regex=True, inplace=True)
digi_muse_c['style'].replace(r'[^a-zA-Z0-9\s]','', regex=True, inplace=True)

In [59]:
digi_muse_c['style'].replace(r'\s+', ' ', regex=True, inplace=True)

In [60]:
print(list(digi_muse_c['style']))

['Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio Cassette', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio Cassette', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format MP3 Music', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD', 'Format Audio CD'

In [61]:
digi_muse_c['style'].value_counts()

Format Audio CD                 62668
Format Vinyl                     9941
Format MP3 Music                 7167
Format DVD                        363
Format Audio Cassette             359
Format DVD Audio                  139
Format Bluray Audio                88
Format Hardcover                   44
Format Bluray                      40
Format Paperback                   20
Format Amazon Video                12
Format USB Memory Stick            12
Format Unknown Binding              6
Format Mass Market Paperback        5
Format CDROM                        5
Format VHS Tape                     5
Format Vinyl Bound                  4
Format Prime Video                  3
Format Video CD                     3
Format MP3 CD                       2
Format Unbound                      1
Format CD Video                     1
Format Misc Supplies                1
Format CDR                          1
Format Kitchen                      1
Name: style, dtype: int64

In [62]:
digi_muse_c.head(10)

Unnamed: 0,asin,summary,reviewText,description,title,overall,brand,rank,verified,reviewerID,reviewerName,reviewTime,price,style,category,also_buy,also_view,main_cat,similar_item
0,7019098606,Uplifting,Brooklyn Tab can not be beat in gospel music. ...,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A2KLYLMS27MMSX,C. oliver,2014-12-12,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
1,7019098606,Five Stars,Great!,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1IOEZDAZD6X51,Raf,2014-12-09,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
2,7019098606,I enjoy listening to it as well as and singing...,Their music is always inspiring! I enjoy liste...,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1LUTAW6O69PQP,Paula Milo-Moultrie,2014-12-05,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
3,7019098606,Five Stars,Beautiful songs,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1UUCVS78RELYW,jadamtam,2014-09-29,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
4,7019098606,I really love it!,It's one of my favorites! I really love it!,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1E2OY42QMGRXR,Chester Thrash,2014-09-06,13.71,Format Audio Cassette,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
5,7019098606,Live Again,This is an excellent CD. The CD was in good co...,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,False,A3W527BN42AXN1,Resa,2012-12-29,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
6,7019098606,Uplifting!,The Brooklyn Tabernacle Choir has been a minis...,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1QS1CO7GHQ6EJ,Georgia Peach,2012-06-13,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
7,7019098606,thank you jesus,this is a must have cd to start the day. you ...,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A349TJMZWT77ZZ,mark dale,2012-06-04,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
8,7019098606,great cd at a great price,I have been looking for the cd for so long and...,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A87FAOGM1F70X,rox,2011-12-23,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",
9,7019098606,best of the Brooklyn Tabernacle,This is my favorite CD from the Brooklyn Taber...,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,False,A2K5I4RAQ30IKI,Kevin Quinn,2004-06-09,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",


In [63]:
digi_muse_c.shape

(80891, 19)

### Combining(concatenating) review text and review summary columns

In [64]:
digi_muse_c['Review'] = digi_muse_c['reviewText'] + digi_muse_c['summary']
digi_muse_c = digi_muse_c.drop(['reviewText','summary'], axis=1)
digi_muse_c.shape

(80891, 18)

In [65]:
digi_muse_c.head(1)

Unnamed: 0,asin,description,title,overall,brand,rank,verified,reviewerID,reviewerName,reviewTime,price,style,category,also_buy,also_view,main_cat,similar_item,Review
0,7019098606,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A2KLYLMS27MMSX,C. oliver,2014-12-12,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,Brooklyn Tab can not be beat in gospel music. ...


### Data cleaning and removing special characters

In [66]:
def clean_text(text):
    # remove any non-alphanumeric characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # convert all text to lower case
    text = text.lower()
    # remove any URLs
    text = re.sub(r'http\S+', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    # remove any email addresses
    text = re.sub(r'\S+@\S+', '', text)
    # remove any numbers, digits, and hyphens
    text = re.sub(r'[0-9\-]+', '', text)
    #: Removes any punctuation marks from the string.
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    #Removes any newline characters.
    text = re.sub('\n', '', text)
    # remove any extra white space
    text = re.sub(r'\s+', ' ', text)
    # remove leading and trailing white space
    text = text.strip()
    return text

In [67]:
digi_muse_c['Review'] = digi_muse_c['Review'].apply(lambda x:clean_text(x))
digi_muse_c[['Review']]

Unnamed: 0,Review
0,brooklyn tab can not be beat in gospel music t...
1,greatfive stars
2,their music is always inspiring i enjoy listen...
3,beautiful songsfive stars
4,its one of my favorites i really love iti real...
...,...
140814,i always like to hear philip glass on solo pia...
140815,an excellent choice of pieces covering the hig...
140819,i love covenantfive stars
140820,pretty good epfour stars


### Removing stop words

In [68]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS                     # Importing stop words from English language.
print('Number of stop words: %d' % len(spacy_stopwords))

Number of stop words: 326


In [69]:
print(spacy_stopwords)

{'of', 'itself', 'hence', 'cannot', 'neither', 'what', 'into', 'should', 'our', 'whatever', 'latterly', 'sixty', 'two', 'if', 'behind', 'three', 'its', 'yet', 'either', 'same', 'never', 'n’t', 'toward', 'some', 'make', 'always', 'moreover', "'re", "'d", 'mostly', 'amongst', 'wherein', 'former', 'them', 'one', 'third', 'along', 'does', 'thru', 'since', 'about', 'him', 'someone', 'hundred', 'whereafter', "'ve", 'everything', 'by', 'beforehand', 'becoming', 'serious', 'must', 'back', 'well', 'off', 'were', 'after', 'then', 'again', 'as', 'who', 'sometime', 'five', 'himself', 'than', 'still', 'last', '‘ll', 'thereby', 'using', 'an', 'ours', 'this', 'might', 'can', 'nine', 'my', 'not', 'except', '’m', 'further', 'see', 'only', 'everywhere', 'we', 'yourselves', 'whereupon', 'first', 'whether', '’ll', 'already', 'is', 'how', 'take', 'though', 'both', 'through', 'which', 'top', 'hereafter', 'next', 'nowhere', 'used', 'could', 'across', 'become', 'above', 'none', 'upon', 'she', 'no', 'whence', 

In [70]:
'five' in spacy_stopwords

True

In [71]:
'stars' in spacy_stopwords

False

In [72]:
excluding = ['against','not','don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't",
             'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 
             'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",'shouldn', "shouldn't", 'wasn',
             "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't",'empty','enough','could','good','except','bad','five',
             'star','pretty','better','smooth','uneven']
stop_words = [words for words in spacy_stopwords if words not in excluding]

In [73]:
print('Number of stop words: %d' % len(stop_words))

Number of stop words: 319


In [74]:
digi_muse_c['Review'] = digi_muse_c['Review'].apply(lambda x: ' '.join([i for i in x.split() if i not in stop_words]))

In [75]:
def lemma(text):
    tokens = nlp(text.lower())
  #Lemmatization

    final=[]
    for word in tokens:
        final.append(word.lemma_) 
    text = " ".join(final)
    return text

In [76]:
digi_muse_c['Review'] = digi_muse_c['Review'].apply(lambda x:lemma(x))
digi_muse_c['Review']

0         brooklyn tab not beat gospel music sing kind w...
1                                            greatfive star
2         music inspire enjoy listen singing alongi enjo...
3                                  beautiful songsfive star
4                                    favorite love iti love
                                ...                        
140814    like hear philip glass solo piano solo piano v...
140815    excellent choice piece cover high point einste...
140819                               love covenantfive star
140820                              pretty good epfour star
140821    michael davis hip respect trombonist rolling s...
Name: Review, Length: 80891, dtype: object

# Sentiment distribution based on polarity scores

### TextBlob:- determing sentiments by creating polarity score of Review-Text

In [77]:
# Create a new column for sentiment polarity scores
digi_muse_c["polarity"] = digi_muse_c["Review"].apply(lambda x: TextBlob(x).sentiment.polarity)

In [78]:
digi_muse_c["polarity"]

0         0.700000
1         0.000000
2         0.400000
3         0.850000
4         0.500000
            ...   
140814    0.425000
140815    0.419184
140819    0.500000
140820    0.475000
140821    0.157870
Name: polarity, Length: 80891, dtype: float64

In [79]:
def pos_neg(score):
    
    '''Returns the sentiment of the polarity'''
    
    if score < 0:
        sentiment_type = 'Negative'
    elif score == 0:
        sentiment_type = 'Neutral'
    else:
        sentiment_type = 'Positive'
        
    return sentiment_type

In [80]:
digi_muse_c['Sentiment'] = digi_muse_c['polarity'].apply(pos_neg)

In [81]:
digi_muse_c.head(10)

Unnamed: 0,asin,description,title,overall,brand,rank,verified,reviewerID,reviewerName,reviewTime,price,style,category,also_buy,also_view,main_cat,similar_item,Review,polarity,Sentiment
0,7019098606,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A2KLYLMS27MMSX,C. oliver,2014-12-12,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,brooklyn tab not beat gospel music sing kind w...,0.7,Positive
1,7019098606,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1IOEZDAZD6X51,Raf,2014-12-09,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,greatfive star,0.0,Neutral
2,7019098606,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1LUTAW6O69PQP,Paula Milo-Moultrie,2014-12-05,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,music inspire enjoy listen singing alongi enjo...,0.4,Positive
3,7019098606,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1UUCVS78RELYW,jadamtam,2014-09-29,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,beautiful songsfive star,0.85,Positive
4,7019098606,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1E2OY42QMGRXR,Chester Thrash,2014-09-06,13.71,Format Audio Cassette,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,favorite love iti love,0.5,Positive
5,7019098606,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,False,A3W527BN42AXN1,Resa,2012-12-29,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,excellent cd cd good condition get brooklyn ta...,0.733333,Positive
6,7019098606,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A1QS1CO7GHQ6EJ,Georgia Peach,2012-06-13,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,brooklyn tabernacle choir ministry year everyt...,0.44,Positive
7,7019098606,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A349TJMZWT77ZZ,mark dale,2012-06-04,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,cd start day feel spirit set day like would no...,0.0,Neutral
8,7019098606,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,True,A87FAOGM1F70X,rox,2011-12-23,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,look cd long find amazon amazing issue got scr...,0.45,Positive
9,7019098606,[],Live...Again,5,Brooklyn Tabernacle Choir,725231,False,A2K5I4RAQ30IKI,Kevin Quinn,2004-06-09,13.71,Format Audio CD,[],"[B00004R8SJ, B00SLEJI78, B000002MJ3, B079PDJSR...","[B00004R8SJ, B079PDJSRJ, B00SLEJI78, B0000DZ3C...","<img src=""https://images-na.ssl-images-amazon....",,favorite cd brooklyn tabernacle member love mu...,0.183333,Positive


In [82]:
digi_muse_c.drop(['description','polarity'],axis=1,inplace=True)

In [83]:
print('New data shape after drop value:\n',digi_muse_c.shape)

New data shape after drop value:
 (80891, 18)


In [91]:
digi_muse_data=digi_muse_c.copy()

In [93]:
digi_muse_data.columns

Index(['asin', 'title', 'overall', 'brand', 'rank', 'verified', 'reviewerID',
       'reviewerName', 'reviewTime', 'price', 'style', 'category', 'also_buy',
       'also_view', 'main_cat', 'similar_item', 'Review', 'Sentiment'],
      dtype='object')

In [94]:
digi_muse_data.isnull().sum()

asin               0
title              0
overall            0
brand              0
rank               0
verified           0
reviewerID         0
reviewerName       0
reviewTime         0
price           1390
style              0
category           0
also_buy           0
also_view          0
main_cat           0
similar_item       0
Review             0
Sentiment          0
dtype: int64

In [95]:
digi_muse_data.dropna(inplace=True)

In [96]:
digi_muse_data.isnull().sum()

asin            0
title           0
overall         0
brand           0
rank            0
verified        0
reviewerID      0
reviewerName    0
reviewTime      0
price           0
style           0
category        0
also_buy        0
also_view       0
main_cat        0
similar_item    0
Review          0
Sentiment       0
dtype: int64

In [97]:
digi_muse_data.shape

(79501, 18)

In [100]:
digi_muse_data.to_csv('Digi_Muse_data.csv')

In [101]:
df=pd.read_csv('Digi_Muse_data.csv')

In [102]:
df.isnull().sum()

Unnamed: 0          0
asin                0
title             591
overall             0
brand            7590
rank                0
verified            0
reviewerID          0
reviewerName        1
reviewTime          0
price               0
style               0
category            0
also_buy            0
also_view           0
main_cat            0
similar_item    79501
Review              2
Sentiment           0
dtype: int64