In [None]:
# Data Source: https://www.kaggle.com/datasets/yasserh/amazon-product-reviews-dataset
# Folder: Amazon
# Description:
### The dataset consists of samples from Amazon Ratings for select products.
### The reviews are picked randomly and the corpus has nearly 1.6k reviews of different customers.
### Amazon aims to understand what are the main topics of these reviews to classify them for easier search.\

# Cleaning, Analysis, Visualization, and Modeling of Amazon Product Reviews Dataset

## Objective
- Understand the Dataset & perform the necessary cleanup.
- Add additional algorithms to go in depth on the positivity of each review
- Build a strong Topic Modelling Algorithm to classify the topics a bit more than what is provided in each review's title.
- Create a regression model to predict product ratings based on the length of reviews 

## Libraries and Tools used throughout
- Pandas
- NLTK(Sentiment Analysis and Intensity)
- sklearn(regression)
- langdetect & googletrans(detecting non-english languages and translating to english)

## In the case of errors:
- Not all python libraries may be on your machine and or within your directory. Ensure to install them.
- You ran a cell with a problematic edit that you made to it(This notebook is designed to run seamlessly with no edits)
- Not running a python kernel or you're using an old version of python kernel
- Don't have libraries or necessary downloads that are necessary for operation of parts or the entirety certain libraries.
    - ex. vader_lexicon is required to be downloaded with Sentiment Analysis(later on in the code)
    

In [71]:
import pandas as pd
from sklearn.linear_model import LinearRegression # minimum model to be used later. May need polynomial or multivariate instead
import nltk # for NLP

In [72]:
df = pd.read_csv('product_reviews.csv')
df.head()

Unnamed: 0,id,asins,brand,categories,colors,dateAdded,dateUpdated,dimension,ean,keys,...,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.userCity,reviews.userProvince,reviews.username,sizes,upc,weight
0,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets!",,,Cristina M,,,205 grams
1,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,Allow me to preface this with a little history...,One Simply Could Not Ask For More,,,Ricky,,,205 grams
2,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,4.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I am enjoying it so far. Great for reading. Ha...,Great for those that just want an e-reader,,,Tedd Gardiner,,,205 grams
3,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I bought one of the first Paperwhites and have...,Love / Hate relationship,,,Dougal,,,205 grams
4,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I have to say upfront - I don't like coroporat...,I LOVE IT,,,Miljan David Tanic,,,205 grams


## From this we can see that this Dataset contains a lot of columns. For the purpose of our analyses, we only need a few

## For reference, here is a description of each column 

- **id:** Unique identifier for each product.
- **asins:** ASIN (Amazon Standard Identification Number) associated with the product.
- **brand:** Brand of the product.
- **categories:** Categories to which the product belongs.
- **colors:** Colors available for the product.
- **dateAdded:** Date when the product was added.
- **dateUpdated:** Date when the product information was last updated.
- **dimension:** Dimensions of the product.
- **ean:** EAN (European Article Number) associated with the product.
- **keys:** Unique keys associated with the product.
- **manufacturer:** Manufacturer of the product.
- **manufacturerNumber:** Manufacturer number for the product.
- **name:** Name of the product.
- **prices:** Prices associated with the product, including currency and date information.
- **reviews.date:** Date when the review was posted.
- **reviews.doRecommend:** Indicates whether the reviewer recommends the product.
- **reviews.numHelpful:** Number of users who found the review helpful.
- **reviews.rating:** Rating given by the reviewer.
- **reviews.sourceURLs:** URLs to the source of the reviews.
- **reviews.text:** Text content of the review.
- **reviews.title:** Title of the review.
- **reviews.userCity:** City of the reviewer.
- **reviews.userProvince:** Province of the reviewer.
- **reviews.username:** Username of the reviewer.
- **sizes:** Sizes available for the product.
- **upc:** UPC (Universal Product Code) associated with the product.
- **weight:** Weight of the product.


In [73]:
# To get an easier idea of all the columns we are working with, let us see how many exist
df.columns

Index(['id', 'asins', 'brand', 'categories', 'colors', 'dateAdded',
       'dateUpdated', 'dimension', 'ean', 'keys', 'manufacturer',
       'manufacturerNumber', 'name', 'prices', 'reviews.date',
       'reviews.doRecommend', 'reviews.numHelpful', 'reviews.rating',
       'reviews.sourceURLs', 'reviews.text', 'reviews.title',
       'reviews.userCity', 'reviews.userProvince', 'reviews.username', 'sizes',
       'upc', 'weight'],
      dtype='object')

In [74]:
# Lets make a new df including more of what is actually relevant
relevant_columns = ['id', 'asins', 'brand', 'categories', 'colors', 'manufacturer',
        'name', 'prices', 'reviews.date',
       'reviews.doRecommend', 'reviews.numHelpful', 'reviews.rating', 'reviews.text', 'reviews.title',
         'sizes', 'weight']
product_reviews = df[relevant_columns]
product_reviews.tail()

Unnamed: 0,id,asins,brand,categories,colors,manufacturer,name,prices,reviews.date,reviews.doRecommend,reviews.numHelpful,reviews.rating,reviews.text,reviews.title,sizes,weight
1592,AVpfo9ukilAPnD_xfhuj,B00NO8JJZW,Amazon,"Amazon Devices & Accessories,Amazon Device Acc...",,,Alexa Voice Remote for Amazon Fire TV and Fire...,"[{""amountMax"":29.99,""amountMin"":29.99,""currenc...",2016-07-06T00:00:00.000Z,,9.0,3.0,This is not the same remote that I got for my ...,I would be disappointed with myself if i produ...,,4 ounces
1593,AVpfo9ukilAPnD_xfhuj,B00NO8JJZW,Amazon,"Amazon Devices & Accessories,Amazon Device Acc...",,,Alexa Voice Remote for Amazon Fire TV and Fire...,"[{""amountMax"":29.99,""amountMin"":29.99,""currenc...",2016-06-22T00:00:00.000Z,,41.0,1.0,I have had to change the batteries in this rem...,Battery draining remote!!!!,,4 ounces
1594,AVpfo9ukilAPnD_xfhuj,B00NO8JJZW,Amazon,"Amazon Devices & Accessories,Amazon Device Acc...",,,Alexa Voice Remote for Amazon Fire TV and Fire...,"[{""amountMax"":29.99,""amountMin"":29.99,""currenc...",2016-03-31T00:00:00.000Z,,34.0,1.0,"Remote did not activate, nor did it connect to...",replacing an even worse remote. Waste of time,,4 ounces
1595,AVpfo9ukilAPnD_xfhuj,B00NO8JJZW,Amazon,"Amazon Devices & Accessories,Amazon Device Acc...",,,Alexa Voice Remote for Amazon Fire TV and Fire...,"[{""amountMax"":29.99,""amountMin"":29.99,""currenc...",2016-04-26T00:00:00Z,,7.0,3.0,It does the job but is super over priced. I fe...,Overpriced,,4 ounces
1596,AVpfo9ukilAPnD_xfhuj,B00NO8JJZW,Amazon,"Amazon Devices & Accessories,Amazon Device Acc...",,,Alexa Voice Remote for Amazon Fire TV and Fire...,"[{""amountMax"":29.99,""amountMin"":29.99,""currenc...",2016-07-31T00:00:00Z,,10.0,1.0,I ordered this item to replace the one that no...,I am sending all of this crap back to amazon a...,,4 ounces


# Now that we have a dataset with more of the information we need, we have spotted that a few columns needs restructuring
### Specifically the prices column and the reviews date.

In [75]:
product_reviews['prices'][0]

'[{"amountMax":139.99,"amountMin":139.99,"currency":"USD","dateAdded":"2017-07-18T23:52:58Z","dateSeen":["2017-07-15T18:10:23.807Z","2016-03-16T00:00:00Z"],"isSale":"false","merchant":"Amazon.com","shipping":"FREE Shipping.","sourceURLs":["https://www.amazon.com/Kindle-Paperwhite-High-Resolution-Display-Built/dp/B00QJDU3KY/ref=lp_6669702011_1_7/132-1677641-8459202?s=amazon-devices&ie=UTF8&qid=1498832761&sr=1-7","http://www.amazon.com/Kindle-Paperwhite-High-Resolution-Display-Built-/dp/B00QJDU3KY"]},{"amountMax":119.99,"amountMin":119.99,"condition":"new","currency":"EUR","dateAdded":"2016-03-08T20:21:53Z","dateSeen":["2016-01-29T00:00:00Z"],"isSale":"false","merchant":"Amazon EU Sarl","shipping":"free","sourceURLs":["http://www.amazon.co.uk/Kindle-Paperwhite-Resolution-Display-Built-/dp/B00QJDU3KY"]},{"amountMax":139.99,"amountMin":139.99,"condition":"new","currency":"CAD","dateAdded":"2016-03-08T20:21:53Z","dateSeen":["2016-01-11T00:00:00Z"],"isSale":"false","merchant":"Amazon","shipp

In [76]:
product_reviews['reviews.date']

0       2015-08-08T00:00:00.000Z
1       2015-09-01T00:00:00.000Z
2       2015-07-20T00:00:00.000Z
3       2017-06-16T00:00:00.000Z
4       2016-08-11T00:00:00.000Z
                  ...           
1592    2016-07-06T00:00:00.000Z
1593    2016-06-22T00:00:00.000Z
1594    2016-03-31T00:00:00.000Z
1595        2016-04-26T00:00:00Z
1596        2016-07-31T00:00:00Z
Name: reviews.date, Length: 1597, dtype: object

In [77]:

# Change format to datetime
product_reviews['reviews.date'] = pd.to_datetime(product_reviews['reviews.date'], format='ISO8601')

# Gets rid of milliseconds
product_reviews['reviews.date'] = product_reviews['reviews.date'].dt.strftime('%Y-%m-%d %H:%M:%S')
product_reviews['reviews.date'].dtype #still datetime but is stored as object

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  product_reviews['reviews.date'] = pd.to_datetime(product_reviews['reviews.date'], format='ISO8601')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  product_reviews['reviews.date'] = product_reviews['reviews.date'].dt.strftime('%Y-%m-%d %H:%M:%S')


dtype('O')

In [78]:
product_reviews['reviews.date']

0       2015-08-08 00:00:00
1       2015-09-01 00:00:00
2       2015-07-20 00:00:00
3       2017-06-16 00:00:00
4       2016-08-11 00:00:00
               ...         
1592    2016-07-06 00:00:00
1593    2016-06-22 00:00:00
1594    2016-03-31 00:00:00
1595    2016-04-26 00:00:00
1596    2016-07-31 00:00:00
Name: reviews.date, Length: 1597, dtype: object

In [79]:
# quick test to make sure things are working as intended
product_reviews['reviews.date'] > '2016-02-01'

0       False
1       False
2       False
3        True
4        True
        ...  
1592     True
1593     True
1594     True
1595     True
1596     True
Name: reviews.date, Length: 1597, dtype: bool

## Now that the date is fixed, we will move on to fixing the price column


In [80]:
# For a refresher here are what values in the price column look like
prices_first_row = product_reviews['prices'][0]
print(prices_first_row)
print(type(prices_first_row))

[{"amountMax":139.99,"amountMin":139.99,"currency":"USD","dateAdded":"2017-07-18T23:52:58Z","dateSeen":["2017-07-15T18:10:23.807Z","2016-03-16T00:00:00Z"],"isSale":"false","merchant":"Amazon.com","shipping":"FREE Shipping.","sourceURLs":["https://www.amazon.com/Kindle-Paperwhite-High-Resolution-Display-Built/dp/B00QJDU3KY/ref=lp_6669702011_1_7/132-1677641-8459202?s=amazon-devices&ie=UTF8&qid=1498832761&sr=1-7","http://www.amazon.com/Kindle-Paperwhite-High-Resolution-Display-Built-/dp/B00QJDU3KY"]},{"amountMax":119.99,"amountMin":119.99,"condition":"new","currency":"EUR","dateAdded":"2016-03-08T20:21:53Z","dateSeen":["2016-01-29T00:00:00Z"],"isSale":"false","merchant":"Amazon EU Sarl","shipping":"free","sourceURLs":["http://www.amazon.co.uk/Kindle-Paperwhite-Resolution-Display-Built-/dp/B00QJDU3KY"]},{"amountMax":139.99,"amountMin":139.99,"condition":"new","currency":"CAD","dateAdded":"2016-03-08T20:21:53Z","dateSeen":["2016-01-11T00:00:00Z"],"isSale":"false","merchant":"Amazon","shippi

In [81]:
product_reviews['prices'][220]

'[{"amountMax":19.99,"amountMin":19.99,"currency":"USD","dateAdded":"2017-08-13T08:29:09Z","dateSeen":["2017-07-25T23:58:36.645Z","2017-07-25T17:33:18.056Z"],"isSale":"false","merchant":"Amazon.com","shipping":"FREE Shipping on orders over USD 25.00","sourceURLs":["https://www.amazon.com/Amazon-Echo-Case-fits-Generation/dp/B01K9KW792/ref=sr_1_127/132-5989575-2985028?s=fiona-hardware&ie=UTF8&qid=1500945247&sr=1-127","https://www.amazon.com/Amazon-Echo-Case-fits-Generation/dp/B01K9KW792/ref=sr_1_127/134-9860406-6453704?s=fiona-hardware&ie=UTF8&qid=1500944917&sr=1-127"]}]'

In [82]:
# it is a lot to take in so we'll adjust it to be more presentable
import json

# convert the value that is currently a str to a list with dictionaries
prices_1 = json.loads(prices_first_row)
print("before proper formatting; ", type(prices_1))

# makes it more presentable within json format
prices_1_format = json.dumps(prices_1, indent = 3)
print(prices_1_format)


before proper formatting;  <class 'list'>
[
   {
      "amountMax": 139.99,
      "amountMin": 139.99,
      "currency": "USD",
      "dateAdded": "2017-07-18T23:52:58Z",
      "dateSeen": [
         "2017-07-15T18:10:23.807Z",
         "2016-03-16T00:00:00Z"
      ],
      "isSale": "false",
      "merchant": "Amazon.com",
      "shipping": "FREE Shipping.",
      "sourceURLs": [
         "https://www.amazon.com/Kindle-Paperwhite-High-Resolution-Display-Built/dp/B00QJDU3KY/ref=lp_6669702011_1_7/132-1677641-8459202?s=amazon-devices&ie=UTF8&qid=1498832761&sr=1-7",
         "http://www.amazon.com/Kindle-Paperwhite-High-Resolution-Display-Built-/dp/B00QJDU3KY"
      ]
   },
   {
      "amountMax": 119.99,
      "amountMin": 119.99,
      "condition": "new",
      "currency": "EUR",
      "dateAdded": "2016-03-08T20:21:53Z",
      "dateSeen": [
         "2016-01-29T00:00:00Z"
      ],
      "isSale": "false",
      "merchant": "Amazon EU Sarl",
      "shipping": "free",
      "sourceURLs":

## For our purposes, we only want prices in USD. With the example shown above we see that there can be multiple prices in USD
- The original price when not on sale and the sale price.

## With this knowledge, we'll create two extra columns to the product reviews table and store those prices in

In [83]:
#ensure all columns have a price in USD
len(product_reviews['prices'].str.contains("USD"))

1597

In [84]:
# TODO: make a loop(hopefully with enumerate) that takes in the prices in USD for each item
full_prices = []
sale_prices = []

for i in product_reviews.index:
    list_dict = json.loads(product_reviews['prices'][i])

    # Initialize variables to store original and sale prices
    original_price = float(list_dict[0]['amountMax'])



    # Iterate through the list of dictionaries to find prices
    for price_info in list_dict:
        if price_info.get('currency') == 'USD' and price_info.get('isSale') == 'true':
            sale_price = float(price_info['amountMax'])
            break


    # Append prices to respective lists
    full_prices.append(original_price)
    sale_prices.append(sale_price)

In [85]:
# checking to ensure if the loop above needs to be adjusted to include a substitute value if there isnt a sale price
print(len(sale_prices),len(full_prices))


1597 1597


In [86]:
# Now we add two columns to showcase the two prices
product_reviews.insert(8,'fullPrice',full_prices)
product_reviews.insert(9,'salePrice',sale_prices)
product_reviews.head()


Unnamed: 0,id,asins,brand,categories,colors,manufacturer,name,prices,fullPrice,salePrice,reviews.date,reviews.doRecommend,reviews.numHelpful,reviews.rating,reviews.text,reviews.title,sizes,weight
0,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,Amazon,Kindle Paperwhite,"[{""amountMax"":139.99,""amountMin"":139.99,""curre...",139.99,119.99,2015-08-08 00:00:00,,139.0,5.0,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets!",,205 grams
1,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,Amazon,Kindle Paperwhite,"[{""amountMax"":139.99,""amountMin"":139.99,""curre...",139.99,119.99,2015-09-01 00:00:00,,126.0,5.0,Allow me to preface this with a little history...,One Simply Could Not Ask For More,,205 grams
2,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,Amazon,Kindle Paperwhite,"[{""amountMax"":139.99,""amountMin"":139.99,""curre...",139.99,119.99,2015-07-20 00:00:00,,69.0,4.0,I am enjoying it so far. Great for reading. Ha...,Great for those that just want an e-reader,,205 grams
3,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,Amazon,Kindle Paperwhite,"[{""amountMax"":139.99,""amountMin"":139.99,""curre...",139.99,119.99,2017-06-16 00:00:00,,2.0,5.0,I bought one of the first Paperwhites and have...,Love / Hate relationship,,205 grams
4,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,Amazon,Kindle Paperwhite,"[{""amountMax"":139.99,""amountMin"":139.99,""curre...",139.99,119.99,2016-08-11 00:00:00,,17.0,5.0,I have to say upfront - I don't like coroporat...,I LOVE IT,,205 grams


In [87]:
#now that this is done, we no longer need the original price column
product_reviews = product_reviews.drop(columns='prices')


In [88]:
product_reviews

Unnamed: 0,id,asins,brand,categories,colors,manufacturer,name,fullPrice,salePrice,reviews.date,reviews.doRecommend,reviews.numHelpful,reviews.rating,reviews.text,reviews.title,sizes,weight
0,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,Amazon,Kindle Paperwhite,139.99,119.99,2015-08-08 00:00:00,,139.0,5.0,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets!",,205 grams
1,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,Amazon,Kindle Paperwhite,139.99,119.99,2015-09-01 00:00:00,,126.0,5.0,Allow me to preface this with a little history...,One Simply Could Not Ask For More,,205 grams
2,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,Amazon,Kindle Paperwhite,139.99,119.99,2015-07-20 00:00:00,,69.0,4.0,I am enjoying it so far. Great for reading. Ha...,Great for those that just want an e-reader,,205 grams
3,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,Amazon,Kindle Paperwhite,139.99,119.99,2017-06-16 00:00:00,,2.0,5.0,I bought one of the first Paperwhites and have...,Love / Hate relationship,,205 grams
4,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,Amazon,Kindle Paperwhite,139.99,119.99,2016-08-11 00:00:00,,17.0,5.0,I have to say upfront - I don't like coroporat...,I LOVE IT,,205 grams
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1592,AVpfo9ukilAPnD_xfhuj,B00NO8JJZW,Amazon,"Amazon Devices & Accessories,Amazon Device Acc...",,,Alexa Voice Remote for Amazon Fire TV and Fire...,29.99,14.99,2016-07-06 00:00:00,,9.0,3.0,This is not the same remote that I got for my ...,I would be disappointed with myself if i produ...,,4 ounces
1593,AVpfo9ukilAPnD_xfhuj,B00NO8JJZW,Amazon,"Amazon Devices & Accessories,Amazon Device Acc...",,,Alexa Voice Remote for Amazon Fire TV and Fire...,29.99,14.99,2016-06-22 00:00:00,,41.0,1.0,I have had to change the batteries in this rem...,Battery draining remote!!!!,,4 ounces
1594,AVpfo9ukilAPnD_xfhuj,B00NO8JJZW,Amazon,"Amazon Devices & Accessories,Amazon Device Acc...",,,Alexa Voice Remote for Amazon Fire TV and Fire...,29.99,14.99,2016-03-31 00:00:00,,34.0,1.0,"Remote did not activate, nor did it connect to...",replacing an even worse remote. Waste of time,,4 ounces
1595,AVpfo9ukilAPnD_xfhuj,B00NO8JJZW,Amazon,"Amazon Devices & Accessories,Amazon Device Acc...",,,Alexa Voice Remote for Amazon Fire TV and Fire...,29.99,14.99,2016-04-26 00:00:00,,7.0,3.0,It does the job but is super over priced. I fe...,Overpriced,,4 ounces


## The data is finally clean and we will now move on to utilizing NLP for the following purposes
- elaborating on how positive each review is
    - creating a classification model to then support classifying the level of positivity
- topic of each review


In [89]:
# for an intro to the natural language processing toolkit and the different language packages it has. Close it when you've had a good view of the GUI
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [90]:
nltk.download('vader_lexicon') # required to be used with sentiment analysis intensity
from nltk.sentiment import SentimentIntensityAnalyzer # for identifying the level of sentiment(neg to pos) of text

# class and function of sentiment intensity analysis
sia = SentimentIntensityAnalyzer()


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\adwal\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [91]:
# quick check to make sure all products have reviews.
product_reviews['reviews.text'].isnull().sum()

0

In [92]:
from langdetect import detect
from googletrans import Translator


translator = Translator()
sia = SentimentIntensityAnalyzer()

scores_data = []

for review in product_reviews['reviews.text']:
    # Check if the review is in English
    try:
        if detect(review) != 'en':
            # Translate non-English reviews to English
            translation = translator.translate(review, dest='en').text
            review = translation

        # Analyze sentiment for the (translated or original) review
        score = sia.polarity_scores(review)
        scores_data.append(score)
    except Exception as e:
        print(f"Error processing review: {e}")

scores_data[:20]


[{'neg': 0.038, 'neu': 0.793, 'pos': 0.169, 'compound': 0.9804},
 {'neg': 0.041, 'neu': 0.812, 'pos': 0.147, 'compound': 0.9874},
 {'neg': 0.181, 'neu': 0.596, 'pos': 0.223, 'compound': 0.4364},
 {'neg': 0.03, 'neu': 0.865, 'pos': 0.105, 'compound': 0.9743},
 {'neg': 0.089, 'neu': 0.715, 'pos': 0.195, 'compound': 0.993},
 {'neg': 0.061, 'neu': 0.87, 'pos': 0.069, 'compound': 0.1695},
 {'neg': 0.041, 'neu': 0.812, 'pos': 0.147, 'compound': 0.9874},
 {'neg': 0.0, 'neu': 0.756, 'pos': 0.244, 'compound': 0.9765},
 {'neg': 0.038, 'neu': 0.793, 'pos': 0.169, 'compound': 0.9804},
 {'neg': 0.181, 'neu': 0.596, 'pos': 0.223, 'compound': 0.4364},
 {'neg': 0.023, 'neu': 0.86, 'pos': 0.117, 'compound': 0.9614},
 {'neg': 0.0, 'neu': 0.767, 'pos': 0.233, 'compound': 0.9804},
 {'neg': 0.181, 'neu': 0.596, 'pos': 0.223, 'compound': 0.4364},
 {'neg': 0.04, 'neu': 0.864, 'pos': 0.096, 'compound': 0.5149},
 {'neg': 0.043, 'neu': 0.798, 'pos': 0.159, 'compound': 0.9997},
 {'neg': 0.0, 'neu': 0.781, 'pos':

In [93]:
# Insert a column to store the positivity scores
product_reviews.insert(15,'positivityScore',[scores_data[i]['compound'] for i in range(len(scores_data))])

In [94]:
positivity_level = []

for i in product_reviews['positivityScore']:
    if .66 <= i <= 1:
        positivity_level.append("highly positive")
    elif .33 <= i < .66:
        positivity_level.append("positive")
    elif .1 <= i < .33:
        positivity_level.append("fairly positive")
    elif -.1 <= i < .1:
        positivity_level.append("neutral")
    elif -.33 <= i < -.1:
        positivity_level.append("fairly negative")
    elif -.66 <= i < -.33:
        positivity_level.append("negative")
    elif -1 <= i < -.66:
        positivity_level.append("highly negative")



product_reviews.insert(16,'positivityLevel',positivity_level)

In [95]:
product_reviews.head(3)

Unnamed: 0,id,asins,brand,categories,colors,manufacturer,name,fullPrice,salePrice,reviews.date,reviews.doRecommend,reviews.numHelpful,reviews.rating,reviews.text,reviews.title,positivityScore,positivityLevel,sizes,weight
0,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,Amazon,Kindle Paperwhite,139.99,119.99,2015-08-08 00:00:00,,139.0,5.0,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets!",0.9804,highly positive,,205 grams
1,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,Amazon,Kindle Paperwhite,139.99,119.99,2015-09-01 00:00:00,,126.0,5.0,Allow me to preface this with a little history...,One Simply Could Not Ask For More,0.9874,highly positive,,205 grams
2,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,Amazon,Kindle Paperwhite,139.99,119.99,2015-07-20 00:00:00,,69.0,4.0,I am enjoying it so far. Great for reading. Ha...,Great for those that just want an e-reader,0.4364,positive,,205 grams


### Now we'll go over to creating the algorithm for identifying the topic within each review

In [96]:
# if you know a specific package that you want to download you can do it like what we
nltk.download('product_reviews_2')

[nltk_data] Downloading package product_reviews_2 to
[nltk_data]     C:\Users\adwal\AppData\Roaming\nltk_data...
[nltk_data]   Package product_reviews_2 is already up-to-date!


True

In [97]:
# See how many files are within this dataset
from nltk.corpus import product_reviews_2
len(product_reviews_2.fileids())


10

In [98]:
# get a quick look at them
product_reviews_2.fileids()

['Canon_PowerShot_SD500.txt',
 'Canon_S100.txt',
 'Diaper_Champ.txt',
 'Hitachi_router.txt',
 'Linksys_Router.txt',
 'MicroMP3.txt',
 'Nokia_6600.txt',
 'README.txt',
 'ipod.txt',
 'norton.txt']

In [99]:
# go deep into seeing one of them,
print(product_reviews_2.raw(fileids='Linksys_Router.txt'))

[t]
router[+2]##This router does everything that it is supposed to do, so i dont really know how to talk that bad about it. 
setup[+2], installation[+2] ##It was a very quick setup and installation, in fact the disc that it comes with pretty much makes sure you cant mess it up. 
install[+3]##By no means do you have to be a tech junkie to be able to install it, just be able to put a CD in the computer and it tells you what to do. 
works[+3] ##It works great, i am usually at the full 54 mbps, although every now and then that drops to around 36 mbps only because i am 2 floors below where the router is. 
##That only happens every so often, but its not that big of a drawback really, just a little slower than usual. 
router[+2][p] ##It really is a great buy if you are lookin at having just one modem but many computers around the house. 
router[+2] ##There are 3 computers in my house all getting wireless connection from this router, and everybody is happy with it. 
##I do not really know why 

In [None]:
# used for splitting the reviews by words
from nltk.tokenize import word_tokenize

# english tokenizer that adds more depth to the tokenizer
nltk.download("punkt")

In [None]:
product_reviews_2.words()