### download dataset 

In [None]:
# download small data (less than 400 MB), example (used here - US kitchen (800 MB))
!wget https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Kitchen_v1_00.tsv.gz
# make unzip dataset 
!gunzip "amazon_reviews_us_Kitchen_v1_00.tsv.gz"

--2023-05-14 14:07:00--  https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Kitchen_v1_00.tsv.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.224.48, 54.231.171.72, 52.216.43.192, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.224.48|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 930744854 (888M) [application/x-gzip]
Saving to: ‘amazon_reviews_us_Kitchen_v1_00.tsv.gz’


2023-05-14 14:07:12 (78.3 MB/s) - ‘amazon_reviews_us_Kitchen_v1_00.tsv.gz’ saved [930744854/930744854]



### sorten dataset (remove unwanted data and save memory) 

need to <font color='red'>restart runtime (clean all ram)</font>

In [None]:
# load required modules --------------------------------------------------------
import pandas as pd                                                             # load pandas for data handling
# load dataset (full) ----------------------------------------------------------
# on_bad_lines : {'error', 'warn', 'skip'} or callable, default 'error', Specifies what to do upon encountering a 
# bad line (a line with too many fields). Allowed values are :
#       - 'error', raise an Exception when a bad line is encountered.
#       - 'warn', raise a warning when a bad line is encountered and skip that line.
#       - 'skip', skip bad lines without raising or warning when they are encountered.
reviews_df=pd.read_csv('amazon_reviews_us_Kitchen_v1_00.tsv',sep='\t',on_bad_lines='skip')
# see head of data 
reviews_df.head(3)

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,37000337,R3DT59XH7HXR9K,B00303FI0G,529320574,Arthur Court Paper Towel Holder,Kitchen,5.0,0.0,0.0,N,Y,Beautiful. Looks great on counter,Beautiful. Looks great on counter.,2015-08-31
1,US,15272914,R1LFS11BNASSU8,B00JCZKZN6,274237558,Olde Thompson Bavaria Glass Salt and Pepper Mi...,Kitchen,5.0,0.0,1.0,N,Y,Awesome & Self-ness,I personally have 5 days sets and have also bo...,2015-08-31
2,US,36137863,R296RT05AG0AF6,B00JLIKA5C,544675303,Progressive International PL8 Professional Man...,Kitchen,5.0,0.0,0.0,N,Y,Fabulous and worth every penny,Fabulous and worth every penny. Used for clean...,2015-08-31


In [None]:
# see dataset info and stats ---------------------------------------------------
# see shape of dataset 
print('Shape of dataset:',reviews_df.shape)                                     # rows, columns
# see info 
print('Info:\n',reviews_df.info(),'\n')
# describe dataset (numerical values)
print('Describe (for numerical columns only):\n',reviews_df.describe(),'\n')
# describe dataset (for non-numerical values)
print('Describe (for non-numerical columns):\n',reviews_df.describe(include=object),'\n')

Shape of dataset: (4874890, 15)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4874890 entries, 0 to 4874889
Data columns (total 15 columns):
 #   Column             Dtype  
---  ------             -----  
 0   marketplace        object 
 1   customer_id        int64  
 2   review_id          object 
 3   product_id         object 
 4   product_parent     int64  
 5   product_title      object 
 6   product_category   object 
 7   star_rating        float64
 8   helpful_votes      float64
 9   total_votes        float64
 10  vine               object 
 11  verified_purchase  object 
 12  review_headline    object 
 13  review_body        object 
 14  review_date        object 
dtypes: float64(3), int64(2), object(10)
memory usage: 557.9+ MB
Info:
 None 

Describe (for numerical columns only):
         customer_id  product_parent   star_rating  helpful_votes   total_votes
count  4.874890e+06    4.874890e+06  4.874887e+06   4.874887e+06  4.874887e+06
mean   2.921071e+07    4.997882e+0

In [None]:
# drop unwanted columns and save space -----------------------------------------
# see all columns
print('All columns:',reviews_df.columns,'\n')
# Columns to drop
# + 'marketplace' - contains only one single value - "US"
# + 'review_date' - no relevance
# + 'customer_id','review_id','product_id','product_parent','product_title' - as our target is to predict (somthing like sentimental analysis) 
#            only on the basis of customer review, so it should on depend on product or customer.
# Note: Other columns - 'product_category', 'helpful_votes', 'total_votes', 'vine', and 'verified_purchase' are also not used but still kept.
# drop unwanted columns and save space (make change in orginal dataset)
reviews_df.drop(columns=['marketplace','customer_id','review_id','product_id','product_parent','product_title','review_date'],inplace=True)
print('Data head:\n',reviews_df.head(3),'\n')                                   # see head of data 
print('Data tail:\n',reviews_df.tail(3),'\n')                                   # see tail of data
print('See shape of data:',reviews_df.shape)                                    # see shape of data 

All columns: Index(['marketplace', 'customer_id', 'review_id', 'product_id',
       'product_parent', 'product_title', 'product_category', 'star_rating',
       'helpful_votes', 'total_votes', 'vine', 'verified_purchase',
       'review_headline', 'review_body', 'review_date'],
      dtype='object') 

Data head:
   product_category  star_rating  helpful_votes  total_votes vine  \
0          Kitchen          5.0            0.0          0.0    N   
1          Kitchen          5.0            0.0          1.0    N   
2          Kitchen          5.0            0.0          0.0    N   

  verified_purchase                    review_headline  \
0                 Y  Beautiful. Looks great on counter   
1                 Y                Awesome & Self-ness   
2                 Y     Fabulous and worth every penny   

                                         review_body  
0                Beautiful.  Looks great on counter.  
1  I personally have 5 days sets and have also bo...  
2  Fabulous an

In [None]:
# make save reduce / cleaned / shorten dataset (as csv file)
reviews_df.to_csv('amazon_reviews_us_Kitchen_v1_00.csv',index=False)

<font color='red'>restart runtime (clean all ram)</font>

In [None]:
exit()                                                                          # restart runtime

### load required modules 

In [None]:
import gc                                                                       # Garbage Collector interface (Source: https://docs.python.org/3/library/gc.html)                           
gc.enable()                                                                     # Enable automatic garbage collection.
import re,string                                                                # load re — Regular expression operations and string for string manipulation 
import numpy as np                                                              # load numerical python 
import pandas as pd                                                             # load pandas for data handling
from matplotlib import pyplot as plt                                            # for ploting graphs 

### download nltk data 

See for more - [installing NLTK Data](https://www.nltk.org/data.html)

In [None]:
# download nltk data
nltk_root=__import__('nltk')                                                    # load nltk 
nltk_root.download('twitter_samples')                                           # download twitter samples
nltk_root.download('punkt')                                                     # download punctuations for tokenization
nltk_root.download('stopwords')                                                 # download all stopwords
nltk_root.download('wordnet')                                                   # download wordnet data for wordnet WordNetLemmatizer
nltk_root.download('averaged_perceptron_tagger')                                # download data for nltk POS tagger 
nltk_root.download('tagsets')                                                   # download tagset info (nltk.help.upenn_tagset)
del nltk_root

### load dataset 

In [None]:
# load cleaned / shorten dataset (required less space) (save approx 1 to 2 GB of RAM)
reviews_df=pd.read_csv('amazon_reviews_us_Kitchen_v1_00.csv',sep=',')
# see head of data 
reviews_df.head(3)

Unnamed: 0,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body
0,Kitchen,5.0,0.0,0.0,N,Y,Beautiful. Looks great on counter,Beautiful. Looks great on counter.
1,Kitchen,5.0,0.0,1.0,N,Y,Awesome & Self-ness,I personally have 5 days sets and have also bo...
2,Kitchen,5.0,0.0,0.0,N,Y,Fabulous and worth every penny,Fabulous and worth every penny. Used for clean...


In [None]:
# see tail of data 
reviews_df.tail(3)

Unnamed: 0,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body
4874887,Kitchen,4.0,55.0,60.0,N,N,Ice Cream Like a Dream,"According to my wife, this is \\""the best birt..."
4874888,Kitchen,4.0,30.0,42.0,N,N,Opens anything and everything,Hoffritz has a name of producing a trendy and ...
4874889,Kitchen,5.0,5.0,5.0,N,N,"The more you listen, the more you hear...",OK. I was late to snap to the Dead Reckoners. ...


In [None]:
# see shape of dataset 
reviews_df.shape                                                                # rows, columns

(4874890, 8)

### see dataset info and stats 

In [None]:
# see info 
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4874890 entries, 0 to 4874889
Data columns (total 8 columns):
 #   Column             Dtype  
---  ------             -----  
 0   product_category   object 
 1   star_rating        float64
 2   helpful_votes      float64
 3   total_votes        float64
 4   vine               object 
 5   verified_purchase  object 
 6   review_headline    object 
 7   review_body        object 
dtypes: float64(3), object(5)
memory usage: 297.5+ MB


In [None]:
# describe dataset (numerical values)
reviews_df.describe()

Unnamed: 0,star_rating,helpful_votes,total_votes
count,4874887.0,4874887.0,4874887.0
mean,4.207311,2.24735,2.678836
std,1.287006,22.92469,24.10152
min,1.0,0.0,0.0
25%,4.0,0.0,0.0
50%,5.0,0.0,0.0
75%,5.0,1.0,1.0
max,5.0,11173.0,11501.0


In [None]:
# describe dataset (for non-numerical values)
reviews_df.describe(include=object)

Unnamed: 0,product_category,vine,verified_purchase,review_headline,review_body
count,4874890,4874887,4874887,4874864,4874644
unique,4,2,2,2494360,4567472
top,Kitchen,N,Y,Five Stars,Great
freq,4874887,4850478,4094402,608887,6068


### look / search for special characters

### clean data

### test of sample review of 2023

### save model and functions to local disk (for model deployment)

# References / Further reading

* [Official python docs](https://docs.python.org/3/)
* [Official python tutorials](https://docs.python.org/3/tutorial/index.html)
* [NLTK :: Natural Language Toolkit](https://www.nltk.org/)
* [Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit](https://www.nltk.org/book/), a book by Steven Bird, Ewan Klein, and Edward Loper. 
* [spaCy · Industrial-strength Natural Language Processing](https://spacy.io/), Docs - https://spacy.io/api/doc , Example - https://spacy.io/api/example
* [spaCy 101: Everything you need to know](https://spacy.io/usage/spacy-101)
* NLTK's [NaiveBayesClassifier](https://scikit-learn.org/stable/modules/naive_bayes.html)
* [Example usage of NLTK modules](https://www.nltk.org/howto.html)
