Introduction
The Amazon Fine Foods Reviews Data Set contains the reviews of fine foods from Amazon. Some important statistics of this dataset are:

Number of reviews: 568,454 Number of users: 256,059 Number of products: 74,258 Timespan: Oct 1999 - Oct 2012 Number of Attributed in Data: 10

What are the Attributes?
1. Id - Review ID
2. ProductId - ID of the Product reviewed
3. UserId - ID of the User who reviewed
4. ProfileName - Name of User (in its Profile)
5. HelpfulnessNumerator - Number of People who found the review helpful
6. HelpfulnessDenominator - Number of People who indicated whether the review was helpful or not
7. Score - Rating (between 1 to 5)
8. Time - Timestamp of the review (in UNIX Time Stamp format)
9. Summary - Brief Summary (or basically the title) of the Review
10 . Text - Text of the Review

Goal
Our main goal is to predict whether the review is positive or not. Positive Reviews can be considered as reviews having rating 4 or 5. Negative Reviews can be considered having rating 1 or 2. Rating of 3 is considered neutral (and as you will see eventually, will be ignored). Thus we are trying to determine the polarity (or sentiment) of the review.

In [1]:
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import nltk
import seaborn as sns
import string

In [2]:
#use sqlite table 

conn = sqlite3.connect('/home/keenu/Documents/datasets/amazon-fine-food-review/database.sqlite')

In [3]:
#considering the data only for positive or negative reviews
# dropping whose score == 3

flt_data = pd.read_sql_query("""select * from reviews where Score != 3 """, conn)

In [4]:
#giving a positive or negative rating


def partition(request):
    if request >3:
        return 'positive'
    return 'negative'

#mapping of positive or negative wrt Score value

actual_data = flt_data['Score']
positiveNegative = actual_data.map(partition)
flt_data['Score'] = positiveNegative


In [5]:
flt_data.shape

(525814, 10)

In [6]:
flt_data.head(10)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,negative,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,positive,1350777600,Great taffy,Great taffy at a great price. There was a wid...
5,6,B006K2ZZ7K,ADT0SRK1MGOEU,Twoapennything,0,0,positive,1342051200,Nice Taffy,I got a wild hair for taffy and ordered this f...
6,7,B006K2ZZ7K,A1SP2KVKFXXRU1,David C. Sullivan,0,0,positive,1340150400,Great! Just as good as the expensive brands!,This saltwater taffy had great flavors and was...
7,8,B006K2ZZ7K,A3JRGQVEQN31IQ,Pamela G. Williams,0,0,positive,1336003200,"Wonderful, tasty taffy",This taffy is so good. It is very soft and ch...
8,9,B000E7L2R4,A1MZYO9TZK0BBI,R. James,1,1,positive,1322006400,Yay Barley,Right now I'm mostly just sprouting this so my...
9,10,B00171APVA,A21BT40VZCCYT4,Carol A. Reed,0,0,positive,1351209600,Healthy Dog Food,This is a very healthy dog food. Good for thei...


In [7]:
display = pd.read_sql_query("""select * from reviews where Score != 3 and UserID ="A1SP2KVKFXXRU1" 
ORDER BY productID""", conn)
display

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,194151,B0058AVD44,A1SP2KVKFXXRU1,David C. Sullivan,0,0,2,1341014400,"Tasted good, but many were stuck together","The candy tasted great, but I had to throw man..."
1,7,B006K2ZZ7K,A1SP2KVKFXXRU1,David C. Sullivan,0,0,5,1340150400,Great! Just as good as the expensive brands!,This saltwater taffy had great flavors and was...


In this review is given by the same user but diffrent products 
But in our next query we can see user , helpfulN, HelpfullD , score and  time all are same which will give unbiased results to us.

In [8]:
display = pd.read_sql_query("""select * from reviews where Score != 3 and UserID ="AR5J8UI46CURR" 
ORDER BY productID""", conn)
display

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


To solve this issue we have to drop the duplicate values from database or we can say data cleaning.


# Data Cleaning


Deduplication means removing duplicate rows, It is necessary to remove duplicates in order to get unbaised results. Checking duplicates based on UserId, ProfileName, Time, Text. If all these values are equal then we will remove those records. (No user can type a review on same exact time for different products.)

In [9]:
sorted_data = flt_data.sort_values('ProductId', axis=0,ascending=True,inplace=False)

In [10]:
final_data = sorted_data.drop_duplicates(subset = {"UserId", "ProfileName", "Time","Text"},keep="first",inplace=False)


In [11]:
flt_data.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [12]:
final_data.shape

(364173, 10)

We have seen that HelpfulnessNumerator should always be less than or equal to HelpfulnessDenominator so checking this condition and removing those records also.

In [13]:
final = final_data[final_data.HelpfulnessNumerator <= final_data.HelpfulnessDenominator]

In [14]:
final.shape


(364171, 10)

In [15]:
final['Score'].value_counts()

positive    307061
negative     57110
Name: Score, dtype: int64

In [16]:
final_X = final['Text']
final_y = final['Score']

# Text-Processing

In [19]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

Converting all words to lowercase and removing punctuations and html tags if any

Stemming- Converting the words into their base word or stem word ( Ex - tastefully, tasty, these words are converted to stem word called 'tasti'). This reduces the vector dimension because we dont consider all similar words

Stopwords - Stopwords are the unnecessary words that even if they are removed the sentiment of the sentence dosent change.

Ex - This pasta is so tasty ==> pasta tasty ( This , is, so are stopwords so they are removed)

In [25]:
stop = set(stopwords.words('english'))
sno = nltk.stem.SnowballStemmer('english')
temp =[]
for sentence in final_X:
    sentence = sentence.lower()                 # Converting to lowercase
    cleanr = re.compile('<.*?>')
    sentence = re.sub(cleanr, ' ', sentence)        #Removing HTML tags
    sentence = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    sentence = re.sub(r'[.|,|)|(|\|/]',r' ',sentence)        #Removing Punctuations
    
    words = [sno.stem(word) for word in sentence.split() if word not in stopwords.words('english')]   # Stemming and removing stopwords
    temp.append(words)
    
final_X = temp 


In [26]:
print(final_X[1])



['grew', 'read', 'sendak', 'book', 'watch', 'realli', 'rosi', 'movi', 'incorpor', 'love', 'son', 'love', 'howev', 'miss', 'hard', 'cover', 'version', 'paperback', 'seem', 'kind', 'flimsi', 'take', 'two', 'hand', 'keep', 'page', 'open']


Techniques for Encoding

BAG OF WORDS

In BoW we construct a dictionary that contains set of all unique words from our text review dataset.The frequency of the word is counted here. if there are d unique words in our dictionary then for every sentence or review the vector will be of length d and count of word from review is stored at its particular location in vector. The vector will be highly sparse in such case.

Ex. pasta is tasty and pasta is good

[0]....[1]............[1]...........[2]..........[2]............[1].......... <== Its vector representation ( remaining all dots will be represented as zeroes)

[a]..[and].....[good].......[is].......[pasta]....[tasty]....... <==This is dictionary .

BI-GRAM BOW

Considering pair of words for creating dictionary is Bi-Gram , Tri-Gram means three consecutive words so as NGram.

CountVectorizer has a parameter ngram_range if assigned to (1,2) it considers Bi-Gram BoW

But this massively increases our dictionary size

In [30]:
sent = []
for row in final_X:
    sequ = ''
    for word in row:
        sequ = sequ + ' ' + word
    sent.append(sequ)

final_X = sent


In [31]:
from sklearn.feature_extraction.text import CountVectorizer 
count_vect = CountVectorizer(ngram_range=(1,2))
Bigram_data = count_vect.fit_transform(final_X)
print(Bigram_data[1])


  (0, 352561)	1
  (0, 2592392)	1
  (0, 1665573)	2
  (0, 1273268)	1
  (0, 2265120)	1
  (0, 2463299)	1
  (0, 3061916)	1
  (0, 2271891)	1
  (0, 2373784)	1
  (0, 1829046)	1
  (0, 1433205)	1
  (0, 1390119)	1
  (0, 1792391)	1
  (0, 1314826)	1
  (0, 691737)	1
  (0, 3024994)	1
  (0, 2026845)	1
  (0, 2451939)	1
  (0, 1531811)	1
  (0, 1103948)	1
  (0, 2767562)	1
  (0, 2951006)	1
  (0, 1305283)	1
  (0, 1517770)	1
  (0, 2016036)	1
  :	:
  (0, 2266614)	1
  (0, 2463301)	1
  (0, 353371)	1
  (0, 3062713)	1
  (0, 2274085)	1
  (0, 2373807)	1
  (0, 1829314)	1
  (0, 1433342)	1
  (0, 1670885)	1
  (0, 2593170)	1
  (0, 1668410)	1
  (0, 1391678)	1
  (0, 1792987)	1
  (0, 1315213)	1
  (0, 693000)	1
  (0, 3026295)	1
  (0, 2026855)	1
  (0, 2453237)	1
  (0, 1532668)	1
  (0, 1104086)	1
  (0, 2770297)	1
  (0, 2952321)	1
  (0, 1306301)	1
  (0, 1519437)	1
  (0, 2016352)	1
