<a href="https://colab.research.google.com/github/krvamsi96/pro-machine-learning-algorithms/blob/master/Project_1_1_Semantic_Analysis_of_Amazon_Fine_Food_Reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Amazon Fine Food Reviews Analysis
Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

EDA: https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualization/

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 - Oct 2012
Number of Attributes/Columns in data: 10

#### Attribute Information:

Id
ProductId - unique identifier for the product
UserId - unqiue identifier for the user
ProfileName
HelpfulnessNumerator - number of users who found the review helpful
HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
Score - rating between 1 and 5
Time - timestamp for the review
Summary - brief summary of the review
Text - text of the review

#### Objective:
Given a review, determine whether the review is positive (rating of 4 or 5) or negative (rating of 1 or 2).


[Q] How to determine if a review is positive or negative?

[Ans] We could use Score/Rating. A rating of 4 or 5 can be cosnidered as a positive review. A rating of 1 or 2 can be considered as negative one. A review of rating 3 is considered nuetral and such reviews are ignored from our analysis. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.

### Importing Libraries

In [0]:
#Ignoring unnecessory warnings
import warnings
warnings.filterwarnings("ignore")                    

#for large and multi-dimensional arrays
import numpy as np                                  

#for data manipulation and analysis
import pandas as pd                                 

#Natural language processing tool-kit
#NLTK is a leading platform for building Python programs to work with human language data.
import nltk                                         

#Stopwords corpus
from nltk.corpus import stopwords  

# Stemmer
from nltk.stem import PorterStemmer                 

#For Bag of words
from sklearn.feature_extraction.text import CountVectorizer   

#For TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer          

#For Word2Vec
from gensim.models import Word2Vec                                   

### Reading Data

In [0]:
df = pd.read_csv('/Users/mohdsaquib/downloads/amazon/Reviews.csv')

### Exploratory Data Analysis

In [0]:
# Getting the head of data 
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [0]:
df.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

### If we see the Score column, it has values 1,2,3,4,5 . Considering 1, 2 as Negative reviews and 4, 5 as Positive reviews. For Score = 3 we will consider it as Neutral review and lets delete the rows that are neutral, so that we can predict either Positive or Negative

In [0]:
# Removing neutral reviews
df_1 = df[df['Score']!=3] 

In [0]:
#Converting Score values into class label either Posituve or Negative.
def partition(x):
    if x < 3:
        return 'positive'
    return 'negative'

score_upd = df_1['Score']
t = score_upd.map(partition)
df_1['Score']=t

In [0]:
df_1.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,negative,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,positive,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,negative,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,positive,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,negative,1350777600,Great taffy,Great taffy at a great price. There was a wid...


## Preprocessing - 

### 1). Data Cleaning: Deduplication
Reviews may have duplicate entries. Hence it is necessary to remove duplicates in order to get unbiased results for the analysis of the data.

In [0]:
# Dropping duplicates
final_data = df_1.drop_duplicates(subset={"UserId","ProfileName","Time","Text"})

### 2). Helpfulness Numerator should always be less than Helpfulness Denominator
#### Helfulness Numerator says about number of people found that review useful and HelpfulnessDenominator is about useful review count + not so useful count. So, from this we can see that HelfulnessNumerator is always less than or equal to HelpfulnesDenominator

In [0]:
# Helpfulness numerator should always be less than Denominator
final = final_data[final_data['HelpfulnessNumerator'] <= final_data['HelpfulnessDenominator']]

In [0]:
final_X = final['Text']
final_y = final['Score']

### 3. Converting all words to lowercase and removing punctuations and html tags if any
###  Stemming- Converting the words into their base word or stem word ( Ex - tastefully, tasty, these words are converted to stem word called 'tasti'). This reduces the vector dimension because we dont consider all similar words
### Stopwords - Stopwords are the unnecessary words that even if they are removed the sentiment of the sentence dosent change.
### Ex - This pasta is so tasty ==> pasta tasty ( This , is, so are stopwords so they are removed)
### To see all the stopwords see the below code cell.


In [0]:
stop = set(stopwords.words('english')) 
print(stop)

{'those', 'been', 'whom', 'them', 'off', 'couldn', 'above', 'through', "that'll", 'have', "wouldn't", 'so', 'myself', "don't", "hadn't", 'out', 'very', 'or', 'until', 'a', 's', 'am', 'mightn', "couldn't", 'we', 'of', 'between', 'just', "won't", 'then', 'doesn', 't', 'yours', 'why', "it's", 'm', 'they', 'you', 'now', 'which', 'your', 'an', 'yourselves', 'from', 'haven', 'theirs', 'having', "doesn't", 'she', 'their', 'will', "weren't", 'under', 'again', 'further', 'for', 'wasn', 'needn', 're', 'itself', 'yourself', 'weren', 'each', 'own', 'against', 'aren', 'being', 'because', 'over', 'hasn', 'only', 'into', 'some', 'these', 'll', 'with', 'not', 'while', 'no', 'ourselves', "you're", 'has', 'this', 'are', 'such', 'all', 'hers', 'his', "needn't", 'didn', 'ours', 'in', "you'd", "you've", 'themselves', 'isn', 'and', 'how', "mustn't", 'i', "aren't", 'than', 'up', 'our', 'when', 'is', 'me', 'few', 'here', 'during', 'after', 'did', 'other', 'ma', 'd', 'shouldn', 'it', "should've", "didn't", 'at

In [0]:
import re
temp =[]
snow = nltk.stem.SnowballStemmer('english')
for sentence in final_X:
    sentence = sentence.lower()                 # Converting to lowercase
    cleanr = re.compile('<.*?>')
    sentence = re.sub(cleanr, ' ', sentence)        #Removing HTML tags
    sentence = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    sentence = re.sub(r'[.|,|)|(|\|/]',r' ',sentence)        #Removing Punctuations
    
    words = [snow.stem(word) for word in sentence.split() if word not in stopwords.words('english')]   # Stemming and removing stopwords
    temp.append(words)
    
final_X = temp    

In [0]:
print(final_X[1])

['product', 'arriv', 'label', 'jumbo', 'salt', 'peanut', 'peanut', 'actual', 'small', 'size', 'unsalt', 'sure', 'error', 'vendor', 'intend', 'repres', 'product', 'jumbo']


In [0]:
sent = []
for row in final_X:
    sequ = ''
    for word in row:
        sequ = sequ + ' ' + word
    sent.append(sequ)

final_X = sent
print(final_X[1])

 product arriv label jumbo salt peanut peanut actual small size unsalt sure error vendor intend repres product jumbo
