<a href="https://colab.research.google.com/github/Drownie/sentiment-analysist-review/blob/master/sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prepare Dataset 🗿

In [None]:
# Install the dataset
!curl -L -o amazon-fine-food-reviews.zip\
  https://www.kaggle.com/api/v1/datasets/download/snap/amazon-fine-food-reviews

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  242M  100  242M    0     0   100M      0  0:00:02  0:00:02 --:--:--  134M


In [None]:
# Unzip the dataset
!unzip amazon-fine-food-reviews.zip

Archive:  amazon-fine-food-reviews.zip
  inflating: Reviews.csv             
  inflating: database.sqlite         
  inflating: hashes.txt              


# Install Modules

In [None]:
!pip install nltk



# Import Modules

In [None]:
import pandas as pd
import nltk

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

import re
from bs4 import BeautifulSoup

from sklearn.metrics import confusion_matrix, classification_report

In [None]:
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_rus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |  

True

# Preprocess Data


In [None]:
df = pd.read_csv('Reviews.csv')

# Drop duplicate and NA values
df.drop_duplicates(subset=['Text'],inplace=True)  #dropping duplicates
df.dropna(axis=0,inplace=True)   #dropping na

In [None]:
print(f'count: {len(df)}')
# df.head(10)

count: 393560


In [None]:
# Get data sampling (Optional)
# You can use full data but it will cost more of the compute power
sample_df = df.sample(n=50000)
print(f'count: {len(sample_df)}')

count: 50000


In [None]:
def preprocess_data(text):
  # Removing HTML, etc.
  newString = text.lower()
  newString = BeautifulSoup(newString, "lxml").text
  newString = re.sub(r'\([^)]*\)', '', newString)
  newString = re.sub('"','', newString)
  newString = re.sub(r"'s\b","",newString)
  newString = re.sub("[^a-zA-Z]", " ", newString)

  tokens = word_tokenize(newString)
  stop_words = set(stopwords.words('english'))

  # Filter stopwords
  filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

  lemmatizer = WordNetLemmatizer()
  lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

  # Return Lemmatized tokens
  return ' '.join(lemmatized_tokens)

sample_df['ReviewText'] = sample_df['Text'].apply(preprocess_data)
sample_df.head(10)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,ReviewText
7092,7093,B004K30HO2,A1WHXQGV8DI7BK,Mary Ann,2,2,5,1307318400,Keurig Cups Lids & Filters,These cups and filters work excellent. A chea...,cup filter work excellent cheaper alternative ...
400254,400255,B002IEVJRY,A73DFWJE0CGY6,"Flight Risk (The Gypsy Moth) ""Exiled Yankee""",0,0,4,1339718400,"balanced flavor, good coffee buzz",Thia was an enjoyable alternative to many such...,thia enjoyable alternative many coffee drink m...
239077,239078,B0083QJUL8,A3ART1EIT6930S,Vivian,2,2,5,1326844800,Fantastic!,This price is amazing! We are a whole foods fa...,price amazing whole food family use baking waf...
309125,309126,B0014K91GY,A21KE10M10LCTE,Justine,14,15,1,1219190400,Not good at all,I thought it was great that there was hot coco...,thought great hot cocoa keurig machine watery ...
2212,2213,B0007T3V82,A2G3R12TYEX4RN,Lisa Merriman,3,3,2,1284681600,It's A Boy Bubble Gum,"Hi,<br /><br />My first grandson was born and ...",hi first grandson born wanted everything sun c...
386657,386658,B005Y10ZMS,AKAZT5193KFR1,"L. Samuelson ""L.W. Samuelson""",0,0,4,1351036800,Fills the Void,These diet bars have a rich dark chocolate tas...,diet bar rich dark chocolate taste little grit...
124476,124477,B000LKUYGE,A281NPSIMI1C2R,"Rebecca of Amazon ""The Rebecca Review""",0,0,3,1344902400,Chocolate with a Poem,Chocolove's dark chocolate is more like milk c...,chocolove dark chocolate like milk chocolate l...
241654,241655,B0081XPTBS,A2J7DG2LRZ7TN7,milkers_mom,0,0,5,1315958400,I would order this again!,The four pack of infant formula was delivered ...,four pack infant formula delivered quickly wit...
532862,532863,B0009F3SFA,AY3QU54B6NG72,barkely,0,0,5,1283731200,It works,I almost gave up on nursing but low in behold ...,almost gave nursing low behold ran product dau...
78964,78965,B0039OZOI2,A3T4LCHSYH8YL3,FYI,0,1,1,1298332800,this item isn't kosher,Just wanted to let you know that this item has...,wanted let know item mistaken kosher certifica...


In [None]:
# Store preprocessed
df.to_csv('Reviews_preprocessed.csv', index=False)

In [None]:
analyzer = SentimentIntensityAnalyzer()

In [None]:
def get_sentiment(text):
    scores = analyzer.polarity_scores(text)

    sentiment = 1 if scores['pos'] > 0 else 0

    return sentiment

def get_positive_score(score):
    return 1 if score >= 3 else 0

sample_df['Sentiment'] = sample_df['ReviewText'].apply(get_sentiment)
sample_df['Positive'] = sample_df['Score'].apply(get_positive_score)
sample_df.head(10)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,ReviewText,Sentiment,Positive
7092,7093,B004K30HO2,A1WHXQGV8DI7BK,Mary Ann,2,2,5,1307318400,Keurig Cups Lids & Filters,These cups and filters work excellent. A chea...,cup filter work excellent cheaper alternative ...,1,1
400254,400255,B002IEVJRY,A73DFWJE0CGY6,"Flight Risk (The Gypsy Moth) ""Exiled Yankee""",0,0,4,1339718400,"balanced flavor, good coffee buzz",Thia was an enjoyable alternative to many such...,thia enjoyable alternative many coffee drink m...,1,1
239077,239078,B0083QJUL8,A3ART1EIT6930S,Vivian,2,2,5,1326844800,Fantastic!,This price is amazing! We are a whole foods fa...,price amazing whole food family use baking waf...,1,1
309125,309126,B0014K91GY,A21KE10M10LCTE,Justine,14,15,1,1219190400,Not good at all,I thought it was great that there was hot coco...,thought great hot cocoa keurig machine watery ...,1,0
2212,2213,B0007T3V82,A2G3R12TYEX4RN,Lisa Merriman,3,3,2,1284681600,It's A Boy Bubble Gum,"Hi,<br /><br />My first grandson was born and ...",hi first grandson born wanted everything sun c...,1,0
386657,386658,B005Y10ZMS,AKAZT5193KFR1,"L. Samuelson ""L.W. Samuelson""",0,0,4,1351036800,Fills the Void,These diet bars have a rich dark chocolate tas...,diet bar rich dark chocolate taste little grit...,1,1
124476,124477,B000LKUYGE,A281NPSIMI1C2R,"Rebecca of Amazon ""The Rebecca Review""",0,0,3,1344902400,Chocolate with a Poem,Chocolove's dark chocolate is more like milk c...,chocolove dark chocolate like milk chocolate l...,1,1
241654,241655,B0081XPTBS,A2J7DG2LRZ7TN7,milkers_mom,0,0,5,1315958400,I would order this again!,The four pack of infant formula was delivered ...,four pack infant formula delivered quickly wit...,1,1
532862,532863,B0009F3SFA,AY3QU54B6NG72,barkely,0,0,5,1283731200,It works,I almost gave up on nursing but low in behold ...,almost gave nursing low behold ran product dau...,1,1
78964,78965,B0039OZOI2,A3T4LCHSYH8YL3,FYI,0,1,1,1298332800,this item isn't kosher,Just wanted to let you know that this item has...,wanted let know item mistaken kosher certifica...,1,0


In [None]:
print(confusion_matrix(sample_df['Positive'], sample_df['Sentiment']), end="\n\n")
print(classification_report(sample_df['Positive'], sample_df['Sentiment']), end="\n\n")

[[  713  6486]
 [  765 42036]]

              precision    recall  f1-score   support

           0       0.48      0.10      0.16      7199
           1       0.87      0.98      0.92     42801

    accuracy                           0.85     50000
   macro avg       0.67      0.54      0.54     50000
weighted avg       0.81      0.85      0.81     50000




In [None]:
# Store
sample_df.to_csv('Reviews_sentiment.csv', index=False)