## About the dataset

The dataset consists of 4 columns: product category (e.g. headsets, cell phones etc.), review title, review content and rating. The rating is a numerical type that can take one of the following value: 1, 2, 3, 4, 5. The value of 1 is the worst score, the value of 5 is the best score. The data is not cleaned. It need to be preprocessed for building models

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

In [5]:
raw_data = pd.read_csv('/home/suraj/ClickUp/Jan-Feb/DataScience_ML_DL_Projects/MachineLearning/Ecommerce_Product_Review/sentiment_analysis/data/ebay_reviews.csv')
raw_data.head()

Unnamed: 0,category,review title,review content,rating
0,Headsets,Wireless gaming headset,This gaming headset ticks all the boxes # look...,5
1,Headsets,"Good for those with a big head, low budget","Easy setup, rated for 6 hours battery but mine...",3
2,Headsets,MezumiWireless Gaming Headset,I originally bought this wireless headset for ...,5
3,Headsets,HW- S2 great headset.,"This is my 2nd Mezumi headset, It kills the fi...",5
4,Headsets,BEST HEADPHONES I'VE PURCHASED IN MY ENTIRE LIFE,This is probably the best headset I've purchas...,5


Let's clean the dataset by following steps

1. Bringing rating to -1,0 and 1 scale where 0 is neutral, posiitive being 4 and 5 and 1,2 being negative.

2. Removing duplicate words, stop words, duplicate reviews and applying stemmer.

Preprocessing reference - https://www.kaggle.com/code/wojtekbonicki/text-data-cleaning-using-user-defined-transformers

In [None]:
cols = ['review_title', 'review_content', 'rating']


Dataset transformations via https://scikit-learn.org/stable/data_transforms.html

In [20]:
class DuplicatesRemover(BaseEstimator, TransformerMixin):
    #Transformer to remove duplicate reviews
    def fit(self,X, y=None):
        return self
    
    def transform(self,X):
        X2 = X.copy()
        #indices of duplicated reviews
        duplicate_idx = X2.duplicated()
        X2= X2[~duplicate_idx].dropna()
        return X2.set_index(np.arange(X2.shape[0]))

In [21]:
# text cleaner
class TextCleaner(BaseEstimator, TransformerMixin):
    # Transformer to remove punctuation and multiple spaces from text and change uppercase to lowercase

    def __init__(self,pattern="[!\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]"):
        self.pattern= pattern
    
    def fit(self,X,y=None):
        return self
    
    def transform(self,X):
        X2= X.copy()
        X2.replace({"\s\s+":" "}, regex=False, inplace=True)

        for col in X2.columns:
            if X2.loc[:,col].dtypes == int: continue
            X2.loc[:,col] = X2.loc[:,col].str.replace(self.pattern,"", regex=True).str.lower()
        return X2

In [22]:
class StopWordsRemover(BaseEstimator,TransformerMixin):
    # Transformer to remove popular english words with some default exceptions. User can add his own words to keep.
    # This is basically done to ensure keywords remain mostly in the reviews
    def __init__(self,words_to_keep = ['few','not','off','all','any','not','no','very']):
        stop_words = set(stopwords.words('english'))
        self.eng_words = stop_words.difference(set(words_to_keep))

    def fit(self,X,y=None):
        return self

    def transform(self,X):
        X2 = X.copy()

        for col in X2.columns:
            if X2.loc[:, col].dtypes == int:continue
            for en,review in enumerate(X2.loc[:,col].astype(str)):
                new = (" ").join(j for j in review.split(" ") if j.lower() not in self.eng_words)
                try:
                    X2.loc[:,col].iloc[en] = new
                except:
                    continue
            

        return X2

In [23]:
# bring down keywords to their root by stemming
class Stemmer(BaseEstimator,TransformerMixin):
    # Transformer to stem words.
    def __init__(self, stem=True):
        self.stemmer = nltk.PorterStemmer()
        self.stem = stem
    
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if self.stem == False:
            return X
        else:
            X2 = X.copy()  
            for col in X2.columns:
                if X2.loc[:, col].dtypes == int: continue
                for en, review in enumerate(X2.loc[:, col].astype(str)):
                    new = (" ").join(self.stemmer.stem(j) for j in review.split(" "))
                    try:
                        X2[:, col].iloc[en] = new
                    except:
                        continue
            return X2


In [24]:
class Rating(BaseEstimator, TransformerMixin):
    # Transformer to change reviews rating number to positive, negative and neutral
    # -1 for negative, 0 for neutral and 1 for positive

    def __init__(self, scale={1:-1, 2:-1, 3:0, 4:1, 5:1}, labels_to_del=[]):
        self.scale = scale
        self.labels_to_del = labels_to_del
    
    def fit(self, X, y=None):
        if self.labels_to_del != []:
            self.idx_to_del = X['rating'] == self.labels_to_del[0]
        return self

    def transform(self, X):    
        X2 = X.copy()
        if self.labels_to_del:
            X2 = X2[~self.idx_to_del]
        X2.replace(self.scale, inplace=True)
        return X2

In [25]:
cols = ['review title', 'review content', 'rating']

#pipeline
preprocessor = Pipeline([
    #at first duplicated reviews will be removed
    ('DuplicateRemover', DuplicatesRemover()),
    #symbols that will be removed are defined in the transformer but a user can define his own/some additional symbols
    ('TextCleaning',TextCleaner()),
    #removing popular english words
    ('StopWordsRemover',StopWordsRemover()),
    #if stem is False the words will not be stemmed
    ('Stemmer', Stemmer(stem=False)),
    #rating changer, in this example negative(1, 2) ratings are equal to -1, neutral (3) 0 and positive(4,5) 1
    ('Rating', Rating(scale={1:-1, 2:-1, 3:0, 4:1, 5:1})),
    #the autor noticed that after cleaning the reviews some duplicated reviews are left, one more time duplicateremover is used (we could use it only one time, but it would make the process of data cleaning longer)
    ('DuplicateRemover2', DuplicatesRemover())
])

In [27]:
#cleaned reviews are available after running this cell (note: it may take a while)
preprocessor.fit(raw_data[cols])
data_preprocessed = preprocessor.transform(raw_data[cols])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [28]:
data_preprocessed.head()

Unnamed: 0,review title,review content,rating
0,wireless gaming headset,gaming headset ticks all boxes looks grate b...,1
1,good big head low budget,easy setup rated 6 hours battery mine lasted s...,0
2,mezumiwireless gaming headset,originally bought wireless headset xbox latest...,1
3,hw s2 great headset,2nd mezumi headset kills first one better ran...,1
4,best headphones ive purchased entire life,probably best headset ive purchased till date ...,1


#### Approach 1
Weighted sum basic approach

* remove stop words, bring back each word via stemmer and check net-rating of the sentence.



In [90]:
# creating a list of positive, negative and neutral keywords
with open('/home/suraj/ClickUp/Jan-Feb/DataScience_ML_DL_Projects/MachineLearning/Ecommerce_Product_Review/sentiment_analysis/data/positive-words.txt', 'r') as f:
    positive_words = [line.strip() for line in f]
with open('/home/suraj/ClickUp/Jan-Feb/DataScience_ML_DL_Projects/MachineLearning/Ecommerce_Product_Review/sentiment_analysis/data/negative-words.txt', 'r') as f:
    negative_words = [line.strip() for line in f]
with open('/home/suraj/ClickUp/Jan-Feb/DataScience_ML_DL_Projects/MachineLearning/Ecommerce_Product_Review/sentiment_analysis/data/neutral-words.txt', 'r') as f:
    neutral_words = [line.strip() for line in f]

In [91]:
print(positive_words)

['a+', 'abound', 'abounds', 'abundance', 'abundant', 'accessable', 'accessible', 'acclaim', 'acclaimed', 'acclamation', 'accolade', 'accolades', 'accommodative', 'accomodative', 'accomplish', 'accomplished', 'accomplishment', 'accomplishments', 'accurate', 'accurately', 'achievable', 'achievement', 'achievements', 'achievible', 'acumen', 'adaptable', 'adaptive', 'adequate', 'adjustable', 'admirable', 'admirably', 'admiration', 'admire', 'admirer', 'admiring', 'admiringly', 'adorable', 'adore', 'adored', 'adorer', 'adoring', 'adoringly', 'adroit', 'adroitly', 'adulate', 'adulation', 'adulatory', 'advanced', 'advantage', 'advantageous', 'advantageously', 'advantages', 'adventuresome', 'adventurous', 'advocate', 'advocated', 'advocates', 'affability', 'affable', 'affably', 'affectation', 'affection', 'affectionate', 'affinity', 'affirm', 'affirmation', 'affirmative', 'affluence', 'affluent', 'afford', 'affordable', 'affordably', 'afordable', 'agile', 'agilely', 'agility', 'agreeable', 'ag

In [92]:
print(negative_words)



In [93]:
print(neutral_words)

['adequate', 'anchored', 'apathetic', 'balanced', 'bland', 'clandestine', 'commonplace', 'complacent', 'concealed', 'conforming', 'constant', 'content', 'drowsy', 'durable', 'enigmatic', 'established', 'exhausted', 'explicit', 'factual', 'faithful', 'fatigued', 'fixed', 'genuine', 'graphic', 'humble', 'impartial', 'indifferent', 'informational', 'jaded', 'knowledgeable', 'legitimate', 'lifelike', 'mediocre', 'meek', 'mellow', 'moderate', 'modest', 'nonchalant', 'objective', 'obscure', 'ordinary', 'placid', 'plainspoken', 'presentable', 'puzzling', 'reported', 'satisfied', 'secretive', 'sedate', 'serene', 'settled', 'sluggish', 'stable', 'standard', 'subdued', 'tangible', 'tolerable', 'tranquil', 'typical', 'unassuming', 'unbiased', 'unchangeable', 'unconcerned', 'unintelligible', 'unopinionated', 'unpretentious', 'veiled', 'weary', 'normal']


In [94]:
# taking a random review and calculating its sentiment
# review taken from - https://www.amazon.in/Top-Examples-Use-Essay-Evidence/product-reviews/1479248738
review = """This is a useful book for SAT practice. My daughter was able to increase her SAT score significantly by understanding how the SATs are reviewed and graded. It's not a short cut but it does help practice the test and allow you to focus on what's important. It is beyond my understanding why the school system in our area doesn't use such a tool rather than leaving it to the kids to have to navigate this area on their own. This is not cheating, it simply explains what the examiners are looking for."""

In [95]:
# creating a function that can preprocess a paragraph of review by cleaning,removing stop words, stemming etc.
# Steps >>
# 1. Duplicate remover
# 2. Text cleaning
# 3. Stop word remover
# 4. Stemmer and lemmatizer
from collections import Counter
import re
import nltk.corpus
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

# This is a helper function to map NTLK position tags
# Full list is available here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

#function to remove duplicates
def remove_duplicate(review):
    review = review.split(" ")
    #Unique words
    unique_words = Counter(review)
    # joining two adjacent element in iterable way
    s = " ".join(unique_words.keys())
    return s

#function to clean text
def clean_text(review):
    #converting to lowercase
    text = review.lower()
    # removing unicode characters.
    text = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text)
    #removing stopwords
    stop = stopwords.words('english')
    text = " ".join([word for word in text.split() if word not in (stop)])
    # stemming and lemmatization
    snow = SnowballStemmer('english')
    stemmed_sentence = []
    # Word Tokenizer
    words = word_tokenize(text)
    for w in words:
        #Apply stemming
        stemmed_sentence.append(snow.stem(w))
    stemmed_text = " ".join(stemmed_sentence)
    #lemmatization
    # Initialize the lemmatizer
    wl = WordNetLemmatizer()
    lemmatized_sentence = []

    words = word_tokenize(stemmed_text)
    word_pos_tags = nltk.pos_tag(words)
    # Get position tags
    word_pos_tags = nltk.pos_tag(words)
    # Map the position tag and lemmatize the word/token
    for idx, tag in enumerate(word_pos_tags):
        lemmatized_sentence.append(wl.lemmatize(tag[0], get_wordnet_pos(tag[1])))
    lemmatized_text = " ".join(lemmatized_sentence)
    return lemmatized_text




[nltk_data] Downloading package stopwords to /home/suraj/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/suraj/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/suraj/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/suraj/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [96]:
# creating predict_function
def predict_sentiment(review):
    #removing duplicate
    review=remove_duplicate(review)
    review= clean_text(review)
    total_review_rating = []
    words = review.split()
    for word in words:
        if word in negative_words:
            total_review_rating.append(-1)
        elif word in positive_words:
            total_review_rating.append(1)
        elif word in neutral_words:
            total_review_rating.append(0)
    print(total_review_rating)
    net_rating = sum(total_review_rating)
    if net_rating ==0:
        return "neutral"
    elif net_rating >0:
        return "positive"
    else:
        return "negative"




In [98]:
#predict
print(predict_sentiment(review))

[-1]
negative


In [99]:
# review linkk - https://www.amazon.in/gp/customer-reviews/RIKJKSB3IR6LP?ASIN=B0006U7FC0review2 = """This product stopped working after just used for 3-4 times.
review2= """It is really worthless to buy such a costly product to waste money.
Neither seller is helping nor Amazon is helping me to get it services. I am not able to find a service center in Hyderabad."""

In [100]:
print(predict_sentiment(review2))

[-1]
negative


In [101]:
# link - https://www.amazon.in/product-reviews/B01F8YGEXE
review3 = """Liked the product. Cushioning is good. Some what small. But then it is 3ft by 6ft. Which is good for 1 person to sleep.
It is very comfortable. Has cotton cover. But its not waterproof. Hence better to put waterproof sheet over if using for an infant.
Overall good product and recommended."""

In [102]:
print(predict_sentiment(review3))

[1, 1, 1, 1, 1, 1]
positive


# Conclusion of first method 
First method of depicting sentiment is done and the results are subject to keywords in the three files(positive-words.txt, negative-words.txt and neutral-words.txt) which right now look satisfactory

# 2. Count Vectorizer with Multinomial Naive Bayes