# Sentiment Analysis on Women's E-Commerce data
In this project we apply methods from Sentiment Analysis on the dataset "Women's E-Commerce Clothing Reviews" (https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews).

## Content of the Dataset

This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:

- Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.
- Age: Positive Integer variable of the reviewers age.
- Title: String variable for the title of the review.
- Review Text: String variable for the review body.
- Rating: Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
- Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
- Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.
- Division Name: Categorical name of the product high level division.
- Department Name: Categorical name of the product department name.
- Class Name: Categorical name of the product class name.

## Approach
The sentiment analysis of the clothing reviews is devided into the following 4 steps:
1. Data pre-processing
2. Build a lexicographic approach
3. Build a supervised machine-learning model
4. Evaluation and results

## 0. Load and explore the data

In [26]:
# NLP libraries and regular expressions
import nltk
import re

# Basic manipulation and numerics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# NLTK corpora and tools
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# function which does a train-test split for training a machine-learning model
from sklearn.model_selection import train_test_split
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /home/max/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/max/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [27]:
# read the data of 23486 Reviews of womens E-Commerce and 10 features
data = pd.read_csv("data/WomensEcomm.csv")
data = data[data["Review Text"].isna() == False] # remove samples without Review Text
column_names = np.array(data.columns)[1:]

# read in the dictionaries
pos_words=open("data/positive_words.txt","r")
pos_words=pos_words.read().split("\n")
neg_words=open("data/negative_words.txt","r")
neg_words=neg_words.read().split("\n")

# Print the column names
print("Columns of the data: \n%s " % column_names)

Columns of the data: 
['Clothing ID' 'Age' 'Title' 'Review Text' 'Rating' 'Recommended IND'
 'Positive Feedback Count' 'Division Name' 'Department Name' 'Class Name'] 


### The first 5 columns of the data-set

In [28]:
data.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


## 1. Data pre-processing
- Create a train-test split
- Remove stopwords, punctuation and numbers and tokenize the sentences (done via process_sentence function)
- Apply pre-processing on the whole data-set

In [29]:
# Preprocessing the data: Remove stop words, numbers and punctuation and tokenize the sentences
train_data, test_data = train_test_split(data, test_size = 0.3)

train_pos = train_data[ train_data["Recommended IND"] == 1]
train_neg = train_data[ train_data["Recommended IND"] == 0]

stop_words = set(stopwords.words('english'))

def process_sentence(sample):
    word_tokens = word_tokenize(sample) 

    filtered_sentence = [w for w in word_tokens if not w in stop_words] 

    filtered_sentence = [] 

    for w in word_tokens: 
        if w not in stop_words and len(w) > 3: 
            filtered_sentence.append(w)
    
    return filtered_sentence

# Processed Review Texts (Train-Set)
train_data_text = [process_sentence(sentence) for sentence in train_data["Review Text"]]
# Processed Review Texts (Negative reviews)
train_data_neg = [process_sentence(sentence) for sentence in train_neg["Review Text"]]
# Processed Review Texts (Positive reviews)
train_data_pos = [process_sentence(sentence) for sentence in train_pos["Review Text"]]

### Some example how the output of the pre-processing looks like

In [30]:
print("Original sentence: \n%s \n" % train_data["Review Text"].iloc[1])
print("Pre-processed sentence:\n%s" % train_data_text[1]) 

Original sentence: 
Recommend this vest - nice design - fabric is thicker than expected - might suggest a size smaller - plenty of room in loose styling 

Pre-processed sentence:
['Recommend', 'vest', 'nice', 'design', 'fabric', 'thicker', 'expected', 'might', 'suggest', 'size', 'smaller', 'plenty', 'room', 'loose', 'styling']


## 2. Lexicographic approach:
Given a lexicon with words and their corresponding sentiments we count the number of negative and positive words.
The lexicon originates from a Statistics for Social Data lecture of NYU and is available at: http://ptrckprry.com/course/ssd/. 

The following function: *get_sentiment_score* counts the number of positive and negative words in a given sentence given a list of positive words (*pos_words*) and negative words (*neg_words*). 

The prediction then is given by the ratio of positive over negative words. If there are more positive words then negative words we classify the sentence with a *1* (*positive sentiment*). If not it is *0* (*negative sentiment*).

In [53]:
def get_sentiment_score(sentence, ratio=0.5):
    neg_cnt = 0
    pos_cnt = 0
    
    for word in sentence:
        if word in pos_words:
            pos_cnt += 1
        elif word in neg_words:
            neg_cnt += 1
    if pos_cnt + neg_cnt > 0:
        if pos_cnt / (pos_cnt + neg_cnt) > ratio:
            return 1
        else:
            return 0
    else:
        return 0

## 3.1 Supervised Machine Learning Approach: Naive Bayes
- Feature Extraction using word frequencies
- classification using bayes theorem

In [33]:
#Getting tupples of list of words and sentiment
train_data_pos_df = pd.DataFrame({'Text':train_data_pos})
train_data_pos_df['Sentiment'] = "Positive"
train_data_neg_df = pd.DataFrame({'Text':train_data_neg})
train_data_neg_df['Sentiment'] = "Negative"
frames = [train_data_pos_df, train_data_neg_df]
train_data_bayes = pd.concat(frames)

train_data_bayes = tuple(zip(train_data_bayes.Text, train_data_bayes.Sentiment))

In [34]:
#Extracts all words to an array
def get_words(train_data):
    all = []
    for (words, sentiment) in train_data:
        all.extend(words)
    return all

# Measures frequency distribution
def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    features = wordlist.keys()
    return features

w_features = get_word_features(get_words(train_data_bayes))


def extract_features(document):
    document_words = set(document)
    features = {}
    for word in w_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

## 3.2 Supervised Machine Learning Approach: Support Vector Machine
- Feature Extraction using tf_idf
- Classification using SVM with kernels: linear, polynomial, rbf and sigmoid

In [35]:
# Feature extraction with tf_idf
from sklearn.feature_extraction.text import TfidfVectorizer 
# The SVM implementation
from sklearn import svm
# Metrics for easy model evaluation
from sklearn.metrics import classification_report

# A vectorizer that can with tf_idf features on some data and transform the data accordingly
vectorizer = TfidfVectorizer(min_df = 5,
                             max_df = 0.8,
                             sublinear_tf = True,
                             use_idf = True)

# Feature extraction on training- and testing data
train_vectors = vectorizer.fit_transform(train_data['Review Text'])
test_vectors = vectorizer.transform(test_data['Review Text'])

## 4. Evaluation and results

In [46]:
# Training the Naive Bayes classifier
training_set = nltk.classify.apply_features(extract_features, train_data_bayes)
classifier = nltk.NaiveBayesClassifier.train(training_set)

# Making predictions for the testing dataset
predictions_bayes = [classifier.classify(extract_features(obj.split())) for obj in test_data["Review Text"]]
bin_pred_bayes = [1 if sentiment == "Positive" else 0 for sentiment in predictions_bayes]

In [47]:
# Training the SVM with a linear kernel
classifier_linear = svm.SVC(kernel='linear')
classifier_linear.fit(train_vectors, train_data['Recommended IND'])
prediction_linear = classifier_linear.predict(test_vectors)

In [48]:
# Training the SVM with a polynomial kernel (degree of freedom is 3 by default)
classifier_poly = svm.SVC(kernel='poly')
classifier_poly.fit(train_vectors, train_data['Recommended IND'])
prediction_poly = classifier_poly.predict(test_vectors)

In [49]:
# Training the SVM with a rbf kernel
classifier_rbf = svm.SVC(kernel='rbf')
classifier_rbf.fit(train_vectors, train_data['Recommended IND'])
prediction_rbf = classifier_rbf.predict(test_vectors)

In [50]:
# Training the SVM with a sigmoid kernel
classifier_sigm = svm.SVC(kernel='sigmoid')
classifier_sigm.fit(train_vectors, train_data['Recommended IND'])
prediction_sigm = classifier_sigm.predict(test_vectors)

In [54]:
# Classify using the Lexicographic approach
sentiments_40 = [get_sentiment_score(sentence, 0.4) for sentence in train_data_text]
sentiments_50 = [get_sentiment_score(sentence, 0.5) for sentence in train_data_text]
sentiments_60 = [get_sentiment_score(sentence, 0.6) for sentence in train_data_text]

In [57]:
# Reports for SVM
report_svm_linear = classification_report(test_data['Recommended IND'], prediction_linear, output_dict=True)
report_svm_poly = classification_report(test_data['Recommended IND'], prediction_poly, output_dict=True)
report_svm_rbf = classification_report(test_data['Recommended IND'], prediction_rbf, output_dict=True)
report_svm_sigm = classification_report(test_data['Recommended IND'], prediction_sigm, output_dict=True)

# Report for Naive Bayes
report_bayes = classification_report(test_data["Recommended IND"], bin_pred_bayes, output_dict=True)

# Reports for Lexicographic approach
report_lex_40 = classification_report(train_data["Recommended IND"], sentiments_40, output_dict=True)
report_lex_50 = classification_report(train_data["Recommended IND"], sentiments_50, output_dict=True)
report_lex_60 = classification_report(train_data["Recommended IND"], sentiments_60, output_dict=True)

In [93]:
models = {'linear' : report_svm_linear,
           'poly' : report_svm_poly,
           'rbf' : report_svm_rbf,
           'sigm' : report_svm_sigm,
           'bayes' : report_bayes,
           'lex_40' : report_lex_40,
           'lex_50' : report_lex_50,
           'lex_60' : report_lex_60}

def get_results(report):
    return [str(round(report['0']['precision']*100, 2)) + "%",
           str(round(report['0']['recall']*100, 2)) + "%",
           str(round(report['1']['precision']*100, 2)) + "%",
           str(round(report['1']['recall']*100, 2)) + "%",
           str(round(report['accuracy']*100, 2)) + "%"]

results_summary = pd.DataFrame([get_results(models[model]) for model in models.keys()],
                               index=['SVM linear kernel', 
                                      'SVM polynomial kernel', 
                                      'SVM rbf kernel',
                                     'SVM sigmoid kernel',
                                     'Naive Bayes',
                                     'Lexicographic ratio = 40%',
                                     'Lexicographic ratio = 50%',
                                     'Lexicographic ratio = 60%'],
                               columns=['precision negative',
                                                             'recall negative',
                                                            'precision positive',
                                                            'recall positive',
                                                            'total accuracy'])

In [97]:
results_summary.to_csv('results', header=True)
