# Sentiment Analysis on Women's E-Commerce data
In this project we apply methods from Sentiment Analysis on the dataset "Women's E-Commerce Clothing Reviews" (https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews).

## Content of the Dataset

This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:

- Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.
- Age: Positive Integer variable of the reviewers age.
- Title: String variable for the title of the review.
- Review Text: String variable for the review body.
- Rating: Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
- Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
- Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.
- Division Name: Categorical name of the product high level division.
- Department Name: Categorical name of the product department name.
- Class Name: Categorical name of the product class name.

## Approach
The sentiment analysis of the clothing reviews is devided into the following 4 steps:
1. Data pre-processing
2. Build a lexicographic approach
3. Build a supervised machine-learning model
4. Evaluation and results

## 0. Load and explore the data

In [4]:
# NLP libraries and regular expressions
import nltk
import re

# Basic manipulation and numerics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# NLTK corpora and tools
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# function which does a train-test split for training a machine-learning model
from sklearn.model_selection import train_test_split

In [25]:
# read the data of 23486 Reviews of womens E-Commerce and 10 features
data = pd.read_csv("data/WomensEcomm.csv")
data = data[data["Review Text"].isna() == False] # remove samples without Review Text
column_names = np.array(data.columns)[1:]

# read in the dictionaries
pos_words=open("data/positive_words.txt","r")
pos_words=pos_words.read().split("\n")
neg_words=open("data/negative_words.txt","r")
neg_words=neg_words.read().split("\n")

# Print the column names
print("Columns of the data: \n%s " % column_names)

Columns of the data: 
['Clothing ID' 'Age' 'Title' 'Review Text' 'Rating' 'Recommended IND'
 'Positive Feedback Count' 'Division Name' 'Department Name' 'Class Name'] 


### The first 5 columns of the data-set

In [17]:
data.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


## 1. Data pre-processing
- Create a train-test split
- Remove stopwords, punctuation and numbers and tokenize the sentences (done via process_sentence function)
- Apply pre-processing on the whole data-set

In [19]:
# Preprocessing the data: Remove stop words, numbers and punctuation and tokenize the sentences
train_data, test_data = train_test_split(data, test_size = 0.3)

train_pos = train_data[ train_data["Recommended IND"] == 1]
train_neg = train_data[ train_data["Recommended IND"] == 0]

stop_words = set(stopwords.words('english'))

def process_sentence(sample):
    word_tokens = word_tokenize(sample) 

    filtered_sentence = [w for w in word_tokens if not w in stop_words] 

    filtered_sentence = [] 

    for w in word_tokens: 
        if w not in stop_words and len(w) > 3: 
            filtered_sentence.append(w)
    
    return filtered_sentence

# Processed Review Texts (Train-Set)
train_data_text = [process_sentence(sentence) for sentence in train_data["Review Text"]]
# Processed Review Texts (Negative reviews)
#train_data_neg = [process_sentence(sentence) for sentence in train_neg["Review Text"]]
# Processed Review Texts (Positive reviews)
#train_data_pos = [process_sentence(sentence) for sentence in train_pos["Review Text"]]

### Some example how the output of the pre-processing looks like

In [24]:
print("Original sentence: \n%s \n" % train_data["Review Text"].iloc[1])
print("Pre-processed sentence:\n%s" % train_data_text[1]) 

Original sentence: 
These are super comfortable. i love feeling like i'm wearing pajamas in public but being properly dressed, although i wear them around the house often, too. i don't regret spending this much money on these pants, it was worth it since i'm wearing them so much. 

Pre-processed sentence:
['These', 'super', 'comfortable', 'love', 'feeling', 'like', 'wearing', 'pajamas', 'public', 'properly', 'dressed', 'although', 'wear', 'around', 'house', 'often', 'regret', 'spending', 'much', 'money', 'pants', 'worth', 'since', 'wearing', 'much']


## 2. Lexicographic approach:
Given a lexicon with words and their corresponding sentiments we count the number of negative and positive words.
The lexicon originates from a Statistics for Social Data lecture of NYU and is available at: http://ptrckprry.com/course/ssd/. 

The following function: *get_sentiment_score* counts the number of positive and negative words in a given sentence given a list of positive words (*pos_words*) and negative words (*neg_words*). 

The prediction then is given by the ratio of positive over negative words. If there are more positive words then negative words we classify the sentence with a *1* (*positive sentiment*). If not it is *0* (*negative sentiment*).

In [26]:
def get_sentiment_score(sentence):
    neg_cnt = 0
    pos_cnt = 0
    
    for word in sentence:
        if word in pos_words:
            pos_cnt += 1
        elif word in neg_words:
            neg_cnt += 1
    
    if pos_cnt > neg_cnt:
        return 1
    
    elif neg_cnt > pos_cnt:
        return 0
    
    else:
        return 2

## 3. Supervised Machine Learning Approach: Feature Extraction and Classifictation
- Feature Extraction using word frequencies
- classification using bayes theorem

In [60]:
#TODO: remove!!
#Getting tupples of list of words and sentiment
#train_data_pos_df = pd.DataFrame({'Text':train_data_pos})
#train_data_pos_df['Sentiment'] = "Positive"
#train_data_neg_df = pd.DataFrame({'Text':train_data_neg})
#train_data_neg_df['Sentiment'] = "Negative"
#frames = [train_data_pos_df, train_data_neg_df]
#train_data1 = pd.concat(frames)

train_data_with_labels = tuple(zip(train_data["Review Text"], train_data["Recommended IND"]))

In [63]:
#Extracts all words to an array
def extract_features(document, train_corpus=train_data_with_labels):
    def get_words_in_tweets(train_data):
        all = []
        for (words, sentiment) in train_data:
            all.extend(words)
        return all

    #Measures frequency distribution
    def get_word_features(wordlist):
        wordlist = nltk.FreqDist(wordlist)
        features = wordlist.keys()
        return features

    w_features = get_word_features(get_words_in_tweets(train_data_with_labels))

    document_words = set(document)
    features = {}
    for word in w_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

## 4. Evaluation and results

In [None]:
# Training the Naive Bayes classifier
training_set = nltk.classify.apply_features(extract_features, train_data_with_labels)
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [None]:
#Getting data for testing
test_pos = test_data[ test_data["Recommended IND"] == 1]['Review Text']
test_neg = test_data[ test_data["Recommended IND"] == 0]['Review Text']

In [None]:
# Measuring how the classifier algorithm scored.
neg_cnt = 0
pos_cnt = 0
for obj in test_neg[:100]: 
    res =  classifier.classify(extract_features(obj.split()))
    if(res == 'Negative'): 
        neg_cnt = neg_cnt + 1
for obj in test_pos[:100]: 
    res =  classifier.classify(extract_features(obj.split()))
    if(res == 'Positive'): 
        pos_cnt = pos_cnt + 1
        
print('[Negative]: %s/%s '  % (len(test_neg[:100]),neg_cnt))        
print('[Positive]: %s/%s '  % (len(test_pos[:100]),pos_cnt)) 

In [None]:
# Measuring how the Lexicographic approach scored
#sentiments = [get_sentiment_score(sentence) for sentence in train_data_text]
#acc = np.mean([1 if pred  == label else 0 for pred, label in zip(sentiments, train_data["Recommended IND"])])
#acc