Women's Clothing E-Commerce Reviews (Supervised NLP model)

https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews

In [27]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords,gutenberg
from nltk.stem import WordNetLemmatizer

In [24]:
def text_cleaner(text):
    text = re.sub('[^a-zA-Z]',' ',text)
    text = text.lower()
    text = text.split()
    lmz = WordNetLemmatizer()
    text = [lmz.lemmatize(word) for word in text if not word in set(stopwords.words('english'))]
    text = ' '.join(text)
    return text

In [21]:
# import text data
data = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")
data.head(3)

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses


In [22]:
# select only "Review Text" & "Rating" columns
data = data[["Review Text","Rating"]]
data.head(3)

Unnamed: 0,Review Text,Rating
0,Absolutely wonderful - silky and sexy and comf...,4
1,Love this dress! it's sooo pretty. i happene...,5
2,I had such high hopes for this dress and reall...,3


In [23]:
# rating is 1-5 (1 for bad, 5 for outstanding), reviews with ratings 4-5 seem positive & 3 below seem negative
data["Rating"] = (data["Rating"] >= 4).astype(int)
data.head(3)

Unnamed: 0,Review Text,Rating
0,Absolutely wonderful - silky and sexy and comf...,1
1,Love this dress! it's sooo pretty. i happene...,1
2,I had such high hopes for this dress and reall...,0


In [47]:
data["Review Text"] = data["Review Text"].astype(str)
data.dtypes

Review Text    object
Rating          int64
dtype: object

In [50]:
data.shape

(23486, 2)

In [48]:
corpus = []
for review in data["Review Text"]:
    cleaned_review = text_cleaner(review)
    corpus.append(cleaned_review)

In [52]:
from sklearn.feature_extraction.text import CountVectorizer

count_v = CountVectorizer(max_features=23000)

In [53]:
# independent and dependent variables
X = count_v.fit_transform(corpus).toarray()

In [54]:
y = data.iloc[:,1].values

In [55]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

## Random Forest Classifier

In [56]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
train_rfc = rfc.fit(X_train,y_train)

In [57]:
print("Train score: ",rfc.score(X_train,y_train))
print("\nTest score: ",rfc.score(X_test,y_test))

Train score:  0.9922290823930168

Test score:  0.8222647935291614


## Logistic Regression

In [58]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
train_lr = lr.fit(X_train,y_train)

In [59]:
print("Train score: ",lr.score(X_train,y_train))
print("\nTest score: ",lr.score(X_test,y_test))

Train score:  0.9390568447945498

Test score:  0.8729246487867177


## Confusion Matrix

In [60]:
from sklearn.metrics import confusion_matrix

y_pred_rfc = rfc.predict(X_test)
y_pred_lr = lr.predict(X_test)

cm_rfc = confusion_matrix(y_test,y_pred_rfc)
cm_lr = confusion_matrix(y_test,y_pred_lr)

In [61]:
cm_rfc

array([[ 495,  595],
       [ 240, 3368]])

Random Forest Classifier (out of 4698 predictions):

1) Number Correct: 3863/4698 (82.23%)

2) Number Incorrect: 835/4698 (17.77%)

In [62]:
cm_lr

array([[ 689,  401],
       [ 196, 3412]])

Logistic Regression (out of 4698 predictions):

1) Number Correct: 4057/4698 (86.36%)

2) Number Incorrect: 641/4698 (13.64%)

## Improving Logistic Regression Model Performance

Use tf-idf method with dimensionality reduction

In [64]:
# save for later
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_v = TfidfVectorizer(max_df=0.5,
                             min_df=2, 
                             stop_words='english',
                             use_idf=True, 
                             norm=u'l2', 
                             smooth_idf=True
                            )

In [65]:
X_2 = tfidf_v.fit_transform(corpus).toarray()
y_2 = y = data.iloc[:,1].values

In [66]:
X2_train,X2_test,y2_train,y2_test = train_test_split(X_2,y_2,test_size=0.2,random_state=42)

In [90]:
# dimension reduction (SVD)
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

svd = TruncatedSVD(2000)
lsa = make_pipeline(svd,Normalizer(copy=False))

In [92]:
X_train.shape

(18788, 12110)

In [91]:
X_train_lsa = lsa.fit_transform(X_train)

In [97]:
X_test_lsa = lsa.transform(X_test)

In [93]:
variance_explained = svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components: ",total_variance*100)

Percent variance captured by all components:  96.05383797024875


In [105]:
lr2 = LogisticRegression()
train_lr = lr2.fit(X_train_lsa,y_train)

In [106]:
print("Train score: ",lr2.score(X_train_lsa,y_train))
print("\nTest score: ",lr2.score(X_test_lsa,y_test))

Train score:  0.8891845859058973

Test score:  0.8656875266070668


In [107]:
y_pred_lr2 = lr2.predict(X_test_lsa)
cm_lr2 = confusion_matrix(y_test,y_pred_lr)

In [108]:
cm_lr2

array([[ 689,  401],
       [ 196, 3412]])

Logistic Regression (out of 4698 predictions):

1) Correct: 4101/4698 (87.30%)

2) Incorrect: 597/4698 (12.70%)

In [102]:
# use tfidf vectorizer
X2_train_lsa = lsa.fit_transform(X2_train)

In [103]:
X2_test_lsa = lsa.transform(X2_test)

In [104]:
variance_explained = svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components: ",total_variance*100)

Percent variance captured by all components:  90.14079550190873


In [109]:
lr_tfidf = LogisticRegression()
train_lr = lr_tfidf.fit(X2_train_lsa,y_train)

In [111]:
print("Train score: ",lr_tfidf.score(X2_train_lsa,y_train))
print("\nTest score: ",lr_tfidf.score(X2_test_lsa,y_test))

Train score:  0.8924313391526506

Test score:  0.8659003831417624


In [112]:
y_pred_lr_tfidf = lr_tfidf.predict(X2_test_lsa)
cm_lr_tfidf = confusion_matrix(y_test,y_pred_lr_tfidf)
cm_lr_tfidf

array([[ 603,  487],
       [ 143, 3465]])

Logistic Regression w/ tfidf vectorized data (out of 4698):

1) Correct: 4068/4698 (86.59%)

2) Incorrect: 630/4698 (13.41%)

Found that the tfdif vectorized data performed slightly less accurate than the count vectorized data with SVD dimensionality reduction. 