## Predicting E-Commerce Product Recommendation Ratings from Reviews:

This is a classic NLP problem dealing with data from an e-commerce store focusing on women's clothing. Each record in the dataset is a customer review which consists of the review title, text description and a rating (ranging from 1 - 5) for a product amongst other features

I convert this into a binary classification problem such that a customer recommends a product (label 1) is the rating is > 3 else they do not recommend the product (label 0)

Main Objective: Leverage the review text attributes to predict the recommendation rating (classification)

In [1]:
import numpy as np
import pandas as pd

from sklearn.metrics import confusion_matrix, classification_report

In [2]:
# Load and View the Dataset
df = pd.read_csv('https://raw.githubusercontent.com/dipanjanS/feature_engineering_session_dhs18/master/ecommerce_product_ratings_prediction/Womens%20Clothing%20E-Commerce%20Reviews.csv', keep_default_na=False)
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [3]:
## Data Processing:

# 1) Merge all review text attributes (title, text description) into one attribute
df['Review'] = (df['Title'].map(str) +' '+ df['Review Text']).apply(lambda row: row.strip())

# 2) Convert the 5-star rating system into a binary recommendation rating of 1 or 0 
df['Rating'] = [1 if rating > 3 else 0 for rating in df['Rating']]

# Select only 'Review' and 'Rating' columns:
df = df[['Review', 'Rating']]
df.head()

Unnamed: 0,Review,Rating
0,Absolutely wonderful - silky and sexy and comf...,1
1,Love this dress! it's sooo pretty. i happene...,1
2,Some major design flaws I had such high hopes ...,0
3,"My favorite buy! I love, love, love this jumps...",1
4,Flattering shirt This shirt is very flattering...,1


In [4]:
# Remove all records with no review:
df = df[df['Review'] != '']
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22642 entries, 0 to 23485
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  22642 non-null  object
 1   Rating  22642 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 530.7+ KB


In [5]:
# Check Rating balance:
df['Rating'].value_counts()

1    17449
0     5193
Name: Rating, dtype: int64

In [6]:
# Split the dataset into train and test datasets:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[['Review']], df['Rating'], random_state=42, stratify = df['Rating'])
X_train.shape, X_test.shape

((16981, 1), (5661, 1))

In [7]:
y_train.value_counts(normalize= True)

1    0.770626
0    0.229374
Name: Rating, dtype: float64

In [8]:
y_test.value_counts(normalize= True)

1    0.770712
0    0.229288
Name: Rating, dtype: float64

 ### Text Pre-processing and Wrangling:


*   Text Lowercasing
*   Removal of contractions
*   Removing unnecessary characters, numbers and symbols
*   Stemming
*   Stopword removal





 



 


In [9]:
!pip install contractions
!pip install textsearch
!pip install tqdm
import nltk
nltk.download('punkt')
nltk.download('stopwords')

Collecting contractions
  Downloading contractions-0.1.66-py2.py3-none-any.whl (8.0 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-1.4.2.tar.gz (321 kB)
[K     |████████████████████████████████| 321 kB 5.3 MB/s 
[?25hCollecting anyascii
  Downloading anyascii-0.3.0-py3-none-any.whl (284 kB)
[K     |████████████████████████████████| 284 kB 49.1 MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone
  Created wheel for pyahocorasick: filename=pyahocorasick-1.4.2-cp37-cp37m-linux_x86_64.whl size=85450 sha256=4cc6733ec6dddedf46d12ef96020ce1128279f0f642a8f374953172ab5610860
  Stored in directory: /root/.cache/pip/wheels/25/19/a6/8f363d9939162782bb8439d886469756271abc01f76fbd790f
Successfully built pyahocorasick
Installing collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully install

True

In [10]:
import nltk
import contractions
import re

# remove some stopwords to capture negation in n-grams if possible
stop_words = nltk.corpus.stopwords.words('english')
stop_words.remove('no')
stop_words.remove('not')
stop_words.remove('but')

# load up a simple porter stemmer
ps = nltk.porter.PorterStemmer()

def simple_text_preprocessor(document): 
    # lower case
    document = str(document).lower()
    
    # expand contractions
    document = contractions.fix(document)
    
    # remove unnecessary characters
    document = re.sub(r'[^a-zA-Z]',r' ', document)
    document = re.sub(r'nbsp', r'', document)
    document = re.sub(' +', ' ', document)
    
    # simple porter stemming
    document = ' '.join([ps.stem(word) for word in document.split()])
    
    # stopwords removal
    document = ' '.join([word for word in document.split() if word not in stop_words])
    
    return document

stp = np.vectorize(simple_text_preprocessor)

In [11]:
# Apply te defined function on train and test datasets:
X_train['Clean Review'] = stp(X_train['Review'].values)
X_test['Clean Review'] = stp(X_test['Review'].values)

X_train.head()

Unnamed: 0,Review,Clean Review
19748,Loved the colors Love the dress. runs a little...,love color love dress run littl big order size...
20740,Great basic- but long! I'm fairly petite and w...,great basic but long fairli petit wa not expec...
15111,Love this top! I'm so glad i bought this. the ...,love thi top glad bought thi pictur doe not ju...
5607,"Graphic appeal I tend to like grey, black simp...",graphic appeal tend like grey black simpl clot...
8489,T-shirt love T la is such a great brand for t ...,shirt love la great brand shirt thi one soft c...


In [12]:
# Basic NLP Count based Features:
import string

# Character Count: total number of characters in the documents
X_train['char_count'] = X_train['Review'].apply(len)
# Word Count: total number of words in the documents
X_train['word_count'] = X_train['Review'].apply(lambda x: len(x.split()))
# Average Word Density: average length of the words used in the documents
X_train['word_density'] = X_train['char_count'] / (X_train['word_count']+1)
# Puncutation Count: total number of punctuation marks in the documents
X_train['punctuation_count'] = X_train['Review'].apply(lambda x: len("".join(_ for _ in x if _ in string.punctuation))) 
# Upper Case Count: total number of upper count words in the documents
X_train['title_word_count'] = X_train['Review'].apply(lambda x: len([wrd for wrd in x.split() if wrd.istitle()]))
# Title Word Count: total number of proper case (title) words in the documents
X_train['upper_case_word_count'] = X_train['Review'].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))

# The same for test dataset:
X_test['char_count'] = X_test['Review'].apply(len)
X_test['word_count'] = X_test['Review'].apply(lambda x: len(x.split()))
X_test['word_density'] = X_test['char_count'] / (X_test['word_count']+1)
X_test['punctuation_count'] = X_test['Review'].apply(lambda x: len("".join(_ for _ in x if _ in string.punctuation))) 
X_test['title_word_count'] = X_test['Review'].apply(lambda x: len([wrd for wrd in x.split() if wrd.istitle()]))
X_test['upper_case_word_count'] = X_test['Review'].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))

In [13]:
X_train.head()

Unnamed: 0,Review,Clean Review,char_count,word_count,word_density,punctuation_count,title_word_count,upper_case_word_count
19748,Loved the colors Love the dress. runs a little...,love color love dress run littl big order size...,207,41,4.928571,6,2,0
20740,Great basic- but long! I'm fairly petite and w...,great basic but long fairli petit wa not expec...,182,34,5.2,7,1,0
15111,Love this top! I'm so glad i bought this. the ...,love thi top glad bought thi pictur doe not ju...,420,79,5.25,21,1,0
5607,"Graphic appeal I tend to like grey, black simp...",graphic appeal tend like grey black simpl clot...,304,61,4.903226,11,2,1
8489,T-shirt love T la is such a great brand for t ...,shirt love la great brand shirt thi one soft c...,131,28,4.517241,3,1,1


In [14]:
# Add Features from Sentiment Analysis:
import textblob     # Unsupervised, lexicon-based sentiment analysis

x_train_snt_obj = X_train['Review'].apply(lambda row: textblob.TextBlob(row).sentiment)
X_train['Polarity'] = [obj.polarity for obj in x_train_snt_obj.values]
X_train['Subjectivity'] = [obj.subjectivity for obj in x_train_snt_obj.values]

x_test_snt_obj = X_test['Review'].apply(lambda row: textblob.TextBlob(row).sentiment)
X_test['Polarity'] = [obj.polarity for obj in x_test_snt_obj.values]
X_test['Subjectivity'] = [obj.subjectivity for obj in x_test_snt_obj.values]

In [15]:
X_train.head()

Unnamed: 0,Review,Clean Review,char_count,word_count,word_density,punctuation_count,title_word_count,upper_case_word_count,Polarity,Subjectivity
19748,Loved the colors Love the dress. runs a little...,love color love dress run littl big order size...,207,41,4.928571,6,2,0,0.149826,0.471528
20740,Great basic- but long! I'm fairly petite and w...,great basic but long fairli petit wa not expec...,182,34,5.2,7,1,0,0.4375,0.560714
15111,Love this top! I'm so glad i bought this. the ...,love thi top glad bought thi pictur doe not ju...,420,79,5.25,21,1,0,0.373718,0.54359
5607,"Graphic appeal I tend to like grey, black simp...",graphic appeal tend like grey black simpl clot...,304,61,4.903226,11,2,1,0.060417,0.569643
8489,T-shirt love T la is such a great brand for t ...,shirt love la great brand shirt thi one soft c...,131,28,4.517241,3,1,1,0.342857,0.421429


In [16]:
# Adding Bag of Words based Features - 1-grams:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(min_df=0.0, max_df=1.0, ngram_range=(1, 1))
X_traincv = cv.fit_transform(X_train['Clean Review']).toarray()
X_traincv = pd.DataFrame(X_traincv, columns=cv.get_feature_names())

X_testcv = cv.transform(X_test['Clean Review']).toarray()
X_testcv = pd.DataFrame(X_testcv, columns=cv.get_feature_names())
X_traincv.head()



Unnamed: 0,aa,aaaaaaamaz,aaaah,aaaahmaz,aam,ab,abbey,abbi,abck,abdomen,abdomin,abercrombi,abhor,abil,abject,abl,abnorm,abo,abolut,abou,abov,abroad,abruptli,absenc,abso,absolut,absoluti,absolutley,absolutli,absorb,abstract,absurd,absurdli,abt,abund,abus,abut,ac,acacia,accent,...,yogini,yoke,yolk,yoo,yore,york,yoself,young,younger,yourselv,youth,youthful,yr,yuck,yucki,yuk,yum,yummi,yummiest,yummysweat,yup,zag,zara,zed,zermatt,zero,zig,zigzag,zillion,zing,zip,zipepr,zipper,zipperi,zombi,zone,zooland,zoom,zowi,zuma
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [17]:
X_train_metadata = X_train.drop(['Review', 'Clean Review'], axis=1).reset_index(drop=True)
X_test_metadata = X_test.drop(['Review', 'Clean Review'], axis=1).reset_index(drop=True)

X_train_comb = pd.concat([X_train_metadata, X_traincv], axis=1)
X_test_comb = pd.concat([X_test_metadata, X_testcv], axis=1)

X_train_comb.head()

Unnamed: 0,char_count,word_count,word_density,punctuation_count,title_word_count,upper_case_word_count,Polarity,Subjectivity,aa,aaaaaaamaz,aaaah,aaaahmaz,aam,ab,abbey,abbi,abck,abdomen,abdomin,abercrombi,abhor,abil,abject,abl,abnorm,abo,abolut,abou,abov,abroad,abruptli,absenc,abso,absolut,absoluti,absolutley,absolutli,absorb,abstract,absurd,...,yogini,yoke,yolk,yoo,yore,york,yoself,young,younger,yourselv,youth,youthful,yr,yuck,yucki,yuk,yum,yummi,yummiest,yummysweat,yup,zag,zara,zed,zermatt,zero,zig,zigzag,zillion,zing,zip,zipepr,zipper,zipperi,zombi,zone,zooland,zoom,zowi,zuma
0,207,41,4.928571,6,2,0,0.149826,0.471528,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,182,34,5.2,7,1,0,0.4375,0.560714,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,420,79,5.25,21,1,0,0.373718,0.54359,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,304,61,4.903226,11,2,1,0.060417,0.569643,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,131,28,4.517241,3,1,1,0.342857,0.421429,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Model Training and Evaluation

In [18]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(C=1, random_state=42, solver='liblinear')
lr.fit(X_train_comb, y_train)
predictions = lr.predict(X_test_comb)

print(classification_report(y_test, predictions))
pd.DataFrame(confusion_matrix(y_test, predictions))

              precision    recall  f1-score   support

           0       0.77      0.71      0.74      1298
           1       0.92      0.94      0.93      4363

    accuracy                           0.88      5661
   macro avg       0.84      0.82      0.83      5661
weighted avg       0.88      0.88      0.88      5661



Unnamed: 0,0,1
0,921,377
1,281,4082


## Conclusion:

This looks promising.

We are able to predict 71% of the total number of bad or negative rated products and 94% of the total number of good or positive rated products! Precision is quite good at 77% for negative rated products and 92% for positive rated products!

F1-Score for bad reviews is 74% and good reviews is 93%

This brings our overall F1-Score to 88% which is quite good.