## **Predicting E-zcommerce product recommendation ratings from review**
### **Main Objective:** Leverage the review text attributes to predict the recommendation rating, 

**Importing useful libraries**




In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression

### Importing the dataset

In [3]:
data = pd.read_csv('https://raw.githubusercontent.com/dipanjanS/feature_engineering_session_dhs18/master/ecommerce_product_ratings_prediction/Womens%20Clothing%20E-Commerce%20Reviews.csv', keep_default_na=False)
data.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


## **Data Preprocessing**


*   We will merge all the review columns into one.
*   We will convert the 5 star system into two diffrent category with ratings 0 (rating<3) and 1 (rating>3).



In [4]:
data['Reviews'] = data.Title + ' ' + data['Review Text']
data['Rating'] = [1 if rating > 3 else 0 for rating in data['Rating']]
# for i in data['Rating']:
#   if i > 3:
#     data['Rating'] == 1
#   else:
#     data['Rating'] == 0 
data = data[['Reviews','Rating']]

In [5]:
pd.set_option('display.max_colwidth', 1000)
data.head()

Unnamed: 0,Reviews,Rating
0,Absolutely wonderful - silky and sexy and comfortable,1
1,"Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite. i bought a petite and am 5'8"". i love the length on me- hits just a little below the knee. would definitely be a true midi on someone who is truly petite.",1
2,"Some major design flaws I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c",0
3,"My favorite buy! I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!",1
4,Flattering shirt This shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan. love this shirt!!!,1


In [6]:
data.Rating.value_counts()

1    18208
0     5278
Name: Rating, dtype: int64

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Reviews  23486 non-null  object
 1   Rating   23486 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 367.1+ KB


In [8]:
# data['Reviews'].dropna()
for i in data['Reviews']:
  if i != '':
    pass
  else:
    data['Reviews'].remove(i)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Reviews  23486 non-null  object
 1   Rating   23486 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 367.1+ KB


### **Splitting the data into dependent and independent variables x , y**

In [9]:
x = data.iloc[:,0:1]
y = data.iloc[:,1:2]

### **further splitting the data as Train and Test data**

In [10]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=1)

Counter produces a dictionary with the letters as keys and their frequency as values, used to count unique values including the alphabets in a word.

In [11]:
from collections import Counter
Counter(y_train), Counter(y_test)

# we have not used Counter for x because it has a lot of alphabets and it'll further be off no use to us.

(Counter({'Rating': 1}), Counter({'Rating': 1}))

# Basic NLP with counting features
## We will use certain features which will help us in improving text classification models such as:


*   **Word Count:** total no of word 
*   **Character count:** total no of character
*   **Average world density:** average length of the words used 
*   **Punctuation count:** total no of punctuation marks
*   **Upper Case Count:** total no of Upper Count words
*   **Title Word Count:** total no of proper case(title) words










In [12]:
import string

x_train['char_count'] = x_train['Reviews'].apply(len)
x_train['word_count'] = x_train['Reviews'].apply(lambda x: len(x.split()))
x_train['Avg_word_density'] = x_train['char_count']/(x_train['word_count']+1)
# x_train['Punctutation_count'] = x_train['Reviews'].apply(lambda x: len("".join(i for i in x if i in string.punctuation)))
# x_train['Title_word_count'] = x_train['Reviews'].apply(lambda x: len([j for j in x.split() if j.istitle()]))

x_test['char_count'] = x_test['Reviews'].apply(len)
x_test['word_count'] = x_test['Reviews'].apply(lambda x: len(x.split()))
x_test['Avg_word_density'] = x_test['char_count']/(x_test['word_count']+1)
# x_test['Punctutation_count'] = x_test['Reviews'].apply(lambda x: len("".join(i for i in x if i in string.punctuation)))
# x_test['Title_word_count'] = x_test['Reviews'].apply(lambda x: len([j for j in x.split() if j.istitle()]))


In [13]:
x_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18788 entries, 17668 to 235
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Reviews           18788 non-null  object 
 1   char_count        18788 non-null  int64  
 2   word_count        18788 non-null  int64  
 3   Avg_word_density  18788 non-null  float64
dtypes: float64(1), int64(2), object(1)
memory usage: 733.9+ KB


## **Trying to train a Logistic Regression model**

In [14]:
log_reg = LogisticRegression()
log_reg.fit(x_train.drop(['Reviews'], axis=1),y_train)

  y = column_or_1d(y, warn=True)


LogisticRegression()

In [15]:
y_pred = log_reg.predict(x_test.drop(['Reviews'], axis=1)) 

In [16]:
print(pd.DataFrame(confusion_matrix(y_test, y_pred)))
print(classification_report(y_pred,y_test))

   0     1
0  0  1085
1  0  3613
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       1.00      0.77      0.87      4698

    accuracy                           0.77      4698
   macro avg       0.50      0.38      0.43      4698
weighted avg       1.00      0.77      0.87      4698



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### The confusion matrix showss that our model was not able to predict a single 0 value, this was somehow expected as we have not used the sentiment analysis or any other way of reading the sentences so far 

## **Using Text Sentiment**

TextBlob is an excellent open-source library for performing NLP tasks with ease, including sentiment analysis. It also an a sentiment lexicon (in the form of an XML file) which it leverages to give both polarity and subjectivity scores.


*   The polarity score is a float within the range [-1.0, 1.0].
*   The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.





In [17]:
import textblob

In [18]:
x_train_snt_obj = x_train['Reviews'].apply(lambda row: textblob.TextBlob(row).sentiment)
x_train['Polarity'] = [obj.polarity for obj in x_train_snt_obj.values]
x_train['Subjectivity'] = [obj.subjectivity for obj in x_train_snt_obj.values]

x_test_snt_obj = x_test['Reviews'].apply(lambda row: textblob.TextBlob(row).sentiment)
x_test['Polarity'] = [obj.polarity for obj in x_test_snt_obj.values]
x_test['Subjectivity'] = [obj.subjectivity for obj in x_test_snt_obj.values]

In [19]:
x_train.head()

Unnamed: 0,Reviews,char_count,word_count,Avg_word_density,Polarity,Subjectivity
17668,"Wide, not as expected, or pictured Wow this is huge! i'm all for the tent-look with the right style, but this was absurd. i could have fit two of me into the width, and with the peplum style, extra width doesn't flatter anyone. the fabric is also so thin that it doesn't hang nicely, it kind of floats around the body. if you want a wide top, then this is perfect, but if you wanted something that hangs nicely, look elsewhere. i purchased a petite s, my typical size.",468,89,5.2,0.174603,0.659048
5824,"Wonderful fabric, fabulously stylish Softest denim -- they feel like they've been through the wash a hundred times already! i'm pleased that the belt is removable as it adds just a bit too much bulk for me. i'm 5'4"" and bought the petite for a perfect fit. these will become a go-to pair!",288,53,5.333333,0.6375,0.766667
10122,"I love it! I saw this in the store and purchased it right on the spot (didn't even see the reviews until now). i can wear either a small or x-small in retailer but for this top, the x-small was a better fit as it draped nicely and was still very loose on me (i'm 5'4, 125 lbs, 34b). i was worried that the opening was too wide but it was perfect. i like the elbow length sleeves and the fact that the blouse is lined which mean it's not sheer and don't have to wear a camisole. i usually stick with the classics i",513,105,4.839623,0.22294,0.56717
8894,"Lovely but so small! The skirt is beautiful, but the sizing is kind of off. i usually wear a 6, at 5'6"" and 130 lbs. i ordered an 8, though, since sometimes athropologie clothes run small. even the 8 was too tight, so i'll have to exchange it.",243,48,4.959184,0.13699,0.569388
22124,,1,0,1.0,0.0,0.0


### **Training the Model and Evaluating its outcome**

In [20]:
log_reg.fit(x_train.drop(['Reviews'], axis=1),y_train)
pred = log_reg.predict(x_test.drop(['Reviews'], axis=1))
print(classification_report(y_test, pred))
cm = confusion_matrix(y_test, pred)
pd.DataFrame(cm)

  y = column_or_1d(y, warn=True)


              precision    recall  f1-score   support

           0       0.69      0.23      0.34      1085
           1       0.81      0.97      0.88      3613

    accuracy                           0.80      4698
   macro avg       0.75      0.60      0.61      4698
weighted avg       0.78      0.80      0.76      4698



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Unnamed: 0,0,1
0,246,839
1,112,3501


### **Text Pre-Processing and Text Wrangling**
We will focus on:

*   Text Lowercasing
*   Removal of contractions
*   Removing unnecessary characters, numbers and symbols
*   Stemming
*   Stopword removal





In [21]:
!pip install contractions
!pip install textsearch
!pip install tqdm
import nltk
nltk.download('punkt')
nltk.download('stopwords')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting contractions
  Downloading contractions-0.1.72-py2.py3-none-any.whl (8.3 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-1.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[K     |████████████████████████████████| 106 kB 9.3 MB/s 
[?25hCollecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[K     |████████████████████████████████| 287 kB 55.0 MB/s 
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.1 contractions-0.1.72 pyahocorasick-1.4.4 textsearch-0.0.21
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [22]:
import contractions
import nltk
import re

contractions.fix('i didn\'t like this ping-pong bat')

'i did not like this ping-pong bat'

In [23]:
# remove some stopwords to capture negation in n-grams if possible
stop_words = nltk.corpus.stopwords.words('english')
stop_words.remove('no')
stop_words.remove('not')
stop_words.remove('but')

# load up a simple porter stemmer - nothing fancy
ps = nltk.porter.PorterStemmer()

def simple_text_preprocessor(document): 
    # lower case
    document = str(document).lower()
    
    # expand contractions
    document = contractions.fix(document)
    
    # remove unnecessary characters
    document = re.sub(r'[^a-zA-Z]',r' ', document)
    document = re.sub(r'nbsp', r'', document)
    document = re.sub(' +', ' ', document)
    
    # simple porter stemming
    document = ' '.join([ps.stem(word) for word in document.split()])
    
    # stopwords removal
    document = ' '.join([word for word in document.split() if word not in stop_words])
    
    return document

stp = np.vectorize(simple_text_preprocessor)

In [24]:
x_train['Clean Reviews'] = stp(x_train['Reviews'].values)
x_test['Clean Reviews'] = stp(x_test['Reviews'].values)
x_train.head()

Unnamed: 0,Reviews,char_count,word_count,Avg_word_density,Polarity,Subjectivity,Clean Reviews
17668,"Wide, not as expected, or pictured Wow this is huge! i'm all for the tent-look with the right style, but this was absurd. i could have fit two of me into the width, and with the peplum style, extra width doesn't flatter anyone. the fabric is also so thin that it doesn't hang nicely, it kind of floats around the body. if you want a wide top, then this is perfect, but if you wanted something that hangs nicely, look elsewhere. i purchased a petite s, my typical size.",468,89,5.2,0.174603,0.659048,wide not expect pictur wow thi huge tent look right style but thi wa absurd could fit two width peplum style extra width doe not flatter anyon fabric also thin doe not hang nice kind float around bodi want wide top thi perfect but want someth hang nice look elsewher purchas petit typic size
5824,"Wonderful fabric, fabulously stylish Softest denim -- they feel like they've been through the wash a hundred times already! i'm pleased that the belt is removable as it adds just a bit too much bulk for me. i'm 5'4"" and bought the petite for a perfect fit. these will become a go-to pair!",288,53,5.333333,0.6375,0.766667,wonder fabric fabul stylish softest denim feel like wash hundr time alreadi pleas belt remov add bit much bulk bought petit perfect fit becom go pair
10122,"I love it! I saw this in the store and purchased it right on the spot (didn't even see the reviews until now). i can wear either a small or x-small in retailer but for this top, the x-small was a better fit as it draped nicely and was still very loose on me (i'm 5'4, 125 lbs, 34b). i was worried that the opening was too wide but it was perfect. i like the elbow length sleeves and the fact that the blouse is lined which mean it's not sheer and don't have to wear a camisole. i usually stick with the classics i",513,105,4.839623,0.22294,0.56717,love saw thi store purchas right spot not even see review wear either small x small retail but thi top x small wa better fit drape nice wa still veri loos lb b wa worri open wa wide but wa perfect like elbow length sleev fact blous line mean not sheer not wear camisol usual stick classic
8894,"Lovely but so small! The skirt is beautiful, but the sizing is kind of off. i usually wear a 6, at 5'6"" and 130 lbs. i ordered an 8, though, since sometimes athropologie clothes run small. even the 8 was too tight, so i'll have to exchange it.",243,48,4.959184,0.13699,0.569388,love but small skirt beauti but size kind usual wear lb order though sinc sometim athropologi cloth run small even wa tight exchang
22124,,1,0,1.0,0.0,0.0,


In [25]:
x_train_metadata = x_train.drop(['Reviews', 'Clean Reviews'], axis=1).reset_index(drop=True)
x_test_metadata = x_test.drop(['Reviews', 'Clean Reviews'], axis=1).reset_index(drop=True)

x_train_metadata.head()

Unnamed: 0,char_count,word_count,Avg_word_density,Polarity,Subjectivity
0,468,89,5.2,0.174603,0.659048
1,288,53,5.333333,0.6375,0.766667
2,513,105,4.839623,0.22294,0.56717
3,243,48,4.959184,0.13699,0.569388
4,1,0,1.0,0.0,0.0


## **Experiment 3:** 
###Adding Bag of Words based Features

In [26]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(min_df = 0.0, max_df=1.0, ngram_range=(1,1))
x_train_cv = cv.fit_transform(x_train['Clean Reviews']).toarray()
x_train_cv = pd.DataFrame(x_train_cv, columns = cv.get_feature_names())

x_test_cv = cv.transform(x_test['Clean Reviews']).toarray()
x_test_cv = pd.DataFrame(x_test_cv, columns = cv.get_feature_names())

x_train_cv.head()



Unnamed: 0,aa,aaaaaaamaz,aaaaandidon,aaaah,aaah,aam,ab,abbey,abbi,abdomen,...,zip,zipepr,ziploc,zipper,zombi,zone,zooland,zoom,zowi,zuma
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [67]:
x_test_metadata

Unnamed: 0,char_count,word_count,Avg_word_density,Polarity,Subjectivity
0,178,34,5.085714,0.833333,0.866667
1,500,99,5.000000,0.113333,0.342051
2,302,53,5.592593,0.446042,0.636667
3,463,94,4.873684,0.260534,0.605983
4,218,42,5.069767,-0.160000,0.393000
...,...,...,...,...,...
4693,336,66,5.014925,0.119444,0.556389
4694,385,74,5.133333,0.025974,0.640909
4695,413,83,4.916667,0.451323,0.622884
4696,278,52,5.245283,0.157692,0.386686


In [54]:
x_train_n = pd.concat([x_train_metadata, x_train_cv], axis=1)
x_test_n = pd.concat([x_test_metadata, x_test_cv], axis=1)

x_train_n.head()

Unnamed: 0,char_count,word_count,Avg_word_density,Polarity,Subjectivity,aa,aaaaaaamaz,aaaaandidon,aaaah,aaah,...,zip,zipepr,ziploc,zipper,zombi,zone,zooland,zoom,zowi,zuma
0,468,89,5.2,0.174603,0.659048,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,288,53,5.333333,0.6375,0.766667,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,513,105,4.839623,0.22294,0.56717,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,243,48,4.959184,0.13699,0.569388,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,1.0,0.0,0.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### **Training the model with these new data and evaluating them**

In [28]:
log_reg.fit(x_train_n, y_train)
pred = log_reg.predict(x_test_n)

print(classification_report(y_test, pred))
pd.DataFrame(confusion_matrix(y_test, pred))

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


              precision    recall  f1-score   support

           0       0.77      0.66      0.71      1085
           1       0.90      0.94      0.92      3613

    accuracy                           0.87      4698
   macro avg       0.83      0.80      0.81      4698
weighted avg       0.87      0.87      0.87      4698



Unnamed: 0,0,1
0,714,371
1,218,3395


### **Our model is able to predict '0' with a precision of 77% and '1' with a precision of 90%.**
###**As well we are getting accuracy of almost 87%, which a very good accuracy for dealing with reviews.** 

Now I am taking a random sentence as an example for a test Review to test wheather our pre trained model is able to predict the rating of it by analysing the statement only.

Firstly we will process the test data into such a format which our model will be able to read as i've already done.
Now we can test random sentences with below code.

In [127]:
# pred_new = log_reg.predict('this product so amazing, the best one i bought')
test_1 = pd.DataFrame(['type here'])
test_1['char_count'] = test_1[0].apply(len)
test_1['word_count'] = test_1[0].apply(lambda x: len(x.split()))
test_1['Avg_word_density'] = test_1['char_count']/(x_test['word_count']+1)

test_1_snt_obj = test_1[0].apply(lambda row: textblob.TextBlob(row).sentiment)
test_1['Polarity'] = [obj.polarity for obj in test_1_snt_obj.values]
test_1['Subjectivity'] = [obj.subjectivity for obj in test_1_snt_obj.values]
# t = 'this product is so amazing, the best one i bought'
test_1

Unnamed: 0,0,char_count,word_count,Avg_word_density,Polarity,Subjectivity
0,type here,9,2,1.0,0.0,0.0


In [128]:
test_1_cv = cv.transform(test_1[0]).toarray()
test_1_cv = pd.DataFrame(test_1_cv, columns = cv.get_feature_names_out())
test_1_cv

Unnamed: 0,aa,aaaaaaamaz,aaaaandidon,aaaah,aaah,aam,ab,abbey,abbi,abdomen,...,zip,zipepr,ziploc,zipper,zombi,zone,zooland,zoom,zowi,zuma
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [129]:
test_2_cv = pd.concat([test_1, test_1_cv], axis=1)
test_2_cv
test_n = test_2_cv.iloc[:,1:]
test_n


Unnamed: 0,char_count,word_count,Avg_word_density,Polarity,Subjectivity,aa,aaaaaaamaz,aaaaandidon,aaaah,aaah,...,zip,zipepr,ziploc,zipper,zombi,zone,zooland,zoom,zowi,zuma
0,9,2,1.0,0.0,0.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [130]:
pred_new = log_reg.predict(test_n)
pred_new

array([1])