## Women's E-Commerce Clothing Reviews

### About Dataset
#### Context
Welcome. This is a Women’s Clothing E-Commerce dataset revolving around the reviews written by customers. Its nine supportive features offer a great environment to parse out the text through its multiple dimensions. Because this is real commercial data, it has been anonymized, and references to the company in the review text and body have been replaced with “retailer”.

#### Content
This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:

- **Clothing ID:** Integer Categorical variable that refers to the specific piece being reviewed.
- **Age:** Positive Integer variable of the reviewers age.
- **Title:** String variable for the title of the review.
- **Review Text:** String variable for the review body.
- **Rating:** Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst,  to 5 Best.
- **Recommended IND:** Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
- **Positive Feedback Count:** Positive Integer documenting the number of other customers who found this review positive.
- **Division Name:** Categorical name of the product high level division.
- **Department Name:** Categorical name of the product department name.
- **Class Name:** Categorical name of the product class name.

In [1]:
# Importing Required Liberaries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

'''import nltk
from nltk.stem.porter import PorterStemmer
nltk.download('stopwords')
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))'''

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
# import WordCloud
import pickle
import re 

import warnings
warnings.filterwarnings('ignore')


In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Load dataset
df = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")
df

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
...,...,...,...,...,...,...,...,...,...,...,...
23481,23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0,General Petite,Dresses,Dresses
23482,23482,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0,General Petite,Tops,Knits
23483,23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1,General Petite,Dresses,Dresses
23484,23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses


In [4]:
# Column names
print(f"Feature names: {df.columns}")

Feature names: Index(['Unnamed: 0', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating',
       'Recommended IND', 'Positive Feedback Count', 'Division Name',
       'Department Name', 'Class Name'],
      dtype='object')


In [5]:
df["Review Text"]

0        Absolutely wonderful - silky and sexy and comf...
1        Love this dress!  it's sooo pretty.  i happene...
2        I had such high hopes for this dress and reall...
3        I love, love, love this jumpsuit. it's fun, fl...
4        This shirt is very flattering to all due to th...
                               ...                        
23481    I was very happy to snag this dress at such a ...
23482    It reminds me of maternity clothes. soft, stre...
23483    This fit well, but the top was very see throug...
23484    I bought this dress for a wedding i have this ...
23485    This dress in a lovely platinum is feminine an...
Name: Review Text, Length: 23486, dtype: object

In [6]:
df["Division Name"].value_counts()

Division Name
General           13850
General Petite     8120
Initmates          1502
Name: count, dtype: int64

In [7]:
df["Department Name"].value_counts()

Department Name
Tops        10468
Dresses      6319
Bottoms      3799
Intimate     1735
Jackets      1032
Trend         119
Name: count, dtype: int64

In [8]:
df["Class Name"].value_counts()

Class Name
Dresses           6319
Knits             4843
Blouses           3097
Sweaters          1428
Pants             1388
Jeans             1147
Fine gauge        1100
Skirts             945
Jackets            704
Lounge             691
Swim               350
Outerwear          328
Shorts             317
Sleep              228
Legwear            165
Intimates          154
Layering           146
Trend              119
Casual bottoms       2
Chemises             1
Name: count, dtype: int64

In [9]:
def sentiment(i):
  if i == 3:
    return 'Neutral'
  elif i<3:
    return 'Negative'
  else:
    return 'Positive'

df['Sentiment'] = df['Rating'].apply(sentiment)
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Sentiment
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates,Positive
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses,Positive
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,Neutral
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,Positive
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,Positive


In [10]:
df.isnull().sum()

Unnamed: 0                    0
Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
Sentiment                     0
dtype: int64

In [11]:
# Function to fill review text based on recommended indicator and null values
def fill_review_text(df):
    if pd.isnull(df['Review Text']):
        if df['Recommended IND'] == 1:
            return 'good' 
        else:
            return 'bad'
    else:
        return df['Review Text']

# Apply the function to fill the review text column
df['Review Text'] = df.apply(fill_review_text, axis=1)

# Print the updated DataFrame
df

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Sentiment
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates,Positive
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses,Positive
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,Neutral
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,Positive
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,Positive
...,...,...,...,...,...,...,...,...,...,...,...,...
23481,23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0,General Petite,Dresses,Dresses,Positive
23482,23482,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0,General Petite,Tops,Knits,Neutral
23483,23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1,General Petite,Dresses,Dresses,Neutral
23484,23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses,Neutral


In [12]:
df.isnull().sum()

Unnamed: 0                    0
Clothing ID                   0
Age                           0
Title                      3810
Review Text                   0
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
Sentiment                     0
dtype: int64

In [14]:
print(df[df["Review Text"]== "good"]["Review Text"])

92       good
93       good
98       good
135      good
142      good
         ... 
23258    good
23301    good
23303    good
23470    good
23480    good
Name: Review Text, Length: 774, dtype: object


In [15]:
print(df[df["Review Text"]== "bad"]["Review Text"])

165      bad
523      bad
574      bad
580      bad
1046     bad
        ... 
20773    bad
22017    bad
22230    bad
22492    bad
23127    bad
Name: Review Text, Length: 71, dtype: object


In [16]:
# Features Datatypes 
print(f"Feature Datatypes: ")
print(df.dtypes)

Feature Datatypes: 
Unnamed: 0                  int64
Clothing ID                 int64
Age                         int64
Title                      object
Review Text                object
Rating                      int64
Recommended IND             int64
Positive Feedback Count     int64
Division Name              object
Department Name            object
Class Name                 object
Sentiment                  object
dtype: object


In [17]:
# checking for Null Values.
df[df["Division Name"].isnull()==True]

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Sentiment
9444,9444,72,25,My favorite socks!!!,"I never write reviews, but these socks are so ...",5,1,0,,,,Positive
13767,13767,492,23,So soft!,I just love this hoodie! it is so soft and com...,5,1,1,,,,Positive
13768,13768,492,49,Wardrobe staple,Love this hoodie. so soft and goes with everyt...,5,1,0,,,,Positive
13787,13787,492,48,,good,5,1,0,,,,Positive
16216,16216,152,36,Warm and cozy,"Just what i was looking for. soft, cozy and warm.",5,1,0,,,,Positive
16221,16221,152,37,Love!,I am loving these. they are quite long but are...,5,1,0,,,,Positive
16223,16223,152,39,"""long and warm""",These leg warmers are perfect for me. they are...,5,1,0,,,,Positive
18626,18626,184,34,Nubby footless tights,"These are amazing quality. i agree, size up to...",5,1,5,,,,Positive
18671,18671,184,54,New workhorse,These tights are amazing! if i care for them w...,5,1,0,,,,Positive
20088,20088,772,50,Comfy sweatshirt!,This sweatshirt is really nice! it's oversize...,5,1,0,,,,Positive


In [18]:
df.shape

(23486, 12)

In [19]:
df.shape

(23486, 12)

# Count vectorization

In [20]:
'''vectorizer = CountVectorizer()

train_data,test_data = train_test_split( df,test_size=0.1)

X_train = vectorizer.fit_transform(train_data["Review Text"])

y_train = train_data['Sentiment']

X_test = vectorizer.transform(test_data["Review Text"])

y_test = test_data['Sentiment']'''

'vectorizer = CountVectorizer()\n\ntrain_data,test_data = train_test_split( df,test_size=0.1)\n\nX_train = vectorizer.fit_transform(train_data["Review Text"])\n\ny_train = train_data[\'Sentiment\']\n\nX_test = vectorizer.transform(test_data["Review Text"])\n\ny_test = test_data[\'Sentiment\']'

In [21]:
'''# decision tree
start=d.datetime.now()
dt = DecisionTreeClassifier()
dt.fit(xtest, y_train)
pred = dt.predict(X_test)
print('Elapsed time: ',str(d.datetime.now()-start))
print('Accracy score:',mt.accuracy_score(y_test, pred))'''

"# decision tree\nstart=d.datetime.now()\ndt = DecisionTreeClassifier()\ndt.fit(xtest, y_train)\npred = dt.predict(X_test)\nprint('Elapsed time: ',str(d.datetime.now()-start))\nprint('Accracy score:',mt.accuracy_score(y_test, pred))"

In [22]:
'''#logistic Regression
import datetime as d
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as mt
start= d.datetime.now()
lr = LogisticRegression(tol=0.002)
lr.fit(X_train, y_train)
pred = lr.predict(X_test)
print('Elapsed time: ',str(d.datetime.now()-start))
print('Accracy score:',mt.accuracy_score(y_test, pred))'''

"#logistic Regression\nimport datetime as d\nfrom sklearn.linear_model import LogisticRegression\nimport sklearn.metrics as mt\nstart= d.datetime.now()\nlr = LogisticRegression(tol=0.002)\nlr.fit(X_train, y_train)\npred = lr.predict(X_test)\nprint('Elapsed time: ',str(d.datetime.now()-start))\nprint('Accracy score:',mt.accuracy_score(y_test, pred))"

In [23]:
'''print("Logistic Regression")
print("Classification Report on train data")
print(mt.classification_report(y_train, lr.predict(X_train)))'''

'print("Logistic Regression")\nprint("Classification Report on train data")\nprint(mt.classification_report(y_train, lr.predict(X_train)))'

In [24]:
'''lr_cm=confusion_matrix(y_test.values, lr.predict(X_test))

plt.figure(figsize=(5,5))
plt.suptitle("Confusion Matrix",fontsize=24)

sns.heatmap(lr_cm, annot = True, cmap="viridis",cbar=False);
plt.xlabel('Predicted Value')
plt.ylabel('Actual Value')'''

'lr_cm=confusion_matrix(y_test.values, lr.predict(X_test))\n\nplt.figure(figsize=(5,5))\nplt.suptitle("Confusion Matrix",fontsize=24)\n\nsns.heatmap(lr_cm, annot = True, cmap="viridis",cbar=False);\nplt.xlabel(\'Predicted Value\')\nplt.ylabel(\'Actual Value\')'

In [25]:
'''print("Logistic Regression")
print("Classification Report on test data")
print(mt.classification_report(y_test, lr.predict(X_test)))'''

'print("Logistic Regression")\nprint("Classification Report on test data")\nprint(mt.classification_report(y_test, lr.predict(X_test)))'

In [26]:
vectorizer = CountVectorizer()

x = df["Review Text"]
y = df["Sentiment"]


xtrain, xtest, ytrain, ytest = train_test_split(x,y,test_size=0.1, random_state=10)

xtrain = vectorizer.fit_transform(xtrain)

xtest = vectorizer.transform(xtest)



In [28]:
import datetime as d
from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression()

# Train the model
logistic_model.fit(xtrain, ytrain)

# Predict on the testing data
y_pred = logistic_model.predict(xtest)

# Calculate accuracy
accuracy = accuracy_score(ytest, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8292890591741167


In [31]:
print("Logistic Regression")
print("Classification Report on test data")
print(lclassification_report(ytest, logistic_model.predict(xtest)))

Logistic Regression
Classification Report on test data
              precision    recall  f1-score   support

    Negative       0.60      0.50      0.55       264
     Neutral       0.43      0.36      0.39       263
    Positive       0.90      0.94      0.92      1822

    accuracy                           0.83      2349
   macro avg       0.64      0.60      0.62      2349
weighted avg       0.82      0.83      0.82      2349



In [32]:
print("Logistic Regression")
print("Classification Report on train data")
print(classification_report(ytrain, logistic_model.predict(xtrain)))

Logistic Regression
Classification Report on train data
              precision    recall  f1-score   support

    Negative       0.84      0.77      0.80      2143
     Neutral       0.78      0.63      0.69      2608
    Positive       0.94      0.98      0.96     16386

    accuracy                           0.91     21137
   macro avg       0.85      0.79      0.82     21137
weighted avg       0.91      0.91      0.91     21137



# TF-IDF VECTORIZER

In [33]:
'''vectorizer = TfidfVectorizer()

train_data,test_data = train_test_split( df,test_size=0.1)

xtrain = vectorizer.fit_transform(train_data["Review Text"])

ytrain = train_data['Sentiment']

xtest = vectorizer.transform(test_data["Review Text"])

ytest = test_data['Sentiment']'''

'vectorizer = TfidfVectorizer()\n\ntrain_data,test_data = train_test_split( df,test_size=0.1)\n\nxtrain = vectorizer.fit_transform(train_data["Review Text"])\n\nytrain = train_data[\'Sentiment\']\n\nxtest = vectorizer.transform(test_data["Review Text"])\n\nytest = test_data[\'Sentiment\']'

In [34]:
'''#logistic Regression
import datetime as d
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as mt
start= d.datetime.now()
lr = LogisticRegression(tol=0.002)
lr.fit(xtrain, ytrain)
pred = lr.predict(xtest)
print('Elapsed time: ',str(d.datetime.now()-start))
print('Accracy score:',mt.accuracy_score(ytest, pred))'''

"#logistic Regression\nimport datetime as d\nfrom sklearn.linear_model import LogisticRegression\nimport sklearn.metrics as mt\nstart= d.datetime.now()\nlr = LogisticRegression(tol=0.002)\nlr.fit(xtrain, ytrain)\npred = lr.predict(xtest)\nprint('Elapsed time: ',str(d.datetime.now()-start))\nprint('Accracy score:',mt.accuracy_score(ytest, pred))"

In [35]:
vectorizer = TfidfVectorizer()

x = df["Review Text"]
y = df["Sentiment"]


xtrain, xtest, ytrain, ytest = train_test_split(x,y,test_size=0.1, random_state=10)

xtrain = vectorizer.fit_transform(xtrain)

xtest = vectorizer.transform(xtest)

In [104]:
logistic_model = LogisticRegression()

start= d.datetime.now()

# Train the model
logistic_model.fit(xtrain, ytrain)

# Predict on the testing data
y_pred = logistic_model.predict(xtest)

# Calculate accuracy
accuracy = accuracy_score(ytest, y_pred)

print('Elapsed time: ',str(d.datetime.now()-start))
print("Accuracy:", accuracy)

Elapsed time:  0:00:01.358070
Accuracy: 0.8365261813537676


In [39]:
print("Logistic Regression")
print("Classification Report on test data")
print(classification_report(ytest, logistic_model.predict(xtest)))

Logistic Regression
Classification Report on test data
              precision    recall  f1-score   support

    Negative       0.66      0.48      0.55       264
     Neutral       0.46      0.28      0.35       263
    Positive       0.88      0.97      0.92      1822

    accuracy                           0.84      2349
   macro avg       0.67      0.57      0.61      2349
weighted avg       0.81      0.84      0.82      2349



In [40]:
print("Logistic Regression")
print("Classification Report on train data")
print(classification_report(ytrain, logistic_model.predict(xtrain)))

Logistic Regression
Classification Report on train data
              precision    recall  f1-score   support

    Negative       0.79      0.61      0.69      2143
     Neutral       0.70      0.41      0.52      2608
    Positive       0.90      0.98      0.94     16386

    accuracy                           0.87     21137
   macro avg       0.80      0.67      0.71     21137
weighted avg       0.86      0.87      0.86     21137



In [41]:
'''lr_cm=confusion_matrix(ytest.values, lr.predict(xtest))

plt.figure(figsize=(5,5))
plt.suptitle("Confusion Matrix",fontsize=24)

sns.heatmap(lr_cm, annot = True, cmap="viridis",cbar=False);
plt.xlabel('Predicted Value')
plt.ylabel('Actual Value')'''

'lr_cm=confusion_matrix(ytest.values, lr.predict(xtest))\n\nplt.figure(figsize=(5,5))\nplt.suptitle("Confusion Matrix",fontsize=24)\n\nsns.heatmap(lr_cm, annot = True, cmap="viridis",cbar=False);\nplt.xlabel(\'Predicted Value\')\nplt.ylabel(\'Actual Value\')'

In [58]:
#SVM
start=d.datetime.now()
svm = SVC()
svm.fit(xtrain, ytrain)
pred = svm.predict(xtest)
print('Elapsed time: ',str(d.datetime.now()-start))
print('Accracy score:',accuracy_score(ytest, pred))

Elapsed time:  0:03:31.151034
Accracy score: 0.8343976160068114


In [43]:
#KNN
start=d.datetime.now()
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(xtrain, ytrain)
pred = neigh.predict(xtest)
print('Elapsed time: ',str(d.datetime.now()-start))
print('Accracy score:',accuracy_score(ytest, pred))

Elapsed time:  0:00:10.052439
Accracy score: 0.7679863771817795


In [105]:
#Naive bayes
from sklearn.naive_bayes import MultinomialNB
nb= MultinomialNB()
start=d.datetime.now()
nb.fit(xtrain, ytrain)
pred = nb.predict(xtest)
print('Elapsed time: ',str(d.datetime.now()-start))
print('Accracy score:',accuracy_score(ytest, pred))

Elapsed time:  0:00:00.056427
Accracy score: 0.7756492124308216


In [55]:
# decision tree
start=d.datetime.now()
dt = DecisionTreeClassifier()
dt.fit(xtrain, ytrain)
pred = dt.predict(xtest)
print('Elapsed time: ',str(d.datetime.now()-start))
print('Accracy score:',accuracy_score(ytest, pred))

Elapsed time:  0:00:17.204069
Accracy score: 0.7458492975734355


# Preprocessing

In [59]:
# cleaning the text 

import re
def clean(text):
  text = re.sub('[^A-Za-z]+', ' ', text)
  return text

df['Cleaned Reviews'] = df['Review Text'].apply(clean)
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Sentiment,Cleaned Reviews
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates,Positive,Absolutely wonderful silky and sexy and comfor...
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses,Positive,Love this dress it s sooo pretty i happened to...
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,Neutral,I had such high hopes for this dress and reall...
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,Positive,I love love love this jumpsuit it s fun flirty...
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,Positive,This shirt is very flattering to all due to th...


In [60]:
'''import nltk
nltk.download('punkt')'''

"import nltk\nnltk.download('punkt')"

In [61]:
# tokenization

import nltk
from nltk.tokenize import word_tokenize

def Word_tokenize(text):
  return word_tokenize(text)

df['token Reviews'] = df['Cleaned Reviews'].apply(Word_tokenize)

df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Sentiment,Cleaned Reviews,token Reviews
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates,Positive,Absolutely wonderful silky and sexy and comfor...,"[Absolutely, wonderful, silky, and, sexy, and,..."
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses,Positive,Love this dress it s sooo pretty i happened to...,"[Love, this, dress, it, s, sooo, pretty, i, ha..."
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,Neutral,I had such high hopes for this dress and reall...,"[I, had, such, high, hopes, for, this, dress, ..."
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,Positive,I love love love this jumpsuit it s fun flirty...,"[I, love, love, love, this, jumpsuit, it, s, f..."
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,Positive,This shirt is very flattering to all due to th...,"[This, shirt, is, very, flattering, to, all, d..."


In [62]:
'''import nltk
nltk.download('averaged_perceptron_tagger')'''

"import nltk\nnltk.download('averaged_perceptron_tagger')"

In [63]:
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.corpus import wordnet
from nltk import pos_tag


pos_dict = {'J':wordnet.ADJ, 'V':wordnet.VERB, 'N':wordnet.NOUN, 'R':wordnet.ADV}
def token_stop_pos(text):
    tags = pos_tag(word_tokenize(text))
    newlist = []
    for word, tag in tags:
      if word.lower() not in set(stopwords.words('english')):
        newlist.append(tuple([word, pos_dict.get(tag[0])]))
    return newlist

df['POS tagged'] = df['Cleaned Reviews'].apply(token_stop_pos)
df.head()

[nltk_data] Error loading stopwords: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>
[nltk_data] Error loading wordnet: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Sentiment,Cleaned Reviews,token Reviews,POS tagged
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates,Positive,Absolutely wonderful silky and sexy and comfor...,"[Absolutely, wonderful, silky, and, sexy, and,...","[(Absolutely, r), (wonderful, a), (silky, n), ..."
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses,Positive,Love this dress it s sooo pretty i happened to...,"[Love, this, dress, it, s, sooo, pretty, i, ha...","[(Love, v), (dress, n), (sooo, a), (pretty, r)..."
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,Neutral,I had such high hopes for this dress and reall...,"[I, had, such, high, hopes, for, this, dress, ...","[(high, a), (hopes, n), (dress, n), (really, r..."
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,Positive,I love love love this jumpsuit it s fun flirty...,"[I, love, love, love, this, jumpsuit, it, s, f...","[(love, v), (love, r), (love, v), (jumpsuit, n..."
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,Positive,This shirt is very flattering to all due to th...,"[This, shirt, is, very, flattering, to, all, d...","[(shirt, n), (flattering, a), (due, a), (adjus..."


In [64]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
def lemmatize(pos_data):
    lemma_rew = " "
    for word, pos in pos_data:
      if not pos:
        lemma = word
        lemma_rew = lemma_rew + " " + lemma
      else:
        lemma = wordnet_lemmatizer.lemmatize(word, pos=pos)
        lemma_rew = lemma_rew + " " + lemma
    return lemma_rew

df['Lemma'] = df['POS tagged'].apply(lemmatize)
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Sentiment,Cleaned Reviews,token Reviews,POS tagged,Lemma
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates,Positive,Absolutely wonderful silky and sexy and comfor...,"[Absolutely, wonderful, silky, and, sexy, and,...","[(Absolutely, r), (wonderful, a), (silky, n), ...",Absolutely wonderful silky sexy comfortable
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses,Positive,Love this dress it s sooo pretty i happened to...,"[Love, this, dress, it, s, sooo, pretty, i, ha...","[(Love, v), (dress, n), (sooo, a), (pretty, r)...",Love dress sooo pretty happen find store gla...
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,Neutral,I had such high hopes for this dress and reall...,"[I, had, such, high, hopes, for, this, dress, ...","[(high, a), (hopes, n), (dress, n), (really, r...",high hope dress really want work initially o...
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,Positive,I love love love this jumpsuit it s fun flirty...,"[I, love, love, love, this, jumpsuit, it, s, f...","[(love, v), (love, r), (love, v), (jumpsuit, n...",love love love jumpsuit fun flirty fabulous ...
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,Positive,This shirt is very flattering to all due to th...,"[This, shirt, is, very, flattering, to, all, d...","[(shirt, n), (flattering, a), (due, a), (adjus...",shirt flattering due adjustable front tie pe...


# Count Vectorizer Post Preprocessing

In [65]:
vectorizer = CountVectorizer()

train_data,test_data = train_test_split(df,test_size=0.1)

Xtrain = vectorizer.fit_transform(train_data["Lemma"])

Ytrain = train_data['Sentiment']

Xtest = vectorizer.transform(test_data["Lemma"])

Ytest = test_data['Sentiment']

In [67]:
logistic_model = LogisticRegression()

# Train the model
logistic_model.fit(Xtrain, Ytrain)

print("Logistic Regression")
print("Classification Report on test data")
print(classification_report(Ytest, logistic_model.predict(Xtest)))

Logistic Regression
Classification Report on test data
              precision    recall  f1-score   support

    Negative       0.53      0.50      0.51       249
     Neutral       0.42      0.30      0.35       301
    Positive       0.89      0.94      0.92      1799

    accuracy                           0.81      2349
   macro avg       0.61      0.58      0.59      2349
weighted avg       0.79      0.81      0.80      2349



In [68]:
print("Logistic Regression")
print("Classification Report on train data")
print(classification_report(Ytrain, logistic_model.predict(Xtrain)))

Logistic Regression
Classification Report on train data
              precision    recall  f1-score   support

    Negative       0.88      0.78      0.82      2158
     Neutral       0.83      0.65      0.73      2570
    Positive       0.94      0.98      0.96     16409

    accuracy                           0.92     21137
   macro avg       0.88      0.80      0.84     21137
weighted avg       0.92      0.92      0.92     21137



# TF-IDF Vectorizer Post Preprocessing

In [69]:
vectorizer = TfidfVectorizer()

train_data,test_data = train_test_split(df,test_size=0.1)

X_train = vectorizer.fit_transform(train_data["Lemma"])

Y_train = train_data['Sentiment']

X_test = vectorizer.transform(test_data["Lemma"])

Y_test = test_data['Sentiment']

In [99]:
AFTER PREPROCESING

# Logistic Regression Model
logistic_model = LogisticRegression()

start= d.datetime.now()

# Train the model
logistic_model.fit(X_train, Y_train)

# Predict on the test and train data
Y_test_pred = logistic_model.predict(X_test)
Y_train_pred = logistic_model.predict(X_train)

# TIME TAKEN
print('Elapsed time: ',str(d.datetime.now()-start))

# Calculate accuracy
accuracy = accuracy_score(Y_test, Y_test_pred)
print("Testing Accuracy:", accuracy)
print("Training Accuracy:",accuracy_score(Y_train, Y_train_pred))

Elapsed time:  0:00:01.050464
Testing Accuracy: 0.8424861643252448
Training Accuracy: 0.8634621753323556


In [74]:
print("Logistic Regression")
print("Classification Report on test data")
print(classification_report(Y_test, logistic_model.predict(X_test)))

Logistic Regression
Classification Report on test data
              precision    recall  f1-score   support

    Negative       0.67      0.46      0.54       236
     Neutral       0.45      0.23      0.30       262
    Positive       0.88      0.98      0.93      1851

    accuracy                           0.84      2349
   macro avg       0.67      0.56      0.59      2349
weighted avg       0.81      0.84      0.82      2349



In [None]:
##############
# KNeighbours
# SVM
# Naivebayes
# Decision tree

In [100]:
# KNeighbors Classifier Model
KN_model = KNeighborsClassifier()

start= d.datetime.now()

# Train the model
KN_model.fit(X_train, Y_train)

# Predict on the test and train data
Y_test_pred = KN_model.predict(X_test)
Y_train_pred = KN_model.predict(X_train)

# TIME TAKEN
print('Elapsed time: ',str(d.datetime.now()-start))

# Calculate accuracy
accuracy = accuracy_score(Y_test, Y_test_pred)
print("Testing Accuracy:", accuracy)
print("Training Accuracy:",accuracy_score(Y_train, Y_train_pred))

Elapsed time:  0:00:40.729139
Testing Accuracy: 0.7969348659003831
Training Accuracy: 0.839523111132138


In [101]:
# SVM Classifier Model
SV_model = SVC()

start= d.datetime.now()

# Train the model
SV_model.fit(X_train, Y_train)

# Predict on the test and train data
Y_test_pred = SV_model.predict(X_test)
Y_train_pred = SV_model.predict(X_train)

# TIME TAKEN
print('Elapsed time: ',str(d.datetime.now()-start))

# Calculate accuracy
accuracy = accuracy_score(Y_test, Y_test_pred)
print("Testing Accuracy:", accuracy)
print("Training Accuracy:",accuracy_score(Y_train, Y_train_pred))

Elapsed time:  0:03:29.807062
Testing Accuracy: 0.8441890166028098
Training Accuracy: 0.9539196669347589


In [102]:
# Multinomial Naive Bayes Classifier Model
NB_model = MultinomialNB()

start= d.datetime.now()

# Train the model
NB_model.fit(X_train, Y_train)

# Predict on the test and train data
Y_test_pred = NB_model.predict(X_test)
Y_train_pred = NB_model.predict(X_train)

# TIME TAKEN
print('Elapsed time: ',str(d.datetime.now()-start))

# Calculate accuracy
accuracy = accuracy_score(Y_test, Y_test_pred)
print("Testing Accuracy:", accuracy)
print("Training Accuracy:",accuracy_score(Y_train, Y_train_pred))

Elapsed time:  0:00:00.049795
Testing Accuracy: 0.7905491698595147
Training Accuracy: 0.7782088281213039


In [103]:
# Decision Tree Classifier Model
DT_model = DecisionTreeClassifier()

start= d.datetime.now()

# Train the model
DT_model.fit(X_train, Y_train)

# Predict on the test and train data
Y_test_pred = DT_model.predict(X_test)
Y_train_pred = DT_model.predict(X_train)

# TIME TAKEN
print('Elapsed time: ',str(d.datetime.now()-start))

# Calculate accuracy
accuracy = accuracy_score(Y_test, Y_test_pred)
print("Testing Accuracy:", accuracy)
print("Training Accuracy:",accuracy_score(Y_train, Y_train_pred))

Elapsed time:  0:00:10.587380
Testing Accuracy: 0.7445721583652618
Training Accuracy: 0.9974925486114397


In [92]:
# Classification Reports
print("Logistic Regression")
print(classification_report(Y_test, logistic_model.predict(X_test)))
print("\n KNeighbours")
print(classification_report(Y_test, KN_model.predict(X_test)))
print("\n Support Vector Machine (SVM)")
print(classification_report(Y_test, SV_model.predict(X_test)))
print("\n Naive Bayes")
print(classification_report(Y_test, NB_model.predict(X_test)))
print("\n Decision Tree")
print(classification_report(Y_test, DT_model.predict(X_test)))

Logistic Regression
              precision    recall  f1-score   support

    Negative       0.67      0.46      0.54       236
     Neutral       0.45      0.23      0.30       262
    Positive       0.88      0.98      0.93      1851

    accuracy                           0.84      2349
   macro avg       0.67      0.56      0.59      2349
weighted avg       0.81      0.84      0.82      2349


 KNeighbours
              precision    recall  f1-score   support

    Negative       0.49      0.32      0.39       236
     Neutral       0.28      0.13      0.17       262
    Positive       0.85      0.95      0.90      1851

    accuracy                           0.80      2349
   macro avg       0.54      0.47      0.49      2349
weighted avg       0.75      0.80      0.77      2349


 Support Vector Machine (SVM)
              precision    recall  f1-score   support

    Negative       0.71      0.44      0.54       236
     Neutral       0.50      0.20      0.28       262
    Positi

###########################################################################################################################