**Goal of the project**

One fundamental problem in sentiment analysis is categorization of sentiment polarity. Given a piece of written text, the problem is to categorize the text into one specific sentiment polarity, positive or negative (or neutral).

In this project, we aim to tackle the problem of sentiment polarity categorization.

**Load the packages**

In [97]:
# Importing libraries
import pandas as pd
import numpy as np

In [98]:
pd.options.mode.chained_assignment = None

**Load the data**

Data used in this project is [a set of product reviews](https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews) collected from an ecommerce business. 

In [99]:
# Load dataset
df = pd.read_csv('../input/womens-ecommerce-clothing-reviews/Womens Clothing E-Commerce Reviews.csv').iloc[: , 2:]

In [100]:
# Rename Pandas columns to lower case
df.columns = df.columns.str.lower()

In [101]:
# Examine the data
df.head()

Unnamed: 0,age,title,review text,rating,recommended ind,positive feedback count,division name,department name,class name
0,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [102]:
# Overview of all variables, their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   age                      23486 non-null  int64 
 1   title                    19676 non-null  object
 2   review text              22641 non-null  object
 3   rating                   23486 non-null  int64 
 4   recommended ind          23486 non-null  int64 
 5   positive feedback count  23486 non-null  int64 
 6   division name            23472 non-null  object
 7   department name          23472 non-null  object
 8   class name               23472 non-null  object
dtypes: int64(4), object(5)
memory usage: 1.6+ MB


**Identify review sentiment**

In fact, since we have star ratings, there’s arguably little need to engineer any additional sentiment-related features. All we really need to know is whether the text from a given review was positive or negative (or neutral), and we can see that from the star rating the customer has given.

In [103]:
star_rating = {'poor_rating': 3, 'great_rating': 4}

df['customer sentiment'] = 'Neutral'
df['customer sentiment'].loc[df['rating'] < star_rating['poor_rating']] = 'Negative'
df['customer sentiment'].loc[df['rating'] >= star_rating['great_rating']] = 'Positive'

**Define the target variable**

In [104]:
# Replace target variables
df['customer sentiment'] = df['customer sentiment'].replace(['Negative', 'Neutral', 'Positive'], [-1, 0, 1])

**Check for missing values**

In [105]:
df.isnull().sum()

age                           0
title                      3810
review text                 845
rating                        0
recommended ind               0
positive feedback count       0
division name                14
department name              14
class name                   14
customer sentiment            0
dtype: int64

Any NaN values for the title and review text fields are filled with blank strings.

In [106]:
df['title'] = df['title'].fillna('')

In [107]:
df['review text'] = df['review text'].fillna('')

**Concatenate the text into a single column**

In [108]:
df['all text'] = df['title'] + ' ' + df['review text']

In [109]:
# Drop unnecessary columns
df.drop(['title', 'review text'], axis = 1, inplace = True)

**Text Preprocessing**

We will need to preprocess our text to remove misleading junk and noise in order to get the best results from our model.

In [110]:
import unicodedata
import re
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

In [111]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [112]:
stop_words = set(stopwords.words('english'))

In [113]:
def text_cleaning(text_data):

  # Remove accented characters
  text_data = unicodedata.normalize('NFKD', text_data).encode('ascii', 'ignore').decode('utf-8', 'ignore')

  # Case conversion
  text_data = text_data.lower()

  # Remove special characters
  text_data = re.sub(r"[^a-zA-Z]+", ' ', text_data)

  # Text as string objects
  text_data = str(text_data)

  # Tokenization
  tokenizer = ToktokTokenizer()
  text_data = tokenizer.tokenize(text_data)

  # Removing stopwords
  text_data = [item for item in text_data if item not in stop_words]
  
  # Convert list of tokens to string data type
  text_data = ' '.join (text_data)

  return text_data

In [114]:
df['all text'] = df['all text'].apply(text_cleaning)

**Define X and y**

Now we’ve examined our data, we need to create a dataset to train the model and one to hold back for testing. The aim of the model is to predict our target variable customer sentiment from the set of features X. The first step is therefore to define which columns go into X and y.

In [115]:
X = df.drop('customer sentiment', axis = 1)

In [116]:
y = df['customer sentiment']

**Split the train and test data**

As usual, we’ll be splitting our data into train and test subsets while ensuring that the resulting split is stratified.

In [117]:
from sklearn.model_selection import train_test_split

In [118]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

**Create a model pipeline**

1. The first step in this pipeline is to use a SimpleImputer to fill in the missing values (np.NaN) with "missing". Although there are many other strategies to use when filling in missing values, there could be underlying reasons in the data collection why an observation has missing data. Therefore, to simply fill in the missing values with the most_frequent of the data would be adding bias from us, the researcher. Without knowing more about why these values are np.nan, we can just fill in the value with "missing" for categorical features.

2. We then pipe this into a OneHotEncoder in order to encode each categorical variable's values as a separate binary column.

3. Next, we use a RobustScalar to normalize our non-textual data.

4. Finally, we need to convert our text to a numeric form. Machine learning models can’t use text, so the final step is to use a text preprocessing technique called Count Vectorization to turn the text into a vector of numbers via the CountVectorizer module in scikit-learn.

In [119]:
import time
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import RobustScaler

from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score

In [120]:
def get_pipeline(X, model):
  
    categorical_columns = list(X.select_dtypes(include = ['object']).columns.values.tolist())
    categorical_columns.remove('all text')
    
    numeric_columns = list(X.select_dtypes(exclude = ['object']).columns.values.tolist()) 
    
    categorical_transformer = Pipeline(steps = [('simple_imputer', SimpleImputer(missing_values = np.nan, fill_value = 'missing', strategy = 'constant')),
                                                ('one_hot_encoder', OneHotEncoder(sparse = False, handle_unknown = 'ignore')),
                                                ('scaler', RobustScaler())])    
     
    preprocessor = ColumnTransformer(transformers = [('text', CountVectorizer(), 'all text'),
                                                     ('categorical', categorical_transformer, categorical_columns),
                                                     ('numeric', RobustScaler(), numeric_columns)], remainder = 'passthrough')

    bundled_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
                                         ('model', model)])
    
    return bundled_pipeline

**Apply model selection**

To undertake the model selection step, we first need to create a dictionary containing the name of each model we want to test, and the name of the model class, i.e. XGBClassifier(random_state = 42).

Next we’ll create a Pandas dataframe into which we’ll store the data. Then we’ll loop over each of the models, fit it using the X_train and y_train data, then generate predictions from X_test and calculate the mean F1 score from 5 rounds of cross-validation. That will give us the F1 score for the X_test data, plus the average F1 score for the training data set.

In [121]:
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier

In [122]:
def select_model(X, y, pipeline = None):

  classifiers = {}
  classifiers.update({'XGBClassifier': OneVsRestClassifier(XGBClassifier(random_state = 42))})
  classifiers.update({'LGBMClassifier': OneVsRestClassifier(LGBMClassifier(random_state = 42))})
  classifiers.update({'DecisionTreeClassifier': OneVsRestClassifier(DecisionTreeClassifier(random_state = 42))})
  classifiers.update({'RandomForestClassifier': OneVsRestClassifier(RandomForestClassifier(random_state = 42))})
  classifiers.update({'ExtraTreesClassifier': OneVsRestClassifier(ExtraTreesClassifier(random_state = 42))})
  classifiers.update({'GradientBoostingClassifier': OneVsRestClassifier(GradientBoostingClassifier(random_state = 42))})    
  classifiers.update({'BaggingClassifier': OneVsRestClassifier(BaggingClassifier(random_state = 42))})
  classifiers.update({'AdaBoostClassifier': OneVsRestClassifier(AdaBoostClassifier(random_state = 42))})
  

  df_models = pd.DataFrame(columns = ['model', 'run_time', 'f1_score_cv', 'f1_score'])

  for key in classifiers:

      print('*', key)

      start_time = time.time()
      
      pipeline = get_pipeline(X_train, classifiers[key])

      cv = cross_val_score(pipeline, X, y, cv = 5, scoring = 'f1_macro', n_jobs = -1)
      
      pipeline.fit(X_train, y_train)
      y_pred = pipeline.predict(X_test)
    
      row = {'model': key,
             'run_time': format(round((time.time() - start_time) / 60, 2)),
             'f1_score_cv': cv.mean(),
             'f1_score': f1_score(y_test, y_pred, average = 'macro')}
      
      df_models = df_models.append(row, ignore_index = True)

  df_models = df_models.sort_values(by = 'f1_score', ascending = False)
      
  return df_models

In [123]:
models = select_model(X_train, y_train)

* XGBClassifier
* LGBMClassifier
* DecisionTreeClassifier
* RandomForestClassifier
* ExtraTreesClassifier
* GradientBoostingClassifier
* BaggingClassifier
* AdaBoostClassifier


After 5-6 minutes, the model selection process had completed. This identified that XGBClassifier was the top performing model, with an F1 score of 100%.

In [124]:
models.head(10)

Unnamed: 0,model,run_time,f1_score_cv,f1_score
0,XGBClassifier,0.42,1.0,1.0
1,LGBMClassifier,0.24,1.0,1.0
2,DecisionTreeClassifier,0.05,1.0,1.0
5,GradientBoostingClassifier,0.66,1.0,1.0
6,BaggingClassifier,0.12,1.0,1.0
7,AdaBoostClassifier,0.17,1.0,1.0
3,RandomForestClassifier,1.41,0.966691,0.951954
4,ExtraTreesClassifier,2.27,0.872257,0.86324


**Examine the performance of the best model**

In [125]:
bundled_pipeline = get_pipeline(X_train, OneVsRestClassifier(XGBClassifier(random_state = 42)))
bundled_pipeline.fit(X_train, y_train)
y_pred = bundled_pipeline.predict(X_test)

Since this is a multi-class classification problem, we can assess its performance using common classification metrics, such as the F1 score and classification report. 


In [126]:
from sklearn.metrics import classification_report

In [127]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          -1       1.00      1.00      1.00       482
           0       1.00      1.00      1.00       574
           1       1.00      1.00      1.00      3642

    accuracy                           1.00      4698
   macro avg       1.00      1.00      1.00      4698
weighted avg       1.00      1.00      1.00      4698

