<a href="https://colab.research.google.com/github/KelvinLam05/sentiment_classification/blob/main/sentiment_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Goal of the project**

One fundamental problem in sentiment analysis is categorization of sentiment polarity. Given a piece of written text, the problem is to categorize the text into one specific sentiment polarity, positive or negative (or neutral). 

In [1]:
# Importing libraries
import pandas as pd
import numpy as np

**Load the data**

Here, we’re going to load up some real [ecommerce data](https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews) from an ecommerce business.

In [2]:
# Load dataset
df = pd.read_csv('/content/womens_clothing_e-commerce_reviews.csv').iloc[: , 2:]

In [3]:
# Rename Pandas columns to lower case
df.columns = df.columns.str.lower()

In [4]:
# Examine the data
df.head()

Unnamed: 0,age,title,review text,rating,recommended ind,positive feedback count,division name,department name,class name
0,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [5]:
# Overview of all variables, their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   age                      23486 non-null  int64 
 1   title                    19676 non-null  object
 2   review text              22641 non-null  object
 3   rating                   23486 non-null  int64 
 4   recommended ind          23486 non-null  int64 
 5   positive feedback count  23486 non-null  int64 
 6   division name            23472 non-null  object
 7   department name          23472 non-null  object
 8   class name               23472 non-null  object
dtypes: int64(4), object(5)
memory usage: 1.6+ MB


**Identify review sentiment**

In fact, since we have star ratings, there’s arguably little need to engineer any additional sentiment-related features. All we really need to know is whether the text from a given review was positive or negative (or neutral), and we can see that from the star rating the customer has given. 

In [None]:
star_rating = {'poor_rating': 3, 'great_rating': 4}

df['sentiment_label'] = 'neutral'
df['sentiment_label'].loc[df['rating'] < star_rating['poor_rating']] = 'negative'
df['sentiment_label'].loc[df['rating'] >= star_rating['great_rating']] = 'positive'

In [7]:
# Replace target variables
df['sentiment_label'] = df['sentiment_label'].replace(['negative', 'neutral', 'positive'], [0, 1, 2])

**Examine the data**

As we will see from examining the value_counts( ) of the target variable column, this dataset is imbalanced.

In [8]:
df['sentiment_label'].value_counts()

2    18208
1     2871
0     2407
Name: sentiment_label, dtype: int64

**Remove NaN values**

In [9]:
df.isnull().sum()

age                           0
title                      3810
review text                 845
rating                        0
recommended ind               0
positive feedback count       0
division name                14
department name              14
class name                   14
sentiment_label               0
dtype: int64

Any NaN values for the title and review text fields are filled with blank strings.

In [10]:
df['title'] = df['title'].fillna('')

In [11]:
df['review text'] = df['review text'].fillna('')

**Concatenate the text into a single column**

The next common step most is to merge the individual text columns together into a single column.

In [12]:
df['all text'] = df['title'] + ' ' + df['review text']

In [14]:
# Drop unnecessary columns
df.drop(['title', 'review text'], axis = 1, inplace = True)

**Text Preprocessing**

We will need to preprocess our text to remove misleading junk and noise in order to get the best results from our model.

In [15]:
import re
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

In [16]:
import unicodedata

In [17]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [18]:
stop_words = set(stopwords.words('english'))

We will now set up our cleaning function.

In [19]:
def text_cleaning(text_data):

  # Remove accented characters
  text_data = unicodedata.normalize('NFKD', text_data).encode('ascii', 'ignore').decode('utf-8', 'ignore')

  # Case conversion
  text_data = text_data.lower()

  # Remove special characters
  text_data = re.sub(r"[^a-zA-Z]+", ' ', text_data)

  # Text as string objects
  text_data = str(text_data)

  # Tokenization
  tokenizer = ToktokTokenizer()
  text_data = tokenizer.tokenize(text_data)

  # Removing stopwords
  text_data = [item for item in text_data if item not in stop_words]
  
  # Convert list of tokens to string data type
  text_data = ' '.join (text_data)

  return text_data

In [20]:
df['all text'] = df['all text'].apply(text_cleaning)

**Split the train and test data**

As usual, we’ll be splitting our data into train and test subsets while ensuring that the resulting split is stratified.

In [21]:
X = df.drop('sentiment_label', axis = 1)

In [22]:
y = df['sentiment_label']

In [23]:
from sklearn.model_selection import train_test_split

In [24]:
# Split imbalanced dataset into training and testing sets with stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

**Run model selection**

In [25]:
import time
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_val_score

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import RobustScaler
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import f1_score

In [26]:
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier, XGBRFClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import BernoulliNB

In [27]:
def get_pipeline(X, model):

    numeric_columns = list(X.select_dtypes(exclude = ['object']).columns.values.tolist())   
    
    categorical_columns = list(X.select_dtypes(include = ['object']).columns.values.tolist())
    categorical_columns.remove('all text')
    
    categorical_transformer = Pipeline(steps = [('simple_imputer', SimpleImputer(missing_values = np.nan, fill_value = 'missing', strategy = 'constant')),
                                                ('one_hot_encoder', OneHotEncoder(drop = 'if_binary', sparse = False, handle_unknown = 'ignore')),
                                                ('scaler', RobustScaler())])    
     
    preprocessor = ColumnTransformer(transformers = [('text', TfidfVectorizer(), 'all text'),
                                                     ('numeric', RobustScaler(), numeric_columns), 
                                                     ('categorical', categorical_transformer, categorical_columns)], remainder = 'passthrough')

    bundled_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
                                         ('model', model)])
    return bundled_pipeline

In [28]:
def select_model(X, y, pipeline = None):

  classifiers = {}
  classifiers.update({'XGBClassifier': OneVsRestClassifier(XGBClassifier(random_state = 42))})
  classifiers.update({'XGBRFClassifier': OneVsRestClassifier(XGBRFClassifier(random_state = 42))})
  classifiers.update({'LGBMClassifier': OneVsRestClassifier(LGBMClassifier(random_state = 42))})
  classifiers.update({'RandomForestClassifier': OneVsRestClassifier(RandomForestClassifier(random_state = 42))})
  classifiers.update({'DecisionTreeClassifier': OneVsRestClassifier(DecisionTreeClassifier(random_state = 42))})
  classifiers.update({'ExtraTreesClassifier': OneVsRestClassifier(ExtraTreesClassifier(random_state = 42))})
  classifiers.update({'GradientBoostingClassifier': OneVsRestClassifier(GradientBoostingClassifier(random_state = 42))})    
  classifiers.update({'BaggingClassifier': OneVsRestClassifier(BaggingClassifier(random_state = 42))})
  classifiers.update({'AdaBoostClassifier': OneVsRestClassifier(AdaBoostClassifier(random_state = 42))})
  classifiers.update({'SGDClassifier': OneVsRestClassifier(SGDClassifier(random_state = 42))})
  classifiers.update({'BaggingClassifier': OneVsRestClassifier(BaggingClassifier(random_state = 42))})
  classifiers.update({'BernoulliNB': OneVsRestClassifier(BernoulliNB())})
  classifiers.update({'SVC': OneVsRestClassifier(SVC(random_state = 42))})

  df_models = pd.DataFrame(columns = ['model', 'run_time', 'f1_score_cv', 'f1_score'])

  for key in classifiers:

      print('*', key)

      start_time = time.time()
      
      pipeline = get_pipeline(X_train, classifiers[key])
      pipeline.fit(X_train, y_train)
      y_pred = pipeline.predict(X_test)

      cv = cross_val_score(pipeline, X, y, cv = 10, scoring = 'f1_macro', n_jobs = -1)

      row = {'model': key,
             'run_time': format(round((time.time() - start_time) / 60, 2)),
             'f1_score_cv': cv.mean(),
             'f1_score': f1_score(y_test, y_pred, average = 'macro')}
      
      df_models = df_models.append(row, ignore_index = True)

  df_models = df_models.sort_values(by = 'f1_score', ascending = False)
      
  return df_models

In [29]:
models = select_model(X_train, y_train)

* XGBClassifier
* XGBRFClassifier
* LGBMClassifier
* RandomForestClassifier
* DecisionTreeClassifier
* ExtraTreesClassifier
* GradientBoostingClassifier
* BaggingClassifier
* AdaBoostClassifier
* SGDClassifier
* BernoulliNB
* SVC


After 11-12 minutes, the model selection process had completed. This identified that XGBClassifier was the top performing model, with an F1 score of 100%.

In [30]:
models.head(10)

Unnamed: 0,model,run_time,f1_score_cv,f1_score
0,XGBClassifier,2.01,1.0,1.0
1,XGBRFClassifier,2.13,1.0,1.0
2,LGBMClassifier,1.26,1.0,1.0
4,DecisionTreeClassifier,0.13,1.0,1.0
6,GradientBoostingClassifier,3.1,1.0,1.0
7,BaggingClassifier,0.35,1.0,1.0
8,AdaBoostClassifier,0.64,1.0,1.0
11,SVC,3.23,0.998315,0.999327
3,RandomForestClassifier,3.36,0.979518,0.983144
9,SGDClassifier,0.12,0.96627,0.963779


**Examine the performance of the best model**

In [31]:
bundled_pipeline = get_pipeline(X_train, OneVsRestClassifier(XGBClassifier(random_state = 42)))
bundled_pipeline.fit(X_train, y_train)
y_pred = bundled_pipeline.predict(X_test)

Since this is a multi-class classification problem, we can assess its performance using common classification metrics, such as the F1 score and classification report. 


In [32]:
from sklearn.metrics import classification_report

In [33]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       482
           1       1.00      1.00      1.00       574
           2       1.00      1.00      1.00      3642

    accuracy                           1.00      4698
   macro avg       1.00      1.00      1.00      4698
weighted avg       1.00      1.00      1.00      4698

