<a href="https://colab.research.google.com/github/KelvinLam05/sentiment_classification/blob/main/sentiment_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Goal of the project**

One fundamental problem in sentiment analysis is categorization of sentiment polarity. Given a piece of written text, the problem is to categorize the text into one specific sentiment polarity, positive or negative (or neutral). 

In [39]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

**Load the data**

In [40]:
# Load dataset
df = pd.read_csv("/content/boat's_headphone_reviews.csv")

In [41]:
# Examine the data
df.head()

Unnamed: 0,review,rating
0,It was nice produt. I like it's design a lot. ...,5
1,awesome sound....very pretty to see this nd th...,5
2,awesome sound quality. pros 7-8 hrs of battery...,4
3,I think it is such a good product not only as ...,5
4,awesome bass sound quality very good bettary l...,5


In [42]:
# Overview of all variables, their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9976 entries, 0 to 9975
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  9976 non-null   object
 1   rating  9976 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 156.0+ KB


**Identify review sentiment**

In fact, since we have star ratings, there’s arguably little need to engineer any additional sentiment-related features. All we really need to know is whether the text from a given review was positive or negative (or neutral), and we can see that from the star rating the customer has given. 

In [None]:
star_rating = {'poor_rating': 3, 'great_rating': 4}

df['sentiment_label'] = 'neutral'
df['sentiment_label'].loc[df['rating'] < star_rating['poor_rating']] = 'negative'
df['sentiment_label'].loc[df['rating'] >= star_rating['great_rating']] = 'positive'

In [44]:
# Replace target variables
df['sentiment_label'] = df['sentiment_label'].replace(['negative', 'neutral', 'positive'], [0, 1, 2])

**Examine the data**

As we will see from examining the value_counts( ) of the target variable column, this dataset is imbalanced.

In [45]:
df['sentiment_label'].value_counts()

2    8091
0    1001
1     884
Name: sentiment_label, dtype: int64

**Check for special characters**

In [46]:
text_data = str()

for sentence in df['review'].values:
    text_data += sentence
    
''.join(set(text_data))

'😇✓👎𝑡 B💗😟VYx😉ś𝑐,F6🔊️😞Z9~𝓮z🤜kKc💜T💌Î★8💕♂💓𝑛🎊°🙈📦𝗲⚡😲💟!😶b🤔💃👈℅💖👍😫’😑🤷😂😋ᴜëS🎶☝•OØ|𝑝d😡😅🙏₹✔😐😄ɪ🎙J👉♥🕒🔸{💣*🔛😃🤙s0+😳😘@🆗1🤯f🏼A💵_D\U0001f90d🙆😱😤🎻r👂𝑠💨ay😊Q🏻©🔋2R𝑢à🙌🔥X\'PEi😍𝑓U𝑎5:💩✨👿☹😔𝓬😜🕺🎧🖤𝑤👋😩m𝑔💥À🥰𝐭😙𝓲✊🙂😒😣🛶🔵🗣u😛…øIoG🥳𝑜w)🙃👻]📞🥵🤩💛𝑙$🤗=💫ᴘ😕×😏t🎉j/ᴏ>&◀🤟😠💘🤓є}ᴄɴ[𝗡✌☺𝐝🤣𝑏😀😌🎮-😻N💤ÃLM🙄𝑆C%🌹❌#�💐✳💞‼𝑖4❇🥇W❤😗h🎸"3𝑑🔮nÍ👏💀🌚∆🤘ʀÇ🧐.🍭𝑦🎵𝑒👳💚⭐Ñ❣í?H🤑𝓝▶💪🤝\u200b☠g<lÉℎ➡𝗶;😎👌💯⛵😓e(😈💝7😁😭💙𝐮🦊𝑟p𝐜ᴅ𝗰☑ᴇ🤭qvᴛ🌟😚💰\u200d'

There are plenty of special characters.

**Text Preprocessing**

We will need to preprocess our text to remove misleading junk and noise in order to get the best results from our model.

In [47]:
import re
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

In [48]:
import unicodedata
from emoji import demojize

In [None]:
nltk.download('stopwords')

In [50]:
stop_words = set(stopwords.words('english'))

We will now set up our cleaning function.

In [51]:
def text_cleaning(text_data):

  # Remove accented characters
  text_data = unicodedata.normalize('NFKD', text_data).encode('ascii', 'ignore').decode('utf-8', 'ignore')

  # Case conversion
  text_data = text_data.lower()

  # Demojize
  text_data = demojize(text_data)

  # Remove special characters
  text_data = re.sub(r"[^a-zA-Z]+", ' ', text_data)

  # Text as string objects
  text_data = str(text_data)

  # Tokenization
  tokenizer = ToktokTokenizer()
  text_data = tokenizer.tokenize(text_data)

  # Removing stopwords
  text_data = [item for item in text_data if item not in stop_words]
  
  # Convert list of tokens to string data type
  text_data = ' '.join (text_data)

  return text_data

In [52]:
df['clean_review'] = df['review'].apply(text_cleaning)

In [53]:
# Drop unwanted column
df.drop(['review'], axis = 1, inplace = True)

**Split the train and test data**

In [54]:
X = df.drop('sentiment_label', axis = 1)

In [55]:
y = df['sentiment_label']

In [56]:
from sklearn.model_selection import train_test_split

In [57]:
# Split imbalanced dataset into training and testing sets with stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

**Run model selection**

In [58]:
import time
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import RobustScaler
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import f1_score

In [59]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier, XGBRFClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.naive_bayes import BernoulliNB

In [60]:
def get_pipeline(model):
                 
  preprocessor = ColumnTransformer(transformers = [('tfidf', TfidfVectorizer(), 'clean_review'),
                                                   ('scaler', RobustScaler(), ['rating'])])

  bundled_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
                                       ('model', model)])
  
  return bundled_pipeline

In [61]:
def select_model(X, y, pipeline = None):

  classifiers = {}
  classifiers.update({'XGBClassifier': OneVsRestClassifier(XGBClassifier(random_state = 42))})
  classifiers.update({'XGBRFClassifier': OneVsRestClassifier(XGBRFClassifier(random_state = 42))})
  classifiers.update({'LogisticRegression': OneVsRestClassifier(LogisticRegression(random_state = 42))})
  classifiers.update({'LGBMClassifier': OneVsRestClassifier(LGBMClassifier(random_state = 42))})
  classifiers.update({'RandomForestClassifier': OneVsRestClassifier(RandomForestClassifier(random_state = 42))})
  classifiers.update({'DecisionTreeClassifier': OneVsRestClassifier(DecisionTreeClassifier(random_state = 42))})
  classifiers.update({'ExtraTreesClassifier': OneVsRestClassifier(ExtraTreesClassifier(random_state = 42))})
  classifiers.update({'GradientBoostingClassifier': OneVsRestClassifier(GradientBoostingClassifier(random_state = 42))})    
  classifiers.update({'BaggingClassifier': OneVsRestClassifier(BaggingClassifier(random_state = 42))})
  classifiers.update({'AdaBoostClassifier': OneVsRestClassifier(AdaBoostClassifier(random_state = 42))})
  classifiers.update({'KNeighborsClassifier': OneVsRestClassifier(KNeighborsClassifier())})
  classifiers.update({'SGDClassifier': OneVsRestClassifier(SGDClassifier(random_state = 42))})
  classifiers.update({'BaggingClassifier': OneVsRestClassifier(BaggingClassifier(random_state = 42))})
  classifiers.update({'BernoulliNB': OneVsRestClassifier(BernoulliNB())})
  classifiers.update({'LinearSVC': OneVsRestClassifier(LinearSVC(random_state = 42))})
  classifiers.update({'SVC': OneVsRestClassifier(SVC(random_state = 42))})

  df_models = pd.DataFrame(columns = ['model', 'run_time', 'f1_score_cv', 'f1_score'])

  for key in classifiers:

      print('*', key)

      start_time = time.time()
      
      pipeline = get_pipeline(classifiers[key])
      pipeline.fit(X_train, y_train)
      y_pred = pipeline.predict(X_test)

      cv = cross_val_score(pipeline, X, y, cv = 10, scoring = 'f1_macro', n_jobs = -1)

      row = {'model': key,
             'run_time': format(round((time.time() - start_time) / 60, 2)),
             'f1_score_cv': cv.mean(),
             'f1_score': f1_score(y_test, y_pred, average = 'macro')}
      
      df_models = df_models.append(row, ignore_index = True)

  df_models = df_models.sort_values(by = 'f1_score_cv', ascending = False)
      
  return df_models

In [62]:
models = select_model(X_train, y_train)

* XGBClassifier
* XGBRFClassifier
* LogisticRegression
* LGBMClassifier
* RandomForestClassifier
* DecisionTreeClassifier
* ExtraTreesClassifier
* GradientBoostingClassifier
* BaggingClassifier
* AdaBoostClassifier
* KNeighborsClassifier
* SGDClassifier
* BernoulliNB
* LinearSVC
* SVC


After 1-2 minutes, the model selection process had completed. This identified that XGBClassifier was the top performing model, with a ROC/AUC score of 100%.

In [63]:
models.head()

Unnamed: 0,model,run_time,f1_score_cv,f1_score
0,XGBClassifier,0.47,1.0,1.0
1,XGBRFClassifier,0.5,1.0,1.0
2,LogisticRegression,0.06,1.0,1.0
3,LGBMClassifier,0.19,1.0,1.0
5,DecisionTreeClassifier,0.03,1.0,1.0


**Examine the performance of the best model**

In [64]:
pipeline = get_pipeline(OneVsRestClassifier(XGBClassifier(random_state = 42)))
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Since this is a multi-class classification problem, we can assess its performance using common classification metrics, such as the F1 score and classification report. Initial results are awesome. We get a F1 score of 100%, which is impressive.



In [65]:
from sklearn.metrics import classification_report

In [66]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       200
           1       1.00      1.00      1.00       177
           2       1.00      1.00      1.00      1619

    accuracy                           1.00      1996
   macro avg       1.00      1.00      1.00      1996
weighted avg       1.00      1.00      1.00      1996

