<a href="https://colab.research.google.com/github/KelvinLam05/sentiment_analysis/blob/main/sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Goal of the project**

One fundamental problem in sentiment analysis is categorization of sentiment polarity. Given a piece of written text, the problem is to categorize the text into one specific sentiment polarity, positive or negative (or neutral). 

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

**Load the data**

In [2]:
# Load dataset
df = pd.read_csv("/content/boat's_headphone_reviews.csv")

In [3]:
# Examine the data
df.head()

Unnamed: 0,review,rating
0,It was nice produt. I like it's design a lot. ...,5
1,awesome sound....very pretty to see this nd th...,5
2,awesome sound quality. pros 7-8 hrs of battery...,4
3,I think it is such a good product not only as ...,5
4,awesome bass sound quality very good bettary l...,5


In [4]:
# Overview of all variables, their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9976 entries, 0 to 9975
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  9976 non-null   object
 1   rating  9976 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 156.0+ KB


**Identify review sentiment**

In fact, since we have star ratings, there‚Äôs arguably little need to engineer any additional sentiment-related features. All we really need to know is whether the text from a given review was positive or negative (or neutral), and we can see that from the star rating the customer has given. 

In [None]:
star_rating = {'poor_rating': 3, 'great_rating': 4}

df['sentiment_label'] = 'neutral'
df['sentiment_label'].loc[df['rating'] < star_rating['poor_rating']] = 'negative'
df['sentiment_label'].loc[df['rating'] >= star_rating['great_rating']] = 'positive'

In [6]:
# Replace target variables
df['sentiment_label'] = df['sentiment_label'].replace(['negative', 'neutral', 'positive'], [0, 1, 2])

**Examine the data**

As we will see from examining the value_counts( ) of the target variable column, this dataset is imbalanced.

In [7]:
df['sentiment_label'].value_counts()

2    8091
0    1001
1     884
Name: sentiment_label, dtype: int64

**Check for special characters**

In [8]:
text_data = str()

for sentence in df['review'].values:
    text_data += sentence
    
''.join(set(text_data))

'üò§ü§óüëçK+üíØ!ü§îüíú7üò†‚ÑÖ‚ö°ùëìüï∫p,‚Çπüò≥üôÜ]üò≠&üé∏üòóV_üç≠…¥√éüòÇ0büî•JüíÉSùëôU|üëåùó∞=DüéßZüò≤üíµ¬©üõ∂<üíõüé∂üò´rdQgFü•µüò∂√ë√∏‚Ñé?·¥ÑtüòÑ"üí§ùìùüéµüëãü§òüò£/üìûü§≠YùëéüéÆ‚òπùê≠ùëíBüôàüèª…™ü§ú‚ò∫Eüì¶Ô∏è√óüéäüíåüèºüíê\'~‚≠ê‚ú®üåöüôåùë°lüòªhoüîõüñ§¬∞OyXüòêfüíôüôÇùë§üòäüòò4üòü1üí®üò°üíì‚Äôa$‚ò†ü§ù√á·¥úcüë≥kùëÜùëî\U0001f90düíÄ)‚û°—îqüòõ‚õµü§ëùëë6vü§©IüíïüòÖüòöL‚òÖ·¥ÖüòúùìÆ‚àÜùêú-üôè·¥áüí∞‚òëüî∏üí•(üíûüéâùëèH√ò√ÉMüòé‚úì‚úäzüòô@:üòàuüíüüëéüòâsüëèüåπüòãüó£j‚òùüòì‚ù£.ùó∂üíóüíñxüí£ÔøΩùëñ·¥èùë¶wüïíüòÅ*niüòå‚úåüôÑü§üüôÉ‚ôÇ8üíùùêÆùë†mùì≤}·¥ò‚Äº>ü•≥\u200büòï√†üòáüòç√çü•∞ùëü üòíCeüí´üëªüîµüòÉüò±#üßêüéªüéôG·¥õùë¢ùëùR≈õü§£üëâùëê‚ô•üîã√´üëàüåüüíö‚ñ∂ü§Ø‚ùå√Äüîä√≠üí©ü§ôüÜó‚Ä¶A[ü¶äüîÆ{ùêùüòèüíò‚óÄWüí™√âüòÄT‚ú≥ùó°5\u200d; Ä3üëøN‚ùáüëÇ‚ù§üò©Püòëüòîü§∑2‚Ä¢ü§ìùëúüòûü•áùëõ‚úîùó≤%9ùì¨'

There are plenty of special characters.

**Text Preprocessing**

We will need to preprocess our text to remove misleading junk and noise in order to get the best results from our model.

In [9]:
import re
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 

In [11]:
import unicodedata
from emoji import demojize

In [12]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [13]:
stop_words = set(stopwords.words('english'))
lemmatizer = nltk.stem.WordNetLemmatizer()

We will now set up our cleaning function.

In [14]:
def text_cleaning(text_data):

  # Remove accented characters
  text_data = unicodedata.normalize('NFKD', text_data).encode('ascii', 'ignore').decode('utf-8', 'ignore')

  # Case conversion
  text_data = text_data.lower()

  # Demojize
  text_data = demojize(text_data)

  # Remove special characters
  text_data = re.sub(r"[^a-zA-Z]+", ' ', text_data)

  # Text as string objects
  text_data = str(text_data)

  # Tokenization
  tokenizer = ToktokTokenizer()
  text_data = tokenizer.tokenize(text_data)

  # Removing stopwords
  text_data = [item for item in text_data if item not in stop_words]
  
  # Convert list of tokens to string data type
  text_data = ' '.join (text_data)

  return text_data

In [15]:
df['clean_review'] = df['review'].apply(text_cleaning)

In [16]:
# Drop unwanted column
df.drop(['review'], axis = 1, inplace = True)

**Split the train and test data**

In [17]:
X = df.drop('sentiment_label', axis = 1)

In [18]:
y = df['sentiment_label']

In [19]:
from sklearn.model_selection import train_test_split

In [20]:
# Split imbalanced dataset into training and testing sets with stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

**Run model selection**

In [21]:
import time
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import RobustScaler
from sklearn.multiclass import OneVsOneClassifier
from sklearn.metrics import f1_score

In [22]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier, XGBRFClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.naive_bayes import BernoulliNB

In [23]:
def get_pipeline(model):
                 
  preprocessor = ColumnTransformer(transformers = [('tfidf', TfidfVectorizer(), 'clean_review'),
                                                   ('scaler', RobustScaler(), ['rating'])])

  bundled_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
                                       ('model', model)])
  
  return bundled_pipeline

In [24]:
def select_model(X, y, pipeline = None):

  classifiers = {}
  classifiers.update({'XGBClassifier': OneVsOneClassifier(XGBClassifier(random_state = 42))})
  classifiers.update({'XGBRFClassifier': OneVsOneClassifier(XGBRFClassifier(random_state = 42))})
  classifiers.update({'LogisticRegression': OneVsOneClassifier(LogisticRegression(random_state = 42))})
  classifiers.update({'LGBMClassifier': OneVsOneClassifier(LGBMClassifier(random_state = 42))})
  classifiers.update({'RandomForestClassifier': OneVsOneClassifier(RandomForestClassifier(random_state = 42))})
  classifiers.update({'DecisionTreeClassifier': OneVsOneClassifier(DecisionTreeClassifier(random_state = 42))})
  classifiers.update({'ExtraTreesClassifier': OneVsOneClassifier(ExtraTreesClassifier(random_state = 42))})
  classifiers.update({'GradientBoostingClassifier': OneVsOneClassifier(GradientBoostingClassifier(random_state = 42))})    
  classifiers.update({'BaggingClassifier': OneVsOneClassifier(BaggingClassifier(random_state = 42))})
  classifiers.update({'AdaBoostClassifier': OneVsOneClassifier(AdaBoostClassifier(random_state = 42))})
  classifiers.update({'KNeighborsClassifier': OneVsOneClassifier(KNeighborsClassifier())})
  classifiers.update({'SGDClassifier': OneVsOneClassifier(SGDClassifier(random_state = 42))})
  classifiers.update({'BaggingClassifier': OneVsOneClassifier(BaggingClassifier(random_state = 42))})
  classifiers.update({'BernoulliNB': OneVsOneClassifier(BernoulliNB())})
  classifiers.update({'LinearSVC': OneVsOneClassifier(LinearSVC(random_state = 42))})
  classifiers.update({'SVC': OneVsOneClassifier(SVC(random_state = 42))})

  df_models = pd.DataFrame(columns = ['model', 'run_time', 'f1_score_cv', 'f1_score'])

  for key in classifiers:

      print('*', key)

      start_time = time.time()
      
      pipeline = get_pipeline(classifiers[key])
      pipeline.fit(X_train, y_train)
      y_pred = pipeline.predict(X_test)

      cv = cross_val_score(pipeline, X, y, cv = 10, scoring = 'f1_macro', n_jobs = -1)

      row = {'model': key,
             'run_time': format(round((time.time() - start_time) / 60, 2)),
             'f1_score_cv': cv.mean(),
             'f1_score': f1_score(y_test, y_pred, average = 'macro')}
      
      df_models = df_models.append(row, ignore_index = True)

  df_models = df_models.sort_values(by = 'f1_score_cv', ascending = False)
      
  return df_models

In [25]:
models = select_model(X_train, y_train)

* XGBClassifier
* XGBRFClassifier
* LogisticRegression
* LGBMClassifier
* RandomForestClassifier
* DecisionTreeClassifier
* ExtraTreesClassifier
* GradientBoostingClassifier
* BaggingClassifier
* AdaBoostClassifier
* KNeighborsClassifier
* SGDClassifier
* BernoulliNB
* LinearSVC
* SVC


After 1-2 minutes, the model selection process had completed. This identified that XGBClassifier was the top performing model, with a ROC/AUC score of 100%.

In [26]:
models.head()

Unnamed: 0,model,run_time,f1_score_cv,f1_score
0,XGBClassifier,0.67,1.0,1.0
1,XGBRFClassifier,0.42,1.0,1.0
2,LogisticRegression,0.04,1.0,1.0
3,LGBMClassifier,0.14,1.0,1.0
5,DecisionTreeClassifier,0.03,1.0,1.0


**Examine the performance of the best model**

In [27]:
pipeline = get_pipeline(OneVsOneClassifier(XGBClassifier(random_state = 42)))
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Since this is a multi-class classification problem, we can assess its performance using common classification metrics, such as the F1 score and classification report. Initial results are awesome. We get a F1 score of 100%, which is impressive.



In [28]:
from sklearn.metrics import classification_report

In [29]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       200
           1       1.00      1.00      1.00       177
           2       1.00      1.00      1.00      1619

    accuracy                           1.00      1996
   macro avg       1.00      1.00      1.00      1996
weighted avg       1.00      1.00      1.00      1996

