<a href="https://colab.research.google.com/github/KelvinLam05/sentiment_analysis/blob/main/sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Goal of the project**

For each textual review, we want to predict if it corresponds to a great review (the customer is happy) or to a poor one (the customer is not satisfied). The reviews overall ratings can range from 1/5 to 5/5. In order to simplify the problem we will split those into two categories:

* negative: ratings <= 2

* positive: ratings >= 4

The challenge here is to be able to predict this information using only the raw textual data from the review. Let's get it started!

In [73]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

**Load the data**

In [74]:
# Load dataset
df = pd.read_csv('/content/online_customer_reviews.csv')

In [75]:
# Examine the data
df.head()

Unnamed: 0,review,rating
0,It was nice produt. I like it's design a lot. ...,5
1,awesome sound....very pretty to see this nd th...,5
2,awesome sound quality. pros 7-8 hrs of battery...,4
3,I think it is such a good product not only as ...,5
4,awesome bass sound quality very good bettary l...,5


In [76]:
# Overview of all variables, their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9976 entries, 0 to 9975
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  9976 non-null   object
 1   rating  9976 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 156.0+ KB


**Identify review sentiment**

To analyse the things that customers like and dislike you need a simple way of differentiating a positive review from a negative one. 

In fact, since we have star ratings, there’s arguably little need to engineer any additional sentiment-related features. All we really need to know is whether the text from a given review was positive or negative, and we can see that from the star rating the customer has given. 

In [None]:
star_rating = {'poor_rating': 3, 'great_rating': 4}

df['sentiment_label'] = 'neutral'
df['sentiment_label'].loc[df['rating'] < star_rating['poor_rating']] = 'negative'
df['sentiment_label'].loc[df['rating'] >= star_rating['great_rating']] = 'positive'

In [78]:
# Drop neutral reviews
df = df[df['rating'] != 3]

At the moment, the sentiment_label column contains a boolean positive or negative value, but we need to “binarise” this to turn it into a numeric value the model can use. A simple replace( ) is one of several ways to do this.

In [None]:
df['sentiment_label'] = df['sentiment_label'].replace(('positive', 'negative'), (1, 0))

**Examine the data**

As we will see from examining the value_counts( ) of the target variable column, this dataset is imbalanced.

In [80]:
df['sentiment_label'].value_counts()

1    8091
0    1001
Name: sentiment_label, dtype: int64

**Check for special characters**

In [81]:
text_data = str()

for sentence in df['review'].values:
    text_data += sentence
    
''.join(set(text_data))

'🤑ø(😙,😳m🔊😑í🦊✊Eᴇ😋𝓲🧐😒𝑡🔥😐Ñ🔸ᴜ2💨6💤💯\U0001f90d❤c👏😊d😍3Í🙈𝑠🌚🙌𝑎%𝑦N🎵xj😂𝓬😌☠💀a🌟👂𝗲X𝑏😚℅k🎧🛶Pn🎶:RW𝑝𝑖G🏼𝑙😏tvV|1💃8😀🤔💵🙏🎙☝M👌7👿є🎻\u200b💫💌{😘>💝👋ᴛ🤘🏻🎉🔵𝑆=s!OÇ𝐜😁💰À👻UÉ🤝@Kqr♂oY😱ɪ\'Ø☑𝑑😜Cë🌹📦🕒w➡𝑜🤗#️F𝑤😲☹🤓🤷👍🤜ᴄ😭g]💘h💗•💞👳𝑐💜à🖤👉🤣Aɴ😟…👈𝐝𝓮😈𝓝5D🙂✨u😞𝗡i😕❌’+‼💐🎮°T~$⚡🎸💛"ᴅ𝗶📞🙄🎊🔮💩[😉𝗰?I;𝑔₹🙆🆗&z𝑢🔛L🤭_𝐭⭐🤩💣😃🗣*◀😡💪4∆p🤟💟B⛵.ᴏ😔💖𝐮💓😣Zʀ🤯💙🙃/🥇✔0 J▶🥳❣★😠f😄🕺\u200d😫🥰-𝑛😇<Î✳)𝑓🍭S𝑟yH☺l💕🤙😛©😓eℎ💚Q🥵😻Ã×ᴘ}💥❇✌😅♥ś🔋✓👎9�b😎😗𝑒'

There are plenty of special characters.

**Text Preprocessing**

We will need to preprocess our text to remove misleading junk and noise in order to get the best results from our model.

In [82]:
import re
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 

In [83]:
import unicodedata
from emoji import demojize

In [84]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [85]:
stop_words = set(stopwords.words('english'))
lemmatizer = nltk.stem.WordNetLemmatizer()

We will now set up our cleaning function.

In [86]:
def text_cleaning(text_data):

  # Remove accented characters
  text_data = unicodedata.normalize('NFKD', text_data).encode('ascii', 'ignore').decode('utf-8', 'ignore')

  # Case conversion
  text_data = text_data.lower()

  # Demojize
  text_data = demojize(text_data)

  # Remove special characters
  text_data = re.sub(r"[^a-zA-Z]+", ' ', text_data)

  # Text as string objects
  text_data = str(text_data)

  # Tokenization
  tokenizer = ToktokTokenizer()
  text_data = tokenizer.tokenize(text_data)

  # Removing stopwords
  text_data = [item for item in text_data if item not in stop_words]
  
  # Convert list of tokens to string data type
  text_data = ' '.join (text_data)

  return text_data

In [87]:
df['clean_review'] = df['review'].apply(text_cleaning)

In [88]:
# Drop unwanted column
df.drop(['review'], axis = 1, inplace = True)

**Split the train and test data**

In [89]:
X = df.drop('sentiment_label', axis = 1)

In [90]:
y = df['sentiment_label']

In [91]:
from sklearn.model_selection import train_test_split

In [92]:
# Split imbalanced dataset into training and testing sets with stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

**Run model selection**

In [93]:
import time
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import roc_auc_score

In [94]:
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier, XGBRFClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.naive_bayes import BernoulliNB

In [95]:
def get_pipeline(model):
                 
  preprocessor = ColumnTransformer(transformers = [('tfidf', TfidfVectorizer(), 'clean_review'),
                                                   ('scaler', RobustScaler(), ['rating'])])

  bundled_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
                                       ('model', model)])
  
  return bundled_pipeline

In [96]:
def select_model(X, y, pipeline = None):

  classifiers = {}
  classifiers.update({'DummyClassifier': DummyClassifier(strategy = 'most_frequent')})
  classifiers.update({'XGBClassifier': XGBClassifier()})
  classifiers.update({'XGBRFClassifier': XGBRFClassifier()})
  classifiers.update({'LogisticRegression': LogisticRegression()})
  classifiers.update({'LGBMClassifier': LGBMClassifier()})
  classifiers.update({'RandomForestClassifier': RandomForestClassifier()})
  classifiers.update({'DecisionTreeClassifier': DecisionTreeClassifier()})
  classifiers.update({'ExtraTreeClassifier': ExtraTreesClassifier()})
  classifiers.update({'GradientBoostingClassifier': GradientBoostingClassifier()})    
  classifiers.update({'BaggingClassifier': BaggingClassifier()})
  classifiers.update({'AdaBoostClassifier': AdaBoostClassifier()})
  classifiers.update({'KNeighborsClassifier': KNeighborsClassifier()})
  classifiers.update({'SGDClassifier': SGDClassifier()})
  classifiers.update({'BaggingClassifier': BaggingClassifier()})
  classifiers.update({'BernoulliNB': BernoulliNB()})
  classifiers.update({'LinearSVC': LinearSVC()})
  classifiers.update({'SVC': SVC()})

  df_models = pd.DataFrame(columns = ['model', 'run_time', 'roc_auc_cv', 'roc_auc'])

  for key in classifiers:

      print('*', key)

      start_time = time.time()
      
      pipeline = get_pipeline(classifiers[key])
      pipeline.fit(X_train, y_train)
      y_pred = pipeline.predict(X_test)

      cv = cross_val_score(pipeline, X, y, cv = 10, scoring = 'roc_auc')

      row = {'model': key,
             'run_time': format(round((time.time() - start_time) / 60, 2)),
             'roc_auc_cv': cv.mean(),
             'roc_auc': roc_auc_score(y_test, y_pred)}

      df_models = df_models.append(row, ignore_index = True)

  df_models = df_models.sort_values(by = 'roc_auc_cv', ascending = False)
      
  return df_models

In [97]:
models = select_model(X_train, y_train)

* DummyClassifier
* XGBClassifier
* XGBRFClassifier
* LogisticRegression
* LGBMClassifier
* RandomForestClassifier
* DecisionTreeClassifier
* ExtraTreeClassifier
* GradientBoostingClassifier
* BaggingClassifier
* AdaBoostClassifier
* KNeighborsClassifier
* SGDClassifier
* BernoulliNB
* LinearSVC
* SVC


After 1-2 minutes, the model selection process had completed. This identified that XGBClassifier was the top performing model, with a ROC/AUC score of 100%.

In [98]:
models.head()

Unnamed: 0,model,run_time,roc_auc_cv,roc_auc
1,XGBClassifier,0.21,1.0,1.0
2,XGBRFClassifier,0.17,1.0,1.0
3,LogisticRegression,0.03,1.0,1.0
4,LGBMClassifier,0.08,1.0,1.0
6,DecisionTreeClassifier,0.02,1.0,1.0


**Examine the performance of the best model**

In [99]:
pipeline = get_pipeline(XGBClassifier(random_state = 42))
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Since this is now just a regular binary classification problem, we can assess its performance using common classification metrics, such as the F1 score and classification report. Initial results are awesome. We get a F1 score of 100%, which is impressive.



In [100]:
from sklearn.metrics import classification_report

In [101]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       200
           1       1.00      1.00      1.00      1619

    accuracy                           1.00      1819
   macro avg       1.00      1.00      1.00      1819
weighted avg       1.00      1.00      1.00      1819

