**Goal of the project**

In this project, we try to predict the positive (label 1) or negative (label 0) sentiment of the sentence.

**Load the packages**

In [749]:
# Importing libraries
import pandas as pd
import numpy as np

**Load the data**

For this project I’ve used the [UCI Sentiment Labelled Sentences Data Set](https://www.kaggle.com/datasets/marklvl/sentiment-labelled-sentences-data-set).

The dataset was first covered in a paper by Kotzias, Denil, De Freitas, and Smyth in 2015, who compared three approaches, including logistic regression w/ bow, logistic regression w/ embeddings, and gicf w/ embeddings, and managed to achieve an impressive accuracy score of 0.882. Let’s see how close we can get to their best score.


In [750]:
# Load dataset
df = pd.read_csv('../input/sentiment-labelled-sentences-data-set/sentiment labelled sentences/amazon_cells_labelled.txt', delimiter = '\t', header = None, names = ['text', 'label'])

In [751]:
# Rename Pandas columns to lower case
df.columns = df.columns.str.lower()

In [752]:
# Examine the data
df.head()

Unnamed: 0,text,label
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [753]:
# Overview of all variables, their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    1000 non-null   object
 1   label   1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


**Examine the target variable**

If we run df['label'].value_counts( ) we’ll notice that we have 500 items in the positive class and 500 in the negative class, so things are perfectly balanced.

In [754]:
df['label'].value_counts()

0    500
1    500
Name: label, dtype: int64

**Feature engineering**

Next we’ll create some features.

In [755]:
df['word_count'] = df['text'].apply(lambda x: len(str(x).split(' ')))

In [756]:
df['sentence_count'] = df['text'].apply(lambda x: len(str(x).split('.')))

In [757]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [758]:
df['polarity_score'] = df['text'].apply(lambda x: SentimentIntensityAnalyzer().polarity_scores(x)['compound'])

**Preprocess the data**

We will need to preprocess our text to remove misleading junk and noise in order to get the best results from our model.

In [759]:
import re
import nltk
import unicodedata
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

In [760]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [761]:
stop_words = set(stopwords.words('english'))

In [762]:
def text_cleaning(text_data):

  # Remove accented characters
  text_data = unicodedata.normalize('NFKD', text_data).encode('ascii', 'ignore').decode('utf-8', 'ignore')

  # Case conversion
  text_data = text_data.lower()

  # Remove special characters
  text_data = re.sub(r"[^a-zA-Z]+", ' ', text_data)

  # Text as string objects
  text_data = str(text_data)

  # Tokenization
  tokenizer = ToktokTokenizer()
  text_data = tokenizer.tokenize(text_data)

  # Removing stopwords
  text_data = [item for item in text_data if item not in stop_words]
  
  # Convert list of tokens to string data type
  text_data = ' '.join (text_data)

  return text_data

In [763]:
df['text'] = df['text'].apply(text_cleaning)

**Create training and test data**

To start the modeling process, we’ll assign the text column to our X feature set and the label column to our target variable y. We’ll then use the scikit-learn train_test_split( ) function to divide this into a training and test set, with 20% of the data being held back for testing.

In [764]:
X = df.drop('label', axis = 1)

In [765]:
y = df['label']

In [766]:
from sklearn.model_selection import train_test_split

In [767]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

**Create a model pipeline**

We construct a pipeline that applies a standard scaler to the numerical features, converts the text column into numerical and then fits a classifier.

In [768]:
import time
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

In [769]:
def get_pipeline(X, model):

    
    numeric_columns = list(X.select_dtypes(exclude = ['object']).columns.values.tolist())
    numeric_pipeline = Pipeline(steps = [('scaler', StandardScaler())])
    
    preprocessor = ColumnTransformer(transformers = [('tfidf', TfidfVectorizer(), 'text'),
                                                     ('numeric', numeric_pipeline, numeric_columns)], remainder = 'passthrough')

    bundled_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
                                         ('model', model)])
    
    return bundled_pipeline

**Apply model selection**

To undertake the model selection step, we first need to create a dictionary containing the name of each model we want to test, and the name of the model class, i.e. XGBClassifier(random_state = 42).

Next we’ll create a Pandas dataframe into which we’ll store the data. Then we’ll loop over each of the models, fit it using the X_train and y_train data, then generate predictions from X_test and calculate the mean accuracy score from 5 rounds of cross-validation. That will give us the accuracy score for the X_test data, plus the average accuracy score for the training data set.

In [770]:
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier
from catboost import CatBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB

In [771]:
def select_model(X, y, pipeline = None):

  classifiers = {}
  classifiers.update({'XGBClassifier': XGBClassifier(random_state = 42)})
  classifiers.update({'LGBMClassifier': LGBMClassifier(random_state = 42)})
  classifiers.update({'DecisionTreeClassifier': DecisionTreeClassifier(random_state = 42)})
  classifiers.update({'RandomForestClassifier': RandomForestClassifier(random_state = 42)})
  classifiers.update({'ExtraTreesClassifier': ExtraTreesClassifier(random_state = 42)})
  classifiers.update({'GradientBoostingClassifier': GradientBoostingClassifier(random_state = 42)})    
  classifiers.update({'BaggingClassifier': BaggingClassifier(random_state = 42)})
  classifiers.update({'AdaBoostClassifier': AdaBoostClassifier(random_state = 42)})
  classifiers.update({'CatBoostClassifier': CatBoostClassifier(silent = True, random_seed = 42)})
  classifiers.update({'LogisticRegression': LogisticRegression(random_state = 42)})
  classifiers.update({'BernoulliNB': BernoulliNB()})

  df_models = pd.DataFrame(columns = ['model', 'run_time', 'accuracy_score_cv', 'accuracy_score'])

  for key in classifiers:

      print('*', key)

      start_time = time.time()
      
      pipeline = get_pipeline(X_train, classifiers[key])

      cv = cross_val_score(pipeline, X, y, cv = 5, scoring = 'accuracy', n_jobs = -1)
      
      pipeline.fit(X_train, y_train)
      y_pred = pipeline.predict(X_test)
    
      row = {'model': key,
             'run_time': format(round((time.time() - start_time) / 60, 2)),
             'accuracy_score_cv': cv.mean(),
             'accuracy_score': accuracy_score(y_test, y_pred)}
      
      df_models = df_models.append(row, ignore_index = True)

  df_models = df_models.sort_values(by = 'accuracy_score', ascending = False)
      
  return df_models

In [772]:
models = select_model(X_train, y_train)

* XGBClassifier
* LGBMClassifier
* DecisionTreeClassifier
* RandomForestClassifier
* ExtraTreesClassifier
* GradientBoostingClassifier
* BaggingClassifier
* AdaBoostClassifier
* CatBoostClassifier
* LogisticRegression
* BernoulliNB


After 1-2 minutes, the model selection process had completed. This identified that ExtraTreesClassifier was the top performing model, with an accuracy of 86.5%.

In [773]:
models.head(10)

Unnamed: 0,model,run_time,accuracy_score_cv,accuracy_score
4,ExtraTreesClassifier,0.02,0.865,0.865
3,RandomForestClassifier,0.02,0.855,0.85
10,BernoulliNB,0.0,0.85125,0.845
0,XGBClassifier,0.03,0.845,0.835
5,GradientBoostingClassifier,0.02,0.85625,0.835
9,LogisticRegression,0.0,0.855,0.835
8,CatBoostClassifier,0.6,0.84875,0.83
1,LGBMClassifier,0.01,0.835,0.825
6,BaggingClassifier,0.01,0.8575,0.815
7,AdaBoostClassifier,0.01,0.83125,0.81


**Evaluate model performance**

In [774]:
bundled_pipeline = get_pipeline(X_train, ExtraTreesClassifier(random_state = 42))
bundled_pipeline.fit(X_train, y_train)
y_pred = bundled_pipeline.predict(X_test)

Another useful metric for evaluating a classification model is the classification report, which can be accessed via the classification_report( ) function.

In [775]:
from sklearn.metrics import classification_report

In [776]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.89      0.86        93
           1       0.90      0.84      0.87       107

    accuracy                           0.86       200
   macro avg       0.86      0.87      0.86       200
weighted avg       0.87      0.86      0.87       200

