### 3. Sentiment Classification

On this note, we move on to building a sentiment classifier building on the feature extraction from previous section. We will build Naive Bayes and Support Vector Machine classifiers. Specifically, we cover:

#### 1. Loading Dataset Features

1. Unpickling features and vectorizers
2. Visualizating Feature Dataset
3. Train and Test Splits


#### 2. Building Classifiers

1. Naive Bayes Classifier
2. Classify Unseen Text - with NaiveBayes
3. Support Vector Machine
4. Classify Unseen Text - with SVM

### Loading Dataset and Features


### 1. Loading Pickled/Serialized Objects

Recall from the previous note, we saved the feature dataset and vectorizers into pickled files. Using python's `pickle` library, we can restore the files as is and continue to the model building stage. The pickled object mapping is:


1. `tfidf_vectorizer.pk` = review_tfidf_vectorizer 
2. `count_vectorizer.pk` = review_countVectorizer
3. `count_features_dataframe.pk` = bowords_features
4. `tfidf_features_dataframe.pk` = tfidf_features
5. `outcomes.pk`: verified_data.rate

Below, we load the pickled objects back into python objects for vectorization and modeling.

In [10]:
import pickle
import pandas as pd
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

import pickle

def load_pickled_object(file_name: str):
    """
    Loads and returns a Python object from a pickled file.
    
    This function reads from a specified file containing a pickled object, unpickles it, and returns the object. 
    It's useful for retrieving Python objects, like models or datasets, that were previously serialized with 
    pickle and saved to a file.
    
    Parameters:
    - file_name: str. The name of the file (with path, if necessary) from which the serialized object will be 
      loaded. The file should exist and be accessible.
    
    Returns:
    - The Python object that was deserialized from the file.
    """
    try:
        with open(file_name, 'rb') as file:
            return pickle.load(file)
    except FileNotFoundError:
        print(f"The file {file_name} was not found.")
        return None
    except Exception as e:
        print(f"An error occurred while loading the pickled object: {e}")
        return None


In [12]:
review_tfidf_vectorizer = load_pickled_object('tfidf_vectorizer.pk')
review_countVectorizer = load_pickled_object('count_vectorizer.pk')
bowords_features = load_pickled_object('count_features_dataframe.pk')
tfidf_features = load_pickled_object('tfidf_features_dataframe.pk')
outcomes = load_pickled_object('outcomes.pk')

<br>

### 2. Visualizing Dataset

With the pickled objects loaded, we can now visualize the dataframe and features for both the `Bag of Words` and `TFIDF` vectorizers.

In [13]:
bowords_features.head()

Unnamed: 0,able,absolute,absolutely,accessory,actual,actually,add,addict,addition,adorable,...,wrinkle,wrong,x,yeah,year,yellow,yes,young,yum,yummy
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
tfidf_features.head()

Unnamed: 0,able,absolute,absolutely,accessory,actual,actually,add,addict,addition,adorable,...,wrinkle,wrong,x,yeah,year,yellow,yes,young,yum,yummy
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.32858,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
outcomes.head()

0    1
1    1
2   -1
4    1
7   -1
Name: rate, dtype: int64

#### Testing the Vectorizer on New Sentence

We can now test the vectorizer to see how it will perform on new sentences.

In [16]:
sample_review = 'I am looking forward to trying the new restaurants in Bremen'

review_countVectorizer.transform([sample_review]).toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

In [17]:
review_tfidf_vectorizer.transform([sample_review]).toarray()

array([[0., 0., 0., ..., 0., 0., 0.]])

#### Train and Test Split 

The code below implement a 70-30 percent Train to Test Split

In [29]:
import numpy as np
from sklearn.model_selection import train_test_split

x_train_bow, x_test_bow, y_train_bow, y_test_bow = train_test_split(bowords_features, pd.DataFrame(outcomes), test_size=.30, stratify=pd.DataFrame(outcomes) )
x_train_tfidf, x_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(tfidf_features, pd.DataFrame(outcomes), test_size=.30, stratify=pd.DataFrame(outcomes) )

y_train_bow.value_counts(), y_test_bow.value_counts(), y_train_tfidf.value_counts(), y_test_tfidf.value_counts(),

(rate
  1      2247
 -1        84
  0        18
 Name: count, dtype: int64,
 rate
  1      963
 -1       36
  0        8
 Name: count, dtype: int64,
 rate
  1      2247
 -1        84
  0        18
 Name: count, dtype: int64,
 rate
  1      963
 -1       36
  0        8
 Name: count, dtype: int64)

<br>

## 2. Sentiment Classifiers


### Naive Bayes Classifier

Naive Bayes Classification Naive Bayes classification algorithm is generally an effective technique to classify texts. Derived from the Bayesian theorem, it evaluates the probability of a sentiment being positive or negative given the presence of the words contained in the review.
Mathematically, it all begins with the Bayes rule of conditional probability:

$$ P(A|B) = \frac {P(A)P(B|A)} {P(B)} $$

 


Translating it to the Naive Bayes Algorithm, we want to predict the sentiment of some text given the presence of words in the text. Mathematically, it can be expressed as:


 $$P(positive\ sentiment|w_1,w_2,w_3,...w_n) = \frac {P(positive\ sentiment)P(w_1,w_2,w_3,...w_n|positive\ sentiment)} {P(w_1, w_2, w_3,...,w_n)}$$


where $w_i$  is a word

The most important assumption of naive Bayes (and where it gets its name from) is conditional independence which stipulates that every word  is independent of each other as long as their condition to the same class. This property is very useful because we can then write the probability equation like:


$$P(positive\ sentiment|w_1,w_2,w_3,...w_n) = P(positive\ sentiment)P(w_1|positive\ sentiment)P(w_2| positive\ sentiment)P(w_3| positive\ sentiment)P(w_{...}| positive\ sentiment)  $$




The above expansion can help us formulate a general formula for the probability class as follows:

$$P(sentiment|w_i) = \frac {1}{Z} \prod P(sentiment)P(w_i|sentiment) $$

 


where $Z$ is the normalizer i.e. Product probability of the occurrence of the words.

Enough with the math and theory, let's see this in action with Python


<br>

#### Naive Bayes Model Implementation in Python

The code below initializes a NaiveBayes Model with a laplace estimator parameter at .3

In [32]:
from sklearn.naive_bayes import MultinomialNB
    
naive_bayes_bow = MultinomialNB(alpha=.5, fit_prior=True)
naive_bayes_tfidf = MultinomialNB(alpha=.5, fit_prior=True)

In [36]:
naive_bayes_bow.fit(x_train_bow, np.ravel(y_train_bow.values))
naive_bayes_tfidf.fit(x_train_tfidf, np.ravel(y_train_tfidf))

<br>

#### Train and Test Assessment

After training the model, we can compute the train and the test error. This helps us determine how good or not-so-good our model did on new reviews.

In [37]:
from sklearn.metrics import accuracy_score

print("BOW Training Accuracy:", round( accuracy_score(naive_bayes_bow.predict(x_train_bow), y_train_bow ), 2) )
print("BOW Test Accuracy:", round( accuracy_score(naive_bayes_bow.predict(x_test_bow), y_test_bow ), 2))


print("TFIDF Training Accuracy:", round( accuracy_score(naive_bayes_tfidf.predict(x_train_tfidf), y_train_tfidf ), 2) )
print("TFIDF Accuracy:", round( accuracy_score(naive_bayes_tfidf.predict(x_test_tfidf), y_test_tfidf ), 2))

BOW Training Accuracy: 0.99
BOW Test Accuracy: 0.99
TFIDF Training Accuracy: 0.98
TFIDF Accuracy: 0.98


<br>

### Predicting New Reviews on Naive Bayes Model

To run the model on reviews outside of the test and train set, we must implement the same preprocessing steps and vectorization. Below is an example of the implementation.

In [45]:
pos_review = 'the food at the restaurant was amazing'
neg_review = 'the food at the restaurant was absolutely terrible'

In [46]:
bow_tests = review_countVectorizer.transform([pos_review, neg_review])
tfidf_tests = review_tfidf_vectorizer.transform([pos_review, neg_review])

In [48]:
naive_bayes_bow.predict(bow_tests[1]), naive_bayes_tfidf.predict(tfidf_tests[1])

(array([1]), array([1]))

### Sentiment Classification with Support Vector Machine

Support Vector Machine is a family of classification algorithms that perform classification by determining the hyperplane that separates the classes in question. SVMs are linear classifiers that can be modified to take a variety of linear functions as a way to separate two or more classes by determining the hyperplane that maximized the distance between observations across classes.


SVM turns out to be very effective in working with sparse data as they are linear. Given that we have very sparse metrics of features, let's use SVM to determine the sentiment.


Notice that SVM has the following tuning parameters:

Kernel: Specifies a kernel formula to use when determining the decision boundary
Gamma: Weighting based on observation distance from the decision boundary
C Parameter: Balance between model complexity (correct classification) and smooth boundary Below is the implementation in Python:

In [56]:
from sklearn.svm import SVC

svm_linear_bow =  SVC( C=1,                # Setting C at default parameter
                       kernel='linear',    # Using linear kernel transformation 
                       gamma=100,          # Setting Gamma at 100
                       probability=True,
                       random_state= 42)

svm_linear_tfidf = SVC( C=1,                # Setting C at default parameter
                       kernel='linear',    # Using linear kernel transformation 
                       gamma=100,          # Setting Gamma at 100
                       probability=True,
                       random_state= 42)

<br>

#### Fitting the model

Much like we did with Naive Bayes, we fit the model for SVM using the `fit()` method

In [57]:
svm_linear_bow.fit(x_train_bow, np.ravel(y_train_bow))

svm_linear_tfidf.fit(x_train_tfidf, np.ravel(y_train_tfidf))

In [58]:
print("Bow Training Accuracy:", round( accuracy_score(svm_linear_bow.predict(x_train_bow), np.ravel(y_train_bow)), 2) )
print("Bow Test Accuracy:", round( accuracy_score(svm_linear_bow.predict(x_test_bow), np.ravel(y_test_bow) ), 2) )

Bow Training Accuracy: 1.0
Bow Test Accuracy: 1.0


In [59]:
print("TFIDF Training Accuracy:", round( accuracy_score(svm_linear_tfidf.predict(x_train_tfidf), np.ravel(y_train_tfidf)), 2) )
print("TFIDF Test Accuracy:", round( accuracy_score(svm_linear_tfidf.predict(x_test_tfidf), np.ravel(y_test_tfidf) ), 2) )

TFIDF Training Accuracy: 1.0
TFIDF Test Accuracy: 0.99
