## 5. Yelp Dataset Feature Development

Text Feature Extraction

The next step in the pipeline is to convert text into numerical features that can be used to develop a model that predicts the sentiment of the text. On this note, we cover traditional and modern feautures are we will use the to develop a sentiment analysis model. Namely, we will cover:

1. Bag of Words
2. Term Frequency Inverse Document Frequency (TF-IDF)


We will use the verified and preprocessed dataset to implement these techniques, but first let's explore these techniques independently.


### 1. Bag of Words

The bag of words technique generates features through a one-hot encoding at the document level. Tactically, all words in the corpus are placed in to a bag. To map each document we assign values 0 if the word in not present in the document and 1 if it is. To better understand the working of the Bag of Words, let'd demonstrate it with an example of a corpus of 5, relatively simple documents.


In [1]:
import pandas as pd
from nltk import word_tokenize

corpus = [ "the restaurant had great food",
           "i love python programming",
           "i prefer R to python",
           "computers are fun to use",
           "i did not like the movie"] 

from sklearn.feature_extraction.text import CountVectorizer

bows_counter = CountVectorizer( analyzer = 'word',            # Word level vectorizer
                                lowercase = True,             # Lower case the text
                                ngram_range = (1, 1),         # Create 1 n-grams
                                tokenizer = word_tokenize,   # Use this tokenizer
                                stop_words = 'english',
                                token_pattern = None )     # remove english stopwords

bows_counter.fit(corpus)
features = bows_counter.transform(corpus).toarray()

In [2]:
features

array([[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0]])

In [3]:
features_df = pd.DataFrame(features, columns=bows_counter.get_feature_names_out())
features_df

Unnamed: 0,computers,did,food,fun,great,like,love,movie,prefer,programming,python,r,restaurant,use
0,0,0,1,0,1,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,1,0,0,1,1,0,0,0
2,0,0,0,0,0,0,0,0,1,0,1,1,0,0
3,1,0,0,1,0,0,0,0,0,0,0,0,0,1
4,0,1,0,0,0,1,0,1,0,0,0,0,0,0


Notice that each sentence is its own document and that we now have binary features and that each column reflect the total vocabulary in the whole corpus.

Notice that the dataframe has 1-gram tokens and an encoding that shows whether a document contains the token. This set of features can help us model the sentiment of the text.

Another thing to notice is that the matrix can be quite sparse depending on the number of vocabularies and their relative frequency. Therefore, it may be useful to limit n-grams and use features using frequency thresholds.

<br>

## 2. Term Frequency - Inverse Document Frequency

Term Frequency Inverse Document Frequency a.k.a TF-IDF "TF-IDF is a commonly used weighting technique that assigns weights reflecting the importance of a word to a document. The basis of this technique is the idea that if a word appears frequently across all documents, it is less likely to hold significant information about any specific document. On the other hand, words that appear frequently in one or a few documents and rarely across all documents are considered to have specific importance and should be assigned higher weights.
The mathematical expression of tf-idf (in one of the many forms) is:

<br>

 
$$ tf\ {idf} =  {frequency_{t,d}} * log  \frac {(total\ documents)}{(total\ documents\ containing\ the\ term)} $$

<br>

It is simply the multiplication of the number of times a word appears in a document by the logarithm of the total number of documents divided by the number of documents that contain the word
Intuitively, high-frequency words that appear in nearly all documents are weighted by the logarithm of 1 (log1), resulting in a weight of zero. Conversely, words with high frequency within a specific document and low frequency across the corpus will have a higher weight.


Let's see an example using our small corpus above.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer( analyzer='word',          # Word level vectorizer
                                    lowercase=True,           # Lowercase the text
                                    stop_words = 'english',
                                    tokenizer= word_tokenize, # Use this tokenizer
                                    token_pattern = None) 

tfidf_vectorizer.fit(corpus)
tfidf_features = tfidf_vectorizer.transform(corpus).toarray()

In [5]:
tfidf_df = pd.DataFrame(tfidf_features, columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df

Unnamed: 0,computers,did,food,fun,great,like,love,movie,prefer,programming,python,r,restaurant,use
0,0.0,0.0,0.57735,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.614189,0.0,0.0,0.614189,0.495524,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.614189,0.0,0.495524,0.614189,0.0,0.0
3,0.57735,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735
4,0.0,0.57735,0.0,0.0,0.0,0.57735,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0


## Yelp Review: Generating CountVectorizer and TFIDF Vectorizer 

Now that we have an understanding and a template for how these methods work, we can apply them to the text data we just processed.

In [6]:
train_data = pd.read_csv('yelp_train_df.txt')
train_data.head()

Unnamed: 0,label,text
0,4,food amazing service notch food quality ...
1,0,dear bill griffin ni photo time technician...
2,3,want fairly quick meal stay wynn terrace caf...
3,1,great restaurant s great location menu in...
4,4,magnificent incredible amazing fantastic...


### CountVectorizer

In [9]:
review_countVectorizer = CountVectorizer( analyzer = 'word', 
                                          lowercase = True, 
                                          tokenizer = word_tokenize, 
                                          token_pattern = None, 
                                          stop_words = 'english', 
                                          ngram_range = (1, 1),
                                          min_df = 5,)

review_countVectorizer.fit( train_data.text )

In [12]:
bowords_features = pd.DataFrame( review_countVectorizer.transform(train_data.text).toarray(), columns=review_countVectorizer.get_feature_names_out() )
bowords_features.head()

Unnamed: 0,aa,aaa,aaron,ab,aback,abandon,abc,abd,abend,aber,...,zu,zucchini,zum,zumanity,zumba,zupas,zur,zuzu,zwar,zwei
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### TFIDF Vectorizer

In [17]:
review_tfidf_vectorizer = TfidfVectorizer( analyzer='word',          # Word level vectorizer
                                           lowercase=True,           # Lowercase the text
                                           stop_words = 'english',   # remove english stopwords
                                           min_df = 5,               # use words that appear > 5
                                           tokenizer= word_tokenize, # Use this tokenizer
                                           token_pattern = None) 

review_tfidf_vectorizer.fit( train_data.text )

In [18]:
tfidf_features = pd.DataFrame( review_tfidf_vectorizer.transform(train_data.text).toarray(), columns = review_tfidf_vectorizer.get_feature_names_out() )
tfidf_features.head()

Unnamed: 0,aa,aaa,aaron,ab,aback,abandon,abc,abd,abend,aber,...,zu,zucchini,zum,zumanity,zumba,zupas,zur,zuzu,zwar,zwei
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Pickling Vectorizers and Datasets for Modelling

The preprocessing steps on large datasets often take very long to implement. It is therefore useful to save checkpoints to be more efficient. For the case of the trained vectorizers and datasets, we will use `pickle` objects to save the state of vectorizers and implement them on new text.

In [19]:
import pickle

def pickle_object(object_to_pickle, file_name):
    """
    Serializes an object and saves it to a file using the pickle protocol.
    
    This function takes any Python object and a file name as input. It serializes the object using pickle and 
    saves it to the specified file. This is particularly useful for saving model objects or data transformers 
    for later use.
    
    Parameters:
    - object_to_pickle: The Python object to serialize. This can be any object that pickle can handle, including
      custom classes, lists, dictionaries, etc.
    - file_name: The name of the file (with path, if necessary) where the serialized object will be saved. 
      It's recommended to use a '.pkl' extension for clarity.
    
    Returns:
    - None
    """
    try:
        with open(file_name, 'wb') as file:
            pickle.dump(object_to_pickle, file)
    except Exception as e:
        print(f"An error occurred while pickling the object: {e}")

In [20]:
data_objects = { 'yelp_tfidf_vectorizer.pk': review_tfidf_vectorizer, 
                 'yelp_count_vectorizer.pk': review_countVectorizer,
                 'yelp_count_features_dataframe.pk': bowords_features,
                 'yelp_tfidf_features_dataframe.pk': tfidf_features}

for name, obj in data_objects.items():
    pickle_object(obj, name)

<br>

# 2.  Sentiment Classifiers


### Naive Bayes Classifier

Naive Bayes Classification Naive Bayes classification algorithm is generally an effective technique to classify texts. Derived from the Bayesian theorem, it evaluates the probability of a sentiment being positive or negative given the presence of the words contained in the review.
Mathematically, it all begins with the Bayes rule of conditional probability:

$$ P(A|B) = \frac {P(A)P(B|A)} {P(B)} $$

 


Translating it to the Naive Bayes Algorithm, we want to predict the sentiment of some text given the presence of words in the text. Mathematically, it can be expressed as:


 $$P(positive\ sentiment|w_1,w_2,w_3,...w_n) = \frac {P(positive\ sentiment)P(w_1,w_2,w_3,...w_n|positive\ sentiment)} {P(w_1, w_2, w_3,...,w_n)}$$


where $w_i$  is a word

The most important assumption of naive Bayes (and where it gets its name from) is conditional independence which stipulates that every word  is independent of each other as long as their condition to the same class. This property is very useful because we can then write the probability equation like:


$$P(positive\ sentiment|w_1,w_2,w_3,...w_n) = P(positive\ sentiment)P(w_1|positive\ sentiment)P(w_2| positive\ sentiment)P(w_3| positive\ sentiment)P(w_{...}| positive\ sentiment)  $$




The above expansion can help us formulate a general formula for the probability class as follows:

$$P(sentiment|w_i) = \frac {1}{Z} \prod P(sentiment)P(w_i|sentiment) $$

 


where $Z$ is the normalizer i.e. Product probability of the occurrence of the words.

Enough with the math and theory, let's see this in action with Python


<br>

#### Naive Bayes Model Implementation in Python

The code below initializes a NaiveBayes Model with a laplace estimator parameter at .3

In [29]:
from sklearn.naive_bayes import MultinomialNB
    
naive_bayes_bow = MultinomialNB(alpha=.7, fit_prior=True)
naive_bayes_tfidf = MultinomialNB(alpha=.7, fit_prior=True)

In [30]:
naive_bayes_bow.fit(bowords_features, train_data.label)

In [31]:
naive_bayes_tfidf.fit(tfidf_features, train_data.label)

<br>

#### Train and Test Assessment

After training the model, we can compute the train and the test error. This helps us determine how good or not-so-good our model did on new reviews.

In [32]:
test_data = pd.read_csv('yelp_test_df.txt')
test_data.head()

Unnamed: 0,label,text
0,0,cox great speak customer service person juan ...
1,3,las vegas work time look cheap strip room s...
2,4,visit morton s night dinner seat waiter gr...
3,1,look forward posh time idea chef create mea...
4,1,thing save joint star tater tot ballanty...


In [33]:
test_bow_features = pd.DataFrame( review_countVectorizer.transform(test_data.text).toarray(), columns = review_countVectorizer.get_feature_names_out() )
test_tfidf_features = pd.DataFrame( review_tfidf_vectorizer.transform(test_data.text).toarray(), columns = review_tfidf_vectorizer.get_feature_names_out() )

In [34]:
from sklearn.metrics import accuracy_score

print("BOW Training Accuracy:", round( accuracy_score(naive_bayes_bow.predict(bowords_features), train_data.label ), 2) )
print("BOW Test Accuracy:", round( accuracy_score(naive_bayes_bow.predict(test_bow_features), test_data.label ), 2))


print("TFIDF Training Accuracy:", round( accuracy_score(naive_bayes_tfidf.predict(tfidf_features), train_data.label ), 2) )
print("TFIDF Accuracy:", round( accuracy_score(naive_bayes_tfidf.predict(test_tfidf_features), test_data.label), 2))

BOW Training Accuracy: 0.62
BOW Test Accuracy: 0.5
TFIDF Training Accuracy: 0.64
TFIDF Accuracy: 0.5


### Predicting New Reviews on Naive Bayes Model

To run the model on reviews outside of the test and train set, we must implement the same preprocessing steps and vectorization. Below is an example of the implementation.

In [44]:
pos_review = 'the food at the restaurant was incredible'
neg_review = 'the food at the restaurant was absolutely terrible'

In [45]:
bow_tests = review_countVectorizer.transform([pos_review, neg_review])
tfidf_tests = review_tfidf_vectorizer.transform([pos_review, neg_review])

In [46]:
naive_bayes_bow.predict(bow_tests), naive_bayes_tfidf.predict(tfidf_tests)

(array([4, 0]), array([4, 0]))

### Sentiment Classification with Support Vector Machine

Support Vector Machine is a family of classification algorithms that perform classification by determining the hyperplane that separates the classes in question. SVMs are linear classifiers that can be modified to take a variety of linear functions as a way to separate two or more classes by determining the hyperplane that maximized the distance between observations across classes.


SVM turns out to be very effective in working with sparse data as they are linear. Given that we have very sparse metrics of features, let's use SVM to determine the sentiment.


Notice that SVM has the following tuning parameters:

Kernel: Specifies a kernel formula to use when determining the decision boundary
Gamma: Weighting based on observation distance from the decision boundary
C Parameter: Balance between model complexity (correct classification) and smooth boundary Below is the implementation in Python:

In [47]:
from sklearn.svm import SVC

svm_linear_bow =  SVC( C=1,                # Setting C at default parameter
                       kernel='linear',    # Using linear kernel transformation 
                       gamma=100,          # Setting Gamma at 100
                       probability=True,
                       random_state= 42)

svm_linear_tfidf = SVC( C=1,                # Setting C at default parameter
                       kernel='linear',    # Using linear kernel transformation 
                       gamma=100,          # Setting Gamma at 100
                       probability=True,
                       random_state= 42)

<br>

#### Fitting the model

Much like we did with Naive Bayes, we fit the model for SVM using the `fit()` method

In [None]:
svm_linear_bow.fit(bowords_features, train_data.label)

svm_linear_tfidf.fit(tfidf_features, train_data.label)

In [None]:
print("Bow Training Accuracy:", round( accuracy_score(svm_linear_bow.predict(bowords_features), train_data.label), 2) )
print("Bow Test Accuracy:", round( accuracy_score(svm_linear_bow.predict(test_bow_features), test_data.label ), 2) )

In [None]:
print("TFIDF Training Accuracy:", round( accuracy_score(svm_linear_tfidf.predict(tfidf_features), train_data.label ), 2) )
print("TFIDF Accuracy:", round( accuracy_score(svm_linear_tfidf.predict(test_tfidf_features), test_data.label), 2))

In [None]:
pickle_object(naive_bayes_bow, 'naive_bayes_bow.pk')
pickle_object(naive_bayes_tfidf, 'naive_bayes_tfidf.pk')
pickle_object(svm_linear_bow, 'svm_linear_bow.pk')
pickle_object(svm_linear_tfidf, 'svm_linear_tfidf.pk')