# Naive Bayes Classification

## Lecture Overview

- Classification as a Machine Learning Problem and Human Learning Problem
- Naive Bayes Classification Formal Definitions
- Naive Bayes Classification Example
- Refactoring Code for Naive Bayes Classification
  - Naive Bayes Classification Example: Spam Filtering
  - Naive Bayes Classification Example: Twitter Sentiment Analysis
- Beyond Naivete: Where Naive Bayes Classification Breaks Down

## Classification as a Machine Learning Problem and Human Learning Problem

<img align="right" style="float: right; padding: 0px 0px 0px 3px;" src="https://pictures.abebooks.com/inventory/31359579518.jpg" height="500" width="400"> "Categorization is not a matter to be taken lightly. There is nothing more basic than categorization to our thought, perception, action, and speech. Every time we see something as a kind of thing, for example, a tree, we are categorizing. Whenever we reason about kinds of things--chairs, nations, illnesses, emotions, any kind of thing at all--we are employing categories." [5-6] --[George Lakoff](https://en.wikipedia.org/wiki/George_Lakoff)


In machine learning we can think of classification as a problem of assigning a label to an input. In human learning we can think of classification as a problem of assigning a label to an input. In both cases, the labels are discrete categories. In both cases, the inputs are features. In both cases, the goal is to learn a function that maps inputs to labels.

Some examples of classification problems:

* Spam detection
* Sentiment analysis
* Language identification
* Authorship identification
* Topic identification
* Part-of-speech tagging

"The goal of classification is to take a single observation, extract some useful features, and thereby classify the observation into one of a set of discrete classes." [59] Jurafsky and Martin.

## Naive Bayes Classification Formal Definitions

Machine learning overview:

<img src="../images/Neuron.drawio.png">

### Inputs, Outputs, and Features

* `Inputs = features = x`

* `Outputs = labels or categories = y` = $Y \in \{c_1, c_2, \ldots, c_n\}$

* Training set = $X_{train} = \{(x_1, c_1), (x_2, c_2), \ldots, (x_m, c_m)\}$ 

* Validation set = $y_{val} = \{(y_1, c_1), (y_2, c_2), \ldots, (y_m, c_m)\}$

* Test set = $X_{test} = \{(x_1, c_1), (x_2, c_2), \ldots, (x_m, c_m)\}$

### Bayesian Inference

Bayes (1701-1761) was an English mathematician and statistician. He is known for his work on probability theory, including the law of total probability, the law of conditional probability, and the Bayes theorem. Bayes theorem is a fundamental result in statistics and machine learning. It is used to compute the posterior probability of an event given some evidence.

$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$


Thus, when we train we are computing the posterior probability of a label given some evidence. When we test we are computing the posterior probability of a label given some evidence.

$$\hat{c} = argmax_{c\in C} P(c|d) = argmax_{c\in C} \frac{P(d|c)P(c)}{P(d)}$$

We can simplify the above equation by dropping the denominator $P(d)$ because it is the same for all $c$.

$$\hat{y} = argmax_{c\in C} P(c|d) = argmax_{c\in C} P(d|c)P(c)$$

### Prior Probability and Likelihood

Choose the class with the highest product of two probabilities: the prior probability of the class and the likelihood of the data given the class.

$\hat{c} = argmax_{c\in C} P(c|d) =$ likelihood $P(d|c)$ prior $P(c)$


### Naive Bayes Assumption

We have used the bag of words method in previous lectures. We are assuming that the order of the words in a document does not matter. We are assuming that the words in a document are independent of each other. This is the Naive Bayes assumption. It is a useful assumption because it allows us to simplify the computation of the likelihood. We can compute the likelihood by multiplying the probabilities of the individual words.

$$P(d|c) = P(w_1, w_2, \ldots, w_n|c) = P(w_1|c)P(w_2|c)\ldots P(w_n|c)$$

Ergo,

$$c_{NB} = argmax_{c\in C} P(c) = \prod_{i\in positions} P(w_i|c)$$

We can optimize compute time by using the log of the likelihood. Postions = all word positions in the document.

$$c_{NB} = argmax_{c\in C} P(c) = \sum_{i\in positions} log(P(w_i|c))$$

### Summary of Naive Bayes

Naive Bayes is a statistical classification technique based on Bayes Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that a particular fruit is an apple or an orange or a banana and that is why it is known as ‘Naive’.

Bayes theorem provides a way of calculating posterior probability $P(c|x)$ from $P(c)$, $P(x)$, and $P(x|c)$. Look at the equation below:

$$P(c|x) = \frac{P(x|c) \cdot P(c)}{P(x)}$$

Here,
- $P(c|x)$ is the posterior probability of class $c$ given predictor $x$,
- $P(c)$ is the prior probability of class,
- $P(x|c)$ is the likelihood which is the probability of predictor $x$ given class $c$,
- $P(x)$ is the prior probability of predictor.


Informally, Naive Bayes classifier is a fast, easy to understand, and highly scalable algorithm that's a good choice when dealing with text data and categorical data. Despite its simplicity and the naive assumption of feature independence, it can perform surprisingly well and is widely used in practice including spam filtering, sentiment analysis, and many other classification problems.

### A Note on Evaluation

* gold labels = human labels of the input data
* predicted labels = labels predicted by the model

* precision = $\frac{TP}{TP + FP}$
* recall = $\frac{TP}{TP + FN}$
* F1 or F-measure = $F\beta \frac{(\beta^2 + 1)\times precision \times recall}{\beta^2 precision + recall}$
* accuracy = $\frac{TP + TN}{TP + TN + FP + FN}$


## Naive Bayes Classification Example

Let's code out the above in practice. We will create our own dataset for the first example.


In [None]:
import numpy as np

class NaiveBayes:
    
    def fit(self, X, y):
        """Get the number of samples and features"""
        n_samples, n_features = X.shape
        self._classes = np.unique(y)
        n_classes = len(self._classes)
        
        # initialize to zeros
        self._mean = np.zeros((n_classes, n_features), dtype=np.float64)
        self._var = np.zeros((n_classes, n_features), dtype=np.float64)
        self._priors = np.zeros(n_classes, dtype=np.float64)
        
        # calculate the mean, variance, and prior for each class
        for idx, c in enumerate(self._classes):
            X_c = X[y == c]
            self._mean[idx, :] = X_c.mean(axis=0) # numpy.mean
            self._var[idx, :] = X_c.var(axis=0) # numpy.var
            self._priors[idx] = X_c.shape[0] / float(n_samples) # number of samples
    
    def predict(self, X):
        y_pred = [self._predict(x) for x in X]
        return np.array(y_pred)

    def _predict(self, x):
        # calculate posterior probability for each class
        posteriors = []
        
        # calculate posterior probability for each class
        for idx, c in enumerate(self._classes):
            prior = np.log(self._priors[idx])
            posterior = np.sum(np.log(self._pdf(idx, x))) # gaussian distribution
            posterior = posterior + prior
            posteriors.append(posterior)
        
        # return class with highest posterior probability
        return self._classes[np.argmax(posteriors)]
    
    def _pdf(self, class_idx, x):
        mean = self._mean[class_idx]
        var = self._var[class_idx]
        numerator = np.exp(- (x - mean) ** 2 / (2 * var))
        denominator = np.sqrt(2 * np.pi * var)
        return numerator / denominator

In [None]:
## Data set creation
from sklearn.model_selection import train_test_split
from sklearn import datasets

def accuracy(y_true, y_pred):
    accuracy = np.sum(y_true == y_pred) / len(y_true)
    return accuracy

X, y = datasets.make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=123)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

nb = NaiveBayes()

nb.fit(X_train, y_train)

predictions = nb.predict(X_test)

print(f'Naive bayes classification accuracy: {accuracy(y_test, predictions):.2f}')

In [None]:
len(X_train), len(X_test), len(y_train), len(y_test)

In [None]:
## Print a confusion matrix using the scikit-learn library
from sklearn.metrics import confusion_matrix

# Create the confusion matrix
cfm = confusion_matrix(y_test, predictions)

## Plot the confusion matrix
import seaborn as sns

## Visualize the confusion matrix
sns.heatmap(cfm, annot=True, fmt='d')


In [None]:
## Let's create a scatter plot of the features in the dataset
import matplotlib.pyplot as plt

# Create a distribution plot of the features
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
plt.show()

In [None]:
## Plot the features of the test data

plt.scatter(X_test[:, 0], X_test[:, 1], c=predictions, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
plt.show()

In [None]:
## Create the distribution plot of the features
sns.histplot(X[:, 0], kde=True, color='red', label='Feature 1')
sns.histplot(X[:, 1], kde=True, color='blue', label='Feature 2')

In [None]:
## Plot the distribution of the test data

sns.histplot(X_test[:, 0], kde=True, color='red', label='Feature 1')
sns.histplot(X_test[:, 1], kde=True, color='blue', label='Feature 2')

## Refactoring Code for Naive Bayes Classification

### Naive Bayes Classification Example: Twitter Sentiment Analysis

In [None]:
import nltk
import numpy as np
import pandas as pd

from nltk.corpus import stopwords
from nltk.corpus import twitter_samples
from nltk.tokenize import TweetTokenizer

In [None]:
## Download the nltk data
nltk.download('twitter_samples')
nltk.download('stopwords')

In [None]:
## Load the positive and negative tweets

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

len(positive_tweets), len(negative_tweets)

In [None]:
## Let's plot the number of positive and negative tweets

sns.histplot([len(tweet) for tweet in positive_tweets], kde=True, color='green', label='Positive')
sns.histplot([len(tweet) for tweet in negative_tweets], kde=True, color='red', label='Negative')

In [None]:
## Let's use sklearn to split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(positive_tweets + negative_tweets,
                                                    np.append(np.ones(len(positive_tweets)),
                                                    np.zeros(len(negative_tweets))),
                                                    test_size=0.2,
                                                    random_state=123)

len(X_train), len(X_test), len(y_train), len(y_test)


In [None]:
X_train[0], X_test[0], y_train[0], y_test[0]

In [None]:
## Plot the length of the tweets re. positive and negative tweets in the training and test sets

sns.histplot([len(tweet) for tweet in X_train], kde=True, color='green', label='Positive', legend=True)
sns.histplot([len(tweet) for tweet in X_test], kde=True, color='red', label='Negative', legend=True)

In [None]:
# Let's plot the distribution of the classes in the training and test sets

sns.barplot(x=['Positive', 'Negative'], y=[len(y_train[y_train == 1]), len(y_train[y_train == 0])])

In [None]:
sns.barplot(x=['Positive', 'Negative'], y=[len(y_test[y_test == 1]), len(y_test[y_test == 0])])

### Preprocessing and cleaning the tweets

In [None]:
### Preprocess the data
import re
import string

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

def process_tweet(tweet):
    """Input: tweet a string containing a tweet
    Return:
    tweets_clean: a list of words containing the processed tweet
    """
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    
    # remove retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    
    # remove hashtags
    tweet = re.sub(r'#', '', tweet)
    
    # Instantiate stemmer class
    stemmer = PorterStemmer()
    
    # Create stopwords list
    stopwords_english = stopwords.words('english')
    
    # Tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    
    # Tokenize the tweets
    tweet_tokens = tokenizer.tokenize(tweet)
    
    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and word not in string.punctuation):
            stem_word = stemmer.stem(word)
            tweets_clean.append(stem_word)
    
    return tweets_clean

### Create a dictionary of words and their frequencies

In [None]:
## Let's create a dictionary of words and their frequencies
def count_tweets(result, tweets, ys):
    """Input:
    result: a dictionary that will contain the frequency of each pair (word, label)
    tweets: a list of tweets
    ys: an m x 1 array with the sentiment label of each tweet (either 0 or 1)
    """
    # iterate through each tweet and its label
    for y, tweet in zip(ys, tweets):
        # process the tweet to get the words in the form of a list
        for word in process_tweet(tweet):
            # increment the word count for the pair (word, label)
            pair = (word, y)
            if pair in result:
                result[pair] += 1
            else:
                result[pair] = 1
    return result

### Train our Naive Bayes classifier

To train our model, we first need to build a dictionary of the tokens in our training set. Let's call the dictionary `freqs`.

In [None]:
## Let's train our model

freqs = count_tweets({}, X_train, y_train)

### Examine the results dictionary

In [None]:
freqs

### Define a Naive Bayes classifier

In [None]:
## Let's create a function to extract the features from the tweets

def naive_bayes(freqs, X_train, y_train):
    """Train a Naive Bayes classifier on twitter data.

    Args:
        freqs (dict): dictionary of (word, label): frequency pairs
        X_train (list): list of tweets
        y_train (list): list of tweets
        
    returns:
    logprior (float): log prior
    loglikelihood (dict): dictionary of (word, label): log likelihood pairs
    """
    ## Compare the code here with Jurafsky and Martin's pseudocode
    
    loglikelihood = {}
    logprior = 0
    
    vocab = set([pair[0] for pair in freqs.keys()]) # words in the vocabulary
    V = len(vocab) # number of unique words in the vocabulary
    
    # Calculate N_pos and N_neg tweets (number of positive and negative tweets)
    N_pos, N_neg = 0, 0 # number of positive and negative tweets
    
    # Calculate the number of positive and negative tweets
    for pair in freqs.keys():
        # positive tweets
        if pair[1] > 0:
            N_pos += freqs[pair]
        # negative tweets
        else:
            N_neg += freqs[pair]
    
    # Documents = total number of tweets
    D = len(X_train)
    
    # Calculate # of positive and negative documents
    D_pos = np.sum(y_train)
    D_neg = D - D_pos
    
    logprior = np.log(D_pos) - np.log(D_neg)
    
    for word in vocab:
        
        freq_pos = freqs.get((word, 1.0), 0)
        freq_neg = freqs.get((word, 0.0), 0)
        
        p_w_pos = (freq_pos + 1) / (N_pos + V)
        p_w_neg = (freq_neg + 1) / (N_neg + V)
        
        loglikelihood[word] = np.log(p_w_pos / p_w_neg)
    
    return logprior, loglikelihood
    

In [None]:
## Let's test our function
logprior, loglikelihood = naive_bayes(freqs, X_train, y_train)

In [None]:
logprior, loglikelihood # most probable class given the tweet

In [None]:
## Let's test our model

def predict_naive_bayes(tweet, logprior, loglikelihood):
    """Input:
    tweet: a string
    logprior: a number
    loglikelihood: a dictionary of words mapping to numbers
    Output:
    p: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)
    """
    word_l = process_tweet(tweet)
    
    p = 0
    p += logprior
    
    for word in word_l:
        if word in loglikelihood:
            p += loglikelihood[word]
    
    return p

In [None]:
## Let's use our tweets from above

test_tweets = [
    'This is a happy tweet because I am learning NLP',
    'The AI revolution will not silence us! We will fight for our rights!',
    'Yesterday was a sad day for me. I lost my job.',
]

for t in test_tweets:
    print( '%s -> %f' % (t, predict_naive_bayes(t, logprior, loglikelihood)))

In [None]:
## Let's test our model on the test set

def test_naive_bayes(X_test, y_test, logprior, loglikelihood):
    """Input:
    X_test: a list of tweets
    y_test: (m, 1) array with the sentiment label of each tweet (either 0 or 1)
    logprior: a number
    loglikelihood: a dictionary of words mapping to numbers
    Output:
    accuracy: (# of tweets classified correctly) / (total # of tweets)
    """
    
    # Let's score the accuracy of our model
    accuracy = 0
    
    # Our predictions will be stored in y_hat
    y_hat = []
    
    for tweet in X_test:
        if predict_naive_bayes(tweet, logprior, loglikelihood) > 0:
            y_hat.append(1)
        else:
            y_hat.append(0)
    
    # error is the average of the absolute values of the differences between y_hat and y_test
    error = np.mean(np.abs(y_hat - y_test))
    
    accuracy = 1 - error
    
    return accuracy

In [None]:
## Let's test our function
print("Naive Bayes accuracy = %f" % test_naive_bayes(X_test, y_test, logprior, loglikelihood))

In [None]:
## Let's visualize some examples
X_test[0:10], y_test[0:10]

## Beyond Naivete: Where Naive Bayes Classification Breaks Down

In [None]:
def get_ratio(freqs, word):
    '''
    Input:
        freqs: dictionary containing the words
        word: string to lookup

    Output: a dictionary with keys 'positive', 'negative', and 'ratio'.
        Example: {'positive': 10, 'negative': 20, 'ratio': 0.5}
    '''
    pos_neg_ratio = {'positive': 0, 'negative': 0, 'ratio': 0.0}
    
    
    pos_neg_ratio['positive'] = freqs.get((word, 1.0), 0)

    pos_neg_ratio['negative'] = freqs.get((word, 0.0), 0)

    # calculate the ratio of positive to negative counts for the word
    pos_neg_ratio['ratio'] = (pos_neg_ratio['positive'] + 1)/(pos_neg_ratio['negative'] + 1)
    
    return pos_neg_ratio


def threshold_lookup(freqs, label, threshold):
    """Input:
    freqs: dictionary of (word, label): frequency pairs
    threshold: words position in the scored list
    """

    words = {}

    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    for key in freqs.keys():
        word, _ = key

        # get the positive/negative ratio for a word
        pos_neg_ratio = get_ratio(freqs, word)

        # if the label is 1 and the ratio is greater than or equal to the threshold...
        if label == 1 and pos_neg_ratio['ratio'] >= threshold:

            # Add the pos_neg_ratio to the dictionary
            words[word] = pos_neg_ratio

        # If the label is 0 and the pos_neg_ratio is less than or equal to the threshold...
        elif label == 0 and pos_neg_ratio['ratio'] <= threshold:

            # Add the pos_neg_ratio to the dictionary
            words[word] = pos_neg_ratio
            
    return words

In [None]:
get_ratio(freqs, 'peopl')

In [None]:
threshold_lookup(freqs, label=1, threshold=2.0)

In [None]:
# Some error analysis done for you
print('Truth Predicted Tweet')
for x, y in zip(X_test, y_test):
    # get the label prediction for the tweet
    y_hat = predict_naive_bayes(x, logprior, loglikelihood)
    
    # if the prediction is not equal to the label, print the tweet and the prediction
    if y != (np.sign(y_hat) > 0):
        
        # print out the gold label ('y'), the predicted label ('y_hat'), and the tweet ('x')
        print('%d\t%0.2f\t%s' % (y, np.sign(y_hat) > 0, ' '.join(
            process_tweet(x)).encode('ascii', 'ignore')))

## Naive Bayes Classification Example: Movie Review

In the above, we precomputed the frequency of each word in our vocabulary. We can, however, implement a Naive Bayes classifier in a different manner. In this example, we will create some classes that will allow us to train a Naive Bayes classifier on the fly. We will use the [IMDB Movie Review Dataset](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) from Kaggle. The dataset contains 50,000 movie reviews from IMDB. The reviews are labeled as positive or negative. The goal is to predict the sentiment of a movie review given the text of the review.

## Imports

In [None]:
import random
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from tqdm.autonotebook import tqdm

from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords

## html display
from IPython.display import display, HTML

## Load the data

In [None]:
## Let's create a dataframe to store the results

## if you are working in Colab replace the path with the url:
# imdb_dataset = 'https://github.com/JamesMTucker/DATA_340_NLP/blob/master/Notebooks/data/IMDB_Dataset.csv'

df = pd.read_csv('../datasets/IMDB_Dataset.csv')
df

In [None]:
df.shape

## Create our train and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=42)

In [None]:
len(X_train), len(X_test), len(y_train), len(y_test)

### Randomly sample our data

In [None]:
## Randomly select a sample from the training set
idx = random.randint(0, len(X_train))

# display the sample in HTML format
print(display(HTML(f"<p>Tweet: {X_train.iloc[idx]}</p>")))
print('label:', y_train.iloc[idx])

## Naive Bayes Classifier Class

In [None]:
import math

class NaiveBayesClassifier:
    def __init__(self):
        self.positive_word_counts = {}
        self.negative_word_counts = {}
        self.positive_total_count = 0
        self.negative_total_count = 0
        self.vocab = set()

    def train(self, data):
        for text, label in data:
            if label == 'positive':
                self.positive_total_count += 1
                for word in text.split():
                    self.positive_word_counts[word] = self.positive_word_counts.get(word, 0) + 1
                    self.vocab.add(word)
            elif label == 'negative':
                self.negative_total_count += 1
                for word in text.split():
                    self.negative_word_counts[word] = self.negative_word_counts.get(word, 0) + 1
                    self.vocab.add(word)

    def predict(self, text):
        # Calculate the prior probability of each class
        positive_prior = self.positive_total_count / (self.positive_total_count + self.negative_total_count + 1e-10)
        negative_prior = self.negative_total_count / (self.positive_total_count + self.negative_total_count + 1e-10)

        # Calculate the likelihood of the text given each class
        positive_likelihood = 0
        negative_likelihood = 0
        for word in text.split():
            if word in self.vocab:
                # Add Laplace smoothing to avoid zero probability
                positive_likelihood += math.log((self.positive_word_counts.get(word, 0) + 1) / (self.positive_total_count + len(self.vocab) + 1))
                negative_likelihood += math.log((self.negative_word_counts.get(word, 0) + 1) / (self.negative_total_count + len(self.vocab) + 1))

        # Calculate the posterior probability of each class
        positive_posterior = math.exp(positive_likelihood) * positive_prior
        negative_posterior = math.exp(negative_likelihood) * negative_prior

        # Return the class with the highest posterior probability
        if positive_posterior > negative_posterior:
            return 'positive'
        else:
            return 'negative'

### Train the Naive Bayes Classifier

In [None]:
## Shape our data
nb = NaiveBayesClassifier()
nb.train(zip(X_train, y_train))

In [None]:
## Test random sample
random_idx = random.randint(0, len(X_test))
result = nb.predict(X_test.iloc[random_idx])
print(display(HTML(f"<p>Tweet: {X_test.iloc[random_idx]}</p> <p>{y_test.iloc[random_idx]}</p>")))
print('prediction:', result)

## Test the accuracy of our model

In [None]:
## Test accuracy

correct_predictions = 0
total_predictions = len(X_test)

for text, label in zip(X_test, y_test):
    pred = nb.predict(text)
    if pred == label:
        correct_predictions += 1

accuracy = correct_predictions / total_predictions
print('Accuracy:', accuracy)

### Can we improve our model?

In [None]:
# Import the following libraries

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

In [None]:
## if you are working in Colab replace the path with the url:
# imdb_dataset = 'https://github.com/JamesMTucker/DATA_340_NLP/blob/master/Notebooks/data/IMDB_Dataset.csv'

df = pd.read_csv('../datasets/IMDB_Dataset.csv')
df.shape

In [None]:
## Let's convert the sentiment to labels
sentiment = {'positive': 1, 'negative': 0}

df['sentiment'] = df['sentiment'].map(sentiment)
df.head()

In [None]:
## Let's define our X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=42)

len(X_train), len(X_test), len(y_train), len(y_test)

### vectorize our dataset

In [None]:
# Vectorize the text data using the CountVectorizer class
vectorizer = CountVectorizer(stop_words='english')
train_vectors = vectorizer.fit_transform(X_train)
validation_vectors = vectorizer.transform(X_test)

In [None]:
train_vectors.shape, validation_vectors.shape

In [None]:
train_vectors[0], validation_vectors[0]

In [None]:
# Train a Naive Bayes classifier using the MultinomialNB class
classifier = MultinomialNB()
classifier.fit(train_vectors, y_train)

# Make predictions on the validation data
y_pred = classifier.predict(validation_vectors)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')  # Output: Accuracy: 0.80

### Check the model's accuracy

In [None]:
## Create a confusion matrix
import seaborn as sns
from sklearn.metrics import confusion_matrix

## visualize the confusion matrix
cfm = confusion_matrix(y_test, y_pred)

## plot the confusion matrix
sns.heatmap(cfm, annot=True, fmt='d', cmap='Blues', xticklabels=['negative', 'positive'], yticklabels=['negative', 'positive'])

In [None]:
# output some of the misclassified tweets
misclassified = np.where(y_pred != y_test)[0]
print('Misclassified tweets:', len(misclassified))

## create a dataframe to store the misclassified tweets
misclassified_df = pd.DataFrame({'text': X_test.iloc[misclassified], 'actual': y_test.iloc[misclassified], 'predicted': y_pred[misclassified]})
misclassified_df.head()