If possible, update your sklearn version to 1.3.2 to reduce variance in the versions.

In [4]:
!pip3 install scikit-learn==1.3.2

In [2]:
import numpy as np
import pandas as pd
import sklearn
from scipy.linalg import solve
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The scikit-learn version is 1.3.2.


## Naive Bayes
From the 20Newsgroups dataset we fetch the documents belonging to three categories, which we use as classes.

In [4]:
categories = ['alt.atheism', 'talk.politics.guns',
              'sci.space']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

For example, the first document in the training data is the following one:

In [5]:
print(train.data[0])

From: fcrary@ucsu.Colorado.EDU (Frank Crary)
Subject: Re: Riddle me this...
Nntp-Posting-Host: ucsu.colorado.edu
Organization: University of Colorado, Boulder
Distribution: usa
Lines: 16

In article <1r1lp1INN752@mojo.eng.umd.edu> chuck@eng.umd.edu (Chuck Harris - WA3UQV) writes:
>>If so, why was CS often employed against tunnels in Vietnam?

>CS "tear-gas" was used in Vietnam because it makes you wretch so hard that
>your stomach comes out thru your throat.  Well, not quite that bad, but
>you can't really do much to defend yourself while you are blowing cookies.

I think the is BZ gas, not CS or CN. BZ gas exposure results in projectile
vomiting, loss of essentially all muscle control, inability to concentrate
or think rationally and fatal reactions in a significant fraction of
the population. For that reason its use is limited to military
applications.

                                                          Frank Crary
                                                          CU B

The classes are indicated categorically with indices from zero to two by the target vector. The target names tell us which index belongs to which class.

In [6]:
len(train.target)

1619

In [7]:
y_train = train.target
y_train

array([2, 2, 1, ..., 1, 2, 2])

In [8]:
train.target_names

['alt.atheism', 'sci.space', 'talk.politics.guns']

We represent the documents in a bag of word format. That is, we create a data matrix ``D`` such that ``D[j,i]=1`` if the j-th document contains the i-th feature (word), and ``D[j,i]=0`` otherwise. 

In [9]:
vectorizer = CountVectorizer(stop_words="english", min_df=5,token_pattern="[^\W\d_]+", binary=True)
D = vectorizer.fit_transform(train.data)
D_test = vectorizer.transform(test.data)

We get the allocation of feature indices to words by the following array, containing the vocabulary.

In [10]:
vectorizer.get_feature_names_out()

array(['aa', 'aario', 'aaron', ..., 'zoology', 'zv', 'ÿ'], dtype=object)

For example, the word `naive` has the index 4044.

In [11]:
np.where(vectorizer.get_feature_names_out() == 'naive')[0]

array([4044])

### Exercise a 
Compute the class prior probabilities p(y)

In [51]:
total_documents = D.shape[0]

class_prior_probabilities = {}

for class_index, class_label in enumerate(train.target_names):
    """
    Compute prior class probabilities.
    class_index: Index of class labels array
    class_label: 'alt.atheism' | 'sci.space' |'talk.politics.guns'
    """
    # Number of documents in specific class
    documents_in_class = np.sum(y_train == class_index)
    
    # The probability of the document in the specific class
    class_prior_probabilities[class_label] = documents_in_class / total_documents

class_prior_probabilities

{'alt.atheism': 0.2964793082149475,
 'sci.space': 0.3662754786905497,
 'talk.politics.guns': 0.3372452130945028}

### Exercise b
What are the log-probabilities of the word 'naive' given each class? Use Laplace smoothing with $α=1e−5α$. Note that the log is in ML as a default the natural logarithm to the base of e.
Assuming that $x_{naive}$ denotes the random variable for the feature-word 'naive', compute the following probabilities: 

In [105]:
word_index = np.where(vectorizer.get_feature_names_out() == 'naive')[0][0]
total_documents = D.shape[0]

# Laplace smoothing
α = 1e-5
total_unique_words = D.shape[1]

# Object to store result in
word_class_log_probability = {}

# For every class count the probability that naive occures in each class
for class_index, class_label in enumerate(train.target_names):
    
    # Filter documents by each class label
    class_documents = D[y_train == class_index]
    
    # Count occurrences of the word 'naive' in each class
    word_occurrences = class_documents[:, word_index].sum()

    # Number of documents(samples) in specific class
    documents_in_class = np.sum(y_train == class_index)
        
    # Calculate probability with laplace smooothing
    probability_word = (word_occurrences + α) / (documents_in_class + α * 2)
    
    # Store log probability for each class
    word_class_log_probability[class_label] = np.log(probability_word)
    
word_class_log_probability

{'alt.atheism': -4.564346233136502,
 'sci.space': -6.385184432774537,
 'talk.politics.guns': -4.916322151258175}

### Exercise c
Compute the class-conditioned log-probabilities $\log⁡p(x_k,y)$ for each word and class combination. Apply the naive Bayes algorithm to compute the class-conditioned log probabilites for the first document x_0 in the training dataset. Use again the Laplace-smoothing with $α=1e−5$. 

$P(x_0, y)=P(x_0 | y)P(y)$

Using equation 62 and 64 in the notebook to calculate the probability.

In [167]:
for class_index, class_label in enumerate(train.target_names):
    
    # Filter documents by class
    documents_in_class = df[y_train == class_index]
    num_of_documents_in_class = documents_in_class.shape[0]

    # Total amount of features (words)
    vocabulary_size = len(vectorizer.vocabulary_)

    # Count the word_occurences in each class
    word_occurences = documents_in_class.sum(axis=0)

    # Class-Conditioned log-probabilities for each word in class
    class_conditioned_probs = np.log((word_occurences + α) / (num_of_documents_in_class + α * vocabulary_size))

    # Apply Naive Bayes Algorithm for x_0
    log_prior = np.log(class_prior_probabilities[class_label])
    log_probs_x0_y0 = log_prior + np.sum(class_conditioned_probs)

    print(f"log p(x=x0, y={class_index}) =", log_probs_x0_y0)

log p(x=x0, y=0) = -57210.94738605933
log p(x=x0, y=1) = -51552.20958428776
log p(x=x0, y=2) = -49191.18333249678
