If possible, update your sklearn version to 1.3.2 to reduce variance in the versions.

In [4]:
!pip3 install scikit-learn==1.3.2

In [2]:
import numpy as np
import sklearn
from scipy.linalg import solve
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The scikit-learn version is 1.3.2.


## Naive Bayes
From the 20Newsgroups dataset we fetch the documents belonging to three categories, which we use as classes.

In [3]:
categories = ['alt.atheism', 'talk.politics.guns',
              'sci.space']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

For example, the first document in the training data is the following one:

In [4]:
print(train.data[0])

From: fcrary@ucsu.Colorado.EDU (Frank Crary)
Subject: Re: Riddle me this...
Nntp-Posting-Host: ucsu.colorado.edu
Organization: University of Colorado, Boulder
Distribution: usa
Lines: 16

In article <1r1lp1INN752@mojo.eng.umd.edu> chuck@eng.umd.edu (Chuck Harris - WA3UQV) writes:
>>If so, why was CS often employed against tunnels in Vietnam?

>CS "tear-gas" was used in Vietnam because it makes you wretch so hard that
>your stomach comes out thru your throat.  Well, not quite that bad, but
>you can't really do much to defend yourself while you are blowing cookies.

I think the is BZ gas, not CS or CN. BZ gas exposure results in projectile
vomiting, loss of essentially all muscle control, inability to concentrate
or think rationally and fatal reactions in a significant fraction of
the population. For that reason its use is limited to military
applications.

                                                          Frank Crary
                                                          CU B

The classes are indicated categorically with indices from zero to two by the target vector. The target names tell us which index belongs to which class.

In [5]:
len(train.target)

1619

In [6]:
y_train = train.target
y_train

array([2, 2, 1, ..., 1, 2, 2])

In [7]:
train.target_names

['alt.atheism', 'sci.space', 'talk.politics.guns']

We represent the documents in a bag of word format. That is, we create a data matrix ``D`` such that ``D[j,i]=1`` if the j-th document contains the i-th feature (word), and ``D[j,i]=0`` otherwise. 

In [8]:
vectorizer = CountVectorizer(stop_words="english", min_df=5,token_pattern="[^\W\d_]+", binary=True)
D = vectorizer.fit_transform(train.data)
D_test = vectorizer.transform(test.data)

We get the allocation of feature indices to words by the following array, containing the vocabulary.

In [9]:
vectorizer.get_feature_names_out()

array(['aa', 'aario', 'aaron', ..., 'zoology', 'zv', 'ÿ'], dtype=object)

For example, the word `naive` has the index 4044.

In [10]:
np.where(vectorizer.get_feature_names_out() == 'naive')[0]

array([4044])

### Exercise a 
Compute the class prior probabilities p(y)

In [11]:
total_documents = D.shape[0]

class_prior_probabilities = {}

for class_index, class_label in enumerate(train.target_names):
    """
    Compute prior class probabilities.
    class_index: Index of class labels array
    class_label: 'alt.atheism' | 'sci.space' |'talk.politics.guns'
    """
    # Number of documents in specific class
    documents_in_class = np.sum(y_train == class_index)
    
    # The probability of the document in the specific class
    class_prior_probabilities[class_label] = documents_in_class / total_documents

class_prior_probabilities

{'alt.atheism': 0.2964793082149475,
 'sci.space': 0.3662754786905497,
 'talk.politics.guns': 0.3372452130945028}

### Exercise b
What are the log-probabilities of the word 'naive' given each class? Use Laplace smoothing with $α=1e−5α$. Note that the log is in ML as a default the natural logarithm to the base of e.
Assuming that $x_{naive}$ denotes the random variable for the feature-word 'naive', compute the following probabilities: 

In [12]:
y_train.shape

(1619,)

In [13]:
D.shape[1]

6880

In [14]:
test = D[y_train == 0]
test.sum()

49267

In [28]:
word_index = np.where(vectorizer.get_feature_names_out() == 'naive')[0][0]
total_documents = D.shape[0]

# Laplace smoothing
α = 1e-5
total_unique_words = D.shape[1]

# Object to store result in
word_class_log_probability = {}

# For every class count the probability that naive occures in each class
for class_index, class_label in enumerate(train.target_names):
    
    # Filter documents by each class label
    class_documents = D[y_train == class_index]
    
    # Count occurrences of the word 'naive' in each class
    word_occurrences = class_documents[:, word_index].sum()

    # Number of documents(samples) in specific class
    documents_in_class = np.sum(y_train == class_index)
    
    # Calculate probability with laplace smooothing
    probability_word = (word_occurrences + α) / (documents_in_class + α * 2)
    
    # Store log probability for each class
    word_class_log_probability[class_label] = np.log(probability_word)
    
word_class_log_probability

{'alt.atheism': -4.564346233136502,
 'sci.space': -6.385184432774537,
 'talk.politics.guns': -4.916322151258175}

### Exercise c
Compute the class-conditioned log-probabilities $\log⁡p(x_k,y)$ for each word and class combination. Apply the naive Bayes algorithm to compute the class-conditioned log probabilites for the first document x_0 in the training dataset. Use again the Laplace-smoothing with $α=1e−5$. 

$P(x0, y)=P(x0 | y)P(y)$

Look through all samples and features. features = words

In [71]:
# use feature count

for word in D[0]:
    print(word[0])
    
D[0]

  (0, 2266)	1
  (0, 6398)	1
  (0, 1133)	1
  (0, 1938)	1
  (0, 2437)	1
  (0, 1419)	1
  (0, 5951)	1
  (0, 5267)	1
  (0, 4146)	1
  (0, 4662)	1
  (0, 2855)	1
  (0, 4338)	1
  (0, 6460)	1
  (0, 727)	1
  (0, 1784)	1
  (0, 6494)	1
  (0, 3543)	1
  (0, 376)	1
  (0, 4946)	1
  (0, 3627)	1
  (0, 3066)	1
  (0, 3951)	1
  (0, 2007)	1
  (0, 6418)	1
  (0, 1031)	1
  :	:
  (0, 6063)	1
  (0, 5024)	1
  (0, 1588)	1
  (0, 677)	1
  (0, 1357)	1
  (0, 6192)	1
  (0, 1104)	1
  (0, 2183)	1
  (0, 5224)	1
  (0, 3605)	1
  (0, 2067)	1
  (0, 4028)	1
  (0, 1332)	1
  (0, 1221)	1
  (0, 2254)	1
  (0, 5008)	1
  (0, 5625)	1
  (0, 2431)	1
  (0, 4641)	1
  (0, 5026)	1
  (0, 6498)	1
  (0, 3537)	1
  (0, 3874)	1
  (0, 313)	1
  (0, 1468)	1


<1x6880 sparse matrix of type '<class 'numpy.int64'>'
	with 66 stored elements in Compressed Sparse Row format>

In [59]:
np.where(vectorizer.get_feature_names_out() == 'fraction')[0]

array([2431])

In [61]:
D

<1619x6880 sparse matrix of type '<class 'numpy.int64'>'
	with 174034 stored elements in Compressed Sparse Row format>

In [46]:
print(train.data[0])

From: fcrary@ucsu.Colorado.EDU (Frank Crary)
Subject: Re: Riddle me this...
Nntp-Posting-Host: ucsu.colorado.edu
Organization: University of Colorado, Boulder
Distribution: usa
Lines: 16

In article <1r1lp1INN752@mojo.eng.umd.edu> chuck@eng.umd.edu (Chuck Harris - WA3UQV) writes:
>>If so, why was CS often employed against tunnels in Vietnam?

>CS "tear-gas" was used in Vietnam because it makes you wretch so hard that
>your stomach comes out thru your throat.  Well, not quite that bad, but
>you can't really do much to defend yourself while you are blowing cookies.

I think the is BZ gas, not CS or CN. BZ gas exposure results in projectile
vomiting, loss of essentially all muscle control, inability to concentrate
or think rationally and fatal reactions in a significant fraction of
the population. For that reason its use is limited to military
applications.

                                                          Frank Crary
                                                          CU B

In [65]:
# If 
D[0]

<1x6880 sparse matrix of type '<class 'numpy.int64'>'
	with 66 stored elements in Compressed Sparse Row format>

In [81]:
# Iterate through the words in the bag matrix
for feature_index, word in enumerate():
    # Access the corresponding column in the bag matrix D
    word_column = D[:, feature_index]
    if word_column[0] != 0:
        print(word_column)

  (0, 0)	1
  (110, 0)	1
  (127, 0)	1
  (150, 0)	1
  (155, 0)	1
  (203, 0)	1
  (350, 0)	1
  (415, 0)	1
  (488, 0)	1
  (501, 0)	1
  (577, 0)	1
  (603, 0)	1
  (609, 0)	1
  (727, 0)	1
  (842, 0)	1
  (931, 0)	1
  (985, 0)	1
  (1088, 0)	1
  (1177, 0)	1
  (1390, 0)	1
  (1435, 0)	1
  (1449, 0)	1
  (1459, 0)	1
  (1523, 0)	1
  (1555, 0)	1
  (0, 0)	1
  (1, 0)	1
  (2, 0)	1
  (6, 0)	1
  (7, 0)	1
  (8, 0)	1
  (12, 0)	1
  (13, 0)	1
  (18, 0)	1
  (19, 0)	1
  (20, 0)	1
  (21, 0)	1
  (22, 0)	1
  (25, 0)	1
  (26, 0)	1
  (27, 0)	1
  (29, 0)	1
  (30, 0)	1
  (31, 0)	1
  (33, 0)	1
  (34, 0)	1
  (37, 0)	1
  (38, 0)	1
  (40, 0)	1
  (41, 0)	1
  :	:
  (1580, 0)	1
  (1581, 0)	1
  (1584, 0)	1
  (1586, 0)	1
  (1587, 0)	1
  (1588, 0)	1
  (1590, 0)	1
  (1591, 0)	1
  (1592, 0)	1
  (1593, 0)	1
  (1594, 0)	1
  (1596, 0)	1
  (1599, 0)	1
  (1601, 0)	1
  (1602, 0)	1
  (1603, 0)	1
  (1606, 0)	1
  (1607, 0)	1
  (1609, 0)	1
  (1610, 0)	1
  (1613, 0)	1
  (1614, 0)	1
  (1616, 0)	1
  (1617, 0)	1
  (1618, 0)	1
  (0, 0)	1
  (10, 0

  (0, 0)	1
  (9, 0)	1
  (22, 0)	1
  (68, 0)	1
  (93, 0)	1
  (128, 0)	1
  (230, 0)	1
  (246, 0)	1
  (269, 0)	1
  (325, 0)	1
  (517, 0)	1
  (619, 0)	1
  (749, 0)	1
  (842, 0)	1
  (922, 0)	1
  (935, 0)	1
  (979, 0)	1
  (1008, 0)	1
  (1037, 0)	1
  (1070, 0)	1
  (1082, 0)	1
  (1088, 0)	1
  (1158, 0)	1
  (1258, 0)	1
  (1296, 0)	1
  (1305, 0)	1
  (1322, 0)	1
  (1379, 0)	1
  (1387, 0)	1
  (1393, 0)	1
  (1542, 0)	1
  (1562, 0)	1
  (0, 0)	1
  (13, 0)	1
  (116, 0)	1
  (347, 0)	1
  (415, 0)	1
  (586, 0)	1
  (739, 0)	1
  (763, 0)	1
  (842, 0)	1
  (882, 0)	1
  (1171, 0)	1
  (1248, 0)	1
  (1459, 0)	1
  (0, 0)	1
  (9, 0)	1
  (88, 0)	1
  (93, 0)	1
  (272, 0)	1
  (288, 0)	1
  (290, 0)	1
  (306, 0)	1
  (419, 0)	1
  (434, 0)	1
  (446, 0)	1
  (607, 0)	1
  (637, 0)	1
  (687, 0)	1
  (694, 0)	1
  (945, 0)	1
  (1023, 0)	1
  (1380, 0)	1
  (0, 0)	1
  (88, 0)	1
  (331, 0)	1
  (357, 0)	1
  (446, 0)	1
  (473, 0)	1
  (499, 0)	1
  (507, 0)	1
  (560, 0)	1
  (602, 0)	1
  (663, 0)	1
  (687, 0)	1
  (757, 0)	1
  (822, 0)	

  (0, 0)	1
  (1, 0)	1
  (2, 0)	1
  (3, 0)	1
  (5, 0)	1
  (6, 0)	1
  (7, 0)	1
  (8, 0)	1
  (9, 0)	1
  (10, 0)	1
  (11, 0)	1
  (12, 0)	1
  (13, 0)	1
  (14, 0)	1
  (15, 0)	1
  (16, 0)	1
  (17, 0)	1
  (18, 0)	1
  (20, 0)	1
  (21, 0)	1
  (22, 0)	1
  (23, 0)	1
  (24, 0)	1
  (25, 0)	1
  (26, 0)	1
  :	:
  (1594, 0)	1
  (1595, 0)	1
  (1596, 0)	1
  (1597, 0)	1
  (1598, 0)	1
  (1599, 0)	1
  (1600, 0)	1
  (1601, 0)	1
  (1602, 0)	1
  (1603, 0)	1
  (1604, 0)	1
  (1605, 0)	1
  (1606, 0)	1
  (1607, 0)	1
  (1608, 0)	1
  (1609, 0)	1
  (1610, 0)	1
  (1611, 0)	1
  (1612, 0)	1
  (1613, 0)	1
  (1614, 0)	1
  (1615, 0)	1
  (1616, 0)	1
  (1617, 0)	1
  (1618, 0)	1
  (0, 0)	1
  (1, 0)	1
  (4, 0)	1
  (9, 0)	1
  (22, 0)	1
  (60, 0)	1
  (77, 0)	1
  (162, 0)	1
  (163, 0)	1
  (220, 0)	1
  (240, 0)	1
  (334, 0)	1
  (366, 0)	1
  (375, 0)	1
  (376, 0)	1
  (395, 0)	1
  (419, 0)	1
  (438, 0)	1
  (483, 0)	1
  (551, 0)	1
  (560, 0)	1
  (569, 0)	1
  (572, 0)	1
  (593, 0)	1
  (598, 0)	1
  :	:
  (951, 0)	1
  (963, 0)	1
  (964,

In [86]:
vectorizer.transform(D)

AttributeError: 'csr_matrix' object has no attribute 'lower'