## Goal ##

Spam filtering is one of the most common real-world applications of machine learning, and most modern email providers already have systems that automatically flag unwanted messages as junk.

In this project, a simple Naïve Bayes classifier is implemented to detect spam in SMS messages. The model is trained on a labeled dataset so that distinctions between spam and non-spam messages can be learned. Spam texts are often characterized by the presence of attention-grabbing words such as “FREE”, “WIN”, “PRIZE”, “CASH”, or by stylistic patterns such as extensive use of all caps and multiple exclamation marks.

From a machine learning perspective, this is a binary classification task — each message is either spam or not spam. It also falls under supervised learning, since the model relies on a dataset of pre-labeled messages to learn patterns and make predictions on unseen data.

# Overview

This project has been broken down in to the following steps: 

- Step 0: Introduction to the Naive Bayes Theorem
- Step 1.1: Understanding the dataset
- Step 1.2: Data Preprocessing
- Step 2.1: Bag of Words (BoW)
- Step 2.2: Implementing BoW from scratch
- Step 2.3: Implementing Bag of Words in scikit-learn
- Step 3.1: Training and testing sets
- Step 3.2: Applying Bag of Words processing to the dataset
- Step 4.1: Bayes Theorem implementation from scratch
- Step 4.2: Naive Bayes implementation from scratch
- Step 5: Naive Bayes implementation using scikit-learn
- Step 6: Evaluating the model
- Step 7: Conclusion

### Step 0: Introduction to the Naive Bayes Theorem ###

Bayes’ Theorem is regarded as one of the earliest algorithms for probabilistic inference. It was originally formulated by Reverend Thomas Bayes and, despite its age, continues to be highly effective in many applications.

The idea can be understood more clearly through an example. Imagine a situation where security personnel are assigned to protect a political candidate during a public campaign event. Since the event is open to everyone, constant vigilance is required and each individual must be assessed for potential risk. Characteristics such as age, possession of a bag, or visible nervousness can be evaluated to determine whether a person poses a threat. If a person displays enough of these characteristics to cross a certain threshold of suspicion, precautionary action can be taken. This is essentially how Bayes’ Theorem operates: the probability of an event (a person being a threat) is calculated using the probabilities of related events (age, bag possession, nervous behavior, etc.).

An important consideration is the assumption of independence among these features. For instance, nervousness in a child is far less concerning than nervousness in an adult. If nervousness alone were used as a predictor, many false alarms would be raised, as minors are more likely to appear nervous. By combining features such as age and nervousness, a more accurate judgment is obtained. This leads to the “naïve” assumption in Naïve Bayes: all features are treated as independent, even though in reality correlations often exist. While this assumption is an oversimplification, it allows the algorithm to remain efficient and surprisingly effective.

Another intuitive example can be found in medical diagnosis. Suppose a patient is tested for a rare disease. The test is not perfect: it sometimes returns positive even when the patient is healthy (false positive), and sometimes returns negative when the patient is actually sick (false negative). Bayes’ Theorem allows the doctor to update the probability that the patient truly has the disease by combining:

- The prior probability of having the disease (based on how common it is in the population), and the likelihood of the observed test result given the disease status. This way, rather than relying only on the test outcome, a more accurate probability can be calculated.

In summary, Bayes’ Theorem provides a way to compute the probability of an event (e.g., a message being spam, a person being a threat, or a patient having a disease) from the joint probabilities of related evidence (e.g., keywords in text, observable behavior, or medical test results). The detailed mechanics of Bayes’ Theorem will be explored later, but first, attention will be given to understanding the dataset being used.

### Step 1.1: Understanding the Dataset ###
For this project, the SMS Spam Collection dataset from the UCI Machine Learning Repository is used. The UCI repository is widely known for hosting benchmark datasets frequently applied in experimental research. Abstract of the paper [abstract](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) and the original [compressed data file](https://archive.ics.uci.edu/ml/machine-learning-databases/00228/) on the UCI site.

Preview of the data:

<img src="./images/dqnb.png" height="1242" width="1242">

The dataset contains two columns:

The first column is the label:

- 'ham' → the SMS is not spam

- 'spam' → the SMS is spam

The second column contains the actual text of the SMS message.

In [1]:
# '!' allows us to run bash commands from jupyter notebook.
print("List all the files in the current directory\n")
!dir
# The required data table can be found under smsspamcollection/SMSSpamCollection
print("\n List all the files inside the smsspamcollection directory\n")
!dir smsspamcollection

List all the files in the current directory

 Volume in drive C is OS
 Volume Serial Number is B8A2-7C6A

 Directory of C:\Users\Asus\NN_Projects\NaiveBayes

19.08.2025  11:16    <DIR>          .
19.08.2025  11:25    <DIR>          ..
19.08.2025  11:16    <DIR>          .ipynb_checkpoints
19.08.2025  11:16    <DIR>          images
19.08.2025  11:16    <DIR>          smsspamcollection
19.08.2025  11:16            90.990 SpamClassifier.ipynb
               1 File(s)         90.990 bytes
               5 Dir(s)  432.080.314.368 bytes free

 List all the files inside the smsspamcollection directory

 Volume in drive C is OS
 Volume Serial Number is B8A2-7C6A

 Directory of C:\Users\Asus\NN_Projects\NaiveBayes\smsspamcollection

19.08.2025  11:16    <DIR>          .
19.08.2025  11:16    <DIR>          ..
19.08.2025  11:16           483.481 SMSSpamCollection
               1 File(s)        483.481 bytes
               2 Dir(s)  432.080.314.368 bytes free


>**Instructions:**

- Load the dataset into a pandas DataFrame using the read_table() function. 
- Since the data is tab-separated, pass \t as the value for the sep parameter.
- Provide custom column names by setting the names argument to ['label', 'sms_message'].
- Finally, display the first five rows of the DataFrame to verify that the data has been imported correctly.

In [2]:
import pandas as pd 
# Path to the dataset
file_path = "smsspamcollection/SMSSpamCollection"

df = pd.read_table(file_path, sep = '\t', header = None, names = ['label', 'sms_message'])
print(df.head())



  label                                        sms_message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


### Step 1.2: Data Preprocessing ###

Once the dataset has been explored, the label column is converted into binary values:  
- **0** → _ham_ (not spam)  
- **1** → _spam_  

This conversion is performed to make the data directly usable for computation.  

The reason for this step lies in how **scikit-learn and most machine learning libraries operate under the hood**. Internally, only **numerical representations** are handled. If string labels such as `"ham"` and `"spam"` were left unchanged, scikit-learn would automatically encode them, often casting them into float values or internally mapping them to integers without explicit visibility.  

While the model would still be able to train and predict, leaving this process to the library can introduce hidden conversions that may cause problems later. For example, during the calculation of **performance metrics** (precision, recall, F1-score), mismatched or opaque encodings may lead to confusing results.  

Converting categorical labels into integers beforehand is considered **best practice in ML pipelines**. Not only does it make the preprocessing explicit and transparent, it also aligns with how modern ML models and data pipelines are typically structured:  

- **Transparency** – it is always clear what numeric value corresponds to each category.  
- **Consistency** – the same encoding can be reused when new data arrives.  
- **Compatibility** – some algorithms (e.g., logistic regression, Naïve Bayes) expect purely numeric input matrices.  
- **Performance** – numerical operations are much faster than string operations.  

In short, preprocessing steps like label encoding are not merely for convenience but are part of the broader practice of ensuring that all inputs to a model are **clean, consistent, and numerical**—which is essential for reliable model training and evaluation.  


>**Instructions:**
* Convert the values in the 'label' colum to numerical values using map method as follows:
    - `{'ham':0, 'spam':1}` This maps the 'ham' value to 0 and the 'spam' value to 1.

In [3]:
df['label'] = df.label.map({'ham':0, 'spam':1})
print(df.shape)
df.head()

(5572, 2)


Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


### Step 2.1: Bag of Words (BoW)  

A dataset containing 5,572 SMS messages is being examined.  
Since most machine learning algorithms are designed to operate on **numerical feature vectors**, a direct use of raw text is not possible.  
Therefore, a transformation into a numerical representation is required before the data can be processed.  

---

### What is Bag of Words (BoW)?  

The **Bag of Words (BoW)** model is considered one of the most fundamental text representation techniques in Natural Language Processing (NLP).  

In this approach:  
- Each text document is regarded as a *“bag”* of words.  
- The order of the words is disregarded.  
- Only the **frequency** (or presence) of words is retained.  

As a result, a **document-term matrix (DTM)** is produced, where:  
- Each **row** is associated with a document (e.g., an SMS message).  
- Each **column** corresponds to a unique word (token).  
- Each **cell** contains the frequency of that word in the corresponding document.  

---

### Why is BoW important?  

- **Numerical input for ML models**  
  By this method, text is converted into numeric input that can be used by algorithms such as Naïve Bayes, Logistic Regression, and SVMs.  

- **Foundation for other techniques**  
  More advanced representations such as **TF-IDF**, **Word2Vec**, **GloVe**, and contextual embeddings (**BERT, GPT**) have been built on top of this basic idea.  

- **Efficiency & simplicity**  
  Despite its simplicity, good performance is often achieved in classification tasks (e.g., spam detection).  

- **Interpretability**  
  Since each feature corresponds directly to an actual word, interpretability is enhanced. For instance, if words like “win” or “prize” are assigned high weights in a classifier, the decision can be intuitively understood.  

---

### Example  

Consider the following four short documents:  

```python
['Hello, how are you!',  
 'Win money, win from home.',  
 'Call me now',  
 'Hello, call you tomorrow?']
 ```
Vocabulary Creation
```python
From the text collection, the following unique tokens (after lowercasing and ignoring punctuation) are extracted:
 ['hello', 'how', 'are', 'you', 'win', 'money', 'from', 'home', 'call', 'me', 'now', 'tomorrow']
 ```

Each document is then represented as a vector of token frequencies:  

| Document                         | hello | how | are | you | win | money | from | home | call | me | now | tomorrow |
|----------------------------------|:-----:|:---:|:---:|:---:|:---:|:-----:|:----:|:----:|:----:|:--:|:---:|:--------:|
| Doc1: "Hello, how are you!"      |   1   |  1  |  1  |  1  |  0  |   0   |  0   |  0   |  0   | 0  |  0  |    0     |
| Doc2: "Win money, win from home."|   0   |  0  |  0  |  0  |  2  |   1   |  1   |  1   |  0   | 0  |  0  |    0     |
| Doc3: "Call me now"              |   0   |  0  |  0  |  0  |  0  |   0   |  0   |  0   |  1   | 1  |  1  |    0     |
| Doc4: "Hello, call you tomorrow?"|   1   |  0  |  0  |  1  |  0  |   0   |  0   |  0   |  1   | 0  |  0  |    1     |

---
### Limitations of BoW

Although widely used, several limitations are associated with BoW:

- **Loss of word order and context**
Sentences such as “dog bites man” and “man bites dog” are represented identically.

- **High dimensionality**
A large vocabulary produces very sparse matrices, which can increase memory usage and computation time.

- **No semantic understanding**
Words such as “reward” and “prize” are treated as unrelated, even though their meanings are close.

- **Sensitivity to vocabulary**
Misspellings, capitalization, or rare words can result in unnecessary dimensions.

---



### Step 2.2: Implementing Bag of Words from Scratch  

Before a library implementation, a minimal Bag of Words pipeline is constructed manually to reveal the operations performed behind the scenes.

**Step 1 — Lowercasing**  
All strings in the document set are converted to lower case. This normalization ensures that tokens differing only by case (e.g., “Hello” vs “hello”) are treated identically.

**Given documents**
```python
documents = [
    'Hello, how are you!',
    'Win money, win from home.',
    'Call me now.',
    'Hello, Call hello you tomorrow?'
]

```
>**Instructions:**
* All strings are to be converted to lower case and stored in a list named `lower_case_documents`.* The Python string method **.lower()** is to be applied (e.g., via a list comprehension)..


In [4]:
documents = ['Hello, how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']

lower_case_documents = [i.lower() for i in documents]
print(lower_case_documents)

['hello, how are you!', 'win money, win from home.', 'call me now.', 'hello, call hello you tomorrow?']


**Step 2 - Removing all punctuation**
- After converting to lowercase, punctuation symbols are removed to prevent them from being treated as distinct tokens. For example, `"hello!"` and `"hello"` should not be considered different words.  

In [5]:
import re
import string 

# Remove all punctuation from the strings in the document set. 
# Save them into a list 
sans_punctuation_documents = []
for i in lower_case_documents:
    sans_punctuation_documents.append(re.sub(r'[^\w\s]', '', i))

print(sans_punctuation_documents)

['hello how are you', 'win money win from home', 'call me now', 'hello call hello you tomorrow']


**Step 3: Tokenization**

Tokenization refers to the process of splitting sentences into individual words (tokens). A **delimeter** is used to determine where one word ends and the next begins. In most cases, a space (`" "`) is used as the delimeter.

In [6]:
# Tokenize the strings stored in sans_punctuation_documents using the 
# split() method. Store the final document set in a list 

#preprocessed_documents = [i.split() for i in sans_punctuation_documents]

preprocessed_documents = []
for i in sans_punctuation_documents:
    preprocessed_documents.append(i.split())

print(preprocessed_documents)


[['hello', 'how', 'are', 'you'], ['win', 'money', 'win', 'from', 'home'], ['call', 'me', 'now'], ['hello', 'call', 'hello', 'you', 'tomorrow']]



**Step 4: Counting Frequencies**  

Once the document set has been tokenized into lists of words, the next step is to determine how often each word appears in every document. This is accomplished by computing the frequency of occurrence for each token.  

For this purpose, the `Counter` class from Python’s `collections` library is utilized. The `Counter` automatically tallies the frequency of each element in a list and produces a dictionary-like object, where the keys represent the items being counted and the values correspond to their respective counts.

In [7]:
# Using the Counter() method and preprocessed_documents as input
# create a dictionary with the keys being each word in each document 
# and the corresponding values being the frequency of occurrence of that word. 
# save each Counter dictionary as an item in a list called frequency_list

import pprint
from collections import Counter

frequency_list = [Counter(i) for i in preprocessed_documents]

pprint.pprint(frequency_list)


[Counter({'hello': 1, 'how': 1, 'are': 1, 'you': 1}),
 Counter({'win': 2, 'money': 1, 'from': 1, 'home': 1}),
 Counter({'call': 1, 'me': 1, 'now': 1}),
 Counter({'hello': 2, 'call': 1, 'you': 1, 'tomorrow': 1})]


### Step 2.3: Implementing Bag of Words in scikit-learn ###

In [8]:
'''
Here we will look to create a frequency matrix on a smaller document set to make sure we understand how the 
document-term matrix generation happens. We have created a sample document set 'documents'.
'''
documents = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()

**Data preprocessing with CountVectorizer()**

In Step 2.2, a simplified version of the `CountVectorizer()` method was implemented manually, which required preprocessing steps such as converting all text to lowercase and removing punctuation. Within `CountVectorizer()`, several parameters are provided to automatically handle these operations:  

* **`lowercase = True`**  

    By default, the `lowercase` parameter is set to `True`, ensuring that all text is automatically converted into lowercase form before further processing.  

* **`token_pattern = (?u)\\b\\w\\w+\\b`**  

    The `token_pattern` parameter is assigned a default regular expression `(?u)\\b\\w\\w+\\b`. With this setting, punctuation marks are ignored and treated as delimiters, while alphanumeric strings of length two or greater are identified as valid tokens.  

* **`stop_words`**  

    The `stop_words` parameter, when specified as `'english'`, removes all tokens from the document set that match a predefined list of English stop words contained within scikit-learn. Since the dataset in question consists of short SMS messages rather than larger bodies of text such as emails, this parameter is not applied in this case.  


In [10]:
'''
Get parameters of the 'count_vector' object which is an instance of 'CountVectorizer()'
'''
count_vector.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.int64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'preprocessor': None,
 'stop_words': None,
 'strip_accents': None,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'vocabulary': None}

In [11]:
# Fitting the document dataset to the CountVectorizer() object
# Get the list of words which have been categorized as features by using the method get_feature_names_out()

count_vector.fit(documents)
count_vector.get_feature_names_out()

array(['are', 'call', 'from', 'hello', 'home', 'how', 'me', 'money',
       'now', 'tomorrow', 'win', 'you'], dtype=object)

The `get_feature_names_out()` method returns the output feature names for this dataset, which is the set of words that make up the vocabulary for 'documents'.

In [12]:
# Create a matrix with the rows being each of the 4 documents, and the columns being each word. 
# The corresponding (row, column) value is the frequency of occurrance of that word
# Achieve this using the `transform()` method and passing in the document data set as the 
# argument. The `transform()` method returns a matrix of numpy integers
# we can convert this to an array using `toarray()`. Call the array `doc_array`.

doc_array = count_vector.transform(documents).toarray()
doc_array

array([[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1]], dtype=int64)

Above a clean representation of the documents in terms of the frequency distribution of the words in them can be seen. To make it easier to understand the next step is to convert this array into a dataframe and name the columns appropriately.

In [13]:
# Conver the array obtained, loaded into doc_array into a dataframe
# set the column names to the word names (which were computed earlier using get_feature_names_out())

feature_names = count_vector.get_feature_names_out()
frequency_matrix = pd.DataFrame(doc_array, columns = feature_names)
frequency_matrix

Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1


One limitation of applying this method directly is that in the case of very large text datasets (for instance, collections of news articles or large volumes of emails), certain words will naturally occur far more frequently simply due to the structure of the language. Words such as *is*, *the*, *an*, pronouns, and other grammatical constructs can dominate the matrix and distort the analysis.  

To mitigate this issue, two approaches are commonly employed:  

- The `stop_words` parameter can be set to `'english'`. When enabled, all words that appear in scikit-learn’s built-in list of English stop words are automatically removed from the input text.  

- The `TfidfVectorizer` method can be applied as an alternative. This approach reduces the weight of very frequent words and increases the importance of less common but informative words. While powerful, the details of this method are beyond the current lesson’s scope.  


### Step 3.1: Training and testing sets ###

In [14]:
print(df.columns)

Index(['label', 'sms_message'], dtype='object')


In [15]:
# Split the dataset into a training and testing set

from sklearn.model_selection import train_test_split

X = df["sms_message"]
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

print(f'Number of rows in the total set: {df.shape[0]}')
print(f'Number of rows in the training set: {X_train.shape[0]}')
print(f'Number of rows in the test set: {X_test.shape[0]}')

Number of rows in the total set: 5572
Number of rows in the training set: 3733
Number of rows in the test set: 1839


### Step 3.2: Applying Bag of Words processing to our dataset. ###

The data has been splitted, the next objective is to follow the steps from Step 2: Bag of words and convert the data into the desired matrix format. To do this we will be using `CountVectorizer()` as we did before. There are two  steps to consider here:

* Firstly, we have to fit our training data (`X_train`) into `CountVectorizer()` and return the matrix.
* Secondly, we have to transform our testing data (`X_test`) to return the matrix. 

Note that `X_train` is our training data for the 'sms_message' column in our dataset and we will be using this to train our model. 

`X_test` is our testing data for the 'sms_message' column and this is the data we will be using(after transformation to a matrix) to make predictions on.

First, we are learning a vocabulary dictionary for the training data 
and then transforming the data into a document-term matrix; secondly, for the testing data we are only 
transforming the data into a document-term matrix.

This is similar to the process in Step 2.3.

In [16]:
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

## Step 4.1: Bayes Theorem Implementation from Scratch

Once the dataset has been prepared in the required format, the next stage involves the implementation of the algorithm that will be applied to classify whether a message should be considered spam or not. At the beginning of this project, Bayes’ theorem was briefly introduced; however, more detailed attention will now be given.

In simple terms, Bayes’ theorem is used to calculate the probability of an event occurring, conditioned on prior knowledge of related probabilities. The theorem is composed of two main components:

- **Prior Probability (Priors):** Probabilities known beforehand or assumed from external knowledge.  
- **Posterior Probability (Posterior):** Probabilities that are updated after observing evidence (computed using priors).

---

### Example: Medical Test for Diabetes

To demonstrate the concept, consider the following illustrative example: the probability of an individual having diabetes given that a medical test has returned a positive result. In medicine, such probabilistic reasoning is critical since it directly influences diagnostic decisions, often with life-or-death consequences.

The following assumptions are made (note: purely hypothetical, not based on real data):

- **P(D):** Probability of having diabetes.  
  - Assumed to be 0.01 (1% of the general population).  

- **P(Pos):** Probability of receiving a positive test result.  

- **P(Neg):** Probability of receiving a negative test result.  

- **P(Pos|D):** Probability of a positive result given that diabetes is present.  
  - Value: 0.9 (i.e., the test is correct 90% of the time for diabetic individuals).  
  - This is also known as **Sensitivity** or **True Positive Rate (TPR)**.  

- **P(Neg|~D):** Probability of a negative result given that diabetes is absent.  
  - Value: 0.9 (i.e., the test is correct 90% of the time for non-diabetic individuals).  
  - This is also known as **Specificity** or **True Negative Rate (TNR)**.  

---

### Additional Notes

- If **specificity** is high but **sensitivity** is low, false negatives are more likely (patients with diabetes may be missed).  
- If **sensitivity** is high but **specificity** is low, false positives are more likely (healthy individuals may be incorrectly diagnosed).  
- In machine learning, this trade-off between sensitivity and specificity is often managed using a **confusion matrix**, **precision/recall metrics**, and **ROC curves**.  
- Bayes’ theorem forms the foundation of the **Naïve Bayes classifier**, a popular algorithm in Natural Language Processing (NLP) for tasks such as spam detection, sentiment analysis, and document classification.  

---

In the next step, the Bayes theorem will be implemented from scratch using this example, allowing us to observe how priors and conditional probabilities are combined to compute posterior probabilities.


### Bayes’ Theorem Formula

The Bayes theorem can be expressed as follows:

<img src="images/bayes_formula.png" height="242" width="242">

Where each term is defined as:

- **P(A):** The *prior probability* of event **A** occurring independently.  
  - In this example, `P(D)` represents the probability of an individual having diabetes.  
  - This value is assumed or given to us beforehand.

- **P(B):** The *prior probability* of event **B** occurring independently.  
  - In this example, `P(Pos)` represents the probability of a positive test result.  
  - This value is computed using the **law of total probability**:
    \[
    P(Pos) = P(Pos|D) \cdot P(D) + P(Pos|\sim D) \cdot P(\sim D)
    \]
    This ensures all possible scenarios are accounted for.

- **P(A|B):** The *posterior probability* that event **A** occurs given that event **B** has occurred.  
  - In this example, `P(D|Pos)` represents the probability of an individual having diabetes **given that** the test returned positive.  
  - This is the **value of interest** that is to be calculated.

- **P(B|A):** The *likelihood probability* of event **B** occurring, given that event **A** has occurred.  
  - In this example, `P(Pos|D)` represents the probability of receiving a positive result **given that diabetes is present**.  
  - This value is provided (0.9 in our case).

---

### Additional Insights

- The **denominator** `P(B)` plays a critical role as a *normalizing constant*.  
  Without it, probabilities could exceed 1 or fail to sum to 1 across all possible outcomes.  

- A common pitfall is the **base rate fallacy**, where the low prevalence of a condition (e.g., diabetes at 1%) is ignored.  
  Even with a highly accurate test, the posterior probability of truly having the condition may remain surprisingly low.

- In practical machine learning applications, this formula is extended across multiple features under the **Naïve Bayes assumption** (features are conditionally independent).  
  Despite the simplifying assumption, Naïve Bayes often performs remarkably well in high-dimensional domains such as text classification.

---

In the subsequent implementation, these probabilities will be plugged into the formula to compute `P(D|Pos)` step by step, before extending the approach to spam classification.


### Substituting Values into Bayes’ Theorem

Using the diabetes-testing example, the posterior probability is obtained by substituting the assumed values into Bayes’ theorem:

$$
P(D \mid Pos) = \frac{P(D)\, P(Pos \mid D)}{P(Pos)}
$$

The denominator $P(Pos)$ (the marginal probability of a positive test) is computed via **sensitivity** and **specificity** using the law of total probability:

$$
P(Pos) = P(D)\cdot \underbrace{P(Pos \mid D)}_{\text{Sensitivity}}
+ P(\lnot D)\cdot \underbrace{(1-\text{Specificity})}_{P(Pos \mid \lnot D)}
$$

---

#### Numerical Illustration (with the assumed values)
- $P(D) = 0.01$ (1% prevalence), hence $P(\lnot D) = 0.99$.  
- Sensitivity $= P(Pos \mid D) = 0.9$.  
- Specificity $= P(Neg \mid \lnot D) = 0.9 \;\Rightarrow\; (1-\text{Specificity}) = 0.1$.  

**Step 1 — Compute $P(Pos)$:**

$$
P(Pos) = 0.01 \cdot 0.9 + 0.99 \cdot 0.1
= 0.009 + 0.099 = 0.108
$$

**Step 2 — Compute $P(D \mid Pos)$:**

$$
P(D \mid Pos) = \frac{0.01 \cdot 0.9}{0.108}
= \frac{0.009}{0.108}
\approx 0.0833 \; (8.33\%)
$$

> **Interpretation:** Despite a highly accurate test (90% sensitivity & specificity), the **posterior** probability of actually having diabetes after a positive result is only about **8.33%** because the **base rate** (prevalence) is low. This is a classic illustration of the **base-rate fallacy**.

---

### Tips, Tricks, and Useful Technical Notes

- **Likelihood Ratios (LR):** Computations can be stabilized by using  

$$
LR^+ = \frac{\text{Sensitivity}}{1-\text{Specificity}}, \quad
LR^- = \frac{1-\text{Sensitivity}}{\text{Specificity}}
$$

and updating **odds** instead of probabilities:  

$$
Posterior \; Odds = Prior \; Odds \times LR^+
$$

for a positive test.  
(With the numbers above, $LR^+ = 0.9/0.1 = 9$; Prior odds $= 0.01/0.99$.)

- **PPV & NPV:**  
  - **Positive Predictive Value (PPV)** equals $P(D \mid Pos)$ (computed above).  
  - **Negative Predictive Value (NPV)** equals $P(\lnot D \mid Neg)$, which increases as prevalence decreases and specificity/sensitivity improve.  

- **Calibration Awareness:** Test performance (sensitivity/specificity) is population-dependent. If actual deployment data differ from the study population, recalibration may be required.  

- **Numerical Stability (Logs):** In high-dimensional ML (e.g., Naïve Bayes for text), products of many small probabilities should be handled in **log-space** to avoid underflow:  

$$
\log P = \sum \log P_i
$$

- **Connection to Spam Filtering (Naïve Bayes):**
  - **Class Priors:** Set $P(\text{spam})$ and $P(\text{ham})$ from corpus frequencies (class imbalance matters).  
  - **Conditional Likelihoods:** Estimate $P(\text{token} \mid \text{class})$ with **Laplace smoothing** (e.g., $\alpha=1$) to avoid zero probabilities.  
  - **Decision Rule:** Compute log-posterior for each class and choose the larger. Thresholds can be shifted to trade off **precision vs. recall**.  
  - **Feature Independence Assumption:** Although “naïve,” it often performs strongly in text due to sparse, high-dimensional signals.  

---

**Quick sanity check for your notes**
- Posterior: $P(D \mid Pos) = \dfrac{P(D)\,P(Pos \mid D)}{P(Pos)}$  
- Marginal: $P(Pos) = P(D)\cdot \text{Sensitivity} + P(\lnot D)\cdot (1-\text{Specificity})$  

> In subsequent sections, the same mechanics can be mirrored to implement a Naïve Bayes classifier from scratch for spam detection, swapping medical events with **class labels** (spam/ham) and **evidence** with token occurrences.


### Tips, Tricks, and Useful Technical Notes

- **Likelihood Ratios (LR):** Computations can be stabilized by using:

$$
LR^+ = \frac{\text{Sensitivity}}{1-\text{Specificity}}, \quad
LR^- = \frac{1-\text{Sensitivity}}{\text{Specificity}}
$$

and updating **odds** instead of probabilities:

$$
Posterior \; Odds = Prior \; Odds \times LR^+
$$

For a positive test.  
(With the numbers above, $LR^+ = 0.9/0.1 = 9$; Prior odds $= 0.01/0.99$.)

- **PPV & NPV:**  
  - **Positive Predictive Value (PPV):** $P(D \mid Pos)$ (computed above).  
  - **Negative Predictive Value (NPV):** $P(\lnot D \mid Neg)$, which increases as prevalence decreases and test accuracy improves.

- **Calibration Awareness:** Test performance (sensitivity/specificity) is population-dependent. If actual deployment data differ from the study population, recalibration may be required.  

- **Numerical Stability (Logs):** In high-dimensional ML (e.g., Naïve Bayes for text), products of many small probabilities should be handled in **log-space** to avoid underflow:

$$
\log P = \sum \log P_i
$$

- **Connection to Spam Filtering (Naïve Bayes):**
  - **Class Priors:** $P(\text{spam})$ and $P(\text{ham})$ are estimated from corpus frequencies (class imbalance matters).  
  - **Conditional Likelihoods:** $P(\text{token} \mid \text{class})$ estimated with **Laplace smoothing** (e.g., $\alpha=1$) to avoid zero probabilities.  
  - **Decision Rule:** Compute log-posterior for each class and select the larger. Thresholds can be shifted to adjust precision vs. recall.  
  - **Feature Independence Assumption:** Although “naïve,” it often performs strongly in text due to sparse, high-dimensional signals.  

---

**Quick sanity check for your notes:**

- Posterior:  
  $$
  P(D \mid Pos) = \dfrac{P(D)\,P(Pos \mid D)}{P(Pos)}
  $$

- Marginal:  
  $$
  P(Pos) = P(D)\cdot \text{Sensitivity} + P(\lnot D)\cdot (1-\text{Specificity})
  $$

> In subsequent sections, the same mechanics can be mirrored to implement a Naïve Bayes classifier from scratch for spam detection, swapping medical events with **class labels** (spam/ham) and **evidence** with token occurrences.


In [17]:
# Calculate probability of getting a positive test result, P(Pos)

# P(D)
p_diabetes = 0.01

# P(~D)
p_no_diabetes = 0.99

# Sensitivity or P(Pos|D)
p_pos_diabetes = 0.9

# Specificity or P(Neg/~D)
p_neg_no_diabetes = 0.9

# P(Pos)
p_pos = (p_diabetes * p_pos_diabetes) + (p_no_diabetes * (1 - p_neg_no_diabetes))
print(f'The probability of getting a positive test result P(Pos) is: {p_pos:.3f}')

The probability of getting a positive test result P(Pos) is: 0.108


### Calculating Posterior Probabilities

Using the previously defined values, the posterior probabilities can now be calculated:

- **Probability of having diabetes, given a positive test result:**

$$
P(D \mid Pos) = \frac{P(D) \cdot \text{Sensitivity}}{P(Pos)}
$$

- **Probability of not having diabetes, given a positive test result:**

$$
P(\lnot D \mid Pos) = \frac{P(\lnot D) \cdot (1 - \text{Specificity})}{P(Pos)}
$$

---

### Important Observation

The sum of the posteriors will always equal **1**:

$$
P(D \mid Pos) + P(\lnot D \mid Pos) = 1
$$

This property ensures that the probabilities are properly normalized after applying Bayes’ theorem.

---

### Additional Notes

- In binary classification, the two posterior probabilities ($P(D \mid Pos)$ and $P(\lnot D \mid Pos)$) fully describe the model’s belief after observing the evidence.
- In **multi-class Naïve Bayes**, this idea generalizes:  
  $$
  \sum_{i=1}^k P(C_i \mid x) = 1
  $$
  where $C_i$ are the possible classes and $x$ is the observed evidence.
- Normalization is essential when comparing likelihoods across multiple classes; otherwise, raw values could exceed 1 or fail to sum correctly.


In [18]:


# Compute the probability of an individual having diabetes, given that, that individual got a positive test result.
# In other words, compute P(D|Pos).

# The formula is: P(D|Pos) = (P(D) * P(Pos|D) / P(Pos)

# P(D|Pos)
p_diabetes_pos = (p_diabetes * p_pos_diabetes) / p_pos
print(f'Probability of an individual having diabetes, given '
      f'that that individual got a positive test result is: {p_diabetes_pos:.3f}') 


Probability of an individual having diabetes, given that that individual got a positive test result is: 0.083


In [19]:
'''
Compute the probability of an individual not having diabetes, given that, that individual got a positive test result.
In other words, compute P(~D|Pos).

The formula is: P(~D|Pos) = (P(~D) * P(Pos|~D) / P(Pos)

Note that P(Pos/~D) can be computed as 1 - P(Neg/~D). 

Therefore:
P(Pos/~D) = p_pos_no_diabetes = 1 - 0.9 = 0.1
'''
# P(Pos/~D)
p_pos_no_diabetes = 0.1

# P(~D|Pos)
p_no_diabetes_pos = (p_no_diabetes * p_pos_no_diabetes) / p_pos
print(f'Probability of an individual not having diabetes, given '
      f'that individual got a positive test result is: {p_no_diabetes_pos:.3f}')

Probability of an individual not having diabetes, given that individual got a positive test result is: 0.917


### What Does the Term *Naïve* in "Naïve Bayes" Mean?

The term *naïve* refers to the simplifying assumption made by the algorithm: **all features are considered conditionally independent of each other, given the class label.**  

This assumption is often unrealistic in practice, since real-world features are frequently correlated.  
- In the diabetes example, only one feature (the test result) was considered, so independence was not an issue.  
- Suppose an additional feature is introduced, such as *exercise habits*. Let this variable take a binary value:  
  - `0` → the individual exercises less than or equal to 2 days a week  
  - `1` → the individual exercises 3 or more days a week  
  If both *test result* and *exercise* were to be included, Bayes’ theorem in its pure form would require modeling their **joint distribution**, which quickly becomes computationally expensive.

Naïve Bayes extends Bayes’ theorem by assuming **feature independence**, so that the joint distribution can be factorized into simpler conditional probabilities:

$$
P(x_1, x_2, \dots, x_n \mid C) = \prod_{i=1}^{n} P(x_i \mid C)
$$

This drastically reduces the number of parameters that need to be estimated.

---

### Why the "Naïve" Assumption Works Surprisingly Well

- Even though strict independence is rarely true, the algorithm often performs remarkably well in practice—especially in **text classification** tasks (spam detection, sentiment analysis, document categorization).  
- The assumption simplifies learning and inference, while still capturing enough signal to be useful in high-dimensional, sparse domains.  
- With proper smoothing (e.g., **Laplace or Lidstone smoothing**), the algorithm can handle unseen words or rare features effectively.

---

### Key Takeaways
- *Naïve* = assumes **feature independence**.  
- Advantage: simplifies computations and prevents the curse of dimensionality.  
- Limitation: performance may degrade if strong feature dependencies exist (e.g., overlapping signals).  
- Despite being naïve, the method remains a **baseline model of choice** for many classification problems due to its speed, simplicity, and interpretability.


### Step 4.2: Naive Bayes implementation from scratch ###

#### Extending Bayes’ Theorem to Multiple Features

Once the foundations of Bayes’ theorem have been understood, the next step is to extend the framework to situations where **more than one feature** must be considered simultaneously.

---

#### Example: Political Candidates and Word Usage

Suppose two candidates are being observed:

- Jill Stein (Green Party)  
- Gary Johnson (Libertarian Party)  

The following probabilities are assumed regarding the likelihood of each candidate using specific words during a speech:

- Jill Stein:  
  - $P(F \mid J) = 0.1$ (says "freedom")  
  - $P(I \mid J) = 0.1$ (says "immigration")  
  - $P(E \mid J) = 0.8$ (says "environment")  

- Gary Johnson:  
  - $P(F \mid G) = 0.7$  
  - $P(I \mid G) = 0.2$  
  - $P(E \mid G) = 0.1$  

Additionally, it is assumed that the prior probability of either candidate giving a speech is equal:  

$$
P(J) = 0.5, \quad P(G) = 0.5
$$

---

#### Why Naïve Bayes?

If the probability of Jill Stein using both the words *freedom* and *immigration* is to be computed, the **joint probability** must be considered.  
This is where **Naïve Bayes** becomes valuable: it assumes that the features (*words*, in this case) are **conditionally independent** given the class (the candidate).

The joint probability is then factorized as:

$$
P(F, I \mid J) = P(F \mid J) \cdot P(I \mid J)
$$

rather than modeling $P(F, I \mid J)$ directly, which would otherwise require significantly more data.

---

#### Naïve Bayes Formula

The general form of the Naïve Bayes theorem is:

<img src="images/naivebayes.png" height="342" width="342">

Where:  

- $y$ → the **class variable** (in this example, the candidate).  
- $x_1, x_2, \dots, x_n$ → the **features** (in this example, the words spoken).  
- Independence is assumed among the features, simplifying the computation of the posterior:

$$
P(y \mid x_1, x_2, \dots, x_n) \propto P(y) \prod_{i=1}^n P(x_i \mid y)
$$

---

### Key Insights

- Without the independence assumption, modeling joint probabilities like $P(F, I, E \mid J)$ would require **exponentially more data**.  
- The *naïve* assumption dramatically reduces complexity while still achieving strong performance in many domains (especially text classification).  
- In practice, features are rarely truly independent; however, Naïve Bayes remains robust, especially in **high-dimensional, sparse settings** such as Natural Language Processing.  
- The decision rule is to select the class $y$ that maximizes the posterior probability:

$$
\hat{y} = \underset{y}{\operatorname{argmax}} \; P(y) \prod_{i=1}^n P(x_i \mid y)
$$


### Computing Posterior Probabilities with Naïve Bayes

To classify which candidate is more likely to have said the words *freedom* and *immigration*, the posterior probabilities for each candidate must be computed.

---

#### 1. Posterior for Jill Stein

The probability that Jill Stein said both words (*freedom* and *immigration*) is:

$$
P(J \mid F, I) = \frac{P(J) \cdot P(F \mid J) \cdot P(I \mid J)}{P(F, I)}
$$

Where:
- $P(J) = 0.5$  
- $P(F \mid J) = 0.1$  
- $P(I \mid J) = 0.1$

Thus:

$$
P(J \mid F, I) = \frac{0.5 \cdot 0.1 \cdot 0.1}{P(F,I)}
= \frac{0.005}{P(F,I)}
$$

---

#### 2. Posterior for Gary Johnson

The probability that Gary Johnson said both words is:

$$
P(G \mid F, I) = \frac{P(G) \cdot P(F \mid G) \cdot P(I \mid G)}{P(F, I)}
$$

Where:
- $P(G) = 0.5$  
- $P(F \mid G) = 0.7$  
- $P(I \mid G) = 0.2$

Thus:

$$
P(G \mid F, I) = \frac{0.5 \cdot 0.7 \cdot 0.2}{P(F,I)}
= \frac{0.07}{P(F,I)}
$$

---

#### 3. Denominator $P(F,I)$

The denominator is the marginal probability of the words *freedom* and *immigration* being spoken in any speech, regardless of candidate:

$$
P(F,I) = P(J) \cdot P(F \mid J) \cdot P(I \mid J) \;+\; P(G) \cdot P(F \mid G) \cdot P(I \mid G)
$$

Substituting values:

$$
P(F,I) = (0.5 \cdot 0.1 \cdot 0.1) + (0.5 \cdot 0.7 \cdot 0.2)
= 0.005 + 0.07 = 0.075
$$

---

#### 4. Final Posterior Probabilities

- For Jill Stein:

$$
P(J \mid F, I) = \frac{0.005}{0.075} \approx 0.0667 \; (6.67\%)
$$

- For Gary Johnson:

$$
P(G \mid F, I) = \frac{0.07}{0.075} \approx 0.9333 \; (93.33\%)
$$

---

### Interpretation

- If both words (*freedom* and *immigration*) are heard in a speech, **Gary Johnson** is far more likely (93.3%) to be the speaker compared to Jill Stein (6.7%).  
- This illustrates how Naïve Bayes combines multiple features under the independence assumption to strongly favor one class over another, even when priors are equal.  


In [20]:
'''
Compute the probability of the words 'freedom' and 'immigration' being said in a speech, or
P(F,I).

The first step is multiplying the probabilities of Jill Stein giving a speech with her individual 
probabilities of saying the words 'freedom' and 'immigration'. Store this in a variable called p_j_text

The second step is multiplying the probabilities of Gary Johnson giving a speech with his individual 
probabilities of saying the words 'freedom' and 'immigration'. Store this in a variable called p_g_text

The third step is to add both of these probabilities and you will get P(F,I).
'''
# P(J)
p_j = 0.5

# P(F/J)
p_j_f = 0.1

# P(I/J)
p_j_i = 0.1

p_j_text = p_j * p_j_f * p_j_i
print(p_j_text)


0.005000000000000001


In [21]:
# P(G)
p_g = 0.5

# P(F/G)
p_g_f = 0.7

# P(I/G)
p_g_i = 0.2

p_g_text = p_g * p_g_f * p_g_i
print(p_g_text)

0.06999999999999999


In [22]:
'''
Step 3: Compute P(F,I) and store in p_f_i
'''
p_f_i = p_j_text + p_g_text
print(f'Probability of words freedom and immigration being said are: {p_f_i:.3f}')

Probability of words freedom and immigration being said are: 0.075


### Computing $P(J \mid F,I)$ and $P(G \mid F,I)$ (Naïve Bayes, two features)

Using the Naïve Bayes assumption (features are conditionally independent given the class), the posteriors for **Jill Stein** ($J$) and **Gary Johnson** ($G$) conditioned on hearing the words *freedom* ($F$) and *immigration* ($I$) are obtained as follows:

---

#### Step 1 — Write the posterior forms
$$
P(J \mid F,I) \;=\; \frac{P(J)\,P(F\mid J)\,P(I\mid J)}{P(F,I)}, 
\qquad
P(G \mid F,I) \;=\; \frac{P(G)\,P(F\mid G)\,P(I\mid G)}{P(F,I)}.
$$

#### Step 2 — Compute the normalizer (marginal)
By the law of total probability:
$$
P(F,I) \;=\; P(J)\,P(F\mid J)\,P(I\mid J) \;+\; P(G)\,P(F\mid G)\,P(I\mid G).
$$

With the given values
\[
P(J)=P(G)=0.5,\quad
P(F\mid J)=0.1,\; P(I\mid J)=0.1,\quad
P(F\mid G)=0.7,\; P(I\mid G)=0.2,
\]
it follows that
$$
P(F,I) \;=\; (0.5)(0.1)(0.1) \;+\; (0.5)(0.7)(0.2)
\;=\; 0.005 \;+\; 0.070 \;=\; 0.075.
$$

#### Step 3 — Plug in and simplify
- For Jill Stein:
$$
P(J \mid F,I) \;=\; \frac{(0.5)(0.1)(0.1)}{0.075}
\;=\; \frac{0.005}{0.075}
\;\approx\; 0.0667 \;\text{ (6.67\%)}.
$$

- For Gary Johnson:
$$
P(G \mid F,I) \;=\; \frac{(0.5)(0.7)(0.2)}{0.075}
\;=\; \frac{0.070}{0.075}
\;\approx\; 0.9333 \;\text{ (93.33\%)}.
$$

---

### Notes
- The posteriors sum to one:
$$
P(J \mid F,I) + P(G \mid F,I) = 0.0667 + 0.9333 \approx 1.
$$
- The result strongly favors the candidate for whom both words are much more likely, despite equal class priors.  
- This procedure scales to additional words by multiplying the corresponding likelihood terms under the independence assumption:
$$
P(y \mid x_1,\dots,x_n) \propto P(y)\prod_{i=1}^n P(x_i \mid y).
$$


In [23]:
'''
Compute P(J|F,I) using the formula P(J|F,I) = (P(J) * P(F|J) * P(I|J)) / P(F,I) and store it in a variable p_j_fi
'''
p_j_fi = p_j_text / p_f_i
print(f'The probability of Jill Stein saying the words Freedom and Immigration: {p_j_fi:.3f}')

The probability of Jill Stein saying the words Freedom and Immigration: 0.067


In [24]:
'''
Compute P(G|F,I) using the formula P(G|F,I) = (P(G) * P(F|G) * P(I|G)) / P(F,I) and store it in a variable p_g_fi
'''
p_g_fi = p_g_text / p_f_i
print(f'The probability of Gary Johnson saying the words Freedom and Immigration: {p_g_fi:.3f}')

The probability of Gary Johnson saying the words Freedom and Immigration: 0.933


### Conclusion: Posteriors Sum to 1 and Final Interpretation

As observed—consistent with Bayes’ theorem—the **posterior probabilities are normalized** and their sum equals 1:

$$
P(J \mid F, I) + P(G \mid F, I) = 1.
$$

Based on the computed values:

- \(P(J \mid F, I) \approx 0.0667\) (≈ **6.7%**)
- \(P(G \mid F, I) \approx 0.9333\) (≈ **93.3%**)

**Interpretation:** Under the given assumptions and the naïve (conditional independence) factorization, it is concluded that the probability that **Jill Stein** uses the words *“freedom”* and *“immigration”* in a speech is approximately **6.7%**, whereas the probability for **Gary Johnson** is approximately **93.3%**. These values sum to 1 by construction due to the normalization via the marginal \(P(F, I)\).

---

#### Quick Sanity Check (Optional)
A compact verification is shown below to confirm normalization:
$$
\frac{0.5 \cdot 0.1 \cdot 0.1}{0.075}
\;+\;
\frac{0.5 \cdot 0.7 \cdot 0.2}{0.075}
=
\frac{0.005 + 0.070}{0.075}
=
\frac{0.075}{0.075}
= 1.
$$

> **Note on rounding:** Percentages were rounded to one decimal place (6.7% and 93.3%). Minor differences (e.g., 6.6% vs. 6.7%) arise from rounding.


### A Generic Example of Naïve Bayes in Action

Consider the scenario where the term *“Sacramento Kings”* is entered into a search engine.  
For accurate results, the system must recognize the phrase as referring to the **NBA basketball team** rather than interpreting the words independently:

- If *“Sacramento”* and *“Kings”* are treated separately, results could include:
  - Images of the city of Sacramento (city landscapes).
  - Images of crowns or historical monarchs (the word *“Kings”*).  
- The desired outcome, however, is content related specifically to the **Sacramento Kings basketball team**.  

This mismatch occurs because a *naïve* approach assumes independence between words, failing to capture their joint meaning as a phrase.

---

### Connection to Spam Detection

The same principle applies when classifying text messages or emails as **spam** or **not spam**.  
- The Naïve Bayes classifier evaluates each word **individually**, without modeling associations between words.  
- While this independence assumption is simplistic, it is often effective in spam detection because certain *red-flag words* are strong indicators of spam.  

**Examples of spam-trigger words:**
- "viagra"
- "lottery"
- "100% free"
- "exclusive offer"

Even though context and word associations are ignored, the sheer presence of such words frequently provides enough evidence for accurate classification.

---

### Key Takeaway

- **Naïve Assumption:** Words/features are treated independently.  
- **Strength:** Works well in domains where individual tokens carry strong predictive power (e.g., spam filters, sentiment analysis).  
- **Limitation:** Fails to capture semantic relationships between words or phrases (e.g., “Sacramento Kings” as a unit).  

Despite being “naïve,” this simplicity is what makes the algorithm computationally efficient and surprisingly powerful in many real-world text classification problems.


## Step 5: Naïve Bayes Implementation with scikit-learn

Returning to the spam-classification setting, the implementation can be carried out using **scikit-learn** without deriving the math from first principles. The `sklearn.naive_bayes` module provides several variants; for text data represented as **discrete counts**, the **Multinomial Naïve Bayes** classifier is typically employed. In contrast, **Gaussian Naïve Bayes** is suited to **continuous** features under a (conditional) normality assumption.

---

### Why Multinomial NB for Text?
- Text documents are commonly represented as **bag-of-words** or **n-gram** **count vectors** (non-negative integers).  
- The Multinomial model assumes conditional independence of features and models counts with a multinomial likelihood, which aligns with token-count inputs (e.g., from `CountVectorizer` or `TfidfVectorizer` with non-negative features).

> **Rule of thumb:**  
> - Use **MultinomialNB** for **counts** or **TF–IDF** features (TF–IDF must be non-negative).  
> - Use **GaussianNB** for continuous, real-valued features (e.g., sensor readings).  
> - Use **BernoulliNB** for **binary** features (e.g., “word present/absent”).

---

### Minimal, Reproducible Pipeline (Recommended)

It is considered best practice to wrap vectorization and classification inside a **Pipeline** so that preprocessing is tied to the model and cross-validation becomes straightforward.

```python
# (This is example code to paste into a code cell)
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# X: list/Series of raw SMS texts
# y: corresponding labels, e.g., 'spam' or 'ham'

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Option A: Count-based features
pipe_counts = Pipeline([("vect", CountVectorizer(ngram_range=(1,1), lowercase=True)),
("clf", MultinomialNB(alpha=1.0))]) # Laplace smoothing by default 

# Option B: TF–IDF features (still valid with MultinomialNB as long as non-negative)
pipe_tfidf = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), lowercase=True)), ("clf", MultinomialNB(alpha=0.5))])

# Fit and evaluate
pipe = pipe_tfidf  # choose either pipe_counts or pipe_tfidf
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred, digits=3))


In [25]:
'''
We have loaded the training data into the variable 'training_data' and the testing data into the 
variable 'testing_data'.

Import the MultinomialNB classifier and fit the training data into the classifier using fit(). Name your classifier
'naive_bayes'. You will be training the classifier using 'training_data' and y_train' from our split earlier. 
'''
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)
predictions = naive_bayes.predict(testing_data)

## Step 6: Evaluating the Model

With predictions obtained on the test set, the next objective is to **quantify performance** using appropriate metrics. Several complementary measures are typically reported, each highlighting a different aspect of classification quality—especially important under **class imbalance** (common in spam detection).

---

### Core Metrics

**Accuracy**  
The fraction of all predictions that are correct. Suitable when classes are balanced; misleading under heavy imbalance.

$$
\text{Accuracy} \;=\; \frac{TP + TN}{TP + TN + FP + FN}
$$

**Precision (Positive Predictive Value)**  
Of the items predicted as *spam*, the proportion that truly are *spam*. High precision implies few **false positives**.

$$
\text{Precision} \;=\; \frac{TP}{TP + FP}
$$

**Recall (Sensitivity / True Positive Rate)**  
Of the items that truly are *spam*, the proportion correctly identified as *spam*. High recall implies few **false negatives**.

$$
\text{Recall} \;=\; \frac{TP}{TP + FN}
$$

**F1 Score**  
The harmonic mean of precision and recall; balances both when a single number is needed.

$$
\text{F1} \;=\; \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

> **Interpretation tip:**  
> - **Precision** answers: “When the model says *spam*, how often is it correct?”  
> - **Recall** answers: “Of all *spam*, how much did the model catch?”

---

### Why Accuracy Alone Is Insufficient (Imbalance)

In skewed datasets (e.g., 2% spam, 98% ham), a classifier that labels everything as *ham* achieves 98% **accuracy** yet is useless. For such cases, **precision**, **recall**, and **F1** are preferred, often alongside **confusion matrices** to expose failure modes.

A confusion matrix for the *spam* (positive) class:

|                | Predicted: Spam | Predicted: Not Spam |
|----------------|------------------|----------------------|
| **Actual: Spam**     | TP               | FN                   |
| **Actual: Not Spam** | FP               | TN                   |

---

### Practical Evaluation Guidance (for Spam Detection)

- **Report per-class metrics** and **macro/micro averages**  
  - *Macro-F1*: unweighted average over classes (treats classes equally).  
  - *Micro-F1*: aggregates contributions of all classes (favors majority class).

- **Inspect Precision–Recall (PR) behavior**  
  - **PR curves** and **Average Precision (AP)** are more informative than ROC under heavy imbalance.

- **Threshold tuning**  
  - Default predictions use argmax of posterior probabilities.  
  - A custom threshold on \( P(\text{spam}\mid x) \) can be set to prioritize **recall** (catch more spam) or **precision** (reduce false alarms), depending on product needs.

- **Calibration**  
  - Naïve Bayes probabilities can be overconfident.  
  - If decision thresholds or downstream costs rely on probability quality, apply **probability calibration** (e.g., `CalibratedClassifierCV` with Platt scaling or isotonic regression).

- **Cross-validation and stratification**  
  - Use **stratified** splits so class ratios are preserved across folds and the test set.  
  - Report mean ± std over folds for stable estimates.

- **Additional metrics (optional but useful)**  
  - **Specificity (TNR)**: \( \frac{TN}{TN+FP} \) — how well *ham* is protected from false spam flags.  
  - **Matthews Correlation Coefficient (MCC)** — a balanced metric even with severe imbalance.  
  - **Balanced Accuracy** — average of TPR and TNR.

---

### Minimal scikit-learn Evaluation Snippet

```python
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_recall_fscore_support

y_pred = pipe.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred, digits=3))


In [26]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print(f'Accuracy score: {accuracy_score(y_test, predictions):.3f}')
print(f'Precision score: {precision_score(y_test, predictions):.3f}')
print(f'Recall score: {recall_score(y_test, predictions):.3f}')
print(f'F1 score: {f1_score(y_test, predictions):.3f}')

Accuracy score: 0.990
Precision score: 0.975
Recall score: 0.951
F1 score: 0.963


## Step 7: Conclusion

The **Naïve Bayes algorithm** offers several advantages compared to other classification methods, particularly in text classification contexts such as spam detection:

---

### Key Advantages

- **Scalability with High-Dimensional Data**  
  Each word in the vocabulary can be treated as a feature, resulting in thousands of dimensions.  
  Naïve Bayes handles this efficiently without requiring complex feature selection.

- **Robustness to Irrelevant Features**  
  Even if many features carry little or no discriminative value, the algorithm is relatively unaffected, as informative features dominate the posterior computations.

- **Simplicity and Ease of Use**  
  Naïve Bayes typically performs well with minimal parameter tuning. Only in special cases (e.g., when the distribution of data is known to be Gaussian) is fine-tuning or choosing a different variant (Multinomial, Bernoulli, Gaussian) necessary.

- **Low Risk of Overfitting**  
  Due to its probabilistic and additive nature, Naïve Bayes tends to generalize well even with limited data compared to more flexible models.

- **Efficiency**  
  Training and inference are extremely fast, making the method well-suited for large-scale datasets or applications requiring real-time predictions.

---

### Final Note

Naïve Bayes may appear simplistic—even “naïve”—due to its independence assumption. Yet, in practice, it remains **remarkably effective** for text classification and many other domains where features are numerous, sparse, and relatively independent.

>

## Recap: Why Use Naïve Bayes?

The **Naïve Bayes algorithm** applies probabilistic reasoning to classification tasks.  
It belongs to the family of **supervised machine learning algorithms**, where training is performed on labeled data across binary or multi-class categories.

---

### Advantages of Naïve Bayes

- **Handles High-Dimensional Feature Spaces**  
  Works effectively even when the number of features is extremely large, as in text classification where each unique word is a feature.

- **Robust to Irrelevant Features**  
  The presence of non-informative features has little impact because strong signals dominate the probability computation.

- **Simplicity and Ease of Use**  
  Performs well with minimal preprocessing and hyperparameter tuning. Different variants (Multinomial, Bernoulli, Gaussian) adapt to different data distributions.

- **Low Risk of Overfitting**  
  Compared to more flexible models, Naïve Bayes tends to generalize well and is less prone to fitting noise in the data.

- **Efficiency**  
  Training and inference are extremely fast, even on large datasets—making it suitable for real-time systems such as spam filters or document classifiers.

---

### Limitations of Naïve Bayes

- **Independence Assumption**  
  The core assumption that features are independent given the class label rarely holds in practice.  
  For example, the words *“New”* and *“York”* are strongly correlated; treating them independently can degrade performance.

- **Zero-Frequency Problem**  
  If a feature never appears in the training data for a class, its probability becomes zero, wiping out the entire posterior.  
  → Solution: apply **Laplace/Lidstone smoothing** (e.g., `alpha` parameter in scikit-learn).

- **Poor Probability Calibration**  
  Although classification accuracy may be high, the predicted probabilities are often **overconfident**.  
  → Solution: apply **probability calibration** methods (e.g., Platt scaling, isotonic regression).

- **Limited Expressiveness**  
  More advanced models (e.g., Logistic Regression, SVMs, Deep Neural Networks) can capture interactions between features, which Naïve Bayes ignores.

- **Performance Drops with Correlated Features**  
  If many features are highly dependent on each other, the independence assumption can distort posterior estimates and harm classification accuracy.

---

### Key Takeaway

Despite these limitations, Naïve Bayes remains a **reliable baseline algorithm** that is fast, interpretable, and robust in many high-dimensional, sparse domains (like spam detection).  
It should often be the **first model tried** in text classification tasks, and its performance serves as a benchmark for evaluating more complex approaches.
