**`Theory NaiveBayes Classifier.`**

- 1. **`Motivation.`** Given n classes $C_1; C_2; ...; C_n$ be the subsets in $\mathbb{R}^d$ and a data point $x \in \mathbb{R}^d$, then for any $k \in \lbrace 1, \ldots, n \rbrace$ we want to know that

$$ P(C_k | x) = P(y(\omega) \in C_k | \omega = x) \to \max ??$$

i.e. find the class $C^{*} = \underset{k \in \lbrace 1, \ldots, n \rbrace }{ \text{argmax} } P(C_k \vert x) $ 

- 2. **`Bayes_theorem & its corollary.`**
We have
$$ P(A \vert B) = \frac{P(A \cap B)}{P(B)} = \frac{P(B \cap A)}{P(A)} \frac{P(A)}{P(B)} = \frac{P(B \vert A) P(A)}{P(B)} $$
then
$$ C^{*} = \underset{k \in \lbrace 1, \ldots, n \rbrace }{ \text{argmax} } \frac{P(x \vert C_k) P(c)}{P(x)} = \underset{k \in \lbrace 1, \ldots, n \rbrace }{ \text{argmax} } \left( P(x \vert C_k) P(C_k) \right). $$

- 3. **`Naive Bayes Classifier.`**

- - using the method `maximum likelihood est (MLE)`; The term $P(C_k)$ can be estimated by $$ \frac{\vert C_k \vert}{\sum_{k=1}^{n} \vert C_k \vert}; $$
where $\vert A \vert$ is the number of elements in the set $A$.

- - For the term $p(x \vert C_k)$; this is too difficult to calculate when $x$ be the `multi-dimension` rvs,

- - - If $x$ is not independent or the classes `C` be unknown; we have
$$ P(x \vert C_k) = P(x_1, \ldots, x_d \vert C_k) = P(x_1 \vert x_2, \ldots, x_n \vert C_k) P(x_2, \ldots x_n \vert C_k) \ldots P(x_n \vert c) $$
hence this is very difficult to calculate the `joint probabilities` produced by expanding for all features.
$$ $$

- - - **Solutions.** $x = (x_1, \ldots, x_d)$ be independent and the class `c` is known; then
$$ P(x \vert C_k) = \Pi_{j=1}^d P(x_k \vert C_k) $$
these assumption is too strictly but useful in the `large_scale` problems; the ratings of training and testing converges too much quickly to 0. The assumptions be `"naive"` in this case; and hence the problem is called `Naive Bayes classifier.`

We have 3 basic type of dataset for this problems; see the simpliest case : `bernoulliNB`

####  NBC algorithms. 

- We have
$$ c^{*} = \underset{k \in \lbrace 1, \ldots, n \rbrace }{ \text{argmax} } P(C_k) \Pi_{j=1}^d P(x_j \vert C_k). $$

- 1. `Training NBC.` 
- - i) For each class $C_k$; calculate the prob given the feature $x$, or $P(C_k \vert x)$
- - ii) The `class assignment` is selected based on the `maximum a posterior (MAP)` rule.

- 2. `The log trick.`
- - i) Multiplying many values together cause computational instabality (underflows).
$$ \underset{k \in \lbrace 1, \ldots, n \rbrace }{ \text{argmax} } P(C_k) \Pi_{j=1}^d P(x_j \vert C_k), $$
- - ii) Applying the `log` function, we get
$$ \underset{k \in \lbrace 1, \ldots, n \rbrace }{ \text{argmax} } \log P(C_k) \sum_{j=1}^d \log P(x_j \vert C_k), $$

**Example 2. Predicting `laying tennis` with `Naive Bayes`.?**

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.metrics import confusion_matrix, accuracy_score 
from sklearn.naive_bayes import MultinomialNB 

## Loading & viewing dataset in the first example

path = r"C:\Users\Admin\Desktop\Nhan_pro\Data\ML\tennis_weather.xlsx"
df = pd.read_excel(path, index_col = 'DAY')
df

Unnamed: 0_level_0,Temp,Humidtity,Outlook,Wind,Play tennis?
DAY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
D1,Hot,High,Sunny,Weak,No
D2,Hot,High,Sunny,Strong,No
D3,Hot,High,Overcast,Weak,Yes
D4,Mild,High,Rain,Weak,Yes
D5,Cool,Normal,Rain,Weak,Yes
D6,Cool,Normal,Rain,Strong,No
D7,Cool,Normal,Overcast,Strong,Yes
D8,Mild,High,Sunny,Weak,No
D9,Cool,Normal,Sunny,Weak,Yes
D10,Mild,Normal,Rain,Weak,Yes


Now; let see the probs

In [2]:
df_Temp_Play = pd.crosstab(index = df['Temp'], columns = df['Play tennis?'])

print('frequency table: \n')
print(df_Temp_Play)

print('\nprobabilities table :')
df_Temp_Play / df_Temp_Play.sum(axis=0)

frequency table: 

Play tennis?  No  Yes
Temp                 
Cool           1    3
Hot            2    2
Mild           2    4

probabilities table :


Play tennis?,No,Yes
Temp,Unnamed: 1_level_1,Unnamed: 2_level_1
Cool,0.2,0.333333
Hot,0.4,0.222222
Mild,0.4,0.444444


**Explainations :** 

- 1) The `freq table` are the possible values of Temperature (i.e High, Medium, Low) and the chances of `“playing tennis or not”`:

- - When the Temperature is `Cool`, then there are `3` cases when `play tennis` occurs (`Yes`) and `1` case when it doesn’t (`No`)
- - When the Temperature is `Hot`, then there is `2` case when `play tennis` occurs and `2` cases when it doesn’t
- - When the Temperature is `Mild`, then there is `4` case when `play tennis` occurs and `2` cases when it doesn’

- 2) The `probs table` is calculated by the conditional probability; for instance

$$ \begin{array}{ccl} P\left( \text{play = 'Yes'} \right) &=& \dfrac{9}{14} \\ P\left( \text{temp = 'Cool'} \right) &=& \dfrac{4}{14} \\ P \left( \text{ { play = 'Yes'} } \cap \text{ { temp = 'Cool'} } \right) &=& \dfrac{3}{14} \end{array} $$

hence,
$$ P\left( \text{play = 'Yes'} \right) \vert \text{temp = 'Cool'}) = \dfrac{P(\text{ { play = 'Yes'} } \cap \text{ { temp = 'Cool'} })}{P(\text{temp = 'Cool'})} = \dfrac{1}{3} $$

Likewise, we get the `probs_tables` for another features to `play tennis`

In [3]:
df_Wind_Play = pd.crosstab(index = df['Wind'], columns = df['Play tennis?'])
df_Wind_Play / df_Wind_Play.sum()

Play tennis?,No,Yes
Wind,Unnamed: 1_level_1,Unnamed: 2_level_1
Strong,0.6,0.333333
Weak,0.4,0.666667


In [4]:
df_oulk_Play = pd.crosstab(index = df['Outlook'], columns = df['Play tennis?'])
df_oulk_Play / df_oulk_Play.sum()

Play tennis?,No,Yes
Outlook,Unnamed: 1_level_1,Unnamed: 2_level_1
Overcast,0.0,0.444444
Rain,0.4,0.333333
Sunny,0.6,0.222222


In [5]:
df_Humd_Play = pd.crosstab(index = df['Humidtity'], columns = df['Play tennis?'])
df_Humd_Play / df_Humd_Play.sum()

Play tennis?,No,Yes
Humidtity,Unnamed: 1_level_1,Unnamed: 2_level_1
High,0.8,0.333333
Normal,0.2,0.666667


Now, we let the `outcome` 
$$ x' = \left( \text{Outlook = 'Sunny', Temp = 'Cool', Humd = 'High', Wind = 'Strong' } \right)$$
and we have
$$ \begin{array}{ccl} P \left( C = \text{"Yes"} \left \vert x' \right. \right) = P \left( \text{"sunny" | "yes"} \right) P \left( \text{"cool" | "yes"} \right) P \left( \text{"high" | "yes"} \right) P \left( \text{"strong" | "yes"} \right) P \left( \text{"yes"} \right) \\ P \left( C = \text{"No"} \left \vert x' \right. \right) = P \left( \text{"sunny" | "no"} \right) P \left( \text{"cool" | "no"} \right) P \left( \text{"high" | "no"} \right) P \left( \text{"strong" | "no"} \right) P \left( \text{"no"} \right)\end{array} $$

Applying to this problem; we obtain

In [6]:
## calculate the probs of each features; such as

## prob("outlook = sunny" given that "play = yes"). We will receive 3/5 and 2/9 (see the first line in the table)
## Here, we reuse the result in the dataframe "df_oulk_Play" above.
c1 = (df_oulk_Play / df_oulk_Play.sum()).loc['Sunny'].values

## Likewise,
c2 = (df_Wind_Play / df_Wind_Play.sum()).loc['Strong'].values
c3 = (df_Humd_Play / df_Humd_Play.sum()).loc['High'].values
c4 = (df_Temp_Play / df_Temp_Play.sum()).loc['Cool'].values

## calculate P(yes) and P(No). The values in the 5th line (in table) are 5/14 and 9/14
c5 = (df_Temp_Play.sum(axis=0) / df_Temp_Play.sum(axis=0).sum())

## set the features names
features = ['Outlook = Sunny', 'Temp = Cool', 'Humd = High', 'Wind = Strong', 'Overall', 'probs']

## To calculate the probs : P(yes | x) and P(No | x); we multiply all the above probs (c1 to c5)
c6 = c1*c2*c3*c4*c5

## initialize the table
df_x = pd.DataFrame([c1, c2, c3, c4, c5, c6], columns = ['play = No', 'play = yes'])

## set the index to the feature names
df_x.index = features

## round to 4 decimals
df_x.round(4)

Unnamed: 0,play = No,play = yes
Outlook = Sunny,0.6,0.2222
Temp = Cool,0.6,0.3333
Humd = High,0.8,0.3333
Wind = Strong,0.2,0.3333
Overall,0.3571,0.6429
probs,0.0206,0.0053


Now, we can conclude that for the feature
$$ x' = \left( \text{Outlook = 'Sunny', Temp = 'Cool', Humd = 'High', Wind = 'Strong' } \right)$$
then the chance of the `playing tennis = No` is about `0.0206`, which is larger than `0.0053` for `playing tennis = Yes`

So, what is the problem when using the `NBC algorithm`?

What happend if `the categories with No entries result in the values of "0" for conditional probability`

For instance, if we consider
$$ P(C \vert X) = P\left( X_1 \vert C \right) P\left( X_2 \vert C \right) P\left( C \right), $$
and what will we do if $P\left( X_1 \vert C \right) = 0$ or $P\left( X_1 \cap C \right) = 0$, equivalently.

***`Solution : Laplace smoothing method.`*** add "1" to `"numerator"` and `"denominator"` for the empty categories; for example
$$ P\left( X_1 \vert C \right) = \frac{1}{|C| + n_1} $$
and
$$ P\left( X_2 \vert C \right)  = \frac{|X_2 \cap C|}{|C| + n_2}$$

Depending on the `data-type` : `Binary (True / False), discrete (e.g. countable), continuous`, we have the corresponding `Naive Bayes model`: `bernoulli, multinomial, gaussian.`

**Combining `feature type`** What happen when the `model features` contains the different_data type

- **option 1.** Bin continuous features to create categorical ones and fit multinomial model

- **option 2.** Fit Gaussian model on continuous features and multinomial on categorical features; combine to create `"meta model"`

First, we will begin with the basic model; (the `combining model example` will be disscused in another topic named `meta_data model`.

**Example 2. Using `BernoulliNB` in the problem `playing tennis`**

This model is using when the values in `dataset` be the `T / F` or {1, 0}

In [7]:
from sklearn.naive_bayes import BernoulliNB
import numpy as np 

In [8]:
## seperate the dataframe into features(X) & target(y) 

X_data = pd.get_dummies(df[['Outlook', 'Temp', 'Humidtity', 'Wind']])
y_target = pd.DataFrame(df['Play tennis?'])

X_data

Unnamed: 0_level_0,Outlook_Overcast,Outlook_Rain,Outlook_Sunny,Temp_Cool,Temp_Hot,Temp_Mild,Humidtity_High,Humidtity_Normal,Wind_Strong,Wind_Weak
DAY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
D1,0,0,1,0,1,0,1,0,0,1
D2,0,0,1,0,1,0,1,0,1,0
D3,1,0,0,0,1,0,1,0,0,1
D4,0,1,0,0,0,1,1,0,0,1
D5,0,1,0,1,0,0,0,1,0,1
D6,0,1,0,1,0,0,0,1,1,0
D7,1,0,0,1,0,0,0,1,1,0
D8,0,0,1,0,0,1,1,0,0,1
D9,0,0,1,1,0,0,0,1,0,1
D10,0,1,0,0,0,1,0,1,0,1


In [9]:
y_target = y_target.to_numpy().ravel()
y_target

array(['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes',
       'Yes', 'Yes', 'Yes', 'No'], dtype=object)

In [10]:
## train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_data, y_target, test_size=0.25, random_state=42)

print(y_test)
X_test

['Yes' 'Yes' 'No' 'Yes']


Unnamed: 0_level_0,Outlook_Overcast,Outlook_Rain,Outlook_Sunny,Temp_Cool,Temp_Hot,Temp_Mild,Humidtity_High,Humidtity_Normal,Wind_Strong,Wind_Weak
DAY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
D10,0,1,0,0,0,1,0,1,0,1
D12,1,0,0,0,0,1,1,0,1,0
D1,0,0,1,0,1,0,1,0,0,1
D13,1,0,0,0,1,0,0,1,0,1


In [11]:
model = BernoulliNB()
model.fit(X_train, y_train)

#Predict Output 
predicted = model.predict(X_test)
print(predicted)

['Yes' 'No' 'No' 'Yes']


**Comments:**
- A. In this case, we have
$$ P(X_i \vert C) = P(i \vert C)^{x_i} \left( 1 - P(i \vert C) \right)^{1 - x_i}, \quad \forall x_i \in \text{{0, 1}} $$ 
where $P(i | C)$ is the `probs of the word i'th appears in the documents`
- B. The `X_train` (using `pandas.get_dummies`) explain how converting the `categorical variable` into dummy/indicator variables.

For instance, the feature `Temp` has 3 classes ('Cool', 'Hot' & 'Mild'); likewise Wind has 2 classes are 'Weak' & 'Strong'. Totally, 4 features have 10 categorical variable. At the first day `(D1)`, we have `Temp = 'Hot', Humidity = 'High', Outlook = 'Sunny', Windy = 'Weak'`; so `Outlook_Overcast = 0 (outlook == 'overcast');  Outlook_Rain = 1; etc.`

**What happend if we applying the `MultinomialNB`** (the `Multinomial model` with the numbers of categorical rvs = 2 be the `Bernoulli model`)

In [12]:
from sklearn.naive_bayes import MultinomialNB
model2 = MultinomialNB()
model.fit(X_train, y_train)

#Predict Output 
predicted = model.predict(X_test)
print(predicted)

['Yes' 'No' 'No' 'Yes']


**Comments.**

- The `multinomial Naive Bayes classifier` is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as `tf-idf` may also work.

- Next is another example using `MultinomialNB` and `TfidfVectorizer`; this is determinating that the `sentences` is `spam` or not (`ham` or `legitimate`)? based on the given `text_SMS` in the dataset `spam.csv`.

In [13]:
path = r"C:\Users\Admin\Desktop\Nhan_pro\Data\ML\spam.csv"
data2 = pd.read_csv(path, usecols = ['text_mes', 'target'], encoding='ISO-8859-1')
data2.head(10)

Unnamed: 0,target,text_mes
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [14]:
from sklearn.feature_extraction.text import CountVectorizer

text_process = CountVectorizer().fit_transform(data2['text_mes'])

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer()

In [16]:
X_train, X_test, y_train, y_test = train_test_split(data2['text_mes'], data2['target'], test_size = 0.3, 
                                                    stratify = data2['target'], random_state = 42)

tfidf_train = tfidf_vect.fit(X_train).transform(X_train) 
tfidf_test = tfidf_vect.fit(X_train).transform(X_test)

x_train = tfidf_train.toarray()
x_test = tfidf_test.toarray()

model = BernoulliNB(alpha = 1.0)
model.fit(x_train, y_train)

tr_pred = model.predict(x_train)
t_preds = model.predict(x_test)

from sklearn.metrics import accuracy_score as acc

print("training_acc_score = ", acc(y_train, tr_pred))
print("testing_acc_score = ", acc(y_test, t_preds))

training_acc_score =  0.9876923076923076
testing_acc_score =  0.9742822966507177


In [17]:
d = tfidf_train.shape[1]

print('dimension d =', d)
x_train.shape, tfidf_train.shape, y_test.shape

dimension d = 7202


((3900, 7202), (3900, 7202), (1672,))

**Explaination.**

- Each `documents` (or the `text_mes`'s lines) will be described by a vector with the length `d` (here we use `TfidfVectorizer` and `CountVectorizer`). So `d = 7202` is the number of the words in the `dictionary_documents` X_train(text_mes). The values of the `coordinate` $i^{th}$ in each vector is the number of the words $i^{th}$ appears in the `docs`. Hence;

$$ \lambda_{C_i} = P(X_i \vert C) = \frac{|X_1 \cap C|}{|C|} $$
where 
- - $|X_i \cap C|$ be the number of word $i^{th}$ in the `docs` of `class C`
- - $|C| = \sum_{i=1}^d |X_i \cap C|$ and hence $\sum_{i=1}^d \lambda_{C_i} = 1$

- When a `new_word` is never defined in the `class C`; we return to the `Laplace_smoothing` method with alpha = 1.0;

$$ \hat{\lambda}_{C_i} \frac{|X_1 \cap C| + \alpha}{|C| + d\alpha} $$

**Example 3. Using `GaussianNB`** when the dataset has the continuous values; such as `iris`

In [18]:
import seaborn as sns    ## pip install seaborn
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB

## Loading dataset
iris = datasets.load_iris()
X = iris.data
## show the first 3 lines of the dataset; 'sepal_length, petal_witdth, sepal_width, petal_width'
X[:3]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2]])

In [19]:
y = iris.target
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

`0 = 'setosa', 1 = 'versicolor'; 2 = 'virginica'`

In [20]:
## train_test split the 'iris'_dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## call GaussianNB
clf = GaussianNB()
# training 
clf.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [21]:
# predict the model
y_pred = clf.predict(X_test)
print('Predicting_class of X_test: ', y_pred) 

Predicting_class of X_test:  [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 2 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1
 0 0 0 2 1 1 0 0]


In [22]:
from sklearn.metrics import accuracy_score as acc

print("acc_score = ", acc(y_test, y_pred))

acc_score =  0.9777777777777777


**Algorithm Explaination.** For each observation `x_i`, `class C`; we assumed that $X_i \sim \mathcal{N}(\mu_{C_i}, \sigma^2_{C_i})$; hence

$$ P(X = x_i \vert C) = {\mathcal{N}(\mu_{C_i}, \sigma^2_{C_i})}(x_i) = \frac{1}{\sigma_{C_i} \sqrt{2\pi}} \exp \left( - \frac{(x_i - \mu_{C_i})^2}{2 \sigma_{C_i}^2} \right) $$

where the parameters $\mu_{C_i}, \sigma^2_{C_i}$ can be estimated by the `M.L.E` method.