<a href="https://colab.research.google.com/github/44REAM/CEB-machine-learning/blob/main/naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Naive bayes

modified from Aj Ratchainant slide

## Theory

## Conditional probability

<p align="center">
<img src="https://github.com/44REAM/CEB-machine-learning/blob/main/images/set.png?raw=1" width="300" />
</p>

$$
P(X|Y) = \frac{P(X \cap Y)}{P(Y)} \to \frac{P(X, Y)}{P(Y)}
$$

$$
P(Y|X) = \frac{P(X \cap Y)}{P(X)} \to \frac{P(X, Y)}{P(X)}
$$


Term:
- joint probability
- marginal probability
- conditional probability

## Bayes Theorem
Let $A$, $B$ a random variable 

$$
P(A|B) = \frac{P(B|A)P(A)}{P(B)}
$$

for

$P(A|B)$ = Posterior, $P(B|A)$ = Likelihood, $P(A)$ = Prior, $P(B)$ = Marginal


## Example

*Historical data tells that 10% of
patients visiting our clinic have liver
disease. 7% of patients diagnosed
with liver disease, they are
alcoholics. 5% of patients according
to the test are alcoholics. Finding
out the probability of liver disease if
a given patient is alcoholics.*

**Prior**: Historical data tells that 10% of patients visiting our clinic
have liver disease, $P(\text{liver disease}) = 0.1$

**Likelihood**: 7% of patients diagnosed with liver disease, they
are alcoholics, $P(\text{alcoholics} | \text{liver disease}) = 0.07$

**Marginal**: 5% of patients according to the test are alcoholics,
$P(\text{alcoholics})$ = 0.05

**Posterior**: $P(\text{liver disease | alcoholics)}) = (0.07 × 0.1)/0.05 = 0.1$


## Conditional independent

Let $A$, $B$ a random variable 

**Definition**: Event $A$ and $B$ are independent if and only if 
$$P(A,B) = P(A)P(B)$$

And if  $P(A|B) = P(A)$

if $P(C) > 0$

**Definition**: Two events $A$ and $B$ are conditionally independent given event $C$ with $P(C) > 0$ if and only if

$$P(A,B|C) = P(A|C)P(B|C)$$

if $P(B)>0$ we also get

$$P(A|B,C) = P(A|C)$$

**Translation**: Knowing information on event $B$ dose not improve our knowledge on $A$ given information $C$

**Example**:  

1. Lung Cancer prediction from smoking and sex

Knowing prior probability of sex does not improve
posterior probability of lung cancer given smoking

2. Rain prediction from a lot of cloud and low light

Knowing prior probability of low light dose not improve posterior probability of raining since we had been given an information of lot of cloud



## Naive bayes model

<p align="center">
<img src="https://github.com/44REAM/CEB-machine-learning/blob/main/images/network.png?raw=1" width="300" />
</p>

## Coding

### Categorical naive bayes

### Gaussian naive bayes

In [None]:
# modified from https://www.kaggle.com/code/nisasoylu/naive-bayes-implementation-on-cancer-dataset/notebook
!wget https://raw.githubusercontent.com/44REAM/CEB-machine-learning/main/data.csv


In [None]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB, CategoricalNB
from sklearn.model_selection import train_test_split

In [None]:
cancer_df = pd.read_csv('data.csv')
cancer_df = cancer_df.drop(columns = ['id', 'Unnamed: 32'])
cancer_df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [None]:
y = cancer_df['diagnosis']
y = [1 if diag == 'M' else 0 for diag in y]

x = cancer_df.drop(columns = ['diagnosis'])/cancer_df.drop(columns = ['diagnosis']).max(axis = 0)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.5, random_state = 42)

In [None]:
nb = GaussianNB()
nb.fit(x_train, y_train)

GaussianNB()

In [None]:
print("Naive Bayes score: ",nb.score(x_test, y_test))

Naive Bayes score:  0.9403508771929825


### Practice

### Discriminant analysis Optional