# <center>Naïve Bayes Classifier</center>
 
The objective of this unit is to explain the basic idea behind Naïve Bayes Classifier in simple words. No background in Statistics or data science is required. I use the terms <i>probability</i> and <i>chance</i> interchangeably through the text.

***
    
##  Introduction 


Suppose that you work in an organization where every few years an employee may receive a promotion which is surprisingly announced at the end of the year party. You wonder if you could find a way to make a close guess for your case so that you could be prepared and well dressed for the surprise. So you start shaping a story about it.
You start thinking about the relevant features that may result to a promotion of an employee and end up with four: **appearance, punctuality, client satisfaction** and **charisma**. 
    
You have the impression that good looking or well-dressed people are more likely to be promoted. Also, people who are punctual in doing the tasks or working hours as well as those with happy clients have higher chance of  promotion. Finally, you think that charismatic employees have more chance for a promotion. 

You are aware that some of these __features are related to each other__ and thus making your analysis inaccurate. For instance, good looking employees or those with charismatic character may also get more positive feedbacks from clients. But you don’t want to bother and be super precise about your story and thus decide to stick with your **naïve** thoughts. 

Ultimately, you are interested to **find out your chance of promotion given your scores on the features**. In other words, assuming that there are two classes of employees, **promoted** and **not yet promoted**, you want to find out which class you will belong to, or how likely is to be in each of these classes. So the first step is to give yourself some scores. You are often well-dressed, often punctual, often receive positive feedback from clients and you do not have a charismatic character. You may think that your chance depends on two elements: 

1. **How often there is a promotion in general**.<br> 
    The number or frequency of promoted employees or more precisely, the chance of having a promotion at all in your company is going to matter. Imagine that in your company there is no such thing happening, that is, no one has ever received a promotion. Thus, the chance of an employee receiving a promotion is zero irrespective of her scores on the features. On the other hand, suppose that in your company half of employees have received a promotion so far, that is, the likelihood of such event happening is 50%. Apparently, the higher this likelihood is the higher is also your chance for being one of them eventually. 


2. **How often you find the promoted persons having a particular feature**. <br>
    The number or frequency of your promoted colleagues who have the same scores as yours also matters. This makes a lot of sense! Imagine that your office mate has almost the same scores as you and she got promoted last year. That makes you excited and you may think that you would be the next. 


<div class="alert alert-info">
<b>If you are convinced that these two elements are relevant then you are already using the Bayes rule. Welcome to the club!</b><br>
Note that the full specification of the Bayes rule (see bellow) is slightly different from what we have sketched so far.<br>   
<b>Bayes rule implies that</b>:

<center>Chance of promotion given the scores is proportional to $\{$chance of each score given the promotion $\times$ chance of promotion$\}$</center>
</div>


Now let's put things into action. As for the first element let's suppose that out of 100 total employees 50 have already received promotion. For the second element, lets count all the promoted persons with the same scores as yours. 

* You find that out of the 50 promoted persons 45 have been as punctual as you. 
* Also, you find 10 persons to be in the same class of appearance. 
* 40 out of 50 have client feedback rates close to yours and finally 
* 20 are not charismatic.

We can translate these numbers into chances:

* $\frac{45}{50}$ is the chance of a promoted person to be punctual as you. 
* $\frac{10}{50}$ is the chance a promoted person to be in the same range of appearance, and finally 
* $\frac{40}{50}$ and $\frac{20}{50}$ are the chances for client satisfaction and charisma. 

Now you can quantify the probabilities. Once again you are interested to calculate the probability of promotion given your scores. Putting the two elements (1) and (2) together we have: 

$\frac{50}{100}\times \big[\frac{45}{50} \times \frac{10}{50} \times \frac{40}{50} \times \frac{20}{50}\big]$ which is about 3\%. 
On the other hand, your chance of not receiving a promotion is: 

$\frac{50}{100}\times \big[(1-\frac{45}{50}) \times (1-\frac{10}{50}) \times (1-\frac{40}{50}) \times (1-\frac{20}{50})\big]$ which is about 0.5\%, and is 6 times smaller than receiving the promotion. 

Thus, it is 6 times more likely that you will be in the promotion class than in the other class.

<div class="alert alert-success">
Congratulation! You are just done with your first Naïve Bayes Classifier. You can use it to assess the chance of promotion for your colleagues too given their specific features.
</div>

If you have followed all the steps so far, you will be able to read the following expression which gives some structure to the story:<br>

$ P(A \mid B) \propto 
\prod_{i=1}^4 P(B_i\mid A)
P(A) $

where A: promotion and B: features. Mathematically speaking, $P(A\mid B)$ is called a conditional probability and refers to the likelihood of event A occurring given that B is true. The general form of Bayes rule is as follows:

$ P(A \mid B) = \frac{P(B\mid A)P(A) }{P(B)} $

You note that we have dropped the denominator and this is because it is common across classes and thus can be ignored. We will be more precise about this in the following tasks.

### **Summary:**
* we can use the Bayes rule and build a simple classification tool.
* the classifier is naive because we assume the features are independent. That is,

$P(B \mid A) = \prod_{i=1}^4 P(B_i\mid A)= P(B_1\mid A)P(B_2\mid A)P(B_3\mid A)P(B_4\mid A)$

---
### Task 1 
Suppose that you have 1000 fruits and they are either banana, orange or other types. The objective is to find out if a fruit is banana, orange or other given a set of features, namely, long, sweet and yellow. Bellow is all we know about the features:

|<span style="color:blue">type</span>     |   long   | not long |   sweet  | not sweet|  yellow  | not yellow |  **<span style="color:blue">total</span>**  |
|:-------- | :---: | :---: | :---: | :---: | :---: | :---:   | --------:|
|<span style="color:blue">banana</span>   |   400    |   100    |   350    |   150    |  450     |    50      | <span style="color:blue">500</span>     |
|<span style="color:blue">orange</span>   |   0      |   300    |   150    |   150    |   300    |    0       |   <span style="color:blue">300</span>   |
|<span style="color:blue">other</span>    |   100    |   100    |   150    |   50     |   50     |    150     |   <span style="color:blue">200</span>   |
|**<span style="color:blue">total</span>**    |   500    |   500    |    650   |    350   |  800     |   200      |  **<span style="color:blue">1000</span>**   |

This example differes from our earlier example in two ways. Here we have three classes instead of two. We could also develop our example with three classes: employees who get promotion, those remaining at same level, and those who will be degraded. The second difference is in the amount of available information or features. Here we have three features whereas we had four features in our exmaple. However, we will be able to use the Bayes Classifier in the same way as before. As a little hint, we infer from the table that the probability of having a banana is 50\% because 500 out of 1000 fruits are banana.
 
Use the idea of Bayes Classifier we just developed and: <br> 
1. **Find out the chance of a fruit to be banana if it is long, sweet and yellow?** 
1. **Calculate the same quantity for orange and other fruits. Whats your conclusion?**
1. **Explain why the Bayes Classifier would be naive**. 

--- 
### Task 2

Suppose that you read in the newspaper that a new virus has been emerged from a particular region and is spreading over and causing a new type of flu. The following table is also published as the symptoms and diagnosis of the suspects in your local community. Y means yes and N means no.

|recently travelling  |runny nose| headache|fever | **<span style="color:blue">flu?</span>** |  
|:---: | :---: | :---: | :---: | :---: | 
| Y | N | mild | Y | **<span style="color:red">N</span>** |
| Y | Y | No | N | **<span style="color:green">Y</span>** |
| Y | N | strong | Y | **<span style="color:green">Y</span>** |
| N | Y | mild | Y | **<span style="color:green">Y</span>** |
| N | N | No | N | **<span style="color:red">N</span>** |
| N | Y | strong | Y | **<span style="color:green">Y</span>** |
| N | Y | strong | N | **<span style="color:red">N</span>** |
| Y | Y | mild | Y | **<span style="color:green">Y</span>** |

You wonder how you should interact with your colleagues or friends if they have similar symptoms. For instance you just realized that your neighbor had the following symptoms this morning:

|recently travelling  |runny nose| headache|fever |  
|:---: | :---: | :---: | :---: | 
| Y | N | mild | N | 

The main difference between this task and the previous task is the way the data are presented. If we had the symptoms and diagnosis of say 1000 suspects then we would need to present the information in a more compact way as in the table of task 1.  <br>
**Use the idea Naive Bayes Classifier in order to decide if you need to take any precautionary measure with your neighbor.**


### Task 3

In this task we would like to discover other use cases of the Naive Bayes Classifier. Examples are endless but bellow is a shortlist. Do you think the following items could be the titles of naive Bayes classification problems? if so, think about how you can develop a story and in particular how you would define classes and features for each case.

* email spam detection!
* classifing news!
* classifing clients' feelings (sentiment analysis)!
* weather prediction!

<a id='solutions'></a> 
## Solutions 

### Task 1
```
A: {banana}
B: {long, sweet, yellow}
B1: {long}
B2: {sweet}
B3: {yellow}
---------------------------------------------------------
P(A|B) = P(banana | long, sweet, yellow) ? 

From Bayes rule we know that:
P(A|B) = P(A)P(B1|A)P(B2|A)P(B3|A) / P(B) 

Information in the table implies that:

P(banana) = 0.5
P(long|banana) = 0.8
P(sweet|banana) = 0.7
P(yellow|banana) = 0.9

P(banana|B) = 0.5*0.8*0.7*0.9 / P(B) 
            = 0.252 / P(B)
---------------------------------------------------------
Following the same logic:
P(orange|B) = 0 / P(B) 
P(other|B) = 0.0187 / P(B) 
```
**We note that the chance of picking up a banana given the features is much higher than that of other fruits**.

### Task 2
```
A: {flu=Y}
B: {travelling=Y, runny nose=N, headache=mild, fever=N}
B1: {travelling=Y}
B2: {runny nose=N}
B3: {headache=mild, fever=N}
B4: {fever=N}
---------------------------------------------------------
P(A|B) = P(flu=Y | travelling=Y, runny nose=N, headache=mild, fever=N) ? 

From Bayes rule we know that:
P(A|B) = P(A)P(B1|A)P(B2|A)P(B3|A)P(B4|A) / P(B) 

Information in the table implies that:

P(flu=Y) = 0.625
P(travelling=Y|flu=Y) = 0.6
P(runny nose=N|flu=Y) = 0.2
P(headache=mild|flu=Y) = 0.4
P(fever=N|flu=Y) =0.2

P(A|B) = 0.625*0.6*0.2*0.4*0.2 / P(B) 
       = 0.006 / P(B)
---------------------------------------------------------
Similarly:

P(flu=N) = 0.375
P(travelling=Y|flu=N) = 0.333
P(runny nose=N|flu=N) = 0.666
P(headache=mild|flu=N) = 0.333
P(fever=N|flu=N) =0.666

P(A'|B) = 0.375*0.333*0.666*0.333*0.666 / P(B) 
        = 0.0185 / P(B)
where A' refers to not having flu.
```
**We note that the chance of having flu given the symptoms is 0.6% and is about 3 times smaller than the chance of not having flu**.