# Classification using Naive Bayes

## 1.1) Naive Bayes
- Like Bayes, maps probabilities of observed input features given belonging classes, to probability distribution over classses based on Bayes' Theorem.
    - \begin{equation*}
    P(A|B)=\frac{P(B|A)P(A)}{P(B)}
    \end{equation*}
  
## 1.2) Naive Bayes Terms
<p>
-Naive Bayes is useful because it provides a very simple way to classify data into classes.<br>
 </p>
 <p>
    -Given data sample x with n features (x1,x2,...,xn), we treat x like a feature vector.<br>
    -Goal of Naive Bayes is to determine probabilities that sample belongs to each of K possible classes (y1,y2...yk).
 </p>
    \begin{equation*}
    P(Yk|x)=\frac{P(x|Yk)P(Yk)}{P(x)}
    \end{equation*}
    
         - P(Yk): Potrays how classes are distributed, providing no further knowledge from data taken in.
             - "What's the probability that this thing is the specific class Yk?"
         - P(Yk|x): Potrays how classes are distributed, provided with the extra knowledge of the observation.
             - "What's the probability that this thing is specific class Yk, given all these features are present?"
         - P(x|Yk): Joint distribution of n features given sample belonging to class Yk.
             - "What's the probability that all these features are present given that this thing is a specific class Yk?"
             - Because we (NAIVELY) assume feature independence, joint conditional distribution of n features is the joint product of individal feature conditional distributions.
                 - P(x|Yk)=P(x1|yk)*P(x2|yk)*...*P(xn|yk).
         - P(x): Evidence, solely depending on the distribution of features not specific to certain classes.
             - "What's the probability that all these features are present?"

## 1.3) Example 1: Coin Flipping.
- Example: You have two coins, one coin flips fairly, the other one is heads 80% of the time. What's the probability that the coin you just flipped is the unfair one if you got heads?
    - \begin{equation*}
    P(Coin Unfair|Is Heads)=\frac{P(Is Heads|Coin Unfair)*P(Coin Unfair)}{P(Is Heads)}
    \end{equation*}

     \begin{equation*}
    P(Coin Unfair|Is Heads)=\frac{P(Is Heads|Coin Unfair)*P(Coin Unfair)}{P(Is Heads|Coin Unfair)*P(Coin Unfair)+ P(Is Heads|Coin Not Unfair)*P(Coin Not Unfair)}
    \end{equation*}
    
     \begin{equation*}
    P(Coin Unfair|Is Heads)=\frac{0.8*0.5}{0.8*0.5+0.5*0.5}
    \end{equation*}

## 1.4) Example 2: Spam Mail
### 1.4.1) Let's say we have four emails that we know are spam or are not based on keywords we have below. How do we predict how likely a new email is spam?

<table style="width:100%">
  <tr>
    <th>ID</th> 
    <th>Terms in e-mail</th>
    <th>Is it Spam</th>
  </tr>
  <tr>
    <td>1</td>
    <td>Click win prize</td>
    <td>Yes</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Click meeting setup meeting </td>
    <td>No</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Prize free prize</td>
    <td>Yes</td>
  </tr>
  <tr>
    <td>4</td>
    <td>Click prize free</td>
    <td>Yes</td>
  </tr>
  <tr>
    <td>5</td>
    <td>Free setup meeting free</td>
    <td>?</td>
  </tr>
  
 </table>
 
 The first four terms that we have labeled are what we train the network on (Training set), and the last term is what we're testing (test set).

   
### 1.4.2: Steps
#### 1.4.2.i: Define S and NS events as the email being spam or not spam.
- From the taining set, we get the following:
\begin{equation*}
    P(S)=\frac{3}{4}
    \end{equation*}
\begin{equation*}
    P(NS)=\frac{1}{4}
    \end{equation*}


#### 1.4.2.ii: Set up to P(x|S), P(x|NS):
- To calculate P(x|S), where x=(free, setup, meeting, free).
- We'll need P(free|S), P(setup|S), P(meeting|S) based on training set.
    - The ratio of the term showing up in all occurences of S in the training set.
- However, with "Free" never appearing in NS, P(free|NS) is 0. Because of this and conditional independence assumption, P(x|NS)=0, and because this term is in the numerator of our equation, this will lead to P(NS|x) immediatley returning 0.
    - To smooth this out, we'll start counting term occurence from 1 rather than 0
        - Laplace smoothing.
    - \begin{equation*}
    P(free|S)=\frac{2+1}{9+6}=\frac{3}{15}
    \end{equation*}
    
    - 2: Occurences of term "Free" in S class.
    - 1: Laplace smoothing constant.
    - 9: Total term occurences in S class (spam).
    - 6: One additional count per term from Laplace smoothing constant (Click, win, prize, meeting, setup, free).
    
     \begin{equation*}
    P(free|NS)=\frac{0+1}{4+6}=\frac{1}{10}
    \end{equation*}
    
    Similarly...
    
     \begin{equation*}
    P(setup|S)=\frac{0+1}{9+6}=\frac{1}{15}
    \end{equation*}
    
     \begin{equation*}
    P(setup|NS)=\frac{1+1}{4+6}=\frac{2}{10}
    \end{equation*}
    
     \begin{equation*}
    P(meeting|S)=\frac{0+1}{9+6}=\frac{1}{15}
    \end{equation*}
    
     \begin{equation*}
    P(meeting|NS)=\frac{2+1}{4+6}=\frac{3}{10}
    \end{equation*}
    
    Therefore...
### 1.4.4: Calculate P(S|x), P(NS|x)

    \begin{equation*}
    P(S|x)=\frac{P(x|S)P(S)}{P(x)}
    \end{equation*}
    
    \begin{equation*}
    P(S|x)=\frac{3}{4}
    \end{equation*}
    
    \begin{equation*}
    P(S|x)=\frac{(P(free|S)*P(setup|S)*P(meeting|S)*P(free|S))*P(S)}{P(x|S)*P(S)+P(x|NS)*P(NS)}
    \end{equation*}
    
    \begin{equation*}
    P(S|X)=\frac{(P(free|S)*P(setup|S)*P(meeting|S)*P(free|S))*P(S)}{(P(free|S)*P(setup|S)*P(meeting|S)*P(free|S)*P(S))+(P(free|NS)*P(setup|NS)*P(meeting|NS)*P(free|NS)*P(NS))}
    \end{equation*}
    
    \begin{equation*}
    P(S|X)=\frac{\frac{3}{15}*\frac{1}{15}*\frac{1}{15}*\frac{3}{15}*\frac{3}{4}}{\frac{3}{15}*\frac{1}{15}*\frac{1}{15}*\frac{3}{15}*\frac{3}{4}+(\frac{1}{10}*\frac{2}{10}*\frac{3}{10}*\frac{1}{10}*\frac{1}{4})}
    \end{equation*}
    
    \begin{equation*}
    P(S|X)=\frac{8}{17}=1-P(NS|X)
    \end{equation*}
    

# 2: Implementing Naive Bayes Classifier to tell if story is happy or sad based on terms included.

## 2.1) Importing required packages
## 2.2) Set paths
## 2.3) Take data and prepare it for parsing.

In [1]:
import naive_bayes_work #File with back-end equations
import os
import csv
import numpy as np

data_dir = os.path.join('data/')

file=open(os.path.join(data_dir, 'words.csv'), 'r')
reader = csv.reader(file)
vocabulary = list(item[0] for item in reader)

# Loading data into numpy arrays
features_train = np.genfromtxt(os.path.join(data_dir, 'features_train.csv'), delimiter=',')
labels_train = np.genfromtxt(os.path.join(data_dir, 'labels_train.csv'), delimiter=',')
features_test = np.genfromtxt(os.path.join(data_dir, 'features_test.csv'), delimiter=',')
labels_test = np.genfromtxt(os.path.join(data_dir, 'labels_test.csv'), delimiter=',')

[NbConvertApp] Converting notebook naive_bayes_work.ipynb to script
[NbConvertApp] Writing 4884 bytes to naive_bayes_work.py


## 2.4) Set hyperparameter constants.
### Laplace Smoothing Constant: What we weight each feature initially with in order to offset initial bias.

In [2]:
lsc=1

## 2.5) Get P(x|Yk) values
### Create matrix of Joint distributions that indicate guess that each word indicates happy or sad story.

In [3]:
likelihood=naive_bayes_work.NBFeatureGivenLabel(features_train, labels_train, lsc)

## 2.6) Get P(Yk) value
### Scalar indicated chance that  story is sad based on all labels.

In [4]:
prior=naive_bayes_work.NBLabelPrior(labels_train)

## 2.7) Calculate classified P(Yk|x) for each test article

In [5]:
labels_guess=naive_bayes_work.NBClassifier(likelihood, prior, features_test)

## 2.8) Calculate the error
### Based on your truth (labels_test) and what your classifier predicted (labels_guess).

In [6]:
labelcount=labels_guess.shape[0]
#Instantiate int for bad(matching)
incorrect_match=0
#Iterate through labelcount
for i in range(0,labelcount):
    ##If items are different, incorrect_match++
    if (labels_guess[i]!=labels_test[i]):
        incorrect_match=incorrect_match+1
error = float(incorrect_match)/float(labelcount)
print(error)

0.1103448275862069


# Glossary
## Binary Classification: Labels a set of data in one of two classes.
## Multiclass Classification: Labels a set of data into 3+ Classes.
## Multilabel Classification: Choosing which class-classifier to use (one of may binary or multiclass).
## Named-Entity Recognition (NER): Subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories that use labels such as the names of persons, organizations.