# The Naive Bayes Classifier

## Definition

Let's begin by defining what exactly is the Naive Bayes Classifier and then we can begin to discuss why it is useful to us and how to use it well.

We will define a Naive Bayes Classifier as just a simple probablistic classifier based on Bayes' Theorem, 

$$
p(\theta) = q(\theta)\frac{q(x'|\theta)}{q(x')}
$$

It is considered *naive*, because of the assumptions one makes about the values of the predictors, or features. Namely, that all features are independent of one another. One can easily see why this is naive, but let's do an example to take us through these assumptions and other nuiances of this Classifier.

## The NCAA Men's Basketball Tournament - Let the Madness Begin!

I thought that it might be fun to generate an example from the basketball statistics of March Madness. 

**I will go step-by-step on how I want to approach the data and apply to my classifier.**

Let's begin.

### Defining the Problem

In Basketball, there are many stats one can use to characterize a team. We will use some the these stats to build our classifier.

First, we must decide on what data will be useful. I will take the data from the past 7 years of NCAA Seasons and Tournament Results.

In [1]:
import pandas as pd
import numpy as np
import math

data17 = pd.read_csv('data/fulldata.csv')
data17.head()

Unnamed: 0,Tournament Wins,School,Season,PPG,Opp PPG,Rank Rating,Kenpom,Seed,SOS,2PPG,...,DRBPG,ASTPG,STLPG,BLKPG,TOVPG,PFPG,FG%,2P%,3P%,FT%
0,2,Southern California,2016-17,78.7,73.18,7.07,350.0,11.0,98.0,19.222222,...,20.777778,14.0,5.805556,1.833333,12.305556,15.222222,0.456,0.51,0.362,0.741
1,1,Florida State,2016-17,82.45,71.27,8.31,350.0,3.0,34.0,22.942857,...,25.085714,15.8,6.914286,2.828571,10.171429,18.714286,0.483,0.544,0.347,0.689
2,1,Iowa State,2016-17,80.88,72.0,8.03,350.0,5.0,11.0,19.628571,...,24.485714,13.457143,4.771429,5.2,10.685714,18.142857,0.472,0.518,0.4,0.7
3,1,Kansas State,2016-17,71.73,66.94,6.84,350.0,11.0,25.0,17.771429,...,22.971429,15.714286,6.542857,3.514286,11.114286,16.857143,0.461,0.518,0.362,0.694
4,1,Michigan State,2016-17,71.73,68.36,6.35,350.0,9.0,12.0,18.057143,...,23.0,15.628571,4.628571,3.457143,10.542857,16.6,0.47,0.529,0.373,0.669


The stats I have chosen are the per game stats for all the following:

- Points
- Opponent's Points
- Ranks of many kinds
- Strength of Schedule
- 2P and 2PA
- 3P and 3PA
- FT and FTA
- the above three percentages
- ORB
- DRB
- AST
- STL
- BLK
- TOV
- PF

And the stat I have chosen to use as our classes is Tournament wins for the given season Tournament.

As we can see, not all of these are independent. We need to carefully choose to include some and not others in order to obtain a worthwhile result.

In [2]:
import seaborn as sb

sb.pairplot(data17)

<seaborn.axisgrid.PairGrid at 0x7fae186b5a20>

We also wish to understand the data to pick the most appropriate prior possible. While we have already seen that more data can overwhelm the prior, we don't actually have a lot of data for 7 years worth of it, so we want to put the best foot forward.

In [3]:
y = []
x = [[0] for x in data17['Tournament Wins']]

for i in range(len(data17['Tournament Wins'])):
    y.append(data17['Tournament Wins'][i])
    
print(y)

for j in range(len(data17['Tournament Wins'])):
    for i in ['PPG', 'Opp PPG', 'SOS', '2PPG', '2PAPG', '3PPG', '3PAPG', 'FTPG', 'FTAPG', 'ORBPG', 'DRBPG', 'ASTPG', 'STLPG', 'BLKPG', 'TOVPG', 'PFPG', 'FG%', '2P%', '3P%', 'FT%']:
        if (i == 'PPG'):
            x[j][0] = (data17[i][j])
        else:
            x[j].append(data17[i][j])
        
print(x)

[2, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 3, 1, 1, 1, 0, 0, 0, 4, 0, 0, 0, 2, 1, 2, 1, 2, 2, 2, 4, 2, 2, 1, 3, 3, 1, 1, 2, 3, 6, 1, 5, 2, 2, 2, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 4, 1, 3, 1, 1, 2, 0, 0, 1, 2, 0, 0, 1, 0, 2, 2, 1, 2, 0, 1, 2, 0, 3, 1, 4, 0, 6, 5, 3, 3, 1, 1, 0, 0, 0, 0, 6, 1, 3, 4, 0, 0, 3, 0, 1, 0, 0, 0, 0, 2, 1, 0, 1, 0, 0, 0, 4, 0, 0, 2, 0, 0, 2, 1, 0, 0, 1, 1, 1, 1, 2, 2, 0, 0, 0, 3, 0, 0, 0, 1, 2, 0, 2, 1, 0, 0, 1, 1, 2, 2, 1, 1, 0, 5, 3, 1, 3, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 3, 2, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 2, 0, 1, 0, 0, 0, 6, 2, 0, 2, 0, 1, 1, 5, 1, 0, 0, 0, 2, 3, 1, 4, 1, 1, 1, 0, 1, 1, 2, 4, 0, 3]
[[78.7, 73.18, 98.0, 19.22222222, 37.69444444, 7.8611111110000005, 21.72222222, 16.19444444, 21.86111111, 7.083333333, 20.77777778, 14.0, 5.805555556, 1.8333333330000001, 12.30555556, 15.22222

### Selecting our Priors

Now that we have our data selected, we now choose our Priors.


#### Choosing the conditional distribution

$$
q(x'|\theta)
$$

We decide this by looking back at our histograms. We find, since we have continuous features, that we would want a probability density function that fits the data best. From the histograms, we find that we can use Guassian and that would be pretty fair.

$$
q(x'|\theta) = \frac{1}{2\pi \sigma_\theta^2}e^{-\left(\frac{(x_i - \mu_\theta)^2}{\sigma^2_\theta}\right)}
$$

#### Choosing the prior distribution for tournament wins

$$
q(\theta)
$$

This is decidedly more tricky. The only thing I know about the tournament is that when it comes to wins, one team will win six games (the Champion), two will win five games, four will win four games, eight will win three games, 16 will win two games, 32 will win one game, and 32 will win none.

For this, I have decided to choose that each game is won by a coin flip so,

$$
q(\theta) = 2^{-\theta}
$$

In [7]:
class FullGNB:
    X_tn = []
    T_tn = []
    E_x = []
    E_x2 = []
    var_x = []
    # defining the variables I need for the Guassian and the Training
    def _init_(self):
        self.E_x = [[0] for i in range(7)]
        self.E_x2 = [[0] for i in range(7)]
        self.var_x = [[0] for i in range(7)]
    # initializing    
    def train(self, x, y):
        self.X_tn = []
        self.T_tn = []
        for yi in y:
            self.T_tn.append(yi)
        for xi in x:
            self.X_tn.append(xi)
            
        self.build
        
    def build():
        xtrain = self.X_tn
        ytrain = self.T_tn
        
        s = [0 for x in xtrain[0]]
        s2 = [0 for x in xtrain[0]]
        c = [0 for x in xtrain[0]]
        
        for Ck in [0, 1, 2, 3, 4, 5, 6]:
            for i in range(len(ytrain)):
                if (ytrain[i] == Ck):
                    for j in range(len(xtrain[i])):
                        s[j] += float(xtrain[i][j])
                        s2[j] += float(xtrain[i][j])**2
                        c[j] += 1
            for l in range(len(c)):
                if (l == 0):
                    self.E_x[Ck][0] = (s[l] / c[l])
                    self.E_x2[Ck][0] = (s2[l] / c[l])
                    self.var_x[Ck][0] = ((s2[l] / c[l]) - ((s[l] / c[l])**2))
                else:
                    self.E_x[Ck].append(s[l] / c[l])
                    self.E_x2[Ck].append(s2[l] / c[l])
                    self.var_x[Ck].append((s2[l] / c[l]) - ((s[l] / c[l])**2))                   
            for k in range(len(c)):
                s[k] = 0
                s2[k] = 0
                c[k] = 0
        
           # This generates the mean and variance for the distributions over each class   
                    
    def predict(self, x):
        
        p = [[1 for i in range(7)] for xi in x]
        
        for i in range(len(x)):
            for j in range(len(x[0])):
                for Ck in [0,1,2,3,4,5,6]:
                    p[i][Ck] *= 2**(-Ck) * self.Guass(x[i][j], Ck, j)
                    
        return p
    # this generates the probability for each class by team and returns the array
    def Guass(self, xi, Ck, feature):
        return (1 / (2 * math.pi * self.var_x[Ck][feature])**0.5) * math.exp(-(xi - self.E_x[Ck][feature])**2 / self.var[Ck][feature])
    
    # this is the Guassian Function

In [8]:
cls = FullGNB()

cls.train(x, y)

data18 = pd.read_csv('data/2018data.csv')

xtest = [[0] for x in data18['Team']]

for j in range(len(data18['Team'])):
    for i in ['PPG', 'Opp PPG', 'SOS', '2PPG', '2PAPG', '3PPG', '3PAPG', 'FTPG', 'FTAPG', 'ORBPG', 'DRBPG', 'ASTPG', 'STLPG', 'BLKPG', 'TOVPG', 'PFPG', 'FG%', '2P%', '3P%', 'FT%']:
        if (i == 'PPG'):
            xtest[j][0] = (data18[i][j])
        else:
            xtest[j].append(data18[i][j])

In [9]:
p = cls.predict(xtest)

IndexError: list index out of range