# 07 Naive Bayes Tutorial
**Q4(b)**  
A ranking classifier is a classifier that can rank a test set in order of confidence for a given classification outcome.  
Naive Bayes is a ranking classifier because the ‘probability’ can be used as a confidence measure for ranking.
1. Train a Naive Bayes classifier from the `AthleteSelection` data. Use `GaussianNB`.
2. Load the test data from `AthleteTest.csv` and apply the classifier. 
3. Use the `predict_proba` method to find the probability of being selected. 
4. Rank the test set by probability of being selected.  
    4.1. Who is most likely to be selected?  
    4.2. Who is least likely?  


In [1]:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.metrics import confusion_matrix 

In [2]:
athlete = pd.read_csv('AthleteSelection.csv',index_col = 'Athlete')
athlete.head()

Unnamed: 0_level_0,Speed,Agility,Selected
Athlete,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
x1,2.5,6.0,No
x2,3.75,8.0,No
x3,2.25,5.5,No
x4,3.25,8.25,No
x5,2.75,7.5,No


In [3]:
y = athlete['Selected'].values
X = athlete[['Speed','Agility']].values

In [4]:
gnb = GaussianNB()
bnb = BernoulliNB()
mnb = MultinomialNB()
ath_NB = gnb.fit(X,y)
y_dash = ath_NB.predict(X)

In [5]:
confusion = confusion_matrix(y, y_dash)
print("Confusion matrix:\n{}".format(confusion)) 

Confusion matrix:
[[12  0]
 [ 1  7]]


In [6]:
print(y)
print(y_dash)

['No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'Yes' 'Yes'
 'Yes' 'Yes' 'Yes' 'Yes' 'Yes' 'Yes']
['No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'Yes' 'Yes'
 'No' 'Yes' 'Yes' 'Yes' 'Yes' 'Yes']


## Test Data 

In [7]:
ath_test = pd.read_csv('AthleteTest.csv',index_col = 'Athlete')
ath_test

Unnamed: 0_level_0,Speed,Agility
Athlete,Unnamed: 1_level_1,Unnamed: 2_level_1
t1,3.3,8.2
t2,4.5,4.5
t3,5.5,7.2
t4,3.8,8.8
t5,5.5,5.2
t6,8.1,7.8
t7,7.7,5.2
t8,6.1,5.5
t9,5.5,6.0
t10,6.1,5.5


In [8]:
X_test = ath_test[['Speed','Agility']].values

In [9]:
yt_dash = ath_NB.predict_proba(X_test)
yt_dash

array([[9.58686371e-01, 4.13136290e-02],
       [8.77017219e-01, 1.22982781e-01],
       [8.80671574e-02, 9.11932843e-01],
       [8.49522335e-01, 1.50477665e-01],
       [2.00167162e-01, 7.99832838e-01],
       [2.64304710e-06, 9.99997357e-01],
       [5.48092049e-05, 9.99945191e-01],
       [2.70690822e-02, 9.72930918e-01],
       [1.45717357e-01, 8.54282643e-01],
       [2.70690822e-02, 9.72930918e-01]])

In [10]:
ath_NB.classes_

array(['No', 'Yes'], dtype='<U3')

In [19]:
ath_test['P No'] = yt_dash[:,0]
ath_test['P Yes'] = yt_dash[:,1]
ath_test.sort_values('P No')

Unnamed: 0_level_0,Speed,Agility,P No,P Yes
Athlete,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
t6,8.1,7.8,3e-06,0.999997
t7,7.7,5.2,5.5e-05,0.999945
t8,6.1,5.5,0.027069,0.972931
t10,6.1,5.5,0.027069,0.972931
t3,5.5,7.2,0.088067,0.911933
t9,5.5,6.0,0.145717,0.854283
t5,5.5,5.2,0.200167,0.799833
t4,3.8,8.8,0.849522,0.150478
t2,4.5,4.5,0.877017,0.122983
t1,3.3,8.2,0.958686,0.041314


**Q4(c)**

When a `GaussianNB` model is trained the model is stored in two parameters `theta_` and `var_`. Train a `GaussianNB` model and check to see if these parameters agree with your own estimates. 

Hint: this code will give you the estimates you need. 
`athlete[athlete[‘Selected']=='No']['Agility'].mean()`
`athlete[athlete[‘Selected']=='No']['Agility'].var(ddof=0)`

The `var_` parameter contains the square of the standard deviation (the variance) rather than the standard deviations.  
The figures should agree exactly if `ddof` is set to zero in the `var` calculation.  

In [13]:
ath_NB.theta_

array([[3.39583333, 5.08333333],
       [6.40625   , 6.96875   ]])

In [16]:
ath_NB.var_

array([[0.80685764, 3.99305556],
       [1.37402344, 3.91308594]])

In [25]:
print(athlete[athlete['Selected']=='No']['Agility'].mean())
print(athlete[athlete['Selected']=='Yes']['Agility'].mean())
print(athlete[athlete['Selected']=='No']['Speed'].mean())
print(athlete[athlete['Selected']=='Yes']['Speed'].mean(), end='\n\n')
print(athlete[athlete['Selected']=='No']['Agility'].var(ddof=0))
print(athlete[athlete['Selected']=='Yes']['Agility'].var(ddof=0))
print(athlete[athlete['Selected']=='No']['Speed'].var(ddof=0))
print(athlete[athlete['Selected']=='Yes']['Speed'].var(ddof=0))

5.083333333333333
6.96875
3.3958333333333335
6.40625

3.9930555555555554
3.9130859375
0.8068576388888888
1.3740234375
