# ML Classification
- Learn a model that can clssify which type of given pokemon belongs to based on its 6 attributes: (1) HP (2) Attack (3) Defense (4) Special Attack (5) Special Defense (6) Speed.
- **Generative Model** is used in this project.
- This is just a **Binary Classification case** which can only classify grass or fire type.

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal

In [2]:
data = pd.read_csv('/Users/yuwenchen/Pokemon-Classifier/Pokemon.csv')
data = pd.DataFrame(data)
data.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [3]:
df = data[['Type 1', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]

# extract all grass type pokemon 
grassType = df.loc[df['Type 1'] == 'Grass']
# extract all fire type pokemon
fireType = df.loc[df['Type 1'] == 'Fire']

# only derive 80% grass type pokemon as training set(35)
grass_train = grassType.sample(frac=0.8, replace=False, random_state=1)
# only derive 80% fire type pokemon as training set(49)
fire_train = fireType.sample(frac=0.8, replace=False, random_state=1)

# the 50% rest is testing set
grass_test = grassType[~grassType.isin(grass_train)].dropna()
fire_test = fireType[~fireType.isin(fire_train)].dropna()
all_test = pd.concat([grass_test, fire_test], ignore_index=True, sort=False)

## Classification

- <h4>In this project, class1 is grass type, class2 is fire type.</h4>
<br>
- <h3>Bayes' theorem</h3>
<p>The way to do the classfication is using <strong>Bayes' theorem</strong>. For example, given input x the probability of it belonging to class 1  is P(C1|x), which is:<br><br>
    $$
    P(C_1|x)=\frac{P(C_1)P(x|C_1)}{P(C_1)P(x|C_1)+P(C_2)P(x|C_2)}
    $$<br><br>    
    We can caculate <strong>P(x|C1)</strong> and <strong>P(x|C2)</strong> by using <strong>Multivariate Gaussian Distrubution</strong>.
</p>
<br>
- <h3>Multivariate Gaussian Distrubution</h3>
<p>Assume the training set is sampled from a Multivariate Gaussian Distrubution.<br><br>
    Once we get the mean (μ) and corvariance (∑) of the Gausssian, we can use this model (Multivariate Gausssian Distrubution) to caculate the probability of given sample belonging C1 or C2.<br><br>
    Following is the formular of Multivariate Gaussian Distrubution:<br><br>
$$
f_{u,\Sigma}(x)=\frac{1}{(2\pi)^{\frac{D}{2}}}\frac{1}{|\Sigma|^{\frac{1}{2}}}e^{-\frac{1}{2}(x-u)^T\Sigma^{-1}(x-u)}
$$<br><br>
Since the likelihoods of each sample are differnt, we use <strong>Maximum Likelihood</strong> to find a optimal likelihood for all the sample.
</p>

- <h3>Maximum Likelihood</h3>
<p>The mean(μ*) and corvariance(∑*) can generate the highest probability for all the sample.<br><br>
    The way we find μ* and ∑* is using following formulars: (Notice that you could use differential fo find μ* and ∑* but I did not)<br><br>
    $$
    w = {sample \ amount}
    $$
   <br>
    $$
    u^*,\Sigma^*=\arg \max\limits_{u,\Sigma} L(u,\Sigma)
    $$
   <br>
    $$
    u^*=\frac{1}{w} \sum_{n=1}^{w} \ x^n
    $$
   <br>
    $$
    \Sigma^*=\frac{1}{w}\sum_{n=1}^{w}(x^n-u^*)(x^n-u^*)^T
    $$
</p>

- <h3>Therefore, P(x|C1) is:</h3>
<p>$$
f_{u^*,\Sigma^*}(x)=\frac{1}{(2\pi)^{\frac{D}{2}}}\frac{1}{|\Sigma^*|^{\frac{1}{2}}}e^{-\frac{1}{2}(x-u^*)^T\Sigma^{-1}(x-u^*)}
$$
</p>

In [4]:
# amount of training set for each
num_g = grass_train.shape[0]
num_f = fire_train.shape[0]

# transform all the attributes into list form
grass_list = grass_train.values.T.tolist()
fire_list = fire_train.values.T.tolist()

#### Caculate P(C1) and P(C2) for Bayes' theorem

In [5]:
# 35 grass type pokemon, 49 fire type pokemon
p_c1 = num_g/(num_g + num_f)
p_c2 = num_f/(num_g + num_f)

#### Caculate P(x|C1) and P(x|C2)

In [6]:
# function that find the optimal Gaussian Distribution by caculating Maximum Likelihood.
'''
Type 1  : 0
HP      : 1
Attack  : 2
Defense : 3
Sp. Atk : 4
Sp. Def : 5
Speed   : 6
'''
def getMeanCor(pokeList:list, length:int):
    
    # caculate the μ*
    u = 0
    for i in range (length):
        x_n = np.array([[pokeList[1][i]],[pokeList[2][i]],[pokeList[3][i]],[pokeList[4][i]],[pokeList[5][i]],[pokeList[6][i]]])
        u = u + x_n

    u_max = u/num_g
    u_max = u_max.astype(int)

    # caculate the ∑*
    sig = 0
    for i in range (length):
        # current
        x_n = np.array([[pokeList[1][i]],[pokeList[2][i]],[pokeList[3][i]],[pokeList[4][i]],[pokeList[5][i]],[pokeList[6][i]]])
        left = x_n - u_max
        right = left.T
        sig = sig + left * right

    sig_max = sig/num_g
    sig_max = sig_max.astype(int)
    
    u_max_flat = list(np.concatenate(u_max).flat)    
    
    return u_max_flat, sig_max

In [7]:
# optiaml Gaussian Distrubution
# class1: grass
meanC1, sigC1 = getMeanCor(grass_list, num_g)
# class2: fire
meanC2, sigC2 = getMeanCor(fire_list, num_f)

# let the two distributions share the same mean
sum_sig = (num_g / (num_g + num_f))*sigC1 + (num_f / (num_g + num_f))*sigC2

In [8]:
sigC1

array([[390, 333, 228, 299, 161, 139],
       [333, 600, 322, 372, 140, 197],
       [228, 322, 579, 118, 242, 125],
       [299, 372, 118, 666, 195, 309],
       [161, 140, 242, 195, 369, 193],
       [139, 197, 125, 309, 193, 777]])

In [9]:
sigC2

array([[ 509,  538,  418,  503,  450,  418],
       [ 538,  975,  482,  627,  520,  533],
       [ 418,  482,  701,  680,  542,  255],
       [ 503,  627,  680, 1131,  746,  467],
       [ 450,  520,  542,  746,  687,  406],
       [ 418,  533,  255,  467,  406,  798]])

In [10]:
p_x_C1 = multivariate_normal(mean=meanC1, cov=sum_sig)
p_x_C2 = multivariate_normal(mean=meanC2, cov=sum_sig)

## Testing Phase
<p>Having all the parameters for Bayes' theorem now, it will give us the probabilty of being which class once we give it a input.<br><br>
If the probability is greater than 0.5 then it belongs to certain category.<br><br>
Use the accuracy formula:
$$
accuracy=\frac{TP+TN}{TP+TN+FP+FN}
$$
</p>

In [11]:
# transform testing set from datafram to list
test = all_test.values.T.tolist()

In [12]:
# the length of the testing set
tlen = all_test.shape[0]

In [13]:
#  TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives.
TP = TN = FP = FN = 0

In [14]:
# if the output > 0.6 then it belongs to class1, else class2

for i in range(tlen):
    temp = [int(test[1][i]) ,int(test[2][i]), int(test[3][i]), int(test[4][i]), int(test[5][i]), int(test[6][i])]
    output =  p_x_C1.pdf(temp)*p_c1 / (p_x_C1.pdf(temp)*p_c1 + p_x_C2.pdf(temp)*p_c2)
    
    if output > 0.5: # the prediction is grass
        if test[0][i] == 'Grass': # if it is actually grass type
            TP += 1
        else: # if it is actually NOT Grass type
            TN += 1
    else: # the prediction is fire
        if test[0][i] == 'Fire': # if it is actually fire type
            FP += 1
        else: # if it is actually NOT fire type
            FN += 1

In [15]:
# caculate the accuracy
accuracy = (TP+TN) / (TP+TN+FP+FN)
accuracy

0.75