# Classification Algorithms

As its name tells, these algorithm put items into classes by the features of the items. 
These classes could be binary (yes or no) or no. For example, predicting hand-written digits into 10 classes (0-9).

This workshop we will take a look a three of the traditional ML algo on classifications with no neuron network involved. We will work the the [mlb dataset](https://github.com/matloff/regtools/blob/master/data/mlb.txt.gz) from UCI machine learning repository.



## Setup

### Environment

In [237]:
import numpy as np
import operator
import pandas as pd
import matplotlib.pyplot as plt
import math
from collections import Counter
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score



### Data Pre-processing

We will start by downloading and reshaping the data.
[mlb](https://github.com/matloff/regtools/blob/master/data/mlb.txt.gz) dataset contains Major Baseball League player info




In [238]:
raw_df = pd.read_csv("mlb.txt", sep=" ")
raw_df.head()

Unnamed: 0,Name,Team,Position,Height,Weight,Age,PosCategory
1,Adam_Donachie,BAL,Catcher,74,180,22.99,Catcher
2,Paul_Bako,BAL,Catcher,74,215,34.69,Catcher
3,Ramon_Hernandez,BAL,Catcher,72,210,30.78,Catcher
4,Kevin_Millar,BAL,First_Baseman,72,210,35.43,Infielder
5,Chris_Gomez,BAL,First_Baseman,73,188,35.71,Infielder


In [239]:
def compare_df(y_true, y_pred):
    return pd.DataFrame.from_dict({
        "real": y_true,
        "predict": y_pred
    })

def splitData(df, train_frac=0.8):
    nrow = df.shape[0]
    train_len = math.floor(nrow * train_frac)
    test_len = nrow - train_len
    return df.head(train_len), df.tail(test_len)

train, test = splitData(raw_df)

In [240]:
X_train = train[['Height', 'Age', 'Weight']]
y_train = train['Position'].map(lambda x: int(x == 'Catcher'))
X_test = test[['Height', 'Age', 'Weight']]
y_test = test['Position'].map(lambda x: int(x == 'Catcher'))

## K Nearest Neighbors (KNN)

In this section, we will use kNN algorithm to predict a particular will vote yes/no to a bill from the user's identities.
The key idea of kNN is to find the distance of a instance to the existing data, then use the closest one as the guess.

Here we will only use Height and Age to predict Weight.

In [241]:
class kNN:
    def __init__(self,k=3):
        self.k = k

    def fit(self, X, y):
        self.X_train = X.to_numpy()
        self.y_train = y.to_numpy()

        
    def predict_df(self, X):
        predicted_labels = [self._predict(x) for x in X.to_numpy()]
        return np.array(predicted_labels)

    def predict(self, X):
        return self._predict(np.array(X))

    def _predict(self, x):
        # compute the distances
        distances = [self._distance(x, x_train) for x_train in self.X_train]
        
        # get the k-nearest samples, labels
        # sort distances, return an array of indices, get only first k
        k_indices = np.argsort(distances)[:self.k]
        k_nearest_labels = [self.y_train[i] for i in k_indices]
        # majority vote, most common class label
        most_common = Counter(k_nearest_labels).most_common(1)
        
        return most_common[0][0]
    
    def _distance(self, x1, x2):
        return np.sqrt(np.sum((x1 - x2) ** 2))

In [242]:

model = kNN()
model.fit(X_train, y_train)

model.predict([100, 23, 200])

0

In [243]:
predictions = model.predict_df(X_test)
print(compare_df(y_test, predictions))
print("The accuracy of our model: ", accuracy(y_test, predictions))
print(accuracy_score(y_test,predictions))

      real  predict
831      1        0
832      1        0
833      1        0
834      0        0
835      0        0
...    ...      ...
1030     0        0
1031     0        0
1032     0        0
1033     0        0
1034     0        0

[203 rows x 2 columns]
The accuracy of our model:  0.9310344827586207
0.9310344827586207


## Naive Bayes

The idea of this algorithm comes directly from conditional probability (Bayes Theorem).
$$
P(y | X) = \frac{P(X | y) \cdot P(y)}{P(X)} = \frac{P(X_1 | y) \cdots P(X_n | y) \cdot P(y)}{P(X)}
$$
assuming all features are independent.

So if we want to choose the $y$ with highest probability, it would be
$$
y = \argmax_y P(y | X) = \argmax_y \frac{P(X_1 | y) \cdots P(X_n | y) \cdot P(y)}{P(X)}
$$

For each $P(X_i | y)$ assume normal distribution, the probability is
$$
P(X_i | y) = \frac{1}{\sigma \sqrt{2\pi}} e^{- \frac{1}{2}(\frac{x - \mu}{\sigma})^2}
$$
where $\sigma$ is the standard deviation of $y$ and $\mu$ is the mean of $y$.

In [244]:
class NaiveBayes:
    def __init__(self):
        self._classes = []
        self._mean = {}
        self._var = {}
        self._priors = {}

    def fit(self, X, y):
        n_samples, n_features = X.shape
        # find unique elements of y, use as classes
        self._classes = np.unique(y)

        for c in self._classes:    # for each class
            X_c = X[y == c]
            self._mean[c] = X_c.mean(axis=0)    # find mean of each column
            self._var[c] = X_c.var(axis=0)
            # frequency of this class in training sample
            self._priors[c] = X_c.shape[0]/float(n_samples)

    def predict_df(self, X):
        X = X.to_numpy()
        y_pred = [self._predict(x) for x in X]
        return y_pred
    
    def predict(self, X):
        return self._predict(np.array(X))

    def _predict(self, x):    # one sample
        posteriors = []
        for c in self._classes:
            prior = self._priors[c]    # with current index
            class_conditional = np.sum(np.log(self._pdf(c, x)))
            posterior = prior + class_conditional
            posteriors.append(posterior)
        return self._classes[np.argmax(posteriors)]

    def _pdf(self, class_name, x):    # probability density function
        mean = self._mean[class_name].values
        var = self._var[class_name].values

        numerator = np.exp(- (x-mean)**2/(2*var))
        denominator = np.sqrt(2*np.pi*var)
        return numerator/denominator


In [245]:
model = NaiveBayes()
model.fit(X_train, y_train)

model.predict([100, 23, 200])


0

In [246]:
predictions = model.predict_df(X_test)
print(compare_df(y_test, predictions))
print("The accuracy of our model: ", accuracy(y_test, predictions))
print(accuracy_score(y_test,predictions))

      real  predict
831      1        1
832      1        0
833      1        0
834      0        0
835      0        0
...    ...      ...
1030     0        0
1031     0        0
1032     0        0
1033     0        0
1034     0        0

[203 rows x 2 columns]
The accuracy of our model:  0.896551724137931
0.896551724137931


## Support Vector Machine (SVM)

In [247]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

clf = SVC(kernel='linear')
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test,y_pred))

0.9359605911330049


In [248]:
l = len(X_train)
train_f1 = np.array(X_train.iloc[:, 0])
train_f2 = np.array(X_train.iloc[:, 1])
w1 = np.zeros((l, 1))
w2 = np.zeros((l, 1))

train_f1 = np.reshape(train_f1, (l, 1))
train_f2 = np.reshape(train_f2, (l, 1))
y_train = list(map(lambda x: -1 if x == 0 else x, y_train))
y_train = np.array(y_train)
y_train = np.reshape(y_train, (len(y_train), 1))

epochs = 1
alpha = 0.0001

while(epochs < 1000):
    y = w1 * train_f1  + w2 * train_f2
    prod = y * y_train
    count = 0
    for val in prod:
        if(val >= 1):
            cost = 0
            w1 = w1 - alpha * (2 * 1/epochs * w1)
            w2 = w2 - alpha * (2 * 1/epochs * w2)
            
        else:
            cost = 1 - val 
            w1 = w1 + alpha * (train_f1[count] * y_train[count] - 2 * 1/epochs * w1)
            w2 = w2 + alpha * (train_f2[count] * y_train[count] - 2 * 1/epochs * w2)
        count += 1
    epochs += 1

In [249]:

## Clip the weights 
l_test = len(X_test)
index = list(range(l_test, l))
w1 = np.delete(w1, index)
w2 = np.delete(w2, index)
w1 = w1.reshape(l_test, 1)
w2 = w2.reshape(l_test, 1)


## Extract the test data features 
test_f1 = np.array(X_test.iloc[:,0])
test_f2 = np.array(X_test.iloc[:,1])
test_f1 = test_f1.reshape(l_test, 1)
test_f2 = test_f2.reshape(l_test, 1)


# Predict
y_pred = w1 * test_f1 + w2 * test_f2
predictions = np.array(list(map(lambda val: int(val > 1), y_pred)))
print(accuracy_score(y_test,predictions))

0.9359605911330049
