<div style="float:right; padding-top: 15px; padding-right: 15px">
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="250">
        </a>
    </div>
</div>

# Supervised Learning: Classification

In classification, the data that we have on the label column is discrete: there are two or more options for what the value of the outcome can take. 

Below, we can see illustrated what the values of our outcome look like in the two cases, and what we are trying to achieve with each. In classification, our value has only two options, and therefore we are trying to find the boundary between the two classes. In regression, we are trying to find the line (not necessarily linear!) that best follows the formation of our data. 

![title](img/classification_vs_regression.png)

Classification problems can be grouped into:
- **binary problems:** is this tumor cancerous or not?
- **multi-class problems:** what type of animal is this?

## 1. Dataset

In [1]:
import numpy as np

from sklearn.datasets import load_iris

data = load_iris()

data

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(112, 4)
(38, 4)
(112,)
(38,)


## 2. Algorithms

For restless souls: <a href="http://themlbook.com/">Andriy Burkov - The Hundred-Page Machine Learning Book</a>

### 2.1 Logistic Regression

In [3]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X_train, y_train)

LogisticRegression()

In [4]:
lr.predict(X_test)

array([1, 2, 2, 1, 1, 1, 0, 0, 2, 2, 0, 1, 2, 2, 0, 1, 1, 1, 0, 2, 0, 0,
       1, 0, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 1, 0, 0, 2])

In [5]:
lr.predict_proba(X_test)

array([[3.26081373e-03, 7.66154282e-01, 2.30584904e-01],
       [1.91232742e-07, 7.06423701e-03, 9.92935572e-01],
       [1.80596632e-03, 4.32564561e-01, 5.65629473e-01],
       [8.64439779e-02, 9.03052856e-01, 1.05031658e-02],
       [3.87044155e-03, 7.68270458e-01, 2.27859101e-01],
       [4.47024589e-03, 8.57365386e-01, 1.38164368e-01],
       [9.79783898e-01, 2.02159189e-02, 1.82834162e-07],
       [9.57406545e-01, 4.25923920e-02, 1.06324889e-06],
       [1.53429896e-04, 1.48997525e-01, 8.50849045e-01],
       [9.97278324e-06, 2.33878846e-02, 9.76602143e-01],
       [9.76180971e-01, 2.38189158e-02, 1.13096817e-07],
       [8.66626520e-03, 8.09949699e-01, 1.81384036e-01],
       [1.34739450e-05, 2.95569600e-02, 9.70429566e-01],
       [1.16113192e-07, 7.24753847e-03, 9.92752345e-01],
       [9.83607964e-01, 1.63919899e-02, 4.62098507e-08],
       [6.07483517e-02, 8.73281940e-01, 6.59697081e-02],
       [2.88957222e-02, 9.05205038e-01, 6.58992393e-02],
       [1.08834371e-02, 8.20994

### 2.2 Random Forest

In [6]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

rf.fit(X_train, y_train)

RandomForestClassifier()

In [8]:
from sklearn.metrics import accuracy_score

accuracy_score(y_true=y_test, y_pred=rf.predict(X_test))

0.9736842105263158

### 2.3 SVM

In [11]:
from sklearn.svm import SVC

svm = SVC()

svm.fit(X_train, y_train)

SVC()

In [12]:
accuracy_score(y_true=y_test, y_pred=svm.predict(X_test))

0.9736842105263158

### 2.4 K-Nearest Neighbours (kNN)

In [13]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

knn.fit(X_train, y_train)

KNeighborsClassifier()

In [14]:
accuracy_score(y_true=y_test, y_pred=knn.predict(X_test))

1.0

<div style="padding-top: 25px; float: right">
    <div>    
        <i>&nbsp;&nbsp;© Copyright by</i>
    </div>
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="125">
        </a>
    </div>
</div>