# Class 08: Supervised Machine Learning - Classification




In [2]:
import statistics
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from urllib.request import urlopen

import matplotlib.pyplot as plt
%matplotlib inline

## 1. Machine Learning:  Features (X) and labels (y)

In supervised machine learning, we use a computer algorithm called a "pattern classifier" to learn relationships between a set of features X, and a label y. When the classifier is given new examples X, it can then make new predictions y. 


In [4]:
penguins = sns.load_dataset("penguins")

penguins = penguins.dropna()

penguins = penguins.sample(frac = 1)

penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
125,Adelie,Torgersen,40.6,19.0,199.0,4000.0,Male
64,Adelie,Biscoe,36.4,17.1,184.0,2850.0,Female
98,Adelie,Dream,33.1,16.1,178.0,2900.0,Female
230,Gentoo,Biscoe,40.9,13.7,214.0,4650.0,Female
34,Adelie,Dream,36.4,17.0,195.0,3325.0,Female


In [5]:
# Let's explore how many different members there are of each species in our data set? 
species_counts = penguins.groupby("species").agg(count = ("species","count"))
species_counts



Unnamed: 0_level_0,count
species,Unnamed: 1_level_1
Adelie,146
Chinstrap,68
Gentoo,119


#### Questions: 

1. If we had to guess the species of the penguin without knowing any of the penguin's features, species of penguin should we guess? 
A: Always guess Adelie


2. If we were to following the optimal guessing strategy, what percent of our guess would be correct (i.e., what would our classification accuracy be)? 


In [7]:
# get proportion that are a particular species
species_counts["count"]/sum(species_counts["count"])




species
Adelie       0.438438
Chinstrap    0.204204
Gentoo       0.357357
Name: count, dtype: float64

To begin the classification process, let's store the features (X) and the labels (y) in separate names called `X_penguin_features` and `y_penguin_labels` respectively. 

In [9]:
# get the features and the labels
X_penguin_features = penguins[["bill_length_mm","bill_depth_mm","flipper_length_mm","body_mass_g"]]
y_penguin_labels = penguins["species"]



print(X_penguin_features.head())
print(y_penguin_labels.head())

     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g
125            40.6           19.0              199.0       4000.0
64             36.4           17.1              184.0       2850.0
98             33.1           16.1              178.0       2900.0
230            40.9           13.7              214.0       4650.0
34             36.4           17.0              195.0       3325.0
125    Adelie
64     Adelie
98     Adelie
230    Gentoo
34     Adelie
Name: species, dtype: object


In [10]:
type(y_penguin_labels)

pandas.core.series.Series

## 2. k-Nearest Neighbors classifier

To explore classification, let's use a k-Nearest Neighbors classifier to predict the species of a penguin based on particular features the penguin has such as the penguin's bill length and body mass. 

Let's construct a K-Nearest Neighbor classifier (KNN) using 5 neighbors for predictions (i.e., k = 5 so we are using a 5-Nearest Neighbor classifier). 

We can do this using the `KNeighborsClassifier(n_neighbors = )` function.  



In [12]:
from sklearn.neighbors import KNeighborsClassifier

# Construct a classifier a 5 nearest neighbor classifier
knn = KNeighborsClassifier(n_neighbors = 5)


Let's now train the classifier (the KNN classifier just stores the data during training)


In [14]:
# “train” the classifier (which for a KNN classifier just involves memorizing the training data)

knn.fit(X_penguin_features,y_penguin_labels)


Let's now use the classifier to make predictions

In [51]:
# make predictions
predictions = knn.predict(X_penguin_features)
predictions[0:5]



array(['Adelie', 'Adelie', 'Adelie', 'Gentoo', 'Adelie'], dtype=object)

Let's get the prediction (classificaton accuracy) which is the proportion of predictions that are correct

In [54]:
# get the classification accuracy
np.mean(predictions == y_penguin_labels)



0.8378378378378378

Let's repeat our analysis with k = 1 to see what happens...

In [58]:
# What happens if k = 1?
# construct a classifier
knn = KNeighborsClassifier(n_neighbors = 1)


# “train” the classifier (which for a KNN classifier just involves memorizing the training data)
knn.fit(X_penguin_features,y_penguin_labels)


# make predictions
predictions = knn.predict(X_penguin_features)


# get classification accuracy
np.mean(predictions == y_penguin_labels)


1.0

Do we believe we have a perfect classifier???


## 3. Cross-validation

To avoid over-fitting, we need to split our data into a training and test set. 

The classifier "learns" the relationship between features (X) and labels (y) on the **training set**.

The classifier makes predictions on the features (X) of the **test set**. 

We compare the classifier's predictions on the test features (X) to the actual labels y, to get a more accuracy assessment of the **classification accuracy**.


Let's try this now...



In [64]:
# manually create a training with 250 examples, and a test set that has the rest of the data
X_manual_train = X_penguin_features[0:250]
y_manual_train = y_penguin_labels[0:250]

X_manual_test = X_penguin_features[250:]
y_manual_test = y_penguin_labels[250:]

print(X_manual_train.shape,X_manual_test.shape)

(250, 4) (83, 4)


In [66]:
from sklearn.model_selection import train_test_split

# split data into a training and test set








In [68]:
from sklearn.neighbors import KNeighborsClassifier


# construct a classifier
knn = KNeighborsClassifier(1)


# “train” the classifier (which for a KNN classifier just involves memorizing the training data)
knn.fit(X_manual_train,y_manual_train)



In [70]:
# get the predictions

predictions = knn.predict(X_manual_test)



In [72]:
# Get the prediction accuracy 

np.mean(predictions == y_manual_test)


0.891566265060241

In [28]:
# Test the classifier on the test set using the .score() method

# prediction accuracy on the test set





In [80]:
# What happens if we test the classifier on the training set? 
predictions = knn.predict(X_manual_train)
np.mean(predictions == y_manual_train)



1.0

### K-fold cross-validation

In k-fold cross-validation we split our data into k-parts (note, the k here has no relation to the k in k-Nearest Neighbor - it is just that k is a frequent letter to use in math to denote integer values).  

To run a k-fold cross-validation analysis, we train the classifier on k-1 parts of the data and test it on the remaining part. We repeat this process k times to get k classification accuracies. We then take the average of these results as our estimate of our overall classification accuracy. 

We can use the scikit-learn `cross_val_score()` to easily do this...


In [31]:
from sklearn.model_selection import cross_val_score


# construct knn classifier



# do 5-fold cross-validation




