# November 18th, 2021

In [1]:
import pandas as pd
df = pd.DataFrame()
df["height(cm)"] = [158,163,165,168]
df["weight(kg)"] = [58,61,61,66]
df["t-shirt size"] = ["M","M","L","L"]

print(df)

   height(cm)  weight(kg) t-shirt size
0         158          58            M
1         163          61            M
2         165          61            L
3         168          66            L


# Intro to ML (Machine Learning)
* Supervised learning: labeled data (e.g. there is an attribute (AKA feature)that you are interested in predicting for unseen instances)
    * The attribute is often called the "class" or the "class label"
    * The attribute is categorical... classification
    * The attribute is numeric... regression
    * Example algorithm we are using today kNN (k nearest neighbors)
* Unsupervised learning: unlabeled data
    * Example algorithm: k-means clustering
             

## Supervised Learning 
* Need a way to divide a dataset into a training set and a testing set
    * The training set is used to build/train a algorithm/model
    * The testing set is used to evaluate the algorithm/model
    * The training test and the testing set *are different*
* Example
    * We have this super tiny t-shirt sizes dataset
        * 4 instances
        * 3 attributes (1 is the class)
        * Goal is to use the height and weight attributes to predict the t-shirt size
        * We will do this for a test set with a single unseen instance
            * height=161 weight=63 t-shirt=?
            * Let's say the "ground truth value" is M (Medium)

## kNN Algorithm
* Identify the k nearest neighbors in the training set to a test set instance
    * The most frequently occuring class label amongst the k nearest neighbors will be the class label prediction for the unseen instance
* We need a way to measure "nearness" AKA "closseness"
    * 2D: Pythagorean theorem
    * ND: Euclidean distance formula: $dist(a,b) = \sqrt{\sum_{i=1}^{n} (a_i - b_i) ^2}$
* We need to normalize (AKA scale) our attributes so we don't have an unanticipated weighting of one attribute more than another (e.g. height has a larger scale then weight, so it will dominate the formula)

In [2]:
X_train = df.drop("t-shirt size",axis=1)
print(X_train)
y_train = df["t-shirt size"]
print(y_train)
X_test = [[161,63]]
print(X_test)

# Step 1: Normalize the x data
# Step 2: Compute the distances to each unseen instance in the test set
# Step 3: Apply majority voting to the k=(3) closest distance labels


   height(cm)  weight(kg)
0         158          58
1         163          61
2         165          61
3         168          66
0    M
1    M
2    L
3    L
Name: t-shirt size, dtype: object
[[161, 63]]


## Basic k-NN Algorithm
```
Input: list of rows, no of atts (n where nth is label), instance to classify, k
def kNN_classifier(training_set, n, instance, k):
    row_distances = []
    for row in training_set:
        d = distance(row, instance, n - 1)
        row_distances.append([d, row])
    top_k_rows = get_top_k(row_distances, k)
    label = select_class_label(top_k_rows)
    return label
```

## kNN Example
Example adapted from [this kNN example](https://people.revoledu.com/kardi/tutorial/KNN/KNN_Numerical-example.html)

Suppose we have the following dataset that has two attributes (acid durability and strength) and a class attribute (whether a special paper tissue is good or not):

|Acid durability (seconds)|Strength (kg/square meter)|Classification|
|-|-|-|
|7|7|Bad|
|7|4|Bad|
|3|4|Good|
|1|4| Good|

Now the factory produces a new paper tissue with acid durability = 3 seconds and strength = 7 kg/square meter. Can we predict what the classification of this new tissue is? Use kNN with $k$ = 3. 

### Make a Prediction Manually
Steps:
1. Normalize
1. Compute distance of each training instance to the test instance
1. Determine the majority classification of the $k$ closest instances... this is your prediction for the test instance

After normalization:

|Acid durability (seconds)|Strength (kg/square meter)|Classification|
|-|-|-|
|1|1|Bad|
|1|0|Bad|
|0.33|0|Good|
|0|0| Good|

Test instance normalization: 0.33, 1

Distances:

|Acid durability (seconds)|Strength (kg/square meter)|Classification|Distance|
|-|-|-|-|
|1|1|Bad|0.66|
|1|0|Bad|1.203|
|0.33|0|Good|1.0|
|0|0| Good|1.05|

Work:
* $\sqrt{(1-0.33)^2 + (1-1)^2} = 0.66$
* $\sqrt{(1-0.33)^2 + (0-1)^2} = 1.203$
* $\sqrt{(0.33-0.33)^2 + (0-1)^2} = 1.0$
* $\sqrt{(0-0.33)^2 + (0-1)^2} = 1.05$

Majority classification: 
1 Bad (0.66) and 2 Goods (1.0 an 1.05) => Good!

### Make a Prediction with Scikit-Learn
Steps:
1. Load data
1. Normalize
1. Train kNN classifier with training set
1. Test kNN classifier on test instance

In [3]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_normalized = scaler.transform(X_train) # often combined into one step, using fit_transform()

In [4]:
# load data
import pandas as pd

data = [[7, 7, "Bad"], [7, 4, "Bad"], [3, 4, "Good"], [1, 4, "Good"]]
df = pd.DataFrame(data, columns=["Acid durability (seconds)", "Strength (kg/square meter)", "Classification"])

print(df)

   Acid durability (seconds)  Strength (kg/square meter) Classification
0                          7                           7            Bad
1                          7                           4            Bad
2                          3                           4           Good
3                          1                           4           Good


In [5]:
# normalize
from sklearn.preprocessing import MinMaxScaler

X_train = df.drop("Classification", axis=1)
y_train = df["Classification"]

scaler = MinMaxScaler()
scaler.fit(X_train)
print(scaler.data_min_)
print(scaler.data_max_)

X_train_normalized = scaler.transform(X_train)
print(X_train_normalized)

[1. 4.]
[7. 7.]
[[1.         1.        ]
 [1.         0.        ]
 [0.33333333 0.        ]
 [0.         0.        ]]


We use minmax scaling because it corrects outliers, making the formulas work 

normalizes attributes to the same scale so one isnt inheritaly weighted more than the other

In [6]:
# train
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train_normalized, y_train)

# test
X_test = pd.Series([3, 7], index=df.columns.drop("Classification"))
X_test = scaler.transform([X_test])
y_test_prediction = neigh.predict(X_test)
print(y_test_prediction)

['Good']


  "X does not have valid feature names, but"


## Warmup

# Classifier Evaluation
* In our previous demo, we had 1 instance in our "test set"
    * If our classifier predicted this instance's class correctly, accuracy = 100%
    * If our classifier predicted this instance's class incorrectly, accuracy = 0%
    * This is very strict. We only gave it one chance
* Notes
    * We should use a "large" test set to get a better picture of how our classifier is performing
    * Accuracy doesn't tell the whole story...
        * E.g. 100 samples... 99 M, 1 L
        * And our classifier simply only predicts M
        * We have 99% accuracy, yet this isnt really good
        * Accuracy only makes sense when your class labels are near evenly distributed
* Given a dataset, we need a way to "divide" our dataset into a training set and a test set
    * A few ways to do this...
        1. Hold out method 
        1. Random subsampling
        1. Cross validation
        1. Boostrap method

## Hold out Method
* "hold out" a certain number or percatage of instances in a dataset for testing
    * Train on the remaining instances
    * typically choose a standard split or percentage
        * E.g. 2:1 split: 1/3 of data held out for testing; 2/3 used for training
        * E.g. 25% of data held out for testing; 75% used for training
            * Default for sklearn's `train_test_slit()`
        
Stratify: Lets be more intentional on producing the test set  
equal distribution in the test set that matches our dataset.  
If the dataset is basically 50/50, make sure the test set is basically 50/50

In [7]:
df = pd.DataFrame(dtype=float)

df = pd.read_csv("shirt_sizes_long.csv")

X = df.drop("t-shirt size",axis=1)
y = df["t-shirt size"]

scaler= MinMaxScaler()
X = scaler.fit_transform(X)

# print(X)
# print(y)

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


print(len(X) * 0.25)
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0,stratify=y)
#print(X_train) 
#print(y_train)
#print(X_test)
print(y_test)

knn_clf = KNeighborsClassifier(n_neighbors=3, metric="euclidean")
knn_clf.fit(X_train,y_train)
y_predicted = knn_clf.predict(X_test)
print(y_predicted)
print(list(y_test)) # 80%
accuracy = accuracy_score(y_test, y_predicted)
print(accuracy)

4.5
12    L
6     M
7     L
0     M
10    L
Name: t-shirt size, dtype: object
['L' 'M' 'M' 'M' 'L']
['L', 'M', 'L', 'M', 'L']
0.8


In [9]:
# Again but with a decision tree
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)
y_predicted = tree_clf.predict(X_test)
print(y_predicted)
print(list(y_test)) # 100%
accuracy = accuracy_score(y_test, y_predicted)
print(accuracy)

['L' 'M' 'L' 'M' 'L']
['L', 'M', 'L', 'M', 'L']
1.0


## Random Subsampling 
* Perform the hold out method k times (diff k from kNN)
* Accuracy is the mean accuracy over the k runs

## Cross Validation
* With random subsampling, we are not guarenteed that each instance ends up in a test set at least once
* With cross validation, we are more intentional about our "partitions"
* Algorithm: Divide the dataset into k folds (also a diff k)
    * For each fold:
        * Hold out the fold and test on it
        * Train on the remaining fold
* With this algorithm each instance is tested exactly 1 time

In [10]:
from sklearn.model_selection import cross_val_score, cross_val_predict 

# run 5-fold cross validation for both the knn and tree
for clf in (knn_clf, tree_clf):
    print(type(clf))
    # a lazy approach
    accuracies = cross_val_score(clf, X, y, cv=5) # cv is the amount of folds
    print(accuracies)
    # a better approach
    y_predicted = cross_val_predict(clf, X, y, cv=5)
    print(y_predicted)
    accuracy = accuracy_score(y, y_predicted)
    print(accuracy)

<class 'sklearn.neighbors._classification.KNeighborsClassifier'>
[0.75       0.5        1.         1.         0.66666667]
['M' 'M' 'M' 'M' 'M' 'M' 'L' 'M' 'L' 'M' 'M' 'L' 'L' 'L' 'L' 'L' 'L' 'L']
0.7777777777777778
<class 'sklearn.tree._classes.DecisionTreeClassifier'>
[0.5        0.5        1.         1.         0.66666667]
['M' 'M' 'L' 'M' 'M' 'M' 'L' 'M' 'M' 'M' 'L' 'L' 'L' 'L' 'L' 'L' 'L' 'L']
0.7222222222222222


### Variants of cross validation
* Stratified k fold cross validation: roughly the same distribution of class labels in each fold
* LOOCV (leave one out cross validation): k = N: Each fold contains exactly one instance
    * Good for when you need as much training data as possible
    * inefficient