In [1]:
import pandas as pd

df = pd.read_csv("shirt_sizes.csv")

print(df)

   height(cm)  weight(kg) t-shirt size
0         158          58            M
1         163          61            M
2         165          61            L
3         168          66            L


# Intro to ML
* Supervised machine learning: labeled data (i.e. there is an attribute we want to predict for unseen data)
    * e.g. kNN (k nearest neighbors) algorithm
* Unsupervised machine learning: unlabeled data
    * e.g. k-means clustering 
    
## Supervised Machine Learning
* Somehow, we need to divide our dataset into a training set and a testing set
    * The testing set is how you evaluate your classifier
    * The testing set is *different* from your training set
* Example
    * We have the super small t-shirt sizes dataset
        * Two features (AKA attributes)
            * height and weight
        * One class label (AKA attribute)
            * t-shirt size
            * This is what we want to predict for unseen instances
            * Ex. say we have a new instance, height = 161 and weight = 63
            * What should it's t-shirt size be?

## kNN Algorithm
* Find the nearest neighbors in a training set to a test instance (e.g. 161, 63)
* Pick the majority class from among the k nearest neighbors
    * This is the test instance's predicted class
* Need a way to measure "near" AKA "close"
    * 2D: Pythagorean theorem to find the distance between two points
    * ND: Distance formula (Euclidean) 
    * $dist(a, b) = \sqrt{\sum_{i = 1}^n(a_i - b_i)^2}$
* We need to normalize (AKA scale) our features so that the units don't cause an unanticipated weighting (e.g. height is on a larger scale than weight)
    * Use a min-max scaling approach
    * For each feature, the min becomes 0 and the max becomes 1
    * Subtract the feature mean from each value, then subtract the original range from each value (max - min)
    

In [2]:
# kNN using scikit learn
# X are our feature vectors (instances) minus their class labels
# y are our class labels
X_train = df.drop("t-shirt size", axis=1) # is for columns
y_train = df["t-shirt size"]

#print(X_train)
#print(y_train)

In [3]:
# normalize features to [0, 1]
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train) # fit_transform(X_train)
X_train_normalized = scaler.transform(X_train)
print(X_train_normalized)
X_test = [161, 63]
X_test_normalized = scaler.transform([X_test])
print(X_test_normalized)

[[0.    0.   ]
 [0.5   0.375]
 [0.7   0.375]
 [1.    1.   ]]
[[0.3   0.625]]


In [4]:
# set up our kNN classifier (to get predictions and distances)
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=3, metric="euclidean")

knn_clf.fit(X_train_normalized, y_train)
y_predicted = knn_clf.predict(X_test_normalized)
print("prediction:", y_predicted)
print("distances:", knn_clf.kneighbors(X_test_normalized))

prediction: ['M']
distances: (array([[0.32015621, 0.47169906, 0.69327123]]), array([[1, 2, 0]]))


Some closing thoughts on the kNN algorithm:
* What do we do if our features are categorical?
    * Simple approach: to convert our string feature values into numeric values
        * E.g. `from sklearn.preprocessing import LabelEncoder`
    * Advanced approach: define your own distance metric for the categorical feature
        * E.g. $dist(v_1, v_2) = 0$ if $v_1 == v_2$, 1 otherwise
* kNN is not the only classification algorithm
    * Decision trees (random forests)
    * Naive Bayes
    * SVMs (support vector machines)
    * Neural networks 
    * etc.

## Classifier Evaluation
* For our simple example (trace of kNN), we didn't have the "ground truth label" (AKA class) for our test instance
    * So how do we know if our classifier was right?
* So we need to have a "test set" that has ground truth labels for all instances
* How do we get a test set?
    * Divide a dataset into a "training set" and a "testing set"
* A few ways to do this
    * Hold out method
    * Random subsampling
    * k fold cross validation
    * Bootstrap method
    
### Hold Out Method
* "Hold out" a certain number or percentage of instances from your dataset to form your test set (the remaining instances form your training set)
    * Typically choose a split or a percentage and you typically stratify
    * E.g. 2:1 (2/3 in training set and 1/3 in test set)
    * E.g. `from sklearn.model_selection import train_test_split` uses a default of 25% for the test set

In [5]:
long_df = pd.read_csv("shirt_sizes_long.csv")
#print(long_df)

y = long_df["t-shirt size"]
X = long_df.drop("t-shirt size", axis=1)
#print(X)
#print(y)

In [6]:
# hold out
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
# task: apply min max scaling to our X (before train test split)
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

# use random_state for reproducibility
# use stratify to ensure a similar distribution of 
# class labels in your training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, stratify=y)
print(X_train)
print(y_train)
print(X_test)
print(y_test)

[[0.58333333 0.3       ]
 [1.         1.        ]
 [0.83333333 0.5       ]
 [0.41666667 0.2       ]
 [0.58333333 0.7       ]
 [0.         0.5       ]
 [0.         0.1       ]
 [0.41666667 0.6       ]
 [1.         0.6       ]
 [0.16666667 0.1       ]
 [0.16666667 0.2       ]
 [1.         0.5       ]
 [0.83333333 0.8       ]]
9     L
17    L
13    L
5     M
11    L
2     M
1     M
8     L
16    L
3     M
4     M
15    L
14    L
Name: t-shirt size, dtype: object
[[0.83333333 0.4       ]
 [0.41666667 0.3       ]
 [0.16666667 0.6       ]
 [0.         0.        ]
 [0.58333333 0.4       ]]
12    L
6     M
7     L
0     M
10    L
Name: t-shirt size, dtype: object


In [7]:
# task: get predictions for X_test and compare them to y_test
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=3, metric="euclidean")
clf.fit(X_train, y_train)

y_predicted = clf.predict(X_test)
print(y_predicted)
# GS note: score needs to be X_test, y_test
accuracy = clf.score(X_test, y_test)
print(accuracy)

['L' 'M' 'M' 'M' 'L']
0.8


In [8]:
# do it again with decision tree
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(random_state=0)
tree_clf.fit(X_train, y_train)
y_predicted_tree = tree_clf.predict(X_test)
print(y_predicted_tree)
accuracy_tree = tree_clf.score(X_test, y_test)
print(accuracy_tree)

['L' 'M' 'L' 'M' 'L']
1.0


### Random Subsampling
* Perform the hold out method k times (diff k)
* The accuracy is the average accuracy over the k runs

### k Fold Cross Validation
* Be more intentional about our "partitions"
* Every instance is testing exactly one time
* Divide the dataset into k folds 
* For each fold:
    * Test on the fold
    * Train on the remaining folds (folds - fold)
* Variants
    * LOOCV: leave one out cross validation (k = N)
        * Test on each instance one a time
        * When you need as much data as possible for training
    * Stratified k fold CV
* Accuracy is the # of correct classifications over the k iterations

In [9]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score

for model in [clf, tree_clf]:
    print(type(model))
    accuracies = cross_val_score(model, X, y, cv=5)
    print(accuracies, accuracies.mean())
    y_predictions = cross_val_predict(model, X, y, cv=5) # GS: look into random_state for this one
    print(y_predictions)
    # better estimate of accuracy
    accuracy = accuracy_score(y, y_predictions)
    print(accuracy)

<class 'sklearn.neighbors.classification.KNeighborsClassifier'>
[0.6        1.         1.         1.         0.66666667] 0.8533333333333333
['M' 'M' 'M' 'M' 'M' 'M' 'L' 'M' 'L' 'M' 'L' 'L' 'L' 'L' 'L' 'L' 'L' 'L']
0.8333333333333334
<class 'sklearn.tree.tree.DecisionTreeClassifier'>
[0.6        0.75       1.         1.         0.66666667] 0.8033333333333333
['M' 'M' 'L' 'M' 'M' 'M' 'L' 'M' 'M' 'L' 'L' 'L' 'L' 'L' 'L' 'L' 'L' 'L']
0.7777777777777778


### Bootstrap Method
* Like random subsampling, but with replacement
* Let D = # of instances in a dataset
* Randomly select D instances with replacement
    * This dataset is used for training (~63.2% of the original dataset)
    * ~36.8% of the instances will not be in the training set (this forms your test set)

TODO: confusion matrices and warning about accuracy with unbalanced class distributions

## Classification Evaluation Metrics
* For binary classification...
    * Choose one of the class labels to be "positive"
    * Choose the other class label to be "negative"
    * P: the # of positive instances in the test set
    * N: the # of negative instances in the test set
    * TP: the # of positives that are correctly classified as positive
    * TN: the # of negatives that are correctly classified as negative
    * FP: the # of negatives that are incorrectly classified as positive
    * FN: the # of positives that are incorrectly classified as negative
* Accuracy: Percent of test instances correctly classified
    * Accuracy = $\frac{TP + TN}{P + N}$
    * Warning!! can be skewed if your class distribution is not even
* Error rate (1 - accuracy)
    * Error rate = $\frac{FP + FN}{P + N}$
* Precision
* Recall
* F Measure
* AUC (area under the ROC curve)...

## Regression Evaluation Metrics
* Standard error
* Mean absolute error
* Root mean square error
* etc...