In [1]:
import pandas as pd

df = pd.read_csv("shirts.csv")
print(df)

   height(cm)  weight(kg) t-shirt size
0         158          58            M
1         163          61            M
2         165          61            L
3         168          66            L


# Intro to Machine Learning
* Supervised machine learning: labeled data (i.e., the dataset has an attribute we are interested in predicting for unseen instances)
    * The attribute we want to predict is called the *class* or the *target*
    * If the class is categorical -> classification task
    * If the class is numeric -> regression task
    * Example algorithm: k nearest neighbors (kNN) classifier and regressor
* Unsupervised machine learning: unlabeled data (i.e., the dataset does not have an attribute we are interested in predicting)
    * E.g., clustering/grouping, associations, outliers/anomalies, trends, etc.
    * Example algorithm: k means clustering

## Supervised ML
* We need to divide a dataset into *training* and *testing* sets
    * We train/build an algorithm/model using a training set
    * We evaluate the algorithm/model using a testing set
    * The training and testing sets are *different*

## kNN Example w/Sci-kit Learn

In [2]:
# notation/convention of sci-kit learn API
# X: feature matrix (2D data structure)
# rows are instances and columns are attributes/features
# y: class column (1D data structure)
# the attribute/feature we want to predict
# AKA target array, target vector, labels, etc.
# _train and _test are used to denote training and testing sets, respectively
X_train = df.drop(["t-shirt size"], axis="columns")
y_train = df["t-shirt size"]
print(X_train)
print(y_train)

   height(cm)  weight(kg)
0         158          58
1         163          61
2         165          61
3         168          66
0    M
1    M
2    L
3    L
Name: t-shirt size, dtype: object


In [3]:
# let's scale our feature values
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
# typically combine fit() transform() calls fit_transform()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
print(X_train_scaled)
X_test = [[161, 63]] #2D
X_test_scaled = scaler.transform(X_test)
print(X_test_scaled)

[[0.    0.   ]
 [0.5   0.375]
 [0.7   0.375]
 [1.    1.   ]]
[[0.3   0.625]]




In [4]:
# now for kNN!
# https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=3, metric="euclidean")
# build/train
knn_clf.fit(X_train_scaled, y_train)
# predict
y_predicted = knn_clf.predict(X_test_scaled)
print(y_predicted)
# challenge: can you print out the neighbor distances?
# read the docs: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

['M']


## Closing Thoughts on kNN
* Inefficient algorithm, but it is a great first algorithm because it is easy to understand and implement
* What if you have features that are categorical?
    * Encode the values as integers (0, 1, 2, ...)
        * Sci-kit learn's [`LabelEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) can help you with this
    * Use another distance function, or create your own
* kNN is one of MANY supervised ML algorithms
    * Decision trees
    * Random forests
    * Naive Bayes
    * Support vector machines (SVMs)
    * Neural networks
    * etc.

In [5]:
df = pd.read_csv("shirt_sizes_long.csv")
X = df.drop(["t-shirt size"], axis="columns")
y = df["t-shirt size"]
print(X)
print(y)

    height(cm)  weight(kg)
0          158          58
1          158          59
2          158          63
3          160          59
4          160          60
5          163          60
6          163          61
7          160          64
8          163          64
9          165          61
10         165          62
11         165          65
12         168          62
13         168          63
14         168          66
15         170          63
16         170          64
17         170          68
0     M
1     M
2     M
3     M
4     M
5     M
6     M
7     L
8     L
9     L
10    L
11    L
12    L
13    L
14    L
15    L
16    L
17    L
Name: t-shirt size, dtype: object


## Classifier Evaluation
* In our last demo, we had 1 instance in our "test set"
    * If the classifier predicted the label correctly -> 100% accuracy
    * If the classifier predicted the label incorrectly -> 0% accuracy
* Notes
    * We want a "large" test set to get a good sense of how well our algorithm learned and generalizes to unseen instances
    * Accuracy doesn't tell the whole story... (more later)
* We need to divide a dataset into training and test set
    * A few ways to do this
        * Holdout method (DA7)
        * Cross validation

### Holdout Method
* "hold out" some instances for testing
    * Train on the remaining instances
* Typically use a standard "split" or percentage holdout
    * 2:1 split: holdout 1/3 for testing, train on remaining 2/3
    * 25% holdout: holdout 25% for testing, train on remaining 75%
        * Sci-kit learn default for `train_test_split()`

In [6]:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
from sklearn.model_selection import train_test_split

# shuffles by default
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, stratify=y)
print(X_train)
print(y_train)

    height(cm)  weight(kg)
9          165          61
17         170          68
13         168          63
5          163          60
11         165          65
2          158          63
1          158          59
8          163          64
16         170          64
3          160          59
4          160          60
15         170          63
14         168          66
9     L
17    L
13    L
5     M
11    L
2     M
1     M
8     L
16    L
3     M
4     M
15    L
14    L
Name: t-shirt size, dtype: object


In [7]:
print(X_test)
print(y_test)

    height(cm)  weight(kg)
12         168          62
6          163          61
7          160          64
0          158          58
10         165          62
12    L
6     M
7     L
0     M
10    L
Name: t-shirt size, dtype: object


In [8]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
from sklearn.metrics import accuracy_score

knn_clf.fit(X_train, y_train)
y_predicted = knn_clf.predict(X_test)
print(y_predicted)
acc = accuracy_score(y_test, y_predicted)
print("accuracy:", acc)
# GS: adding after class another way to get accuracy
knn_clf.fit(X_train, y_train)
acc = knn_clf.score(X_test, y_test)
print("accuracy:", acc)

['L' 'L' 'M' 'M' 'L']
accuracy: 0.6
accuracy: 0.6


In [9]:
# can a decision tree do better?
# https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier()
# task: take it from here!
tree_clf.fit(X_train, y_train)
y_predicted = tree_clf.predict(X_test)
tree_acc = accuracy_score(y_test, y_predicted)
print("tree accuracy:", tree_acc)

tree accuracy: 1.0


### k Fold Cross Validation (GS adding after class)
* With cross validation, every instance is in the test set exactly one time
* Basic algorithm: divide the dataset into "folds"
    * For each fold
        * Test on the fold
        * Train on the remaining folds (folds - fold)
* Accuracy is the total correctly predicted over all the folds divided by the total number of instances

In [10]:
from sklearn.model_selection import cross_val_score, cross_val_predict
import numpy as np

# do 5 fold cross validation for both the knn and decision tree classifiers
for clf in [knn_clf, tree_clf]:
    print(type(clf))
    accuracies = cross_val_score(clf, X, y, cv=5)
    print(accuracies, np.mean(accuracies))
    # better way to calculate accuracy
    y_predicted = cross_val_predict(clf, X, y, cv=5)
    acc = accuracy_score(y, y_predicted)
    print(acc)

<class 'sklearn.neighbors._classification.KNeighborsClassifier'>
[0.75       0.5        1.         0.66666667 0.66666667] 0.7166666666666666
0.7222222222222222
<class 'sklearn.tree._classes.DecisionTreeClassifier'>
[0.5        0.5        1.         1.         0.66666667] 0.7333333333333333
0.7222222222222222


Variants of cross validation
* Stratified k fold validation: each fold has roughly the same distribution of class labels
    * Default for sci-kit learns cross validation
        * https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
        * https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html
* LOOCV (leave one out cross validation) k = N
    * Inefficient
    * Good when you need all the training data you can get