# 09.03.01 - Classification

## Purpose

This notebook will go over two classification strategies using the test set we talked about in 09.01 and 09.02

## Libraries

* Sklearn

## References/Reading

* KNeighborsClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
  * https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
* LogisticRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
  * https://en.wikipedia.org/wiki/Logistic_regression
* Confusion Matrix
  * Tutorial: https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
  * Scikit: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

In [1]:
import pandas as pd
from seaborn import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

In [2]:
# Helper methods
def createCategoricalDummies(dataFrame, categoryList):
    return pd.get_dummies(dataFrame[categoryList], prefix_sep = "::", drop_first = True)

# Part 1: Prepare the data
We've seen all this already, we'll just do it in one spot

In [3]:
titanicDataSet = load_dataset("titanic")
columns = ["survived", "pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]
categories = ["embarked", "sex"]

In [4]:
titanicDataSet = titanicDataSet[columns]
titanicDataSet.dropna(inplace=True)
titanicDataSet = pd.concat(
    [titanicDataSet.drop(categories, axis=1), createCategoricalDummies(titanicDataSet, categories)], axis= 1)

features = list(titanicDataSet.columns)
features.remove("survived")
target = "survived"

In [5]:
X = titanicDataSet[features]
y = titanicDataSet[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Modeling with k-Nearest Neighbors (k-NN)

Interfaces for the algorithms are very similar, which makes use easy.  You'll see later on with Logistic Regression.

In [6]:
knn = KNeighborsClassifier(n_neighbors=3)
knn

KNeighborsClassifier(n_neighbors=3)

In [7]:
knn.fit(X_train, y_train)    # Remember, X = features, y = target

KNeighborsClassifier(n_neighbors=3)

In [8]:
knn.score(X_train, y_train)  # What's our score with the training data set?

0.8220973782771536

In [9]:
knn.score(X_test, y_test)    # What's our score with the test data set?


0.6853932584269663

## model notes...
Looking at our scores above, they don't look great.  64% accuracy is a bit better than a coin flip, which isn't saying much.

Now's a good time to review over 09.02.01, specifically evaluating and what the confusion matrix means, but let's also use sklearn to help get that information.

In [10]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score, fbeta_score, classification_report

def printMetrics(test, predictions):
    print("Confusion Matrix:")
    print(confusion_matrix(test, predictions))
    print("------------------")
    print(f"Accuracy: {accuracy_score(test, predictions):.2f}")
    print(f"Recall: {recall_score(test, predictions):.2f}")
    print(f"Prediction: {precision_score(test, predictions):.2f}")
    print(f"f-measure: {fbeta_score(test, predictions, beta=1):.2f}")
    print("------------------")
    print(classification_report(test, predictions))

In [11]:
predictions = knn.predict(X_test)
printMetrics(y_test, predictions)

Confusion Matrix:
[[84 33]
 [23 38]]
------------------
Accuracy: 0.69
Recall: 0.62
Prediction: 0.54
f-measure: 0.58
------------------
              precision    recall  f1-score   support

           0       0.79      0.72      0.75       117
           1       0.54      0.62      0.58        61

    accuracy                           0.69       178
   macro avg       0.66      0.67      0.66       178
weighted avg       0.70      0.69      0.69       178



## Predict some new samples

Lets define a few new people.  You can do this as a list, and pass that in, or as a DataFrame.  We'll do it via a DataFrame only, but will randomly generate our sample people.

In [12]:
import random as rnd
rnd.seed(1024)
titanicDataSet

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embarked::Q,embarked::S,sex::male
0,0,3,22.0,1,0,7.2500,0,1,1
1,1,1,38.0,1,0,71.2833,0,0,0
2,1,3,26.0,0,0,7.9250,0,1,0
3,1,1,35.0,1,0,53.1000,0,1,0
4,0,3,35.0,0,0,8.0500,0,1,1
...,...,...,...,...,...,...,...,...,...
885,0,3,39.0,0,5,29.1250,1,0,0
886,0,2,27.0,0,0,13.0000,0,1,1
887,1,1,19.0,0,0,30.0000,0,1,0
889,1,1,26.0,0,0,30.0000,0,0,1


In [13]:
numElements = 3
samplePeople = []
for _ in range(numElements):
    dict = {}
    for column in X.columns:
        min = 0  # We'll always allow at lea
        maxValue = round(max(titanicDataSet[column].values))
        dict[column] = rnd.randint(min, maxValue)
    samplePeople.append(dict)
samplePeople

[{'pclass': 0,
  'age': 61,
  'sibsp': 3,
  'parch': 2,
  'fare': 102,
  'embarked::Q': 1,
  'embarked::S': 1,
  'sex::male': 1},
 {'pclass': 3,
  'age': 12,
  'sibsp': 5,
  'parch': 5,
  'fare': 143,
  'embarked::Q': 1,
  'embarked::S': 0,
  'sex::male': 1},
 {'pclass': 1,
  'age': 55,
  'sibsp': 3,
  'parch': 3,
  'fare': 355,
  'embarked::Q': 0,
  'embarked::S': 0,
  'sex::male': 0}]

In [14]:
pdSamplePeople = pd.DataFrame.from_dict(samplePeople)

In [15]:
predictions = knn.predict(pdSamplePeople)
predictions

array([0, 0, 1])

In [16]:
pdPredictedPeople = pdSamplePeople
pdPredictedPeople["Survived?"] = predictions.astype(bool)
pdPredictedPeople

Unnamed: 0,pclass,age,sibsp,parch,fare,embarked::Q,embarked::S,sex::male,Survived?
0,0,61,3,2,102,1,1,1,False
1,3,12,5,5,143,1,0,1,False
2,1,55,3,3,355,0,0,0,True


## Logistic Regression

We'll use the same sample set we have above, and use a new classifier (intentionally fudging the term here), since we're dealing with a binary representation/cutoff

In [17]:
lr = LogisticRegression(solver="liblinear")
lr

LogisticRegression(solver='liblinear')

In [18]:
lr.fit(X_train, y_train)

LogisticRegression(solver='liblinear')

In [19]:
lr.score(X_train, y_train)

0.8014981273408239

In [20]:
lr.score(X_test, y_test)

0.7921348314606742

In [21]:
predictions = lr.predict(X_test)
printMetrics(y_test, predictions)

Confusion Matrix:
[[97 20]
 [17 44]]
------------------
Accuracy: 0.79
Recall: 0.72
Prediction: 0.69
f-measure: 0.70
------------------
              precision    recall  f1-score   support

           0       0.85      0.83      0.84       117
           1       0.69      0.72      0.70        61

    accuracy                           0.79       178
   macro avg       0.77      0.78      0.77       178
weighted avg       0.79      0.79      0.79       178

