# 09.02.01 - SkLearnIntro

## Purpose

This notebook is primarily documentation.  It's intended to kickstart a bit in the Machine Learning world.  Note, we'll do a lot of hand waving to more deep concepts, and will only cover a small portion of what's out there

## Libraries

* Sklearn

## References/Reading
* Choosing the right classifier - https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
* Examples https://scikit-learn.org/stable/auto_examples/index.html
* Logistic Classification - https://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html#sphx-glr-auto-examples-linear-model-plot-iris-logistic-py

![Sklearn ML Chart](https://scikit-learn.org/stable/_static/ml_map.png)

In [1]:
# Import modules using the from syntax
from sklearn.cluster import KMeans                      # k-means clustering
from sklearn.model_selection import train_test_split    # For generating test/train
from sklearn.linear_model import LogisticRegression     # Logistic regression


# A note about imports

Import as little as you need to get the job done.  Remember this does impact your global namespace, so if you aren't using it, don't import it.  Simple as that.

# Evaluating model accuracy

## Confusion Matrix Explained

- True Positive (TP) : Observation is positive, and is predicted to be positive.
- False Negative (FN) : Observation is positive, but is predicted negative.
- True Negative (TN) : Observation is negative, and is predicted to be negative.
- False Positive (FP) : Observation is negative, but is predicted positive.

### Classification Rate or Accuracy is given by the relation:
- (TP + TN) / (TP + TN + FN + FP)

### Recall
- Recall can be defined as the ratio of the total number of correctly classified positive examples divided by the total number of positive examples.
- High Recall indicates the class is correctly recognized (small number of FN).
- Recall is given by the relation: TP / (TP + FN)

### Precision
- For precision we divide the total number of correctly classified positive examples by the total number of predicted positive examples.
- High Precision indicates an example labeled as positive is indeed positive (small number of FP).
- Precision is given by the relation: TP / (TP + FP)

High recall, low precision: Most of the positive examples are correctly recognized (low FN) but there are a lot of false positives.

Low recall, high precision: Miss a lot of positive examples (high FN) but those we predict as positive are indeed positive (low FP)

### F-measure
- F-measure which uses Harmonic Mean in place of Arithmetic Mean as it punishes the extreme values more.
- The F-Measure will always be nearer to the smaller value of Precision or Recall.
- F-Measure : (2 * Recall * Precision) / (Recall + Precision)

# Generating test data

In [2]:
import pandas as pd
from seaborn import load_dataset

In [3]:
# Helper methods
def createCategoricalDummies(dataFrame, categoryList):
    return pd.get_dummies(dataFrame[categoryList], prefix_sep = "::", drop_first = True)

In [4]:
titanicDataSet = load_dataset("titanic")
columns = ["survived", "pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]
categories = ["embarked", "sex"]

In [5]:
titanicDataSet = titanicDataSet[columns]
titanicDataSet.dropna(inplace=True)
titanicDataSet = pd.concat(
    [titanicDataSet.drop(categories, axis=1), createCategoricalDummies(titanicDataSet, categories)], axis= 1)

In [6]:
titanicDataSet


Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embarked::Q,embarked::S,sex::male
0,0,3,22.0,1,0,7.2500,0,1,1
1,1,1,38.0,1,0,71.2833,0,0,0
2,1,3,26.0,0,0,7.9250,0,1,0
3,1,1,35.0,1,0,53.1000,0,1,0
4,0,3,35.0,0,0,8.0500,0,1,1
...,...,...,...,...,...,...,...,...,...
885,0,3,39.0,0,5,29.1250,1,0,0
886,0,2,27.0,0,0,13.0000,0,1,1
887,1,1,19.0,0,0,30.0000,0,1,0
889,1,1,26.0,0,0,30.0000,0,0,1


In [7]:
features = list(titanicDataSet.columns)
features.remove("survived")
target = "survived"

print(f"Feature categories: {features}")
print(f"Target feature: {target}")

Feature categories: ['pclass', 'age', 'sibsp', 'parch', 'fare', 'embarked::Q', 'embarked::S', 'sex::male']
Target feature: survived


In [8]:
X = titanicDataSet[features]
X

Unnamed: 0,pclass,age,sibsp,parch,fare,embarked::Q,embarked::S,sex::male
0,3,22.0,1,0,7.2500,0,1,1
1,1,38.0,1,0,71.2833,0,0,0
2,3,26.0,0,0,7.9250,0,1,0
3,1,35.0,1,0,53.1000,0,1,0
4,3,35.0,0,0,8.0500,0,1,1
...,...,...,...,...,...,...,...,...
885,3,39.0,0,5,29.1250,1,0,0
886,2,27.0,0,0,13.0000,0,1,1
887,1,19.0,0,0,30.0000,0,1,0
889,1,26.0,0,0,30.0000,0,0,1


In [9]:
y = titanicDataSet[target]
y

0      0
1      1
2      1
3      1
4      0
      ..
885    0
886    0
887    1
889    1
890    0
Name: survived, Length: 712, dtype: int64

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

print(f"Length of X_train (feature training set): {len(X_train)}")
print(f"Length of X_test (feature test set): {len(X_test)}")
print(f"Length of y_train (target training set): {len(y_train)}")
print(f"Length of y_test (target training set): {len(y_test)}")

Length of X_train (feature training set): 534
Length of X_test (feature test set): 178
Length of y_train (target training set): 534
Length of y_test (target training set): 178
