# Overview of Machine Learning

Machine Learning (ML) is a subfield of Artificial Intelligence that applies math and computation to build models from data. Like modeling tasks in other fields, the goal of machine learning is to cut through the messiness, chaos, and complexity of real-world data to find the underlying relationships in order to represent the system in a meaningful way. All without being explicitly programmed to do so!

ML algorithms use data to learn these representations, and are generally based on statistics and mathematical optimization techniques. The algorithms find patterns, trends, or learn relationships in data that may have gone unnoticed by humans.

There are different types of algorithms available, and which ones to apply will largely depend on the data and the problem at hand. Algorithms are usually grouped by the following type:

- **Supervised learning**: learns patterns from labeled data then applies them to make accurate predictions about new, similar data
- **Unsupervised learning**: gives insight into the structure of the data or reduces the number of variables (or features) to what's relevant
- **Recommender system**: learns relationships within data to make useful recommendations
- **Reinforcement learning**: finds the optimal way to perform a task or learn how to interact in an environment, given a system of rewards and punishments

Machine Learning is closely related to predictive statistics, and is sometimes referred to as predictive analytics or predictive modeling.

In [3]:
# Import libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')

from sklearn.model_selection import train_test_split

## Supervised Learning

Supervised learning is in the realm of building models to make predictions. The data that are fed to the machine learning algorithms include the "answer" or outcome the model needs to predict, which is also called the target variable or labels. The model maps the relationships between the independent variables (the input, or features) and the labels, which is later applied to new, unseen data to make predictions going forward.

Supervised learning is further broken down by the type of problem - there are **classification** problems, which predict a class or category of an observation, and **regression** problems, which predict a numerical value for an observation.

### Classification

Examples of classification algorithms:

- **Logistic Regression**: applies log transform on linear regression
- **$k$-Nearest Neighbors (kNN)**: finds $k$-nearest neighbors based on a similarity function and uses majority vote to determine the class
- **Stochastic Gradient Descent Classifier**: capable of handling large datasets and training in batches (or each instance independently)

#### Performance Measures

Performance measures tie closely to the confusion matrix, which shows a grid of actual labels against predicted results.

| Actual: | Predicted False | Predicted True |
| ----- | ----- | ----- |
| **Is False** | TN | FP |
| **Is True** | FN | TP |


A **type I error** (false negative) is when the label is true but the observation is incorrectly classified as false. The probability of a type I error is called **alpha risk**. A **type II error** (false positive) is when the label is false but the observation is incorrectly classified as true. The probability of a type II error is called **beta risk**.

- **Accuracy**: the fraction of correct predictions to total predictions $\frac{TN + TP}{TN + FN + TP + FP}$
- **Precision**: the rate of true positives to everything predicted as positive. In other words, when the model claims an observation is positive, it's correct this percent of the time. Good metric to use when what is classified as positive MUST be correct (predicting appropriate videos for kids) $\frac{TP}{TP + FP}$
- **Recall (sensitivity, true positive rate)**: the rate of true positives to everything that is postitive, or in other words, the percent of all the positives the model detects. Good metric to use when you can't let any positives slip through the cracks (predicting malignant tumors so patient receives timely treatment) $\frac{TP}{TP + FN}$
- **F1 Score**: the harmonic mean of precision and recall - the highest value (1) is only possible when both values for precision and recall are high. Good for situations that don't favor either precision or recall, but want to maximize both, or for when the positive class is scarce $\frac{2}{\frac{1}{precision} + \frac{1}{recall}} = \frac{TP}{TP + \frac{FN + FP}{2}}$
- **Specificity (true negative rate)**: the rate of true negatives to everything that is negative $\frac{TN}{TN + FP}$
- **False positive rate**: rate of negative values falsely classified as positive to everything that is negative $\frac{FP}{TN + FP}$
- **False negative rate**: rate of positive values falsely classified as negative to everything that is  positive $\frac{FN}{TP + FN}$
- **Receiver Operating Characteristic (ROC)**: plots the FPR (incorrectly classified negative values to all negative values) against the TPR/recall/sensitivity (correctly classified positive values to all positive values). As the threshold shifts (to left) to increase the TPR, it incorrectly classifies more and more negative values as positive and vice versa. The total **area under the ROC curve (AUROC)** is 1, indicating a perfect classifier (0.5 is completely random). Similar to precision-recall and F1 - use that when the positive class is scarce or when you care more about the false positives than the false negatives, use ROC otherwise.

In [4]:
# scikit learn metrics
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, roc_curve, roc_auc_score