<a href="https://colab.research.google.com/github/bundickm/DS-Unit-2-Sprint-3-Classification-Validation/blob/master/Classification_Validation_Cheat_Sheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Resources
[Sprint GitHub Repo](https://github.com/bundickm/DS-Unit-2-Sprint-3-Classification-Validation)

[Cheat Sheet Repo](https://github.com/bundickm/CheatSheets)

[Yellowbrick: Machine Learning Visualization](https://www.scikit-yb.org/en/latest/index.html)

#Baselines and Validation
[**Model Validation**](https://github.com/bundickm/DS-Unit-2-Sprint-3-Classification-Validation/blob/master/module2-baselines-validation/model-validation-preread.md) - The process where a trained model is evaluated with a testing data set. The testing data set is a separate portion of the same data set from which the training set is derived. The main purpose of using the testing data set is to test the generalization ability of a trained model.

Some Model Validation Methods
- Performance estimation
 - 2-way holdout method (train/test split)
 - (Repeated) k-fold cross-validation without independent test set
- Model selection (hyperparameter optimization) and performance estimation ← *We usually want to do this*
 - 3-way holdout method (train/validation/test split)
 - (Repeated) k-fold cross-validation with independent test set

"There is a pair of ideas that you must understand in order to do inference correctly:
```
        Each observation can either be used for exploration or confirmation, not both.

        You can use an observation as many times as you like for exploration, 
        but you can only use it once for confirmation.As soon as you use an 
        observation twice, you’ve switched from confirmation to exploration.
```
This is necessary because to confirm a hypothesis you must use data independent of the data that you used to generate the hypothesis. Otherwise you will be over optimistic. There is absolutely nothing wrong with exploration, but you should never sell an exploratory analysis as a confirmatory analysis because it is fundamentally misleading."

[**Baselines**](https://github.com/bundickm/DS-Unit-2-Sprint-3-Classification-Validation/blob/master/module2-baselines-validation/model-validation-preread.md#why-begin-with-baselines) - A baseline is a very basic model/solution to create predictions for a dataset. You can use these predictions to measure the baseline's performance (e.g., accuracy)-- this metric will then become what you compare any other machine learning algorithm against. Why start with a baseline? It will take you less than 1/10th of the time, and could provide up to 90% of the results as well as put a more complex model into context.

**Leaky Data** - Any feature whose value would not actually be available in practice at the time you'ld want to use the model to make a prediction (ex. - Data from the future, that you would only get after the event occured). An indicator of leaky data are results that are "too good to be true."

In [0]:
from sklearn.tree import DecisionTreeClassifier

#Shallow tree can work as a quick baseline
tree = DecisionTreeClassifier(max_depth=2)
tree.fit(X_train,y_train)

In [0]:
import graphviz
from sklearn.tree import export_graphviz

#visualizing the tree can expose possible leaky data
dot_data = export_graphviz(tree, out_file=None, 
                           feature_names=X_train_numeric.columns, 
                           class_names=['No', 'Yes'], filled=True, 
                           impurity=False, proportion=True)

graphviz.Source(dot_data)

#Logistic Regression
**Logistic Regression** - a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).

The goal of logistic regression is to find the best fitting model to describe the relationship between the dichotomous characteristic of interest (dependent variable) and a set of independent (predictor or explanatory) variables.

In [0]:
import pandas as pd
from sklearn.linear_model import LogisticRegression

#read in the data
df = pd.read_csv('some.csv')

#split the data
target = 'target_feature'
X = df.drop(target)
y = df[target]

#train-test-split or some other validation split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                    test_size=0.2, random_state=42, stratify=y)

#instantiate and fit the model
log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_train, y_train)

#make predictions
y_pred = log_reg.predict(X_test)

In [0]:
from sklearn.metrics import accuracy_score

#Calculate Accuracy (correct predictions/total predictions)
accuracy_score(y_test, y_pred)

In [0]:
from sklearn.model_selection import cross_val_score

#10 splits with the test data and calculate the models accuracy on each
scores = cross_val_score(log_reg, X_test, y_test, cv=10, scoring='accuracy')

print('Cross-Validation Accuracy Scores', scores)
scores.min(), scores.mean(), scores.max()

In [0]:
#see the coefficients of the model
coefficients = pd.Series(log_reg.coef_[0], X_train.columns)
coefficients

In [0]:
#see the predicted probabilities instead of the binary predictions
log_reg.predict_proba(test_case)

#Radom Forests and Gradient Boosting

**Ensemble Methods** - Machine learning techniques that combines several base models in order to produce one optimal predictive model.

**Bagging (Bootstrap Aggregating)** - A machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

<center><img src="https://victorzhou.com/media/random-forest-post/random-forest.svg" width="400"/></center>

**Random Forest** - An ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
1. Each tree trains on a random bootstrap sample (sample with replacement) of the data. (In scikit-learn, for `RandomForestRegressor` and `RandomForestClassifier`, the bootstrap parameter's default is `True`.) This type of ensembling is called Bagging.
2. Each split considers a random subset of the features. (In scikit-learn, when the `max_features` parameter is not `None`.)


In [0]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=100, max_depth=4, 
                                class_weight={1:5,0:1})
forest.fit(X,y)

**Boosting** - Boosting works in a similar way to bagging, except that the models are made sequentially: each model is grown using information from previously made models. Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly. Given the current model, we fit a decision model to the residuals from the current model. We then add this new model into the fitted function in order to update the residuals. By fitting models to the residuals, we slowly improve fˆ in areas where it does not perform well. Note that in boosting, unlike in bagging, the construction of each model depends strongly on the models that have already been grown.

In [0]:
from xgboost import XGBClassifier

booster = XGBClassifier(n_estimators=20, n_jobs=-1)
booster.fit(X,y)

#Metrics

##Classification Metrics and Confusion Matrix

**Accuracy** - The proportion of predictions a model got correct. \begin{align}Accuracy = \frac{\text{True Positives + True Negatives}}{\text{Total Number of Predictions}}\end{align}

**Precision** - The proportion of positive identifications that were actually correct. "How useful the results are" \begin{align}Precision = \frac{\text{True Positives}}{\text{True Positives + False Positives}}\end{align}

**Recall** - The proportion of actual positives that were correctly identified.  "How complete the results are" \begin{align}Recall = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}\end{align}

**F1 Score** -  The harmonic average of the precision and recall; an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. 
\begin{align}F1 = 2\frac{\text{Precision} \cdot \text{Recall}}{\text{Precision + Recall}}\end{align}

Precision and recall are measuring the relevance of results and are intertwined. Generally, a positive increase in one will result in a negative increase in the other. In simple terms, high precision means that an algorithm returned substantially more relevant results than irrelevant ones, while high recall means that an algorithm returned most of the relevant results.The F1 score attempts to measure how well precision and recall are balanced.

**Confusion Matrix (error matrix)** - A table layout that allows visualization of the performance of an algorithm. Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa).

In [0]:
from sklearn.metrics import classification_report, confusion_matrix

#The classification_report will give precision, recall, and F1
print(classification_report(y, y_pred))

#Or we can calculate the metrics from the confusion matrix
pd.DataFrame(confusion_matrix(y, y_pred), 
             columns=['Predicted Negative', 'Predicted Positive'], 
             index=['Actual Negative', 'Actual Positive'])

##ROC-AUC

**Discrimination Threshold** - The probability or score at which the positive class is chosen over the negative class. Generally, this is set to 50% but the threshold can be adjusted to increase or decrease the sensitivity to false positives or to other application factors.

**Receiver Operating Characteristic (ROC) Curve** - A graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. In simple terms, it illustrates the price you pay in terms of false positive rate to increase the true positive rate. The conservatism is controlled via thresholds on confidence scores to assign the positive and negative label.

**[ROC-AUC](https://www.kaggle.com/learn-forum/53782)** -  The area under the ROC curve, measures how well a classifier ranks predicted probabilities. It ranges from 0 to 1. The ROC-AUC is good for classifications with a class imbalance since a naive majority class baseline will have an ROC-AUC score of 0.5. AUC scoring also allows us to evaluate models independently of the threshold.

In [0]:
#visualize the ROC curve
from sklearn.metrics import roc_auc_score, roc_curve

#get the false-positive and true positives
fpr, tpr, thresholds = roc_curve(y, y_pred_proba)

#plot it
plt.plot(fpr, tpr)
plt.title('ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

print('Area under the Receiver Operating Characteristic curve:', 
      roc_auc_score(y, y_pred_proba))

In [0]:
#adjusting threshold for accuracy
pipe.fit(X,y)
((pd.Series(pipe.predict_proba(X)[:,1])>.65) == y).value_counts(normalize=True)

In [0]:
#adjusting threshold for precision
pipe.fit(X,y)
true_pos = ((pd.Series(pipe.predict_proba(X)[:,1])>.1) & y).value_counts()[1]
pred_pos = (pd.Series(pipe.predict_proba(X)[:,1])>.1).value_counts()[1]
print('Precision:',true_pos/pred_pos)

In [0]:
#adjusting threshold for recall
pipe.fit(X,y)
true_pos = ((pd.Series(pipe.predict_proba(X)[:,1])>.1) & y).value_counts()[1]
actual_pos = y.value_counts()[1]
print('Recall:',true_pos/actual_pos)

##Imbalanced Classes

**[Imbalanced Classes](https://www.svds.com/tbt-learning-imbalanced-classes/)** - Target classes with unequal distribution (More `True` than `False`, more `Red` than `Blue` or `Yellow`). Conventional algorithms are often biased towards the majority class because their loss functions attempt to optimize quantities such as error rate, not taking the data distribution into consideration2. In the worst case, minority examples are treated as outliers of the majority class and ignored. The learning algorithm simply generates a trivial classifier that classifies every example as the majority class.

A rough outline of useful approaches. Approximately in order of effort:
- Do nothing. Sometimes you get lucky and nothing needs to be done. You can train on the so-called natural (or stratified) distribution and sometimes it works without need for modification.
- Balance the training set in some way:
 - Oversample the minority class.
 - Undersample the majority class.
 - Synthesize new minority classes.
- Throw away minority examples and switch to an anomaly detection framework.
- At the algorithm level, or after it:
 - *Adjust the class weight* (misclassification costs).
 - *Adjust the decision threshold.*
 - Modify an existing algorithm to be more sensitive to rare classes.
- Construct an entirely new algorithm to perform well on imbalanced data.


#Feature Engineering and Misc.

In [0]:
!pip install category_encoders
import category_encoders as ce

#one hot encode all string columns
encoder = ce.OneHotEncoder.(use_cat_names=True)
encoder.fit_transform(X_train)

In [0]:
!pip install category_encoders
import category_encoders as ce

#encode all string columns
encoder = ce.OrdinalEncoder()
encoder.fit_transform(X_train)

In [0]:
from sklearn.impute import SimpleImputer

#fill in missing values
imputer.fit_transform(X_train)

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt

#visualize the coefficients
coefficients = pd.Series(log_reg.coef_[0], X_train.columns)
plt.figure(figsize=(10,10))
coefficients.sort_values().plot.barh(color='grey');

In [0]:
from sklearn.preprocessing import StandardScaler

#Standardize the data (center on the mean and set unit variance)
scaler = StandardScaler()
scaler.fit_transform(X_train)

In [0]:
from sklearn.preprocessing import MinMaxScaler

#set all features to have a range of 0 to 1
scaler = MinMaxScaler()
scaler.fit_transform(X_train)

In [0]:
from sklearn.pipeline import make_pipeline

#example pipeline, all parts in pipe must follow sklearn format
pipe = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    SimpleImputer(), 
    MinMaxScaler(), 
    LogisticRegression(solver='lbfgs', max_iter=1000))

#End Case 1: Scoring Method
scores = cross_val_score(pipe, X, y, cv=10)

#End Case 2: Calling pipe just like any other sklearn model
pipe.fit(X, y)
y_pred = pipe.predict(X_test)

In [0]:
#Test multiple models in quick succession, preprocess data before this
models = [LogisticRegression(solver='lbfgs', max_iter=1000), 
          DecisionTreeClassifier(max_depth=3), 
          DecisionTreeClassifier(max_depth=None), 
          RandomForestClassifier(max_depth=3, n_estimators=100, n_jobs=-1, 
                                 random_state=42), 
          RandomForestClassifier(max_depth=None, n_estimators=100, n_jobs=-1, 
                                 random_state=42), 
          XGBClassifier(max_depth=3, n_estimators=100, n_jobs=-1, 
                        random_state=42)]

for model in models:
  print(model, '\n')
  score = cross_val_score(model, titanic_X, titanic_y, 
                          scoring='accuracy', cv=5).mean()
  print('Cross-Validation Accuracy:', score, '\n', '\n')

In [0]:
#visualize feature importance for multiple models, pairs with above cell above
for model in models:
  name = model.__class__.__name__
  model.fit(titanic_X, titanic_y)
  if name == 'LogisticRegression':
    coefficients = pd.Series(model.coef_[0], titanic_X.columns)
    coefficients.sort_values().plot.barh(color='grey', title=name)
    plt.show()
  else:
    importances = pd.Series(model.feature_importances_, titanic_X.columns)
    title = f'{name}, max_depth={model.max_depth}'
    importances.sort_values().plot.barh(color='grey', title=title)
    plt.show()