# Ch. 1 - Basic Modeling in Scikit-learn

### What is model validation
- ensuring your model performs as expected on new data
- testing model performance on holdout sets
- selecting the best model, parameters, and accuracy metrics
- achieving the best accuracy for the data given

### Basic modeling steps
- create a model by instantiating a model type and it's parameters (ex. rf = RandomForestRegressor(n_estimators=500))
- fit the model to a training set of data (ex. rf.fit(X_train, y_train)
- generate predictions for the testing set of data (ex. y_pred = rf.predict(X_test))
- assess the accuracy metrics (ex. MSE(y_test, y_pred))

In [45]:
import pandas as pd
candy = pd.read_csv('candy-data.csv')
ttt = pd.read_csv('tic-tac-toe.csv')

#### Testing model accuracy on seen vs unseen data

In [34]:
# Basic Imports to use
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error as  mae

# Split the data into predictors and response
X = candy.drop(['competitorname', 'winpercent'], axis=1)
y = candy['winpercent']

# Create a training and testing set of data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=1111)

# Instantiate the model
rf = RandomForestRegressor(random_state=1111)

# Fit the model
rf.fit(X_train, y_train)

# Predict on both the training and testing set
train_pred = rf.predict(X_train)
test_pred = rf.predict(X_test)
rf.score

# Calculate errors for train and test predictions
train_error = mae(y_train, train_pred)
test_error = mae(y_test, test_pred)

# Print the accuracy for seen and unseen data
print("Model error on seen data: {0:.2f}.".format(train_error))
print("Model error on unseen data: {0:.2f}.\n".format(test_error))

Model error on seen data: 3.59.
Model error on unseen data: 9.49.



### Regression Models - RandomForestRegressor
We will focus on 3 hyperparameters
- n_estimators: number of trees in the forest
- max_depth: maximum depth of the trees (levels)
- random_state: random seed to ensure reproducibility

You can set these hyperparameters in two ways
- when instantiating the model: rf = RandomForestRegressor(n_estimators=500)
- after the model is created: rf.n_estimators = 50 or rf.max_depth = 10
    - this could be helpful when testing out different sets of parameters

Feature Importance can be assessed to see how each feature contributed to the model
- for i, item in enumerate(rf.feature_importances_): print('{}: {:.2f}'.format(X.columns[i], item))

In [33]:
# Print the Feature Importances from the model created above
print('Feature Importance Scores:')
for i, item in enumerate(rf.feature_importances_):
    print('{}: {:.2f}'.format(X.columns[i], item))

Feature Importance Scores:
chocolate: 0.43
fruity: 0.04
caramel: 0.02
peanutyalmondy: 0.08
nougat: 0.00
crispedricewafer: 0.02
hard: 0.01
bar: 0.03
pluribus: 0.01
sugarpercent: 0.20
pricepercent: 0.16


### Classification Models - RandomForestClassifier
We'll use the tic-tac-toe dataset since it is a complete data set of all possible game combinations.
- x = player one
- o = player two
- b = blank space at end of game
- positive = player one wins

#### Prepare the Tic-Tac-Toe data for model building

In [51]:
# Turn Class into binary (positive=1, negative=0)
ttt.replace('positive', 1, inplace=True)
ttt.replace('negative', 0, inplace=True)

# Get Dummies for other features
ttt_prep = pd.get_dummies(ttt, drop_first=True)

#### Build the Classification model

In [88]:
# Basic Imports to use
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error as  mae

# Split the data into predictors and response
X = ttt_prep.drop(['Class'], axis=1)
y = ttt_prep.Class

# Create a training and testing set of data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1111)

# Instantiate the model
rf = RandomForestClassifier(random_state=1111)

# Fit the model
rf.fit(X_train, y_train)

# Predict the outcomes
y_pred = rf.predict(X_test)

# Score the accuracy
print('Training Set Score: ',rf.score(X_train, y_train))
print('Test Set Score: ',rf.score(X_test, y_test))

Training Set Score:  1.0
Test Set Score:  0.9861111111111112


# Ch. 2 - Validation Basics
## Holdout Samples
- Training set: Used to build and train models
- Validation set: used to assess and compare models, and to tune hyperparameters
- Testing set: Used to assess the final models performance

In [74]:
# Split the data into predictors and response
X = ttt_prep.drop(['Class'], axis=1)
y = ttt_prep.Class

# Imports
from sklearn.model_selection import train_test_split

# Split X into temporary and test sets, then split the temp set into train and validation sets
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=1111)

X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=1111)

## Accuracy Metrics
### Regression
The metric you pick is application specific. These are measured in different units and should not be compared
- <b>Mean Absolute Error (MAE)</b>:
    - simplest and most intuitive
    - treats all points equally
    - not sensitive to outliers
- <b>Mean Squared Error (MSE)</b>:
    - most widely used regression metric
    - larger errors have a higher impact (due to squaring)
    - more sensitive to outliers
    

In [80]:
# Basic Imports to use
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error as  MAE
from sklearn.metrics import mean_squared_error as MSE

# Split the data into predictors and response
X = candy.drop(['competitorname', 'winpercent'], axis=1)
y = candy['winpercent']

# Create a training and testing set of data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1111)

# Instantiate the model
rf = RandomForestRegressor(random_state=1111)

# Fit the model
rf.fit(X_train, y_train)

# Predict on both the training and testing set
y_pred = rf.predict(X_test)

# Calculate and print error with MAE and MSE
mae_rf = MAE(y_test, y_pred)
mse_rf = MSE(y_test, y_pred)

print('Mean Absolute Error (MAE): {:.2f}'.format(mae_rf))
print('Mean Squared Error (MSE): {:.2f}'.format(mse_rf))

Mean Absolute Error (MAE): 9.91
Mean Squared Error (MSE): 144.10


### Classification Metrics
Precision, Recall, Accuracy, Speciificity, F1-Score (and it's variations), and many more
##### Confusion Matrix can help with these scores (from Tic-Tac-Toe classification above)
[[204  60]
 
 [ 19 484]]
- <b>Precision</b>:
    - number of true positives out of all predicted positive values
    - Used when we don't want to over predict positive values
    - 484 / (484 + 60) = 0.89
- <b>Recall (Sensitivity)</b>:
    - portion of true positives out of all possible possitives
    - Used when we can't afford to miss any positive values (medical diagnosis)
    - 484 / (484 + 19) = 0.962
- <b>Accuracy</b>:
    - Proportion of correct predictions compared to entire sample
    - (204 + 484) / (204 + 60 + 19 + 484) = 0.897
    
from sklearn.metrics import accuracy_score, precision_score, recall_score

## The Bias-Variance Tradeoff
- Variance: 
    - model pays too close of attention to the training data
    - fails to generalize
    - low training error but high testing error
    - Over Fit with high complexity  
- Bias:
    - Model fails to find relationships between data and response
    - high errors on both training and testing
    - Under Fit with low complexity

In [94]:
# Basic Imports to use
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Split the data into predictors and response
X = ttt_prep.drop(['Class'], axis=1)
y = ttt_prep.Class

# Create a training and testing set of data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1111)

# Use a for loop to test different accuracy scores for a range of n_estimators
test_scores, train_scores = [], []
for i in [1, 2, 3, 4, 5, 10, 20, 50]:
    rfc = RandomForestClassifier(n_estimators=i, random_state=1111)
    rfc.fit(X_train, y_train)
    # Create predictions for the X_train and X_test datasets.
    train_predictions = rfc.predict(X_train)
    test_predictions = rfc.predict(X_test)
    # Append the accuracy score for the test and train predictions.
    train_scores.append(round(accuracy_score(y_train, train_predictions), 2))
    test_scores.append(round(accuracy_score(y_test, test_predictions), 2))
    
# Print the train and test scores.
print("The training scores were: {}".format(train_scores))
print("The testing scores were: {}".format(test_scores))

The training scores were: [0.94, 0.94, 0.99, 0.98, 1.0, 1.0, 1.0, 1.0]
The testing scores were: [0.83, 0.86, 0.93, 0.93, 0.93, 0.96, 0.97, 0.99]


Notice that with only one tree, both the train and test scores are low. As you add more trees, both errors improve. Even at 50 trees, this still might not be enough. Every time you use more trees, you achieve higher accuracy. At some point though, more trees increase training time, but do not decrease testing error.

# Ch. 3 Cross-Validation
- Holdout samples run the risk of holding out data that could have key effects on the model building and predictions
- something as simple as changing a random seed state can drastically affect the performance of a model

## Cross validation: the gold standard
cross_val_score
- estimator - model to use
- X - complete training set
- y - response variable data set
- cv (number of folds)

In [103]:
# Imports
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, make_scorer

# Split the data into predictors and response
X = candy.drop(['competitorname', 'winpercent'], axis=1)
y = candy['winpercent']

# Instantiate the regressor
rf = RandomForestRegressor(n_estimators=20, max_depth=5, random_state=1111)

# Make scorer for Cross validation (optional)
mse = make_scorer(mean_squared_error)

# Run cross validation
cv_results = cross_val_score(rf, X, y, cv=5, scoring=mse)

# Evaluate results
print(cv_results)
print('Cross Val Mean: {}'.format(cv_results.mean()))
print('Cross Val Std: {}'.format(cv_results.std()))

[152.03588775 103.95118251  89.34700797 206.07148616 140.8862936 ]
Cross Val Mean: 138.45837159620368
Cross Val Std: 40.90097589634289


### Leave-one-out Cross Validation (LOOCV)
k-Fold CV where k=n. This means every point will be used in a validation set all by itself. Because of this it is very computationally expensive

- Use when:
    - data is limited
    - want to use as much data for training as possible
- Don't use when:
    - computation resources are limited
    - large datasets
    - lots of parameters to test

In [105]:
# Imports
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, make_scorer

# Split the data into predictors and response
X = candy.drop(['competitorname', 'winpercent'], axis=1)
y = candy['winpercent']

# Instantiate the regressor
rf = RandomForestRegressor(n_estimators=20, max_depth=5, random_state=1111)

# Make scorer for Cross validation (optional)
mse = make_scorer(mean_squared_error)

# Run cross validation, using the number of observations as the number of folds
n = X.shape[0]
cv_results = cross_val_score(rf, X, y, cv=n, scoring=mse)

# Evaluate results
print('Cross Val Mean: {}'.format(cv_results.mean()))
print('Cross Val Std: {}'.format(cv_results.std()))

Cross Val Mean: 134.12865873926833
Cross Val Std: 180.11962158885254


# Ch. 4 - Hyperparameter Tuning
Parameters
- learned or calculated by the algorithm based on training data

Hyperparametrs
- Set manually by the modeler to tune the performance