#### From Lecture
#### Training and Evaluation
* training to learn underlying concepts - not memorize
* allows for generalization from training to test 
* underfitting - model is too simple, only looks at basic patterns
    * high error on training and testing data
* overfitting - too complex of a model, matches the training data too perfectly - very little error
    * too many parameters/degrees of freedom
    * low error on training, high error on testing data
* for linear regression compare R2 from train and test
    * always want to get higher R2 for test (better accuracy)
    * check for overfitting difference between training R2 and test R2 - (better at training than test e.g.)
    * underfitting - is just checking test R2 - difference between the two, if model1 has lower R2 than model2 - then model1 might be underfitting 
* too complex (overfitting) - train R2= 1, and test becomes negative
    * not enough data points per features
* can plot R2 scores per complexity on train and test
* main goal is to get a good R2 score on test - try to maximize performance on test

#### Sampling Bias
* when data comes from different distribution than the new data (test)
* e.g. studying past exams test on  new exams with different teacher
* old time periods to new time periods,  different locations, sexes, phone/email/polling data

#### Cross validation
* get different training and test set everytime
* smaller data sets = more likely to suffer from sampling bias
* vary which data you are using as test data- then average the resulting error (loss function)
* if high standard deviation of variance - we probably need more data, lots of variation between errors
* take fifth of every data 'k-fold' (20%) until you use 100% of the data as your test
* other method: leave out 1, take test everytime but with different 
* average R2 - because there's multiple test-data used and cross-validated on - more accurate measure of R2 
* if model is not hard to run - cross validation can be done easily
* done manually or through KFold on sklearn
* more robust estimate of true sample error

#### After cross-validating
* none of them are actually used - but gives us estimate of how model will perform
* but now we can train our model on the entire data set
* based on cross-validation

#### Hyperparameter tuning
* parameters set - that are not learned during training 
* usually done via guessing and checking - try hyperparameters and set to whatever gives the lowest error
* common techniques: grid search (looks at every possible combination), random search (selects at random based on some boundary)
* can evaluate which achieves the lowest loss  can be done with cross validation
* can become computationally expensive 
* other advanced techniques: bayesian optimization, genetic algorithms
* goal is choose the best model: first step choose the right model, and then choose the right hyperparameters to fine tune even more
* hyperparameters is being trained on cross validation data - somewhat overfitting
* can do an initial split again with train/test, and then subset train again for k-folds
* create a 'validation' dataset so hyperparameters aren't being trained on **any** test data  


#### Classification
* same thing can be done for categorical variables
* evaluate using accuracy/precision - predicted vs actual (see lecture for more details)



#### From Compass
* Validation - deciding whether numerical results about the relationships between variables are acceptable 
* common types: holdout, k-fold, leave one out, bootstrap
* Cross Validation - testing performance of the model on 'unseen' data
    * holdout method (splitting training and test)
    * k-fold cross validation 

Holdout Method
* split data for train and test
* test and 'unseen' data must be similar, test must be close approximation of unseen
* a lot rides on the test data 
* representation of training and test - may not be equal 
* instead we can use k-fold Cross Validation

K-Fold Cross Validation
* essentially repeating the holdout method k times, k subsets of the data gets used as test/validation set
* creates and evaluates multiple models on multiple subsets of the dataset
* reduces selection bias 
* reduces bias and variance as most of the data is being used in validation
* general rule of 5-10 Ks
* creating a 'population' of performance measures (loss functions)
* we can then calcualte the mean/st.d of these measures to get an idea of how well the procedure does on average and how much they vary
* resampling method
* goal is find a model that has maximized estimated skill
* once a model is finalized - saved for us - and make predictions on new data
* cross-validation models and train-test datasets can be discarded
* final model is then trained on all available data

Stratified K-fold cross validation
* each fold contains same percentage of samples of each target class as the complete set (
* e.g. more blues than reds in dataset, will bias classification to blue
* ensures equal representation in training data set
* these are non-exhaustive cross validation - do not compute all ways of splitting the data, number of subsets is decided 

**Exhaustive Methods of Cross Validation**

Leave-P-out Cross validation
* leaves p number of data points out of training data 
* n-p used for validation set
* repeated for all combinations in which original sample can be separated 
* similar to k-fold but specific case, takes all data except 1, and calculate average error 
* good way to validate because each data point becomes part of validation, but very intensive because must compute every time
* popular: p=1, "Leave one out cross validation
    * intensive computation
    * number of possible combinations = number of data points
* useful technique for getting effectiveness of model
* mitigates overfitting, and deciding hyper parameters 

**Best Practices**
* K-fold cross validation 
    * for small amount of data, and can't afford to drop values
* Holdout set and Cross Validation
    * splits data into train/test
    * apply cross-validation on training set to get optimal model
    * test model on test set to see performance on "new"/"unseen" data


Bootstrap Methods
* randomly draw datasets from the training sample - with replacement
* data point can be selected more than once
* each sample is same size as the training sampple
* refit the model with bootstrap samples
* examine the model


#### Overfitting
* noise interferes with signal, algorithm that's too complex ends up memorizing the noise instead of finding the signal
* 'goodness of fit' - linked to approximation error
    * how closely a model's predicted values match the observed (true) values)
* Overfit
    * model that learned the noise instead of the signal is considered overfit 
    * more variance in their predictions
    * too high = too complex
* Underfitting - model is too simple
    * less variance in their predictions - but more biased towards wrong outcomes
    * too much bias - too simple 
* want to get sweet spot  with low bias and low variance
* Detecting overfitting
    * can't know until testing
    * can holdout method to train-test split to see performance on 'new' data
    * does well on training but sig drop off at test data
    * start with a simple model and make more complex
* Preventing overfitting
    * 1. cross-validation - use initial training data to generate multiple mini train-test splits, use these to tune your model
    * standard k-fold cross validation - partition the data into k subsets (folds)
    * iterate on folds while holding out as a test 
    * 2. train with more data, - can help algorithms detect the signal better
    * make sure data you are adding is relevant and clean
    * 3. Remove features
    * removing irrelevant input features
    * make sure the features make sense 
    * 4. Early stopping
    * how well each iteration performs, stop training process early
    * usually for deep learning
    * 5. Regularization
    * used commonly for classical machine learning
    * forcing model to be simpler (ridge, lasso)
    * pruning decision tree, dropout on neural network, penalty parameters on the cost function in regression
    * can be tuned through cross validation (because it is a hyperparameter)
    * 6. Ensembling
    * combining predictions from multiple separate models
    * Bagging - reduce change - trains a large number of strong learners in parallel
    * combines all 'strong' learners together to smooth out predictions
    * strong to smooth 
    * Boosting - trains a large number of weak learners (constrained model)
    * each one in the sequence focuses on learning from mistakes of the one before it
    * combines all weak learners into a single strong learner 
    * simple to complex (boost)

Walkthrough - Data Splitting

In [3]:
# initialize a list
X = list(range(10))
print(X)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [5]:
# squared values with list comprehension
y = [x*x for x in X]
print(y)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


In [7]:
# splitting the data to train and test
import sklearn.model_selection as model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.75,test_size=0.25, random_state=101)
print ("X_train: ", X_train)
print ("y_train: ", y_train)
print("X_test:", X_test)
print ("y_test: ", y_test)
X_train:  [4, 9, 3, 5, 7, 6, 1]
y_train:  [16, 81, 9, 25, 49, 36, 1]
X_test:  [8, 2, 0]
y_test:  [64, 4, 0]

X_train:  [4, 9, 3, 5, 7, 6, 1]
y_train:  [16, 81, 9, 25, 49, 36, 1]
X_test: [8, 2, 0]
y_test:  [64, 4, 0]


In [9]:
# note by default the train_test_split doesn't follow the same ascending order
# i.e. the split shuffles the data, shuffle=True (by default) can be False
# this does the same thing as above - old way
import sklearn.model_selection as cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size=0.75, random_state=101)

Using KFold cross validation
* note that a 0.25 subset of X is used as a test
* shuffle is default False for KFold 

In [15]:
from sklearn.model_selection import KFold
import numpy as np
kf = KFold(n_splits=5, shuffle=True)
X = np.array(X)
y = np.array(y)
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("X_test: ", X_test)

X_test:  [2 8]
X_test:  [4 9]
X_test:  [0 7]
X_test:  [1 3]
X_test:  [5 6]


To see the score of each of your cross validation
* can use model_selection.cross_val_score  - see tutorial

#### Class imbalance
* over representation of one class of data, not equally distributed
* e.g. fraud prediction - 1% of frauds, and the rest are normal observations, Churn/NoChurn
* models will be biased in selecting normal observations than 1% frauds
* can use 'resampling' 
* can start with undersampling the class with bigger occurrence
* remember to always test the model on a test set with the right proportions must have correct representation

Ways to Combat
* 1. collect more data, more examples of minor classes
* 2. change performance metric. - can't use accuracy here
    * confusion matrix
        * showing correct (diagonal) and incorrect 
    * precision - measure of classifier exactness
    * recall - measure of classifier completeness (out of trues, which one was right)
    * F1 Score (F-score) weighted average of precision and recall
    * Cohen's kappa
    * ROC curves
* 3. Resampling the data
    * more balanced data
    * add copies of instances from underrepresented class (oversampling) (< 10k data)
    * delete instances from over-represented class called under sampling (100k+ data)
    * stratified sampling
    * resampled ratios 
* 4. Synthetic samples
    * naive bayes to sample each attribute independently when run in reverse
    * data non-linear relations may not be preserved
    * SMOTE (synthetic minority over-sampling technique) - instead of creating copies it creates synthetic samples
* 5. Try different algorithms
    * decision trees are often good for imbalanced data sets
    * if in doubt try a random forest
* 6. Penalized models
    * give same models a different perspective on the problem
    * SVM, LDA have penalized version
* 7. Different perspective
    * anomaly or change detection
    * anomaly detection of rare events - detecting outliers
    * change detection - looks for change/difference e.g. bank transactions
* 8. creativity
    * break down to smaller problems 
    * different approahces possible

#### Sampling Bias
* the way you find your respondents affects the questions you can ask them
* who you are asking will bias the response you will get 
* some members of intended population have lower sampling probability
* sample is not representative of the population you want

#### Model Evaluation
* quantifying model performance
* depends on machine learning task, such as classification, regression, or clustering
* Regression
    * MSE (mean squared)
        * most preferrred
        * preferred more than other metrics 
        * can be optimized easier 
    * Root mean sqaures error (RMSE)
        * square root of averaged squared difference
        * leads to high penalty for large errors - useful when you don't want large errors 
    * mean abs error (MAE)
        * absolute difference between actual and predicted
        * robust to outliers (doesn't penalize errors as much)
        * all weighted equally (linear)
        * bad if you want to look at outliers 
    * R2 - coefficient of determination
        * 1-MSE(model)/MSE(baseline)
        * compares current model with constant baseline - and how much our model is better 
        * baseline is just the mean of the data 
        * scale free - less or equal to 1
        * can be negative (neg infinity to 1) - usually means the trend doesn't actually follow the data 
        * MSE of baseline is actually lower than MSE of the model (numerator) - high numerator (> 1)
            * could be due to outliers, intercept is missing 
    * adjusted R2
        * *doesn't actually adjust the model* - but adjusts the R2
        * always lower than R2 because it's a penalty for number of predictors 

* Classification
    * actual vs predicted - get TP, FP, FN, TN. T/F is actual, P/N is prediction 
    * accuracy
        * TP+TN / TP + FP + FN + TP
        * may not always be what you want
        * can be very accurate if labeled everyone Negative - if TP is really rare or it's what you care about detecting 
        * for example COVID+ people
    * recall
        * better evaluation of how model performs in finding true positives
        * gives fraction of correctly identified positive out of all the positive 
        * Recall = TP / TP + FN 
        * note false negative - meaning labeled negative, but actually positive
        * falls short if you  end up with  a lot of false positives - ignored by calculation
    * precision
        * TP/ TP+FP 
        * fraction of correctly identified positive out of all labeled positive 
        * corrects against false positives, will have low score 
        * good when detecting rare TP cases, and have good way of detecting TNs 
        * leads to a precision/recall trade off 
    * summary:
        * all negative - accuracy: high, recall: low, precision: low
        * all positive - accuracy: low, recall: high, precision: low
        * max probability as positive: acc: high, rec: low, prec: low
        * increase in precision, reduces recall (and vice versa)
        * depends on problem if you want to maximize detection of TP, and don't care about false positives for e.g. 
    * precision/recall tradeoff
        * can strategize to predict positive if output reliability is > 0.3 (higher recall, lower precision) - maximize positive detection
        * can strategize for higher precision - predict positive if output reliability >0.9 (low recall, high precision) 
    * F1-score
        * combines precision and recall scores into one 
        * 2* precision*recall / precision + recall
        * harmonic mean - less sensitive to large values (1 vs 0 will average as 0.5)
        * good when you have similar precision/recall 
    * ROC curve/AUC score
        * receiver operator characteristic 
        * classification problems with probability outputs 
        * convert probability outputs to classifications to be more or less selective of what's labeled positive/negative
        * ROC plots false positive rate vs true positive rate 
        * adjusting threshold can see effects on TPr and FPr
        * plot TPr and FPr against each other - will get a curve, find area under the curve (AUC)
        * largest AUC means it increases more on TPr while FPr remains low
    * Lift/Lift-curve [revisit] (https://algolytics.com/tutorial-how-to-establish-quality-and-correctness-of-classification-models-part-5-lift-curve/)
        * pictures gains from applying a classifer for a section of data compared to a random classifer
        * with percentage scale (accumulated/nonaccumulated)
        * with quotient scale (acc/nonacc)
        * first sort data (descending) 
        * divide into quantiles - number of quantiles is important, too small - non-reliable, too large and not detailed enough 
        * plots density of positive observations over quantiles (step function going down)
        * accumulated - will show more a curve, non-accumulated lift will be more stepwise 
        * can get a quotient scale
        * (??) needs revisiting

Regression Model Evaluation

In [17]:
import numpy as np
y_true = np.random.normal(0,1,10)
# generate random errors
errors = np.random.normal(0,0.02,10)
y_pred = y_true + errors

In [18]:
# import MSE from sklearn
from sklearn.metrics import mean_squared_error
# compute MSE takes two arrays, true values and prediction values
MSE = mean_squared_error(y_true,y_pred)  
# print MSE
print(MSE)

0.0004382144625764541


In [20]:
# for RMSE
# RMSE by Numpy
RMSE = np.sqrt(MSE)
print(RMSE)
# RMSE by sklearn
RMSE = mean_squared_error(y_true,y_pred,squared=False)
print(RMSE)

0.020933572618558306
0.020933572618558306


Classification model evaluation

In [22]:
# note ONLY for binary
# ground truth
y_true = [1,1,0,1,0,0,1,0,0,1]

# simulate probabilites of positive class - # probability of observations
y_proba = [0.9,0.7,0.2,0.99,0.7,0.1,0.5,0.2,0.4,0.6]

# set the threshold to predict positive class - anything above will be labelled as positive 
thres = 0.5

# class predictions
y_pred = [int(value > thres) for value in y_proba]

In [23]:
# for accuracy 
# import accuracy_score from sklearn
from sklearn.metrics import accuracy_score

# compute accuracy
accuracy = accuracy_score(y_true,y_pred)

# print accuracy
print(accuracy)

0.8


In [24]:
# for f1_score
from sklearn.metrics import f1_score

# compute F1-score
f1_score = f1_score(y_true,y_pred)

# print F1-score
print(f1_score)

0.8000000000000002


In [26]:
# ROC_AUC scores - ** uses probabilities instead of class labels, predictions with binary will give inaccurate scores (y in auc score is probability)
# import roc_auc_score from sklearn
from sklearn.metrics import roc_auc_score

# compute AUC-score
auc = roc_auc_score(y_true,y_proba)

# print AUC-score
print(auc)


0.9


#### Grid Search in sklearn
* used for hyperparameter tuning

In [28]:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn import datasets, svm
import matplotlib.pyplot as plt

In [29]:
# Load the digit data - darkness of pixel in an 8x8 image of handwritten digit
digits = datasets.load_digits()

In [30]:
# View the features of the first observation - handwritten 0
digits.data[0:1]

array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
        15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
        12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
         0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
        10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])

In [31]:
# splitting data for later cross validation
# Create dataset 1
data1_features = digits.data[:1000]
data1_target = digits.target[:1000]

# Create dataset 2
data2_features = digits.data[1000:]
data2_target = digits.target[1000:]

In [32]:
parameter_candidates = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]

In [33]:
# test out all combination of parameters that gets the highest score - default is 3-fold KFold 
# Create a classifier object with the classifier and parameter candidates
clf = GridSearchCV(estimator=svm.SVC(), param_grid=parameter_candidates, n_jobs=-1)

# Train the classifier on data1's feature and target data
clf.fit(data1_features, data1_target)  

In [36]:
# check the accuracy 
# View the accuracy score
print('Best score for data1:', clf.best_score_) 

Best score for data1: 0.966


In [38]:
# View the best parameters for the model found using grid search
print('Best C:',clf.best_estimator_.C) 
print('Best Kernel:',clf.best_estimator_.kernel)
print('Best Gamma:',clf.best_estimator_.gamma)

Best C: 10
Best Kernel: rbf
Best Gamma: 0.001
