## Model Evaluation

It is pretty straightforward that, to evaluate our model, you'll want to compare your predicted values, $\hat y$ with the actual value, $y$. The difference between the two values is referred to as the residuals. When using a train test split, you'll compare your residuals for both test set and training set:

$r_{i,train} = y_{i,train} - \hat y_{i,train}$ 

$r_{i,test} = y_{i,test} - \hat y_{i,test}$ 

To get a summarized measure over all the instances in the test set and training set, a popular metric is the (Root) Mean Squared Error:

RMSE = $\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_{i} - \hat y_{i})^2}$

MSE = $\frac{1}{n}\sum_{i=1}^{n}(y_{i} - \hat y_{i})^2$

Again, you can compute these for both the traing and the test set. A big difference in value between the test and training set (R)MSE is an indication of overfitting.

#### R squared

    r2 = model.score(X,y)
    print(r2)
    


#### Mean Squared Error for test and train 

    train_residuals = y_hat_train - y_train
    test_residuals = y_hat_test - y_test
    from sklearn.metrics import mean_squared_error
    train_mse = mean_squared_error(y_train, y_hat_train)
    test_mse = mean_squared_error(y_test, y_hat_test)
    print('Train Mean Squared Error:', train_mse)
    print('Test Mean Squared Error:', test_mse)
    
#### Mean Absolute Error
    
    train_residuals = y_hat_train - y_train
    test_residuals = y_hat_test - y_test
    from sklearn.metrics import mean_absolute_error
    train_mse = mean_squared_error(y_train, y_hat_train)
    test_mse = mean_absolute_error(y_test, y_hat_test)
    print('Train Mean Absolute Error:', train_mse)
    print('Test Mean Absolute Error:', test_mse)

    
#### K-Fold Cross Validation¶
K-Fold Cross Validation expands on the idea of training and testing splits by splitting the entire dataset into {K} equal sections of data. We'll then iteratively train {K} linear regression models on the data, with each linear model using a different section of data as the testing set, and all other sections combined as the training set.

We can then average the individual results frome each of these linear models to get a Cross-Validation MSE. This will be closer to the model's actual MSE, since "noisy" results that are higher than average will cancel out the "noisy" results that are lower than average.

    from sklearn.model_selection import cross_val_score
    cv_5_results = cross_val_score(linreg, X, y, cv=5, scoring="neg_mean_squared_error")

## Bias variance trade-off (underfitting and overfitting)

Another perspective on this problem of overfitting versus underfitting is the bias variance tradeoff. The idea is that We can decompose the Mean Squared Error as the sum of 
- *bias*
- *variance*, and
- *irreducible error*:

Formally, this is written as: 
$ MSE = Bias(\hat{f}(x))^2 + Var(\hat{f}(x)) + \sigma^2$. The derivation of this result can be found on the wikipedia page of the bias-variance trade-off, [here](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff#Derivation).

<img src="./images/bias_variance.png" alt="Drawing" style="width: 500px;"/>

## AIC and BIC

#### AIC

The AIC is generally used to compare each candidate model. The nice thing about the AIC is that for every model that uses **Maximum Likelihood Estimation**, the log-likelihood is automatically computed, and as a consequence the AIC is very easy to calculate.

The AIC acts as a penalised log-likelihood criterion, giving a balance between a good fit
(high value of log-likelihood) and complexity (complex models are penalized more than fairly simple ones). The AIC is unbounded so can take any type of value, but the bottom line is that when comparing models, the model with the lowest AIC should be selected.

Note that directly comparing the values of log-likelihood maxima for different models (without including the penalty) is not good enough for model comparison, because including more parameters in a model will *always* give rise to an increased value of the maximum likelihood. Because of that reason, searching for the model with maximal log-likelihood
would always lead to the model with the most parameters. The AIC balances this by penalizing for number of parameters, hence searching for models with few parameters but fitting the data well.
data well.

#### BIC

The BIC (Bayesian Information Criterion) is very similar to the AIC and emerged as a Bayesian response to the AIC, but can be used for the exact same purposes. The idea is to select the candidate model with the highest probability
given the data. 
This idea can be formalised inside a Bayesian framework, involving prior probabilities on candidate models along with prior densities on all parameters in the models. The penalty is slightly changed and depends on the number of rows to the data set:

**BIC(model) = -2 \* log-likelihood(model) + log(number of observations) \* (length of the parameter space)**

#### LassoLarsIC to visualize AIC/BIC 

    import numpy as np
    import matplotlib.pyplot as plt

    from sklearn.linear_model import LassoCV, LassoLarsCV, LassoLarsIC
    model_bic = LassoLarsIC(criterion='bic')
    model_bic.fit(df_inter, y)
    alpha_bic_ = model_bic.alpha_

    model_aic = LassoLarsIC(criterion='aic')
    model_aic.fit(df_inter, y)
    alpha_aic_ = model_aic.alpha_


    def plot_ic_criterion(model, name, color):
        alpha_ = model.alpha_
        alphas_ = model.alphas_
        criterion_ = model.criterion_
        plt.plot(-np.log10(alphas_), criterion_, '--', color=color, linewidth=2, label= name)
        plt.axvline(-np.log10(alpha_), color=color, linewidth=2,
                    label='alpha for %s ' % name)
        plt.xlabel('-log(alpha)')
        plt.ylabel('criterion')

    plt.figure()
    plot_ic_criterion(model_aic, 'AIC', 'green')
    plot_ic_criterion(model_bic, 'BIC', 'blue')
    plt.legend()
    plt.title('Information-criterion for model selection');

## Classification Report
#### Provides accuracy, recall and f1 score
    from sklearn import classification_report
    print(classification_report(y_test, pred))


## Confusion Matrix Visualization

### Best for multi-class classification evaluation

    from sklearn.metrics import confusion_matrix
    # compute initial confusion matrix
    cnf_matrix = confusion_matrix(y_hat_test, y_test)
    # visualization packages
    import numpy as np
    import itertools
    import matplotlib.pyplot as plt
    %matplotlib inline

    def plot_confusion_matrix(cm, classes,
                              normalize=False,
                              title='Confusion matrix',
                              cmap=plt.cm.Blues):
     """ function to plot a confusion matrix, with color and normalization options. If normalize = True, will show a % rather than a raw number"""
        #Add Normalization Option
        if normalize:
            cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
            print("Normalized confusion matrix")
        else:
            print('Confusion matrix, without normalization')

        print(cm)

        plt.imshow(cm, interpolation='nearest', cmap=cmap)
        plt.title(title)
        plt.colorbar()
        tick_marks = np.arange(len(classes))
        plt.xticks(tick_marks, classes, rotation=45)
        plt.yticks(tick_marks, classes)
    
        fmt = '.2f' if normalize else 'd'
        thresh = cm.max() / 2.
        for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
            plt.text(j, i, format(cm[i, j], fmt),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")

        plt.tight_layout()
        plt.ylabel('True label')
        plt.xlabel('Predicted label')
    
    

## Feature Importance 

    
    def plot_feature_importances(model):
        n_features = X_train.shape[1]
        plt.figure(figsize=(8,8))
        plt.barh(range(n_features), model.feature_importances_, align='center') 
        plt.yticks(np.arange(n_features), X_train.columns.values) 
        plt.xlabel("Feature importance")
        plt.ylabel("Feature")

    plot_feature_importances()

## ROC, AUC Visualization (can only use with regression classification)

### Best evaluation for binary classification problems

    from sklearn.metrics import roc_curve, auc

    #for various decision boundaries given the case member probabilites

    #First calculate the probability scores of each of the datapoints:
    y_score = model_log.decision_function(X_test)

    fpr, tpr, thresholds = roc_curve(y_test, y_score)

    y_train_score = model_log.decision_function(X_train)
    train_fpr, train_tpr, thresholds = roc_curve(y_train, y_train_score)
    
    
    import seaborn as sns
    import matplotlib.pyplot as plt
    %matplotlib inline
    sns.set_style("darkgrid", {"axes.facecolor": ".9"})
    plt.figure(figsize=(10,10))
    plt.plot(fpr, tpr, color='darkorange', lw=5, label='ROC curve')
    plt.plot([0,1], [0,1], color = 'blue', linestyle = '--')
    plt.xlim([-0.1,1.0])
    plt.ylim([0.0,1.1])
    plt.xticks([i/20.0 for i in range(21)])
    plt.yticks([i/20.0 for i in range(21)])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.legend(loc='best')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.show()


## Regressor Evaluations
#### MAE, MSE, RMSE, R2

    from sklearn import metrics

    print('R Squared Value:', metrics.r2_score(y_test, y_pred))
    print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
    print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
    print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))