#  Model Evaluation
---

## The model development process

A rigorous approach to model development uses both cross-validation and validation. The cross-validation can be used to tune hyperparameters, while the separate validation set lets you compare the scores of different algorithms (e.g., logistic regression vs. Naive Bayes vs. decision tree) to select a champion model. Finally, the test set gives a benchmark score for performance on new data. This process is illustrated in the diagram below.

![model_development_diagram](images/model_development_diagram.png)

*image source: google advanced data analytics certificate*

---
## Metrics used to evaluate regression models

| Metric | Perfect Score | Description |
| --- | --- | --- |
| R-squared (R^2) | 1 | Proportion of the variance in the dependent variable (Y) explained by the independent variables (X). Range of 0 to 1, where 1 represents a perfect fit of the model to the data. |
| Adjusted R-squared | 1 | Modified R^2 accounts for the number of predictors in order to avoid overfitting. Does this by penalizing unnecessary explanatory variables. Range of 0 to 1, where 1 represents a perfect fit of the model to the data. |
| Mean Squared Error (MSE) | 0 | Measures the average squared difference between predicted and actual values. |
| Root Mean Squared Error (RMSE) | 0 | Square root of MSE, providing a measure of the average magnitude of errors. |
| Mean Absolute Error (MAE) | 0 | Measures the average absolute difference between predicted and actual values. |
| Mean Percentage Error (MPE) | 0 | Measures the average percentage difference between predicted and actual values. |
| Mean Absolute Percentage Error (MAPE) | 0 | Measures average absolute percentage difference between predicted and actual values. |

It's important to note that the "ideal" score for each metric is task-dependent and may vary based on the specific problem context. For example, for some metrics like MSE, RMSE, and MAE, lower values indicate better performance, while for others like R-squared, higher values (closer to 1) indicate a better fit of the model to the data.

The R-squared (R2) and Adjusted R-squared metrics have a range of 0 to 1, where 1 represents a perfect fit of the model to the data.

Metrics like Mean Percentage Error (MPE) and Mean Absolute Percentage Error (MAPE) are expressed as percentages. The perfect score for these metrics is 0, indicating no percentage difference between the predicted and actual values.

## Metrics used to evaluate classification models

| Metric | Ideal Score | Description | Appropriate When... |
| --- | --- | --- | --- |
| Accuracy | 1 | Measures the overall correctness of the model's predictions, the proportion of correct classifications. | Classes are balanced or misclassification costs are equal across classes. |
| Precision | 1 | Measures the proportion of correctly predicted positive instances out of the total predicted positive instances. | The cost of false positives is high (e.g., spam detection). |
| Recall (Sensitivity) | 1 | Measures the proportion of correctly predicted positive instances out of the total actual positive instances. | The cost of false negatives is high (e.g., disease detection). |
| F1 Score | 1 | Harmonic mean of precision and recall. Provides a balanced measure of model performance. | Balance between precision and recall is desired (e.g., text classification). |
| Area Under the ROC Curve (AUC) | 1 | Represents the probability that a randomly selected positive instance is ranked higher than a randomly selected negative instance according to the model's predictions. | The class distribution is imbalanced or when ranking predictions is important.
Ranking predictions refers to the ability of a classification model to correctly order or rank instances according to their predicted probabilities of belonging to a particular class. |
| Confusion Matrix | N/A | A table representing the performance of a classification model. It shows true positive, true negative, false positive, and false negative values. | Comprehensive understanding of the model's performance across different classes is needed. |

It's important to consider the specific requirements, goals, and costs associated with the classification problem when selecting the appropriate evaluation metric. Depending on the context, different performance measures may be favored. For example, in scenarios where the cost of false positives is high (e.g., in medical diagnosis), precision is a crucial metric. On the other hand, in situations where the cost of false negatives is high (e.g., in fraud detection), recall becomes more important.

---
## Classification model metric evaluation throughout the train, validate, test process.



### Imports

In [1]:
import sklearn.metrics as metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay


#### Define dictionary of scoring metrics (req. for GridSearchCV object instantiation)

In [2]:
# Define a dictionary of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}

### Step.1 - Training



#### Define a function to generate scores on training data

In [None]:
def get_training_results(model_name:str, model_object, metric:str):
    '''
    Arguments:
        model_name (string): what you want the model to be called in the output table
        model_object: a fit GridSearchCV object
        metric (string): precision, recall, f1, or accuracy
  
    Returns a pandas df with the F1, recall, precision, and accuracy scores
    for the model with the best mean 'metric' score across all validation folds.  
    '''

    # Create dictionary that maps input metric to actual metric name in GridSearchCV
    metric_dict = {'precision': 'mean_test_precision',
                 'recall': 'mean_test_recall',
                 'f1': 'mean_test_f1',
                 'accuracy': 'mean_test_accuracy',
                 'auc': 'mean_test_roc_auc'
                 }

    # Get all the results from the CV and put them in a df
    cv_results = pd.DataFrame(model_object.cv_results_)

    # Isolate the row of the df with the max(metric) score
    best_estimator_results = cv_results.iloc[cv_results[metric_dict[metric]].idxmax(), :]

    # Extract Accuracy, precision, recall, and f1 score from that row
    f1 = best_estimator_results.mean_test_f1
    recall = best_estimator_results.mean_test_recall
    precision = best_estimator_results.mean_test_precision
    accuracy = best_estimator_results.mean_test_accuracy
    auc = best_estimator_results.mean_test_roc_auc
  
    # Create table of results
    # table = pd.DataFrame()
    table = pd.DataFrame({'Model': model_name,
                        'Precision': precision,
                        'Recall': recall,
                        'F1': f1,
                        'Accuracy': accuracy,
                        'AUC': auc,
                        },
                        index=[0]
                    )

    return table


#### Get scores on training data

In [None]:
# Call 'make_results()' on the GridSearch object
results = get_training_results('random forest 1: f1', rf_val, 'auc')
results


### Step.2 - Validating



#### Define a function to generate scores on validation data


In [None]:

def get_validation_scores(model_name:str, preds, y_val_data):
    '''
    Generate a table of validation scores.

    In: 
        model_name (string): Your choice: how the model will be named in the output table
        preds: numpy array of validation predictions
        y_test_data: numpy array of y_val data

    Out: 
        table: a pandas df of precision, recall, f1, and accuracy, auc scores for your model
    '''
    accuracy = round(accuracy_score(y_val_data, preds), 3)
    precision = round(precision_score(y_val_data, preds), 3)
    recall = round(recall_score(y_val_data, preds), 3)
    f1 = round(f1_score(y_val_data, preds), 3)
    auc = round(roc_auc_score(y_val_data, preds), 3)

    table = pd.DataFrame({'model': [model_name],
                        'precision': [precision], 
                        'recall': [recall],
                        'f1': [f1],
                        'accuracy': [accuracy],
                        'AUC': [auc]
                        })
    return table


#### Get scores on validation data


In [None]:

model_validation_scores = get_validation_scores('model_name', preds, y_val)
model_validation_scores


### Step.3 - Testing



#### Define a function to generate scores on validation data


In [None]:

def get_test_scores(model_name:str, preds, y_test_data):
    '''
    Generate a table of test scores.

    In: 
        model_name (string): Your choice: how the model will be named in the output table
        preds: numpy array of test predictions
        y_test_data: numpy array of y_test data

    Out: 
        table: a pandas df of precision, recall, f1, and accuracy scores for your model
    '''
    accuracy = round(accuracy_score(y_test_data, preds), 3)
    precision = round(precision_score(y_test_data, preds), 3)
    recall = round(recall_score(y_test_data, preds), 3)
    f1 = round(f1_score(y_test_data, preds), 3)
    auc = round(roc_auc_score(y_test_data, preds), 3)

    table = pd.DataFrame({'model': [model_name],
                        'precision': [precision], 
                        'recall': [recall],
                        'f1': [f1],
                        'accuracy': [accuracy],
                        'AUC': [auc]
                        })
  
    return table



#### Get scores on test data


In [None]:

# Get final test scores for the selected model
results = get_test_scores('random forest: auc', preds, y_test)
results