# Evaluating a model

1. Estimator's buitin  `score()` method. See [Coefficient of Determination (R<sup>2</sup>)](https://www.sciencedirect.com/topics/mathematics/determination-coefficient)
2. The `scoring` parameter. See [Cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html)
3. Problem-specific metric functions

Read mode: [Scikit-Learn Model Evaluation](https://scikit-learn.org/stable/modules/model_evaluation.html)

In [None]:
# !pip install cupy-cuda12x cutensor-cu12 nvidia-nccl-cu12 nvidia-cudnn-cu112
! pip install "cuml-cu12==25.8.*" polars

In [3]:
import pandas as pd
import polars as pl
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score, roc_curve, roc_auc_score, precision_score, recall_score, f1_score
from sklearn.datasets import fetch_california_housing
import cupy as cp
from cuml.ensemble import RandomForestClassifier as cuRFC, RandomForestRegressor as cuRFR
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

## Estimator's builtin `score()` method

### Classifiers

In [None]:
cp.random.seed(0)

hd = pd.read_csv('./heart-disease.csv')

In [None]:
X = hd.drop('target', axis=1)
y = hd['target']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train);
# clf.estimator_params

In [None]:
score = clf.score(X_train, y_train) # should return 1.0 since training data is already known!
# So if the model is powerful enough it'll score the max value.
score

In [None]:
score = clf.score(X_test, y_test)
score

### Regressors

In [None]:
cp.random.seed(0)

housing = fetch_california_housing()
housing_df = pd.DataFrame(housing['data'], columns=housing['feature_names'])
housing_df['target'] = housing['target']

In [None]:
X = housing_df.drop('target', axis=1)
y = housing_df['target']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
reg = RandomForestRegressor(n_estimators=100)
reg.fit(X_train, y_train);

In [None]:
score = reg.score(X_test, y_test)
score  # 0.7998221927014879 with 10000 estimators, 0.7999345435994805 with 1000 estimators, 0.7980392453626703 with 100 estimators

## Classifiers evaluation metrics
1. **Accuracy**
2. **Area under the [Receiver Operating Characteristic (ROC) curve](https://pmc.ncbi.nlm.nih.gov/articles/PMC8831439/) (and ROC_AUC)**
3. **Confusion Matrix**
4. **Classification report**

See [Scikit-Learn Docs - Model Evaluation - Regression Metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics)

#### Accuracy

In [None]:
cp.random.seed(0)

hd = pd.read_csv('./heart-disease.csv')

In [None]:
X = hd.drop('target', axis=1)
y = hd['target']

In [None]:
clf = cuRFC(n_estimators=300)

In [None]:
cvs = cross_val_score(clf, X, y, cv=10)
cvs = cp.mean(cvs)
cvs

In [None]:
print(f'Heart disease classifier accuracy: {cvs * 100:.2f}%')

#### Area under the Receiver Operating Characteristic (ROC) curve

It compares the model's **_True Positive Rate (TPR)_** versus its **_False Positive Rate (FPR)_**.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
fitted_clf = clf.fit(X_train, y_train)

In [None]:
y_probs = clf.predict_proba(X_test)

In [None]:
y_probs_positives = y_probs[1]

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_probs_positives)

In [None]:
def plot_roc(fpr, tpr):
    roc_fig, ax = plt.subplots()
    ax.set_title('Receiver Operating Characteristic (ROC) Curve')
    ax.plot(fpr, tpr, color='orange', label='ROC')
    ax.plot([0, 1], [0, 1], color='darkblue', label='guessing', linestyle='--')
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.legend()

    roc_fig.show()

In [None]:
plot_roc(fpr, tpr)

In [None]:
roc_auc_sc = roc_auc_score(y_test, y_probs_positives)
# 91% of the diagram is covered by the area under the curve

In [None]:
train_fpr, train_tpr, train_thresholds = roc_curve(y_test, y_test)    # Passing the y_test as though it were the model's predictions so the score is 1! 

In [None]:
plot_roc(train_fpr, train_tpr)

In [None]:
y_predictions = clf.predict(X_test)

In [None]:
conf_matrix = confusion_matrix(y_test, y_predictions)
conf_matrix
# array([[20,  4],
#      [ 3, 34]])
# 20 is where the value is 0 for both actual and predicted (true negative)
# 34 is where the value is 1 for both actual and predicted (true positive)
# 4 is where the model predicted true but it actually was false (false positive)
# 3 is where the model predicted false but it actually was true (false negative)
# The main diagonal is where the model did a good job,
# the secondary diagonal is where the model is getting confused!

In [None]:
crosstab = pd.crosstab(y_test,
                      y_predictions,
                      rownames=['Actual'],
                      colnames=['Predicted'])
crosstab   # Only one row?!

In [None]:
# Set seaborn font scale to a higher value for readabilirt
sns.set(font_scale=1.5)

In [None]:
sns.heatmap(conf_matrix);

In [None]:
ConfusionMatrixDisplay.from_estimator(estimator=clf, X=X, y=y)

In [None]:
ConfusionMatrixDisplay.from_predictions(y_true=y_test, y_pred=y_predictions)

#### Classification Report

**Precision**: Indicates the portion of True Positives **over all the positive cases** (true and false): true_positives/(true_positives + false_positives). A model that produces no false positives has precision 1.0

**Recall**: Indicates the portion of True Positives **over all the actual positives** (true positives + false negatives): true_positives/(true_positives + false_negatives). A model that produces no false negatives has recall 1.0

**F1**: a harmonic means between the Precision and Recall score: 2 x ((precision * recall)/(precision + recall)). A perfect model has F1 score of 1.0

**Accuracy**

**Macro avg**: Average of **precision**, **recall** and **F1 score**. Does not take into account **class imbalances**

**Weighted avg**: Average of **precision**, **recall** and **F1 score** taking into account how many samples there are per class

More on [Scikit-Learn documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#metrics-and-scoring-quantifying-the-quality-of-predictions)

In [None]:
c_report = classification_report(y_test, y_predictions)

In [None]:
print(c_report)

In [None]:
c_report = classification_report(y_test, y_predictions, output_dict=True)
print(c_report)
c_report_df = pd.DataFrame(c_report)
c_report_df

## Regressors evaluation metrics

1. **R<sup>2</sup> (Coefficient of Determination)**: how much of the prediction on the target (dependent) variable y comes from the independent variables in x (feature variables). **Compares the model's predictions to the <u>mean of the targets</u>**. Ranges from -(infinite) to 1.0, if all the predictions are equal to the mean of the targets the value is 0.0.
2. **Mean Absolute Error (MAE)**: the **average of the absolute difference** between predictions and actual values. It gives an idea of how wrong predictions are.
3. **Mean Squared Error (MSE)**: the **average of the SQUARE of the absolute difference** between predictions and actual values.


See [Scikit-Learn Docs - Model Evaluation - Classification Metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics)

#### R<sup>2</sup> (Coefficient of Determination)

In [4]:
cp.random.seed(0)

housing = fetch_california_housing()
housing_df = pd.DataFrame(housing['data'], columns=housing['feature_names'])
housing_df['target'] = housing['target']

In [11]:
X = housing_df.drop('target', axis=1)
y = housing_df['target']

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [8]:
reg = RandomForestRegressor(n_estimators=100)
reg.fit(X_train, y_train);
y_predictions = reg.predict(X_test)

In [6]:
r2_s = reg.score(X_test, y_test)     # R2 is the default for `score()`

In [7]:
r2_s

0.8210634595805447

In [7]:
r2_score_from_func = r2_score(y_test, y_predictions)   # another way to do it

In [8]:
r2_score_from_func

0.8190853992521174

#### Mean Absolute Error (MAE)

In [8]:
mae = mean_absolute_error(y_test, y_predictions)
mae # On average, each prediction is + o - 0,32... compared to the actual value

0.31181006887112417

In [22]:
%%time
# Manually with polars
pl_df = pl.DataFrame({'actual_values': y_test, 'predicted_values': y_predictions});
pl_df = pl_df.with_columns(differences=pl.col('actual_values').sub(pl.col('predicted_values')).abs())


CPU times: user 1.11 ms, sys: 5 μs, total: 1.11 ms
Wall time: 715 μs


In [23]:
%%time
mae_polars = pl_df.select('differences').mean()
mae_polars

CPU times: user 188 μs, sys: 8 μs, total: 196 μs
Wall time: 197 μs


differences
f64
0.330665


In [24]:
%%time
# Manually with pandas
pd_df = pd.DataFrame({'actual_values': y_test, 'predicted_values': y_predictions});
pd_df['differences'] = pd_df['actual_values'] - pd_df['predicted_values']

CPU times: user 1.01 ms, sys: 0 ns, total: 1.01 ms
Wall time: 901 μs


In [25]:
%%time
mae_pd = np.abs(pd_df['differences']).mean()
mae_pd

CPU times: user 457 μs, sys: 0 ns, total: 457 μs
Wall time: 451 μs


np.float64(0.33066513752422494)

#### Mean Squared Error (MSE)

In [16]:
mse = mean_squared_error(y_test, y_predictions)
mse

0.265457516240721

In [26]:
%%time
# Manually with polars
pl_df = pl_df.with_columns(squared_differences=pl.col('differences').pow(2))
mse_polars = pl_df.select(pl.col('squared_differences')).mean()
mse_polars


CPU times: user 0 ns, sys: 1.57 ms, total: 1.57 ms
Wall time: 870 μs


squared_differences
f64
0.265458


In [27]:
%%time
# Manually with Pandas
pd_df['squared_differences'] = np.square(np.abs(pd_df['actual_values'] - pd_df['predicted_values']))
mse_pandas = pd_df['squared_differences'].mean()
mse_pandas

CPU times: user 656 μs, sys: 74 μs, total: 730 μs
Wall time: 786 μs


np.float64(0.265457516240721)

## The `scoring` parameter

It allows to customize what metric will be used by the `score` method.
If `scoring=None` then the default for each regressor/classifier will be used. Regressors usually use R<sup>2</sup> score while classifiers usually use Accuracy.

See [The scoring parameter: defining model evaluation rules](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules) and [String name scorers](https://scikit-learn.org/stable/modules/model_evaluation.html#string-name-scorers)

### Classifier example

In [28]:
hd = pd.read_csv('./heart-disease.csv')

In [29]:
X = hd.drop('target', axis=1)
y = hd['target']

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [37]:
cp.random.seed(0)

clf = RandomForestClassifier(n_estimators=100)
# clf.estimator_params

In [42]:
cp.random.seed(0)

# Cross-validation accuracy
cv_acc = cross_val_score(clf, X, y, cv=5, scoring=None)  # default is accuracy
print(f'Cross-validation accuracy is: {cp.mean(cv_acc) * 100 :.2f}%')

Cross-validation accuracy is: 82.49%


In [45]:
cp.random.seed(0)

# Cross-validation precision
cv_prec = cross_val_score(clf, X, y, cv=5, scoring='precision')
print(f'Cross-validation precision is: {cp.mean(cv_prec)}')

Cross-validation precision is: 0.8215199727533694


In [46]:
cp.random.seed(0)

# Cross-validation recall
cv_recall = cross_val_score(clf, X, y, cv=5, scoring='recall')
print(f'Cross-validation recall is: {cp.mean(cv_recall)}')

Cross-validation recall is: 0.8545454545454545


### Regressor example

In [47]:
housing = fetch_california_housing()
housing_df = pd.DataFrame(housing['data'], columns=housing['feature_names'])
housing_df['target'] = housing['target']

In [48]:
X = housing_df.drop('target', axis=1)
y = housing_df['target']

In [68]:
cp.random.seed(0)

reg = cuRFR(n_estimators=100)   # Uses GPU, there are a lot of samples

In [69]:
cp.random.seed(0)

# Cross-validation R2 score
cv_r2 = cross_val_score(reg, X, y, cv=5, scoring=None) # Default scoring is R2
print(f'Cross-validation R2 score is: {cp.mean(cv_r2)}')

Cross-validation R2 score is: 0.6478298505824341


In [72]:
cp.random.seed(0)

# Cross-validation MAE
cv_mae = cross_val_score(reg, X, y, cv=5, scoring='neg_mean_absolute_error')   # "Returns the negated value of the metric"
print(f'Cross-validation Mean Absolute Error is: {cp.mean(cv_mae)}')

Cross-validation Mean Absolute Error is: -0.469995690528779


In [73]:
cp.random.seed(0)

# Cross-validation MSE
cv_mse = cross_val_score(reg, X, y, cv=5, scoring='neg_mean_squared_error')   # "Returns the negated value of the metric"
print(f'Cross-validation Mean Squared Error is: {cp.mean(cv_mse)}')

Cross-validation Mean Squared Error is: -0.4359976091039309


## Metrics Functions

In [74]:
hd = pd.read_csv('./heart-disease.csv')

In [75]:
X = hd.drop('target', axis=1)
y = hd['target']

In [110]:
cp.random.seed(0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [111]:
cp.random.seed(0)

clf = cuRFC(n_estimators=100)
clf.fit(X_train, y_train)

In [112]:
y_preds = clf.predict(X_test)

In [113]:
acc_score = accuracy_score(y_test, y_preds)

In [114]:
prec_score = precision_score(y_test, y_preds)

In [115]:
rec_score = recall_score(y_test, y_preds)

In [116]:
f1_sc = f1_score(y_test, y_preds)

In [117]:
acc_score, prec_score, rec_score, f1_sc

(0.9016393442622951, 0.8787878787878788, 0.9354838709677419, 0.90625)