# Evaluation in Machine Learning

Generally speaking, we don't make final evaluations on our _training set_.  Instead, what we care most about is how well our model generalizes to unseen data. Thus, we usually hold some data in reserve (which we call "held-out" or "test" data), train on the non-test data (called the "training data"), and evaluate on the test data.  The gold standard for this is "cross-validation," which performs this process multiple times and then averages scores across the different held-out samples. However, cross-validation is computationally more intensive, and so sometimes we'll use simpler methods (e.g., a single train test split).

In this notebook, we'll explore essential techniques for evaluating and validating machine learning models. First though, let's review a couple of different metrics we can use to evaluate regressions.

## Scoring Regression Results

When evaluating regression models, three commonly used metrics are R² (R-squared), RMSE (Root Mean Square Error), and MAE (Mean Absolute Error). Let's explore these metrics in detail and see how to implement them in Python.

### R² (R-squared) Score

#### What is R²?

R², also known as the coefficient of determination, is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It provides an indication of the goodness of fit of a model.

- R² ranges from 0 to 1.
- An R² of 0 indicates that the model explains none of the variability of the data.
- An R² of 1 indicates that the model explains all the variability of the data.

#### Formula

$$ R^2 = 1 - \frac{SSR}{SST} $$
Where:

SSR is the sum of squared residuals
SST is the total sum of squares

More specifically:
$$ R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}i)^2}{\sum{i=1}^n (y_i - \bar{y})^2} $$
Where:

$y_i$ are the observed values
$\hat{y}_i$ are the predicted values
$\bar{y}$ is the mean of the observed data


#### Interpretation

- An R² of 0.7 means that 70% of the variance in the target variable can be explained by the model.
- Higher R² values indicate a better fit, but be cautious of overfitting when R² is very close to 1.




### RMSE (Root Mean Square Error)

RMSE (which we've discussed before) is a frequently used measure of the differences between values predicted by a model and the values actually observed. It represents the standard deviation of the residuals (prediction errors).

- RMSE is always non-negative, and a value of 0 indicates a perfect fit to the data.
- It has the same units as the dependent variable.
- Lower values of RMSE indicate better fit.

#### Formula

The formula for RMSE is:

$$ RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2} $$
Where:

$n$ is the number of observations
$y_i$ are the observed values
$\hat{y}_i$ are the predicted values

#### Interpretation

- RMSE can be interpreted as the average deviation of the predictions from the observed values.
- It gives more weight to large errors due to the squaring operation.

### Mean Absolute Error (MAE)

Mean Absolute Error (MAE) is another common metric used to evaluate regression models. It measures the average magnitude of the errors in a set of predictions, without considering their direction. 

#### Formula

The formula for MAE is:

$$ MAE = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i| $$

Where:
- $n$ is the number of observations
- $y_i$ are the observed values
- $\hat{y}_i$ are the predicted values

#### Interpretation

- MAE is always non-negative, and a value of 0 indicates a perfect fit to the data.
- It has the same units as the dependent variable.
- Lower values of MAE indicate better fit.
- MAE represents the average absolute difference between predicted and actual values.





## Comparing MAE and RMSE

Both MAE and RMSE are commonly used metrics for regression problems, but they have some key differences:

1. **Interpretation**: 
   - MAE is easier to interpret as it's in the same units as the target variable and represents the average absolute error.
   - RMSE is in the same units as the target variable, but it represents the standard deviation of the residuals.

2. **Sensitivity to outliers**:
   - MAE is less sensitive to outliers because it doesn't square the errors.
   - RMSE gives higher weight to large errors due to the squaring operation, making it more sensitive to outliers.

3. **Mathematical properties**:
   - MAE is based on the L1 norm (sum of absolute values).
   - RMSE is based on the L2 norm (sum of squared values).

4. **Formula comparison**:
   MAE: $\frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|$
   RMSE: $\sqrt{\frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2}$

5. **Use cases**:
   - MAE is preferred when you want to treat all errors equally.
   - RMSE is preferred when large errors are particularly undesirable, as it penalizes them more heavily.


#### Choosing Between MAE and RMSE

- Use MAE when you want to treat all errors equally and when outliers are not particularly problematic for your application.
- Use RMSE when large errors are especially undesirable, or when you want to maintain mathematical properties like differentiability (RMSE is differentiable everywhere, while MAE is not differentiable at 0).
- Often, it's beneficial to report both metrics to provide a more comprehensive view of your model's performance.

Remember, the choice between MAE and RMSE (or using both) can depend on your specific problem, the nature of your data, and the requirements of your stakeholders.


## Using sklearn

Scikit-learn provides easy to use implementations most common metrics in the "metrics" package of the library.



In [17]:

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np

y_true = np.array([3, -0.5, 2, 7])
y_pred = np.array([2.5, 0.0, 2, 8])

# R² score
r2 = r2_score(y_true, y_pred)
print(f"R² Score: {r2:.4f}")

# RMSE
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print(f"RMSE: {rmse:.4f}")

# MAE
mae = mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae:.4f}")


R² Score: 0.9486
RMSE: 0.6124
MAE: 0.5000




## 2. Train/Test Splits

Train/test splits are crucial in machine learning for several reasons:
1. They help us estimate how well our model will perform on unseen data.
2. They prevent overfitting by evaluating the model on data it hasn't seen during training.
3. They provide a way to compare different models fairly.

### Implementing train/test split in raw Python

Let's start by implementing a simple train/test split function:


In [9]:

import numpy as np

# This initializes the random state for reproduceability 
np.random.seed(42)

def my_train_test_split(X, y, test_size=0.2, random_state=None):
    
    
    n_samples = len(X)
    n_test = int(n_samples * test_size)
    
    # Create random indices
    indices = np.random.permutation(n_samples)
    test_indices = indices[:n_test]
    train_indices = indices[n_test:]
    
    # Split the data
    X_train, X_test = X[train_indices], X[test_indices]
    y_train, y_test = y[train_indices], y[test_indices]
    
    return X_train, X_test, y_train, y_test

# Example usage
X = np.random.rand(100, 1)  # 100 samples, 1 feature
y = 2 * X + 1 + np.random.randn(100, 1) * 0.1  # Linear relationship with some noise

X_train, X_test, y_train, y_test = my_train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Train set shape: {X_train.shape}, Test set shape: {X_test.shape}")



Train set shape: (80, 1), Test set shape: (20, 1)


With this train test split, we can now train and evaluate a predictor.  We'll use sklearn's linear regression for this.

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

lr = LinearRegression()
lr.fit(X_train,y_train)

y_pred = lr.predict(X_test)
print(f"RMSE = {mean_squared_error(y_test,y_pred)}")
print(f"Explained Variance = {r2_score(y_test, y_pred)}")



RMSE = 0.012927141660667569
Explained Variance = 0.9452137743424628


Note that different splits will return different scores because they are randomly chosen.  This is one reason why we use cross validation!

In [11]:
for trial in range(5):
    X_train, X_test, y_train, y_test = my_train_test_split(X, y, test_size=0.2, random_state=42)
    lr = LinearRegression()
    lr.fit(X_train,y_train)

    y_pred = lr.predict(X_test)
    print(f"Trial {trial+1}. RMSE = {mean_squared_error(y_test,y_pred)}")
    print(f"Trial {trial+1}. Explained Variance = {r2_score(y_test, y_pred)}\n")

Trial 1. RMSE = 0.006067782758113735
Trial 1. Explained Variance = 0.9840161691700738

Trial 2. RMSE = 0.008795323257008715
Trial 2. Explained Variance = 0.9792219687827336

Trial 3. RMSE = 0.010294716137719685
Trial 3. Explained Variance = 0.9750807417475278

Trial 4. RMSE = 0.007274012847573889
Trial 4. Explained Variance = 0.9788703444365846

Trial 5. RMSE = 0.006743404542508677
Trial 5. Explained Variance = 0.972488694087248




### Using scikit-learn for train/test split

The same result can be achieved more concisely with scikit learn:


In [12]:
from sklearn.model_selection import train_test_split
np.random.seed(42)

for trial in range(5):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    lr = LinearRegression()
    lr.fit(X_train,y_train)

    y_pred = lr.predict(X_test)
    print(f"Trial {trial+1}. RMSE = {mean_squared_error(y_test,y_pred)}")
    print(f"Trial {trial+1}. Explained Variance = {r2_score(y_test, y_pred)}\n")


Trial 1. RMSE = 0.006536995137170029
Trial 1. Explained Variance = 0.9825431689004598

Trial 2. RMSE = 0.014438846524693839
Trial 2. Explained Variance = 0.9458126172986798

Trial 3. RMSE = 0.003921147362602184
Trial 3. Explained Variance = 0.9808558934078107

Trial 4. RMSE = 0.00534263214183482
Trial 4. Explained Variance = 0.9876195998568261

Trial 5. RMSE = 0.00958719625066688
Trial 5. Explained Variance = 0.9648974460279144




As you can see, scikit-learn's implementation is more concise and likely more optimized.

## 3. Cross-Validation

### Why do we need cross-validation?

Cross-validation helps us:
1. Get a more robust estimate of model performance by using multiple train/test splits.
2. Reduce the impact of data variability on model evaluation.
3. Make better use of limited data for both training and validation.

### Implementing cross-validation in raw Python

Let's implement a simple 5-fold cross-validation function:


In [6]:
import numpy as np

def five_fold_cross_validation(X, y, random_state=None):
    if random_state is not None:
        np.random.seed(random_state)
    
    n_samples = len(X)
    fold_size = n_samples // 5
    indices = np.random.permutation(n_samples)
    
    for i in range(5):
        test_start = i * fold_size
        test_end = (i + 1) * fold_size if i < 4 else n_samples
        
        test_indices = indices[test_start:test_end]
        train_indices = np.concatenate([indices[:test_start], indices[test_end:]])
        
        yield X[train_indices], X[test_indices], y[train_indices], y[test_indices]

X = np.random.rand(100, 1)  # 100 samples, 1 feature
y = 2 * X + 1 + np.random.randn(100, 1) * 0.1 

# Perform cross-validation
cv_scores = []
for X_train, X_test, y_train, y_test in five_fold_cross_validation(X, y, random_state=42):
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    cv_scores.append(mean_squared_error(y_test, y_pred))

print(f"Cross-validation RMSE scores: {cv_scores}")
print(f"Mean RMSE: {np.mean(cv_scores):.4f}")


Cross-validation RMSE scores: [np.float64(0.012907383102223256), np.float64(0.005676334975620066), np.float64(0.008995239911145073), np.float64(0.013058370647536618), np.float64(0.013084154878304746)]
Mean RMSE: 0.0107



### Using scikit-learn for cross-validation

Scikit-learn has lots of tools for handling cross-validation.  `cross_val_score` is an easy one liner that handles most of the simple cases for cross validation.  By default, `cross_val_score` uses R-squared rather than RMSE for regression tasks, but we can pass a string to get one of several predefined scores (see the [docs](https://scikit-learn.org/stable/modules/model_evaluation.html)) or even pass a scoring function.  By default, metrics specified as string in `cross_val_score` are written as fitness functions, so that "good" values are higher.  Hence, instead of reporting RMSE, we get _negative_ RMSE. 

In [11]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np

# Create a linear regression model
model = LinearRegression()



# Perform cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring="neg_mean_squared_error")

print(f"Cross-validation neg RMSE scores: {cv_scores}")
print(f"Mean neg RMSE: {np.mean(cv_scores):.4f}")


Cross-validation neg RMSE scores: [-0.00762899 -0.01442171 -0.01197623 -0.01278082 -0.00685235]
Mean neg RMSE: -0.0107


Note that `cross_val_score` only reports one measure (passed as the 'scoring' parameter).  If you want to pass more than one measure, you can use the `cross_validate` method. The `cross_validate` API requires a scorer (rather that a simple metric), so we use the `make_scorer` method to turn a metric into a scorer.

In [14]:

from sklearn.metrics import mean_squared_error, r2_score, make_scorer
from sklearn.model_selection import cross_validate



scoring = {
    'R2': make_scorer(lambda y, y_pred: r2_score(y, y_pred)),
    'RMSE': make_scorer(lambda y, y_pred: np.sqrt(mean_squared_error(y, y_pred)))
}

# Using multiple metrics
scores = cross_validate(model, X, y, cv=5, scoring=scoring,return_train_score=True)
for key, values in scores.items():
    print(f"{key}: {values.mean():.3f} (+/- {values.std() * 2:.3f})")


fit_time: 0.001 (+/- 0.000)
score_time: 0.001 (+/- 0.000)
test_R2: 0.980 (+/- 0.012)
train_R2: 0.983 (+/- 0.003)
test_RMSE: 0.103 (+/- 0.029)
train_RMSE: 0.100 (+/- 0.007)


#### **KFold sampler**

If you want even more control over your training and testing, you can use the `KFold` class in sklearn.  This works a lot like our python implementation above.

In [16]:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

kf = KFold(n_splits=5, random_state=42, shuffle=True)


# Initialize list to store accuracy for each fold
rmse_list = []

# Loop through each fold
for train_index, test_index in kf.split(X_train):
    # Split the data into current train and test set
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # It's a good idea to use a fresh, untrained model each time you run on new data
    # The "clone" command does that, but simplifies things by copying other parameters

    lr = LinearRegression()

    # Fit the model
    lr.fit(X_train, y_train)

    # Make predictions
    y_pred = lr.predict(X_test)

    # Calculate accuracy
    rmse_list.append(mean_squared_error(y_test, y_pred))

# Calculate and print the average accuracy
mean_rmse = np.mean(rmse_list)
print(f'Average RMSE: {mean_rmse}')

Average RMSE: 0.010748635085188005
