# Prologue: Statistical Metrics and Evaluation

To utilize the content of this page optimally, I recommend consulting the relevant topic to gain a deeper understanding of specific subjects.

## Training set, Validation set, and Test sets

The process of partitioning a dataset into training, validation, and test sets is of paramount importance when it comes to evaluating predictive models effectively. This practice is elucidated and supported by the following explanations [Dangeti, 2017, Priddy and Keller, 2005]:

1. **Training Set**: The training set, constituting the largest portion of the dataset, serves as the bedrock for model development. Within this segment, machine learning algorithms glean insights from the data's patterns. It is during this phase that the model is meticulously trained to recognize intricate relationships and make accurate predictions.

2. **Validation Set**: Following the training phase, the model's performance is rigorously assessed using the validation set. This evaluation process is instrumental in fine-tuning the model's hyperparameters and gauging its ability to generalize to new data. The validation set holds a critical role in preventing the model from overfitting or underfitting the training data.

3. **Test Set**: Distinct from both the training and validation sets, the test set serves as an objective yardstick for evaluating the model's performance on entirely new and unseen data. Its primary function is to provide an estimate of how well the model is likely to perform when deployed in a real-world context.

<center>
<img src="https://raw.githubusercontent.com/HatefDastour/hatefdastour.github.io/master/_notes/Introduction_to_Digital_Engineering/_images/Data_Distribution.jpg" alt="picture" width="500">
</center>

The partitioning of a dataset into these three subsets represents a fundamental and established practice in the development of robust machine learning models. This approach is widely embraced both within the academic sphere and the broader field of data analysis.

There are scenarios where only the training and test sets are used, and the validation set is omitted. The decision to exclude the validation set can be based on the specifics of the problem and the available data. Here are a few situations where using only the training and test sets might be appropriate:

1. **Large Datasets**: When you have a substantial amount of data, the training set can be sizeable enough to adequately train the model while still leaving a substantial portion for testing. In such cases, the need for a separate validation set is reduced.

2. **K-Fold Cross-Validation**: Instead of a fixed validation set, you can perform K-fold cross-validation using the training data. This involves splitting the training data into K subsets (folds) and iteratively using each fold as a validation set while training on the remaining data. This can provide a more robust assessment of model performance without the need for a dedicated validation set.

3. **Limited Data Availability**: In situations where data is scarce, allocating a portion to a validation set may significantly reduce the training data's size, impacting the model's ability to learn. In such cases, some practitioners choose to rely on the test set to evaluate the model's performance.

4. **Time Series Data**: When dealing with time series data, it's common to use only a training set and a test set. This is because the order of data points is critical, and random shuffling, as typically done with validation sets, can lead to data leakage. In time series analysis, the test set often represents future data points.

However, it's essential to be cautious when omitting a validation set. The validation set is crucial for hyperparameter tuning and ensuring that the model generalizes well. If you choose to exclude it, you should be aware of the potential risks, such as overfitting or suboptimal model performance due to poor hyperparameter choices. The decision should be made based on a careful consideration of the specific problem, dataset size, and available resources.

## Standard Error

In statistics, the **standard error (SE)** is a way to measure how accurate an estimated value is, like the coefficients in a linear regression. It shows the average amount by which estimates can vary when we take multiple samples from the same group [Breiman, 2017, Scott, 2010].

The standard error (SE) is typically calculated using this formula:

\begin{equation} \text{SE} = \frac{s}{\sqrt{n}} \end{equation}

Where:
- $ s $ represents the sample standard deviation, which measures the variability of data points within the sample.
- $ n $ is the number of data points (sample size).

This formula provides a straightforward way to compute the standard error, which quantifies the precision of an estimated value based on a given sample.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Set the matplotlib style using a custom style file
plt.style.use('https://raw.githubusercontent.com/HatefDastour/ENGG_680/main/Files/mystyle.mplstyle')

# Define the URL for the data
Link = 'https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=27211&Year=2007&Month=1&Day=1&time=&timeframe=3&submit=Download+Data'

# Read the CSV data and select specific columns
df = pd.read_csv(Link, usecols=['Date/Time', 'Year', 'Month', 'Mean Temp (°C)'])

# Convert the 'Date/Time' column to datetime format and set it as the index
df['Date/Time'] = pd.to_datetime(df['Date/Time'])
df = df.set_index('Date/Time')

# Display the DataFrame
display(df)

# Create a figure and axes
fig, ax = plt.subplots(figsize=(9.5, 4.5))

# Create a line plot for 'Mean Temp (°C)'
df['Mean Temp (°C)'].plot(ax=ax, kind='line', color='Maroon', marker='.',
                          linestyle='--', linewidth=1,
                          title="""Mean Temperature Over Time (CALGARY INT'L CS)""",
                          xlabel='Date/Time', ylabel='Mean Temp (°C)',
                          xlim=['2000-01-01', '2008-01-01'], ylim=[-15, 25])

# Ensure the plot layout is tight
plt.tight_layout()

In [None]:
import numpy as np

df['Month Name'] = df.index.month_name()
Standard_Error = df.groupby(['Month', 'Month Name'])['Mean Temp (°C)'].std(ddof=1)\
                /np.sqrt(df.groupby(['Month'])['Mean Temp (°C)'].size())
Standard_Error = Standard_Error.to_frame('Standard Error')
display(Standard_Error)

We also have the option to compute this using the `scipy` package:

In [None]:
from scipy import stats
Standard_Error = df.dropna().groupby(['Month', 'Month Name'])['Mean Temp (°C)']\
                            .agg(Mean_Temperature_C = 'mean',
                                 SE = lambda x: stats.sem(x)).reset_index(drop = False)
display(Standard_Error)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Create a figure and axis
fig, ax = plt.subplots(figsize=(7.5, 6))

# Create the bar plot with error bars
ax.bar(x=Standard_Error['Month Name'],
       height=Standard_Error.Mean_Temperature_C,
       yerr= 1.96 * Standard_Error.SE,
       capsize=5,
       color='Bisque',
       edgecolor='Tomato',
       linewidth=2,
       error_kw=dict(linewidth=2, ecolor='DarkRed'))

# Set labels and title
ax.set_xlabel('Month')
ax.set_ylabel('Mean Temperature (°C)')
ax.set_title('Monthly Mean Temperature with Standard Errors\nError bars demonstrate a 95% confidence interval', weight='bold')

# Set the Y-axis limits for better visualization
ax.set(ylim=[-10, 20])

# Rotate x-axis labels for better readability
ax.tick_params(axis='x', rotation=45)

# Add a grid for clarity
ax.grid(axis='y', linestyle='--', alpha=0.7)

# Add a horizontal line at y=0 for reference
ax.axhline(0, color='k', linewidth=1.5, linestyle='--')

# Customize the appearance
plt.tight_layout()

Let's break down what it represents:

1. **X-Axis (Month Name)**: The X-axis of the plot represents the months of the year, from January to December. Each month is labeled with its name.

2. **Y-Axis (Mean Temperature in °C)**: The Y-axis represents the average (mean) temperature in degrees Celsius for each respective month. It shows the typical or average temperature for that month.

3. **Bars**: The blue bars on the plot represent the mean temperature for each month. The height of each bar indicates the average temperature for that specific month. For example, you can visually compare which months have higher or lower average temperatures.

4. **Error Bars**: The black vertical lines with caps extending from the top of the bars are error bars. These error bars represent the standard error associated with the mean temperature for each month. They show the level of uncertainty in the mean temperature estimate. Larger error bars indicate higher variability, meaning that the temperature data for that month varies more widely.

---


<font color='Red'><b>Note:</b></font>

If the error bar for a particular data point, such as March in your case, is longer than the actual bar, it typically means that the variability or uncertainty in that data point is relatively high compared to the actual value. This can happen for various reasons, such as:

1. **Small Sample Size:** If you have a small sample size for the data point (March in this case), the estimate of the mean (the actual bar) might have higher uncertainty, leading to a longer error bar.

2. **Outliers:** Outliers in the data can significantly affect the standard error and, consequently, the length of the error bars.

3. **Heteroscedasticity:** If the variability of data is not consistent across all data points, it can result in longer error bars for data points with higher variability.

4. **Skewed Data:** Data distributions that are not symmetric can lead to differences in the standard error and actual values, affecting the error bar length.

---

<font color='Blue'><b>Example:</b></font> In this example, we have two key variables, $y$ and 'y-hat,' utilized to simulate and explain statistical metrics such as R-squared ($R^2$), Mean Squared Error (MSE), Mean Absolute Error (MAE), and more. $y$ represents the observed data, which we aim to approximate, while $\hat{y}$ is a constructed set of values used for this simulation. $y$ serves as a representation of actual or observed data, although it's important to note that in this case, we do not have a specific model generating $y$ from 'X' or any real-world data. Instead, $y$ is introduced as a reference point against which we can measure the performance of various statistical metrics. $\hat{y}$ is a synthetic or hypothetical dataset, created for the purpose of this simulation. It is designed to emulate the values that a model might predict for $y$ in a real-world scenario. The comparison between $y$ and $\hat{y}$ in this context provides a basis for understanding and evaluating statistical measures, shedding light on how well these metrics assess the relationship between observed and predicted data.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Generate sample data
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 3 * X + np.random.rand(100, 1)

# Create a model or use any other method to make predictions
# In this example, we use a simple quadratic model for demonstration
y_hat = 1 * X**2 + 2 * X + np.random.rand(100, 1)

# =============================================================================
# Figure
# =============================================================================
fig, ax = plt.subplots(2, 2, figsize=(9.5, 9), gridspec_kw={'width_ratios': [.6, .4]})
ax = ax.ravel()

# Plot the data points and model predictions with enhanced styling
ax[0].scatter(X, y, c='blue', edgecolor='k', s=60,
              alpha=0.7, label=r'Observed Data ($y$)')
ax[0].scatter(X, y_hat, c='red', edgecolor='k', s=60,
              alpha=0.7, label=r'Model Prediction ($\hat{y}$)')
ax[0].set(xlabel=r'$X$', ylabel=r'$y$ and $\hat{y}$')
ax[0].legend(loc='upper left')

_ = ax[1].scatter(y, y_hat, facecolors='SkyBlue', edgecolors='MidnightBlue', alpha=0.8)
_ = ax[1].plot([ax[1].get_xlim()[0], ax[1].get_xlim()[1]],
               [ax[1].get_ylim()[0], ax[1].get_ylim()[1]], '--r', linewidth=2)
_ = ax[1].set(xlabel=r'Observed Data ($y$)',
              ylabel=r'Model Prediction ($\hat{y}$)')

gs = ax[2].get_gridspec()
for i in range(2, 4):
    ax[i].remove()
ax3 = fig.add_subplot(gs[1, :])
markerline, _, _ = ax3.stem(np.arange(y.shape[0]), abs(y - y_hat),
                            linefmt='LightCoral', markerfmt='o')
_ = markerline.set_markerfacecolor('red')
_ = ax3.set(aspect='auto', xlabel='Index',
            ylabel=r'$|y - \hat{y}|$',
            yscale='log',
            ylim=[1e-2, 1e1])

_ = fig.suptitle(r'y and $\hat{y}$ comparison', weight='bold', fontsize=16)

plt.tight_layout()

## R-squared Statistic

The R-squared statistic, symbolized as $R^2$, is a valuable metric used in linear regression analysis to gauge the proportion of the total variability observed in the dependent variable ($Y$) that can be explained by the independent variable(s) ($X$) included in the model. It takes values within the range of 0 to 1, where a value of 1 indicates that the model perfectly accounts for the variability in the dependent variable, while a value of 0 suggests that the model fails to explain any of the observed variability [Powers, 2011, Scott, 2010, scikit-learn Developers, 2023].

The formula for calculating $R^2$ is expressed as:

\begin{align}
R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}
\end{align}

Here are the components in the formula:
- $n$ represents the number of data points in the dataset.
- $y_i$ signifies the actual value of the dependent variable for the $i$-th data point.
- $\hat{y}_i$ corresponds to the predicted value of the dependent variable for the $i$-th data point based on the regression model.
- $\bar{y}$ denotes the mean value of the dependent variable.

$R^2$ is an essential tool in assessing the goodness of fit of a linear regression model. It quantifies how well the model's predictions align with the actual data. A high $R^2$ value implies that a substantial proportion of the variability in the dependent variable is captured by the model, indicating a better fit. Conversely, a low $R^2$ suggests that the model is not effectively explaining the variability in the dependent variable, and improvements may be necessary.

$R^2$ has important uses, but it's essential to complement its interpretation with other evaluation techniques, especially when working with complex models or in the presence of multicollinearity or heteroscedasticity. Combining $R^2$ with domain knowledge and diagnostic tests ensures a comprehensive understanding of the model's performance.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

fig, ax = plt.subplots(2, 1, figsize=(9.5, 9))
ax = ax.ravel()
y_bar = np.mean(y)
markerline, _, _ = ax[0].stem(np.arange(y.shape[0]), (y - y_hat)**2,
                            linefmt='#198c19', markerfmt='o')
_ = markerline.set_markerfacecolor('#304529')
_ = ax[0].set(aspect='auto', xlabel='Index',
            ylabel=r'$(y - \hat{y})^2$',
            yscale='log',
            ylim=[1e-4, 1e2])

markerline, _, _ = ax[1].stem(np.arange(y.shape[0]), (y - y_bar)**2,
                            linefmt='#5830ea', markerfmt='o')
_ = markerline.set_markerfacecolor('#331689')
_ = ax[1].set(aspect='auto', xlabel='Index',
            ylabel=r'$(y - \bar{y})^2$',
            yscale='log',
            ylim=[1e-4, 1e2])

plt.tight_layout()

In [None]:
import numpy as np

def r2_score(y, y_hat):
    # the mean of y
    y_bar = np.mean(y)
    # Sum of Squares Error
    SSE = np.sum((y - y_hat)**2)
    # Total Sum of Squares
    SST = np.sum((y - y_bar)**2)
    return 1 - (SSE / SST)

# Calculate R-squared
r_squared = r2_score(y, y_hat)
print(f'R-squared = {r_squared:.4f}')

Using sklearn's built-in `r2_score`:

In [None]:
from sklearn.metrics import r2_score
r_squared = r2_score(y, y_hat)
print(f'R-squared = {r_squared:.4f}')

## Mean Squared Error (MSE)

The Mean Squared Error (MSE) is a fundamental metric used to quantify the goodness of fit of a regression model by measuring the average squared difference between the predicted values and the actual values of the dependent variable. It provides a measure of how well the model's predictions align with the observed data [Powers, 2011, Scott, 2010, scikit-learn Developers, 2023].

Mathematically, the MSE is calculated using the following formula:

\begin{equation}
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
\end{equation}

Here are the components in the formula:
- $ n $ represents the number of data points in the dataset.
- $ y_i $ signifies the actual value of the dependent variable for the $ i $-th data point.
- $ \hat{y}_i $ corresponds to the predicted value of the dependent variable for the $ i $-th data point based on the regression model.

Interpreting the MSE:
- The MSE measures the average squared difference between the predicted and actual values.
- A lower MSE indicates that the model's predictions are closer to the actual values, implying a better fit.
- A higher MSE suggests that the model's predictions deviate more from the actual values, indicating a poorer fit.

MSE is commonly used as a loss function to assess the performance of regression models during training. It plays a crucial role in model selection and comparison, helping to identify the model that provides the best balance between predictive accuracy and generalization.

It's important to note that while MSE is a valuable metric for evaluating model performance, it should be interpreted in conjunction with other metrics and diagnostic tools to ensure a comprehensive understanding of the model's strengths and limitations.

<font color='Blue'><b>Example:</b></font> Considering the data from the previous instance, we have

In [None]:
import numpy as np
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 1, figsize=(9.5, 4.5))
y_bar = np.mean(y)
markerline, _, _ = ax.stem(np.arange(y.shape[0]), ((y - y_hat)**2)/len(y),
                            linefmt='#e483b3', markerfmt='o')
_ = markerline.set_markerfacecolor('#a94375')
_ = ax.set(aspect='auto', xlabel='Index',
            ylabel=r'$\dfrac{(y - \hat{y})^2}{n}$',
            yscale='log',
            ylim=[1e-6, 1e-1])

plt.tight_layout()

In [None]:
import numpy as np

def MSE_score(y, y_hat):
    # Sum of Squares Error
    SSE = np.sum((y - y_hat)**2)
    return SSE/len(y)

# Calculate the Mean Squared Error to assess the goodness of fit
mse = MSE_score(y, y_hat)
print(f'R-squared = {mse:.4f}')

Using sklearn's built-in `mean_squared_error`:

In [None]:
from sklearn.metrics import mean_squared_error
# Calculate the Mean Squared Error to assess the goodness of fit
mse = mean_squared_error(y, y_hat)
print(f'R-squared = {mse:.4f}')

## Mean Absolute Error (MAE)

The Mean Absolute Error (MAE) is a fundamental metric used in regression analysis to assess the accuracy of a predictive model by measuring the average absolute differences between the predicted values and the actual values of the dependent variable. It provides insight into how closely the model's predictions align with the observed data [Powers, 2011, Scott, 2010, scikit-learn Developers, 2023].

Mathematically, the MAE is calculated as follows:

\begin{equation}
MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
\end{equation}

Here are the components in the formula:
- $ n $ represents the number of data points in the dataset.
- $ y_i $ denotes the actual value of the dependent variable for the $ i $-th data point.
- $ \hat{y}_i $ represents the predicted value of the dependent variable for the $ i $-th data point based on the regression model.

Interpreting the MAE:
- The MAE measures the average absolute difference between the predicted and actual values.
- A lower MAE indicates that the model's predictions are closer to the actual values, suggesting a better fit.
- A higher MAE implies that the model's predictions have larger absolute differences from the actual values, indicating a poorer fit.

MAE is often used in combination with other metrics to evaluate the performance of regression models, especially when dealing with outliers or when you want to penalize large prediction errors linearly. It provides valuable insights into the model's predictive accuracy and is a useful tool for model selection and comparison.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 1, figsize=(9.5, 4.5))
y_bar = np.mean(y)
markerline, _, _ = ax.stem(np.arange(y.shape[0]), (abs(y - y_hat)/len(y)),
                            linefmt='#d45353', markerfmt='o')
_ = markerline.set_markerfacecolor('#7f1717')
_ = ax.set(aspect='auto', xlabel='Index',
            ylabel=r'$\dfrac{|y - \hat{y}|}{n}$',
            yscale='log',
            ylim=[1e-5, 1e-1])

plt.tight_layout()

In [None]:
import numpy as np

def MAE_score(y, y_hat):
    # Sum of Squares Error
    SSE = np.sum(abs(y - y_hat))
    return SSE/len(y)

# Calculate the Mean Absolute Error to assess the goodness of fit
mae = MAE_score(y, y_hat)
print(f'R-squared = {mae:.4f}')

Using sklearn's built-in `mean_absolute_error`:

In [None]:
from sklearn.metrics import mean_absolute_error
# Calculate the Mean Absolute Error to assess the goodness of fit
mae = mean_absolute_error(y, y_hat)
print(f'R-squared = {mae:.4f}')

## Correlation Matrix

The correlation matrix is a square matrix that displays the correlations between different variables (parameters) in a given dataset. Each element in the matrix represents the correlation coefficient between two variables. The correlation coefficient measures the strength and direction of the linear relationship between two variables. It ranges from -1 to +1, with -1 indicating a perfect negative linear relationship, +1 indicating a perfect positive linear relationship, and 0 indicating no linear relationship [Powers, 2011, Scott, 2010, scikit-learn Developers, 2023].

The correlation coefficient between two variables $X$ and $Y$ can be calculated using the formula:

\begin{align}
\text{Cor}(X, Y) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}}
\end{align}

where:
- $n$ is the number of data points in the dataset.
- $x_i$ and $y_i$ are the values of the $X$ and $Y$ variables, respectively, for the $i$th data point.
- $\bar{x}$ and $\bar{y}$ are the mean values of $X$ and $Y$ variables, respectively.

The correlation matrix provides a convenient way to visualize the relationships between multiple variables in a dataset. It helps identify patterns of association and potential multicollinearity (high correlation between predictor variables) in regression models. A positive correlation coefficient indicates that the variables tend to increase or decrease together, while a negative correlation coefficient indicates an inverse relationship. A correlation coefficient close to 0 suggests little or no linear relationship between the variables.

<font color='Blue'><b>Example - Iris Flower Pair Plot:</b></font>

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Iris dataset and format column titles and species names
iris = sns.load_dataset('iris')
iris.columns = [x.replace('_', ' ').title() for x in iris.columns]
iris.Species = iris.Species.map(lambda x: x.title())

# Display the formatted dataset
display(iris)

# Create a 2x2 grid of subplots with shared axes
fig, axes = plt.subplots(2, 2, figsize=(9.5, 9), sharex=False, sharey=True)
axes = axes.ravel()
# Remove the last subplot
fig.delaxes(axes[-1])

# List of species (Using only the first three species)
species_list = iris['Species'].unique()[:3]

# Calculate and display the correlation matrices for each species
for i, (species, ax) in enumerate(zip(species_list, axes)):
    # Extract data for the current species
    subset = iris[iris['Species'] == species]

    # Calculate the correlation matrix for the subset
    correlation_matrix = subset.drop(columns=['Species']).corr()

    # Create a heatmap for the correlation matrix on the respective subplot
    sns.heatmap(correlation_matrix, ax=ax,
                annot=True, fmt='.2f',
                cmap='RdYlGn', linewidths=0.5, cbar=False,
                vmin=0, vmax=1,
                annot_kws={"fontsize": 14})

    # Set title for the subplot with proper formatting
    ax.set_title(f'Correlation Matrix for {species}', weight='bold')

    # Disable grid lines
    ax.grid(False)

plt.setp(axes[0].get_xticklabels(), visible=False)
# Add a single color bar outside the subplots
cax = fig.add_axes([0.92, 0.15, 0.02, 0.7])  # Define the position and size of the color bar
sm = plt.cm.ScalarMappable(cmap='RdYlGn')
sm.set_array([])
cbar = plt.colorbar(sm, ax=axes, cax=cax)
cbar.set_label('Correlation', rotation=90)

In [None]:
# Import necessary libraries
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

# Create a 2x2 grid of subplots with shared axes
fig, axes = plt.subplots(2, 2, figsize=(9.5, 9))
ax = axes.ravel()

# Define species and corresponding variable pairs
species = ['Setosa', 'Versicolor', 'Versicolor', 'Virginica']
variables = [('Sepal Length', 'Sepal Width'), ('Petal Length', 'Sepal Length'),
             ('Petal Length', 'Petal Width'), ('Petal Length', 'Sepal Length')]

# Loop through species and variables
for i, spec in enumerate(species):
    # Select subset of data for the current species
    subset = iris[iris['Species'] == spec]
    x_var, y_var = variables[i]

    # Create a regression plot for the selected variables
    sns.regplot(x=subset[x_var], y=subset[y_var], ci=None,
                scatter_kws={'s': 40, 'fc' : '#7fbdd4', 'ec': '#335f8a'},
                line_kws={'color': 'red'}, ax=ax[i])

    # Calculate and display the Pearson correlation coefficient
    corr = stats.pearsonr(subset[x_var], subset[y_var]).statistic
    ax[i].set_title(f'{x_var} vs. {y_var} for {spec}. Corr = {corr:.3f}')

# Adjust subplot layout for better presentation
plt.tight_layout()

The relationship between the heatmap of linear correlation and the plot above lies in their shared objective of visualizing statistical relationships within a dataset. The plot presented here demonstrates scatterplots with regression lines, showcasing how two variables interact within different species of iris flowers. In contrast, a heatmap of linear correlation provides a comprehensive overview of the entire dataset, displaying the correlation coefficients between all pairs of variables in a matrix format. While the plot focuses on specific pairwise relationships within distinct species, the heatmap offers a broader perspective by highlighting the strength and direction of correlations across the entire dataset. Both visualization techniques are valuable tools in exploratory data analysis, with the scatterplot plot offering insights into individual relationships, and the heatmap providing a global view of correlation patterns within the data.

## Confusion Matrix

The confusion matrix is a fundamental tool used to evaluate the performance of classification algorithms by comparing their predictions against actual outcomes. It's especially important for understanding the types of errors a model makes. To explain the math behind the confusion matrix, let's consider a binary classification scenario, where there are two classes: "Positive" (P) and "Negative" (N). In this context, the confusion matrix is a 2x2 matrix with four entries [Powers, 2011, Scott, 2010, scikit-learn Developers, 2023]:


<center>
<img src="https://raw.githubusercontent.com/HatefDastour/hatefdastour.github.io/master/_notes/Introduction_to_Digital_Engineering/_images/CM.png" alt="picture" width="450">
</center>


Here's how each of these terms is defined:

- **True Positives (TP)**: The number of instances that are actually positive (P) and are correctly predicted as positive by the classification algorithm.

- **False Positives (FP)**: The number of instances that are actually negative (N) but are incorrectly predicted as positive (P) by the algorithm.

- **True Negatives (TN)**: The number of instances that are actually negative (N) and are correctly predicted as negative by the algorithm.

- **False Negatives (FN)**: The number of instances that are actually positive (P) but are incorrectly predicted as negative (N) by the algorithm.

These elements provide valuable information about the performance of a classification model. They can be used to calculate various metrics like accuracy, precision, recall (sensitivity), specificity, and F1-score, which provide deeper insights into the model's effectiveness in different aspects.

Keep in mind that the terms TP, FP, TN, and FN can have different interpretations depending on the context of the problem. For instance, in medical diagnostics, a false negative (FN) might mean failing to identify a disease in a patient, which could have significant consequences. Understanding the confusion matrix helps in understanding the strengths and weaknesses of a classification model in real-world applications.

<font color='Blue'><b>Confusion Matrix Example in Environmental Classification: Water Pollution Detection (Fictional):</b></font>
Consider the following fictional example of a confusion matrix in the context of environmental classification, specifically applied to the detection of pollution in water samples:

|                              | **Actual Polluted Water** | **Actual Clean Water** |
|:----------------------------:|:-------------------------:|:----------------------:|
| **Predicted Polluted Water** |               85          |            10          |
|   **Predicted Clean Water**  |               15          |           120          |

In this confusion matrix:

- "Clean Water" represents water samples that are truly clean and free from pollution.
- "Polluted Water" represents water samples that are contaminated or polluted.
- "Predicted Clean" represents water samples that a pollution detection system has correctly classified as clean.
- "Predicted Polluted" represents water samples that a pollution detection system has correctly classified as polluted.

Now, let's interpret the numbers:

- There are 120 water samples that are truly clean (Clean Water) and were correctly classified as clean (Predicted Clean).
- There are 15 water samples that are polluted (Polluted Water) but were incorrectly classified as clean (Predicted Clean).
- There are 10 water samples that are truly clean (Clean Water) but were incorrectly classified as polluted (Predicted Polluted).
- There are 85 water samples that are polluted (Polluted Water) and were correctly classified as polluted (Predicted Polluted).

In the context of the provided confusion matrix, which is used to evaluate the performance of a pollution detection system in water samples, we can interpret the terms TP (True Positives), FP (False Positives), TN (True Negatives), and FN (False Negatives) as follows:

1. **True Positives (TP):**
   - TP represents the cases where the pollution detection system correctly predicted that water samples were polluted (Predicted Polluted) and indeed, they were polluted (Actual Polluted Water).
   - In this example, there are 85 instances where the system correctly identified polluted water samples.

2. **False Positives (FP):**
   - FP represents the cases where the pollution detection system incorrectly predicted that water samples were polluted (Predicted Polluted), but in reality, they were clean (Actual Clean Water).
   - There are 10 instances where the system incorrectly classified clean water samples as polluted.

3. **True Negatives (TN):**
   - TN represents the cases where the pollution detection system correctly predicted that water samples were clean (Predicted Clean) and indeed, they were clean (Actual Clean Water).
   - In this example, there are 120 instances where the system correctly identified clean water samples.

4. **False Negatives (FN):**
   - FN represents the cases where the pollution detection system incorrectly predicted that water samples were clean (Predicted Clean), but in reality, they were polluted (Actual Polluted Water).
   - There are 15 instances where the system failed to detect pollution in water samples that were actually polluted.

## Accuracy: Evaluating Correct Classifications

Accuracy is a metric that quantifies the ratio of correctly classified instances to the total predictions made by a model. It's determined using the elements of the confusion matrix [Powers, 2011, Scott, 2010, scikit-learn Developers, 2023]:

\begin{equation}
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
\end{equation}

While commonly employed, accuracy may lead to misinterpretations, particularly in cases of imbalanced datasets. In such scenarios, where one class dominates, high accuracy can be achieved even when the model struggles to discern the minority class. Therefore, while accuracy provides an overall view of model performance, it's advisable to complement it with additional metrics, especially when dealing with imbalanced data distributions.

<font color='Blue'><b>Example:</b></font>
In the context of the Water Pollution Detection example, accuracy is computed using the following formula:

\begin{equation*}
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
\end{equation*}

Where:
- TP (True Positives) denotes instances where the pollution detection system correctly predicted "Polluted Water" (85).
- TN (True Negatives) represents instances where the system correctly predicted "Clean Water" (120).
- FP (False Positives) correspond to instances where the system incorrectly predicted "Polluted Water" when it was actually "Clean Water" (10).
- FN (False Negatives) denote instances where the system incorrectly predicted "Clean Water" when it was actually "Polluted Water" (15).

Utilizing these values in the formula, we compute the accuracy as follows:

\begin{equation*}
\text{Accuracy} = \frac{85 + 120}{85 + 120 + 10 + 15} = \frac{205}{230} \approx 0.8913
\end{equation*}

Thus, the accuracy of the pollution detection system in this example is approximately 0.8913 or 89.13%.

## Precision

Precision is a metric that measures the accuracy of positive predictions generated by a model, taking false positives into account. It's derived from the confusion matrix elements [Powers, 2011, Scott, 2010, scikit-learn Developers, 2023]:

\begin{equation}
\text{Precision} = \frac{TP}{TP + FP}
\end{equation}

Precision gains significance in situations where the implications of false positives are significant. In scenarios where mistakenly predicting a positive outcome has notable consequences, precision serves as a vital indicator of a model's ability to avoid making erroneous positive predictions.

<font color='Blue'><b>Example:</b></font> In the context of the Water Pollution Detection example,

* **Precision for Polluted Water:**
\begin{equation*} \text{Precision} = \frac{TP}{TP + FP}  = \frac{85}{85 + 10} = \frac{85}{95} \approx 0.8947 \end{equation*}

* **Precision for Clean Water:**
\begin{equation*} \text{Precision} = \frac{TN}{TN + FN} = \frac{120}{120 + 15} = \frac{120}{135} \approx 0.8889 \end{equation*}

## Recall (Sensitivity or True Positive Rate)

Recall, also known as sensitivity or the true positive rate, quantifies a model's capacity to identify all positive instances, even when considering false negatives. This metric is calculated using the confusion matrix elements [Powers, 2011, Scott, 2010, scikit-learn Developers, 2023]:

\begin{equation}
\text{Recall} = \frac{TP}{TP + FN}
\end{equation}

Recall holds paramount importance in situations where the avoidance of false negatives is critical. When failing to detect positive cases can have significant repercussions, recall serves as a pivotal indicator of a model's effectiveness in recognizing positive instances, ensuring minimal instances are overlooked.

<font color='Blue'><b>Example:</b></font> In the context of the Water Pollution Detection example,

* **Recall for Polluted Water:**
\begin{equation*} \text{Recall} = \frac{TP}{TP + FN} =  \frac{85}{85 + 15} = \frac{85}{100} = 0.85 \end{equation*}

* **Recall for Clean Water:**
\begin{equation*} \text{Recall} = \frac{TN}{TN + FP} = \frac{120}{120 + 10} = \frac{120}{130} \approx 0.9231 \end{equation*}

## F1-Score

The F1-Score presents a harmonious equilibrium between precision and recall, while accounting for both false positives and false negatives. It's calculated using the confusion matrix elements [Powers, 2011, Scott, 2010, scikit-learn Developers, 2023]:

\begin{equation}
\text{F1-Score} = 2 \times  \frac{\text{Precision} \times \text{Recall}}{\text{Precision + Recall}}
\end{equation}

When a trade-off situation emerges between precision and recall, the F1-Score serves as a valuable indicator. In scenarios where achieving high precision might lower recall or vice versa, the F1-Score provides comprehensive insight into a model's performance by considering the balance between these two vital aspects of prediction accuracy.

<font color='Blue'><b>Example:</b></font>
In the context of the Water Pollution Detection example,

* **F1 Score for Polluted Water:**
\begin{equation} \text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}  = 2 \cdot \frac{0.8947 \cdot 0.85}{0.8947 + 0.85} \approx 0.8711 \end{equation}

* **F1 Score for Clean Water:**
\begin{equation} \text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}  = 2 \cdot \frac{0.8889 \cdot 0.9231}{0.8889 + 0.9231} \approx 0.9056 \end{equation}

## Balanced Accuracy

Balanced Accuracy is a metric that addresses the challenges posed by imbalanced datasets, where one class significantly outweighs the others. It calculates the average accuracy for each class, providing a more reliable measure of overall model performance [Powers, 2011, Scott, 2010, scikit-learn Developers, 2023].

The Balanced Accuracy is calculated as the average of the true positive rate (Recall) for each class:

\begin{equation}
\text{Balanced Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \frac{TP_i}{TP_i + FN_i}
\end{equation}

Where:
- $N$ is the number of classes.
- $TP_i$ is the true positive count for class $i$.
- $FN_i$ is the false negative count for class $i$.

Balanced Accuracy takes into account the individual class performance, making it particularly useful for evaluating models in scenarios where class distribution is skewed. It offers a fairer assessment of how well a model performs across all classes, providing a more accurate representation of its effectiveness on imbalanced datasets.

In binary classification scenarios, Balanced Accuracy can be defined as the average of two key metrics: sensitivity (the true positive rate) and specificity (the true negative rate). Alternatively, it can be likened to the area under the Receiver Operating Characteristic (ROC) curve when using binary predictions instead of continuous scores. Mathematically, this can be expressed as:

\begin{equation} \text{Balanced Accuracy} = \frac{1}{2} \left( \frac{TP}{TP + FN} + \frac{TN}{TN + FP} \right) \end{equation}

This formula captures the balanced evaluation of a binary classification model's ability to correctly identify both positive and negative instances while considering the trade-off between them.

<font color='Blue'><b>Example:</b></font> In the context of the Water Pollution Detection example,
\begin{align*}
\text{Balanced Accuracy} &= \frac{1}{2} \left( \frac{TP}{TP + FN} + \frac{TN}{TN + FP} \right) \\
&= \frac{\text{Recall for Clean Water} + \text{Recall for Polluted Water}}{2} \\
&= \frac{1}{2} \left( \frac{0.9231 + 0.85}{2} \right) \approx 0.8866.
\end{align*}


## Matthews Correlation Coefficient (MCC)

The Matthews Correlation Coefficient (MCC) is a metric used to assess the quality of binary classification models. It takes into account true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) to provide a balanced evaluation of classification performance, especially when dealing with imbalanced datasets [Powers, 2011, Scott, 2010, scikit-learn Developers, 2023].

MCC ranges from -1 to +1, where:
- +1 indicates perfect prediction.
- 0 indicates no better than random prediction.
- -1 indicates perfect disagreement between predictions and actual classes.

Mathematically, MCC is calculated as:

\begin{equation}
\text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}
\end{equation}

Here's how the components of the formula contribute:

- $TP \times TN$ term rewards correct positive and negative predictions.
- $FP \times FN$ term penalizes the count of false positive and false negative predictions.
- The denominator normalizes the score, accounting for class distribution.

Key points about MCC:

1. **Balanced Metric**: MCC is suitable for imbalanced datasets because it considers TP, TN, FP, and FN, providing a balanced evaluation.

2. **Symmetry**: MCC is symmetric, meaning that swapping the positive and negative classes doesn't affect the score.

3. **Sensitive to Class Imbalance**: Unlike accuracy, MCC isn't easily misled by imbalanced data.

4. **Usefulness**: MCC helps identify how well the model's predictions align with the actual classes, considering all four possible outcomes of binary classification.

<font color='Blue'><b>Example:</b></font>
We also can calculate the Matthews Correlation Coefficient (MCC) for the Water Pollution Detection example as follows:

In [None]:
import math

def calculate_mcc(tp, tn, fp, fn):
    numerator = (tp * tn) - (fp * fn)
    denominator = math.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))

    mcc = numerator / denominator if denominator != 0 else 0.0

    return mcc

mcc = calculate_mcc(tp=85, tn=120, fp=10, fn=15)
print(f"Matthews Correlation Coefficient (MCC): {mcc:.4f}")

## Cohen's Kappa
Cohen's Kappa, often referred to simply as Kappa, is a statistic used to measure the level of agreement between two raters or evaluators, especially in situations involving categorical data or classification tasks. It's particularly useful when assessing the reliability or consistency of human judgments or the agreement between human judgments and machine predictions [Powers, 2011, Scott, 2010, scikit-learn Developers, 2023].

Kappa takes into account the agreement that could occur by chance and provides a normalized score that indicates the extent to which the observed agreement between raters or evaluators is beyond what could be expected by chance alone.

The formula for Cohen's Kappa is as follows:

\begin{equation}
\text{Kappa} = \frac{P_o - P_e}{1 - P_e}
\end{equation}

Where:
- $ P_o $ is the observed proportion of agreement between the raters or evaluators.
- $ P_e $ is the proportion of agreement expected by chance, calculated as the product of the marginal frequencies of the categories being rated.

Key points about Cohen's Kappa:

1. **Range of Values**: Kappa values range from -1 to +1.
   - $ +1 $: Perfect agreement between raters.
   - $ 0 $: Agreement equivalent to chance.
   - $ -1 $: Complete disagreement between raters.

2. **Adjustment for Chance**: Kappa adjusts for agreement that could occur by random chance.

3. **Interpretation**: The interpretation of Kappa values varies, but typically:
   - $ 0.2 $ or less: Poor agreement.
   - $ 0.21 - 0.40 $: Fair agreement.
   - $ 0.41 - 0.60 $: Moderate agreement.
   - $ 0.61 - 0.80 $: Substantial agreement.
   - $ 0.81 - 1.00 $: Almost perfect agreement.

4. **Considerations**: Cohen's Kappa is sensitive to the distribution of categories and can be affected by the prevalence of a particular category.

Cohen's Kappa is a valuable tool in fields such as inter-rater reliability studies, medical diagnoses, and social sciences, where assessing agreement among raters is crucial. It provides insight into the reliability of judgments beyond raw agreement percentages and helps ensure the consistency and quality of human or machine-based evaluations.

In the context of a binary classification confusion matrix used in machine learning and statistics, Cohen's Kappa formula can be expressed as follows:

\begin{equation}
\kappa = \frac{2 \times (TP \times TN - FN \times FP)}{(TP + FP) \times (FP + TN) + (TP + FN) \times (FN + TN)}
\end{equation}

Where:
- TP represents the true positives.
- FP represents the false positives.
- TN represents the true negatives.
- FN represents the false negatives.

Cohen's Kappa is a metric used to assess the agreement between observed and expected classification results in binary classification scenarios. It quantifies the level of agreement beyond what would be expected by chance. Notably, in this context, Cohen's Kappa is equivalent to the Heidke skill score used in Meteorology. This measure has a historical origin, as it was first introduced by Myrick Haskell Doolittle in 1888.

<font color='Blue'><b>Example:</b></font> In the Water Pollution Detection example, we can calculate Cohen's Kappa as follows:

**Total Number of Samples ($N$):**
\begin{equation}
N = \text{TP} + \text{TN} + \text{FP} + \text{FN} = 85 + 120 + 10 + 15 = 230
\end{equation}

**Observed Agreement ($po$):**
\begin{equation}
po = \frac{\text{TP} + \text{TN}}{N} = \frac{85 + 120}{230} = \frac{205}{230}
\end{equation}

**Probabilities of Random Agreement ($pe_{\text{polluted}}, pe_{\text{clean}}$):**
\begin{equation}
pe_{\text{polluted}} = \frac{(\text{TP} + \text{FP}) \times (\text{TP} + \text{FN})}{N^2} = \frac{(85 + 10) \times (85 + 15)}{230^2}
\end{equation}
\begin{equation}
pe_{\text{clean}} = \frac{(\text{TN} + \text{FP}) \times (\text{TN} + \text{FN})}{N^2} = \frac{(120 + 10) \times (120 + 15)}{230^2}
\end{equation}

**Cohen's Kappa ($\kappa$):**
\begin{equation}
\kappa = \frac{po - (pe_{\text{polluted}} + pe_{\text{clean}})}{1 - (pe_{\text{polluted}} + pe_{\text{clean}})} \approx 0.7776
\end{equation}

In [None]:
def calculate_cohens_kappa(tp, tn, fp, fn):
    # Calculate the total number of samples
    total_samples = tp + tn + fp + fn

    # Calculate observed agreement (po)
    po = (tp + tn) / total_samples

    # Calculate the probabilities of random agreement for each category
    pe_polluted = ((tp + fp) / total_samples) * ((tp + fn) / total_samples)
    pe_clean = ((tn + fp) / total_samples) * ((tn + fn) / total_samples)

    # Calculate Cohen's Kappa
    kappa = (po - (pe_polluted + pe_clean)) / (1 - (pe_polluted + pe_clean))

    return kappa

# Define the values (you can replace these with your specific values)
tp = 85
tn = 120
fp = 10
fn = 15

# Calculate Cohen's Kappa using the function
kappa = calculate_cohens_kappa(tp, tn, fp, fn)
print(f"Cohen's Kappa: {kappa:.4f}")

Calculating Cohen's Kappa for the Water Pollution Detection example using the binary formula:

\begin{align}
\kappa &= \frac{2 \times (TP \times TN - FP \times FN)}{(TP + FP) \times (FP + TN) + (TP + FN) \times (FN + TN)} \\
&= \frac{2 \times (85 \times 120 - 10 \times 15)}{(85 + 10) \times (10 + 120) + (85 + 15) \times (15 + 120)} \\
&= \frac{2 \times (10200 - 150)}{(95) \times (130) + (100) \times (135)} \\
&= \frac{2 \times 10050}{12350 + 13500} \\
&= \frac{20100}{25850} \approx 0.7776
\end{align}

In [None]:
def calculate_cohens_kappa_alt(tp, tn, fp, fn):
    # Calculate Cohen's Kappa using the alternative formula
    kappa = (2 * (tp * tn - fp * fn)) / ((tp + fp) * (fp + tn) + (tp + fn) * (fn + tn))
    return kappa

# Define the values (you can replace these with your specific values)
tp = 85
tn = 120
fp = 10
fn = 15

# Calculate Cohen's Kappa using the alternative formula and the function
kappa = calculate_cohens_kappa_alt(tp, tn, fp, fn)
print(f"Cohen's Kappa: {kappa:.4f}")

## Scikit-learn's metrics

Scikit-learn's functions, such as `sklearn.metrics.accuracy_score`, `sklearn.metrics.precision_score`, `sklearn.metrics.recall_score`, and `sklearn.metrics.f1_score`, make use of these fundamental metrics for their computations. By interpreting these metrics within the context of the confusion matrix, practitioners can acquire a more profound insight into their model's capabilities and limitations. This enhanced understanding empowers them to make informed decisions throughout the various stages of model development and deployment [Pedregosa et al., 2011, Powers, 2011, scikit-learn Developers, 2023].


| Metric                       | Description                                                                                     | Formula or Usage                                                                                                            |
|------------------------------|-------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|
| `accuracy_score`             | Computes accuracy, the ratio of correctly predicted instances to the total instances.         | `accuracy_score(y_true, y_pred)`                                                                                           |
| `balanced_accuracy_score`    | Calculates accuracy, considering imbalanced classes by averaging accuracy across classes.      | `balanced_accuracy_score(y_true, y_pred)`                                                                                  |
| `precision_score`            | Measures the proportion of true positive predictions among all positive predictions.          | `precision_score(y_true, y_pred)`                                                                                         |
| `recall_score`               | Computes the proportion of true positive predictions among actual positive instances.          | `recall_score(y_true, y_pred)`                                                                                            |
| `f1_score`                   | Balances precision and recall, combining them into a single score.                             | `f1_score(y_true, y_pred)`                                                                                                |
| `specificity_score`          | Measures the proportion of true negative predictions among actual negative instances.          | Calculate sensitivity, then apply `specificity_score = 1 - sensitivity`.                                                  |
| `neg_pred_value`             | Calculates the likelihood of true negative predictions among all predicted negatives.          | `neg_pred_value = TN / (TN + FN)`                                                                                         |
| `false_positive_rate`        | Computes the proportion of false positive predictions among actual negative instances.          | `false_positive_rate = FP / (FP + TN)`                                                                                    |
| `false_negative_rate`        | Measures the rate of false negative predictions among actual positive instances.                | `false_negative_rate = FN / (FN + TP)`                                                                                    |
| `matthews_corrcoef`          | Incorporates true positives, true negatives, false positives, and false negatives into one value. | `matthews_corrcoef(y_true, y_pred)`                                                                                       |
| `cohen_kappa_score`          | Measures agreement between raters or evaluators while considering chance.                        | `cohen_kappa_score(y1, y2)`                                                                                               |
| ...                          | Additional metrics such as AUC-ROC, AUC-PR, log loss, etc.                                       | Various functions available in `sklearn.metrics`                                                                           |
