<font>
<div dir=ltr align=center>
<img src='Sharif_logo.png' width=250 height=250> <br>
<font color=0F5298 size=7>
Applied Data Science<br>
<font color=2565AE size=5>
Spring 2025<br>
<font color=3C99D size=5>
HW5 - Accuracy Metrics <br>
<font color=696880 size=4>
Ali Mohammadzade Shabestari - 401106482 - Computer Engineering



# 1. Import Libraries

In [26]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import Binarizer, KBinsDiscretizer
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error, r2_score
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report, hamming_loss
import matplotlib.pyplot as plt

# 2. Load Dataset

In this notebook, we make use of `load_diabetes` dataset from `Scikit-Learn` library.

In [27]:
diabetes = load_diabetes()

# Create DataFrame
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

# Add target variable
df['target'] = diabetes.target

df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


# 3. 1 Regression Task

In machine learning, it is essential to evaluate the performance of a regression model on unseen data. To achieve this, the dataset is typically split into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance. This process helps in assessing how well the model generalizes to new data.

The `train_test_split` function from the `sklearn.model_selection` module is commonly used to split the dataset into training and testing sets. It randomly divides the data into specified proportions, ensuring that the model is trained and tested on different subsets of the data.

In [28]:
# Split data
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

After preprocessing and feature selection, we can proceed with training a regression model. Here, we use the `LinearRegression` model from `scikit-learn`.


In [29]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

# 3. 2 Regression Accuracy Metrics

When evaluating the performance of a regression model, several metrics can be used to measure how well the model's predictions match the actual values. Here are some commonly used metrics:

### Mean Squared Error (MSE)
Mean Squared Error (MSE) is the average of the squared differences between the predicted and actual values. It is calculated as:

$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

where \( y_i \) is the actual value, \( \hat{y}_i \) is the predicted value, and \( n \) is the number of observations. MSE gives a higher weight to larger errors, making it sensitive to outliers.

### Mean Absolute Error (MAE)
Mean Absolute Error (MAE) is the average of the absolute differences between the predicted and actual values. It is calculated as:

$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$

MAE provides a linear score that does not emphasize large errors as much as MSE.

### Mean Absolute Percentage Error (MAPE)
Mean Absolute Percentage Error (MAPE) is the average of the absolute percentage differences between the predicted and actual values. It is calculated as:

$\text{MAPE} = \frac{100}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right|$

MAPE expresses the error as a percentage, making it easier to interpret in the context of the data.

### R-squared (R²)
R-squared (R²) is a statistical measure that represents the proportion of the variance for the dependent variable that is explained by the independent variables in the model. It is calculated as:

$R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$

where \( \bar{y} \) is the mean of the actual values. R² ranges from 0 to 1, with higher values indicating a better fit of the model to the data.

Each of these metrics provides different insights into the performance of a regression model, and they are often used together to get a comprehensive understanding of the model's accuracy.


In [30]:
# Regression Metrics
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Regression Metrics:\nMSE: {mse:.3f}\nMAE: {mae:.3f}\nMAPE: {mape:.3f}\nR2 Score: {r2:.3f}\n")

Regression Metrics:
MSE: 2900.194
MAE: 42.794
MAPE: 0.375
R2 Score: 0.453



## Interpretation

- <b>MSE:</b> The model’s predictions deviate significantly from actual values on average.

- <b>MAE:</b> On average, the model’s predictions are off by about 42.79 units, which might be large depending on the scale of the target variable.

- <b>MAPE:</b> The model’s predictions are, on average, 37.5% off from the actual values, which is relatively high.

- <b>R2:</b> The model explains only 45.3% of the variance, meaning that more than half of the variability in the target variable remains unexplained. This suggests that the model may be underfitting or missing important features.

# 4. 1 Binary Classification Task

In the context of a regression problem, binary classification can be applied by converting the continuous target variable into a binary variable. This is typically done by defining a threshold value, above which the target variable is classified as one class (e.g., 1) and below which it is classified as the other class (e.g., 0).

For example, in the diabetes dataset, we could convert the continuous target variable (which represents a quantitative measure of disease progression) into a binary variable indicating whether the progression is above or below a certain threshold. This allows us to apply binary classification techniques to predict whether a patient's disease progression is high or low.

Binary classification accuracy metrics, such as precision, recall, F1-score, and AUC-ROC, can then be used to evaluate the performance of the classification model. These metrics provide insights into the model's ability to correctly classify instances into the two classes, which is particularly useful when dealing with imbalanced datasets.


In [31]:
# Convert target into binary: High (>=140) vs Low (<140)
y_binary = (df['target'] >= 140).astype(int)
X_train_bin, X_test_bin, y_train_bin, y_test_bin = train_test_split(df.drop(columns=['target']), y_binary, test_size=0.2, random_state=42)
classifier = LogisticRegression(max_iter=1000)
classifier.fit(X_train_bin, y_train_bin)
y_pred_bin = classifier.predict(X_test_bin)

# 4. 2 Binary Classification Accuracy Metrics

### Precision
Precision is the ratio of correctly predicted positive observations to the total predicted positives. It answers the question: "Of all the instances that were predicted as positive, how many were actually positive?" Precision is calculated as:

$\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}$

### Recall
Recall (also known as Sensitivity or True Positive Rate) is the ratio of correctly predicted positive observations to all the observations in the actual class. It answers the question: "Of all the instances that were actually positive, how many were correctly predicted as positive?" Recall is calculated as:

$\text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}$

### F1-Score
The F1-Score is the weighted average of Precision and Recall. It is especially useful when you need a balance between Precision and Recall, and when you have an uneven class distribution. The F1-Score is calculated as:

$\text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$

These metrics provide different insights into the performance of a classification model, and they are often used together to get a comprehensive understanding of the model's accuracy.

In [32]:
precision_bin = precision_score(y_test_bin, y_pred_bin)
recall_bin = recall_score(y_test_bin, y_pred_bin)
f1_bin = f1_score(y_test_bin, y_pred_bin)

print(f"Binary Classification Metrics:\nPrecision: {precision_bin:.3f}\nRecall: {recall_bin:.3f}\nF1-Score: {f1_bin:.3f}\n")

Binary Classification Metrics:
Precision: 0.732
Recall: 0.714
F1-Score: 0.723



## Interpretation

- <b>Precision:</b> Out of all the instances classified as positive, 73.2% were correctly identified. A lower precision indicates a high false positive rate, meaning the model often misclassifies negatives as positives.

- <b>Recall:</b> The model correctly identified 71.4% of actual positive cases. A lower recall suggests a high false negative rate, meaning the model is missing some actual positives.

- <b>F1:</b> Since F1-score = 72.3%, it suggests a good trade-off between precision and recall. If precision and recall differ significantly, F1-score is more informative than using either metric alone.

# 5. 1 Multi-class Classification Task

In the context of a regression problem, multiclass classification can be applied by converting the continuous target variable into multiple discrete classes. This is typically done by defining multiple threshold values, which divide the target variable into several classes. For example, in the diabetes dataset, we could convert the continuous target variable (which represents a quantitative measure of disease progression) into multiple classes indicating different levels of progression (e.g., low, medium, high).

By applying multiclass classification techniques, we can gain a more nuanced understanding of the model's performance and its ability to distinguish between different levels of the target variable. This approach is especially valuable in scenarios where the target variable exhibits a wide range of values and a simple binary classification would not capture the complexity of the data.

In [33]:
# Discretizing target into 3 categories (Low, Medium, High)
encoder = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
y_multi = encoder.fit_transform(df[['target']]).astype(int).flatten()
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(df.drop(columns=['target']), y_multi, test_size=0.2, random_state=42)

classifier_multi = LogisticRegression(max_iter=1000)
classifier_multi.fit(X_train_multi, y_train_multi)
y_pred_multi = classifier_multi.predict(X_test_multi)

# 5. 2 Multi-class Classification Accuracy Metrics

### Precision for Each Class
Precision for each class is the ratio of correctly predicted instances of that class to the total instances predicted as that class. It answers the question: "Of all the instances predicted as a specific class, how many were actually of that class?" High precision indicates a low false positive rate.

### Recall for Each Class
Recall for each class is the ratio of correctly predicted instances of that class to the total actual instances of that class. It answers the question: "Of all the instances that actually belong to a specific class, how many were correctly predicted?" High recall indicates a low false negative rate.

### Macro-Averaged F1-Score
Macro-averaged F1-Score calculates the F1-Score for each class independently and then takes the average. This treats all classes equally, regardless of their frequency. It is useful when you want to evaluate the model's performance across all classes without considering class imbalance.

### Weighted-Averaged F1-Score
Weighted-averaged F1-Score calculates the F1-Score for each class independently and then takes the average, weighted by the number of true instances for each class. This accounts for class imbalance by giving more importance to classes with more instances.

### Micro-Averaged F1-Score
Micro-averaged F1-Score aggregates the contributions of all classes to compute the average F1-Score. It calculates the total true positives, false negatives, and false positives across all classes and then computes the F1-Score. This method is useful when you want to evaluate the overall performance of the model, especially in the presence of class imbalance.

In [34]:
print("Multi-class Classification Report:")
print(classification_report(y_test_multi, y_pred_multi, digits=3))

Multi-class Classification Report:
              precision    recall  f1-score   support

           0      0.642     0.956     0.768        45
           1      0.545     0.375     0.444        32
           2      0.000     0.000     0.000        12

    accuracy                          0.618        89
   macro avg      0.396     0.444     0.404        89
weighted avg      0.521     0.618     0.548        89



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Interpretation

•	Class 2 is completely ignored → Try balancing the dataset or adjusting class weights.

•	Class 1 has poor recall → Consider feature selection or a different model.

•	Majority class (0) dominates predictions → A weighted loss function or oversampling may help.

# 6. Football Problem

### Hamming Loss

Hamming Loss is a metric used to evaluate the performance of a multi-label classification model. It calculates the fraction of labels that are incorrectly predicted. In other words, it measures the proportion of labels that are incorrectly predicted across all samples.

The formula for Hamming Loss is:

$$
\text{Hamming Loss} = \frac{1}{n_{\text{samples}} \times n_{\text{labels}}} \sum_{i=1}^{n_{\text{samples}}} \sum_{j=1}^{n_{\text{labels}}} \mathbf{1}(y_{ij} \neq \hat{y}_{ij})
$$

where:
- $n_{\text{samples}}$ is the number of samples.

- $n_{\text{labels}}$ is the number of labels.

- $y_{ij}$ is the true value of the $j$-th label for the $i$-th sample.

- $\hat{y}_{ij}$ is the predicted value of the $j$-th label for the $i$-th sample.

- $\mathbf{1}$ is the indicator function that returns 1 if $y_{ij} \neq \hat{y}_{ij}$ and 0 otherwise.

A lower Hamming Loss indicates better performance, as it means fewer labels are incorrectly predicted.


In [35]:
n_samples = 1000
n_classes = 4
np.random.seed(42)
y_true_multi_label = np.random.randint(0, 2, size=(n_samples, n_classes))
y_pred_multi_label = np.random.randint(0, 2, size=(n_samples, n_classes))

In [36]:
# Multi-label Classification Metric (Hamming Loss)
hamming = hamming_loss(y_true_multi_label, y_pred_multi_label)
print(f"\nMulti-Label Classification (Football Example):\nHamming Loss: {hamming:.3f}")


Multi-Label Classification (Football Example):
Hamming Loss: 0.499


### Interpretation

Hamming Loss of 0.499 indicates that 49.9% of the labels are incorrectly predicted.

Since this is a multi-label problem, the model’s predictions are incorrect nearly half the time for each label across all samples. This is relatively high, which suggests that the model might not be capturing the relationships well or is struggling to predict the correct labels for each player.