# Performance Metrics

Essential for evaluating the effectiveness of machine learning models. The choice of metric depends on the type of problem (classification, regression, clustering, or time series) and the specific goals of the analysis. Below, we break down the most commonly used metrics for each type of problem.

- Learning Curves: Plot training and validation performance over time to diagnose overfitting or underfitting.

## Classification Metrics
- Confusion Matrix: For classification problems, use a confusion matrix to visualize true positives, false positives, etc.
- Accuracy: Proportion of correctly classified instances. When to use: Balanced classes, equal misclassification costs
- Precision: Focus on positive class performance When to use: Imbalanced classes, asymmetric costs (e.g., fraud detection)
- Recall: Measures the proportion of true positives among all actual positives.When the cost of false negatives is high (e.g., disease detection).
- F1-Score: Harmonic mean of precision and recall. When to use: Need balance between precision and recall
- ROC-AUC: Area under ROC curve, measures discrimination.When to use: Ranking quality, threshold-independent evaluation
- PR-AUC: Area under precision-recall curve.When to use: Highly imbalanced datasets

## Regression Metrics
- Mean Squared Error (MSE): Average squared difference between predictions and actuals.When to use: General purpose, penalizes large errors
- Mean Absolute Error (MAE): Average absolute difference.When to use: Need for interpretability, less sensitivity to outliers
- R-squared: Proportion of variance explained.When to use: Comparing models, explaining fit to stakeholders
- MAPE: Mean absolute percentage error.When to use: Comparing errors across different scales

## Clustering: 
- Silhouette score: Measures how similar an object is to its own cluster compared to other clusters.
- Davies-Bouldin index: Measures the average similarity ratio of each cluster with the cluster that is most similar to it.

## Time Series: 
- Mean Absolute Percentage Error (MAPE):  Mean absolute percentage error for time series forecasting.
- AIC (Akaike Information Criterion): Measures the trade-off between model complexity and goodness of fit.
- BIC (Bayesian Information Criterion): Similar to AIC but penalizes model complexity more heavily.


### Learning Curves

Can be used for classification, regression, time series, and clustering problems. They are a versatile diagnostic tool to understand how a model is performing during training and to identify potential issues like overfitting or underfitting

- When to Use:
During model training to monitor performance.
To determine if the model needs more data, regularization, or architectural changes.
- Interpretation:
Overfitting: Training performance improves, but validation performance plateaus or worsens
Underfitting: Both training and validation performance are poor.

#### Learning Curves for Classification
To monitor the performance of a classification model (e.g., accuracy, precision, recall, F1-score) on the training set and validation set over time (epochs or iterations).

In [None]:
from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Generate learning curves
train_sizes, train_scores, val_scores = learning_curve(
    RandomForestClassifier(), X, y, cv=5, scoring='accuracy', train_sizes=np.linspace(0.1, 1.0, 10)
)

# Plot learning curves
plt.plot(train_sizes, np.mean(train_scores, axis=1), label='Training Score')
plt.plot(train_sizes, np.mean(val_scores, axis=1), label='Validation Score')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.title('Learning Curves (Classification)')
plt.legend()
plt.show()

#### Learning Curves for Regression
To monitor the performance of a regression model (e.g., MSE, MAE, R²) on the training set and validation set over time.

- When to use: 
To diagnose overfitting or underfitting in regression tasks.
To determine if the model needs more data or regularization.

In [None]:
from sklearn.model_selection import learning_curve
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
import numpy as np

# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1)

# Generate learning curves
train_sizes, train_scores, val_scores = learning_curve(
    LinearRegression(), X, y, cv=5, scoring='neg_mean_squared_error', train_sizes=np.linspace(0.1, 1.0, 10)
)

# Plot learning curves
plt.plot(train_sizes, -np.mean(train_scores, axis=1), label='Training MSE')
plt.plot(train_sizes, -np.mean(val_scores, axis=1), label='Validation MSE')
plt.xlabel('Training Set Size')
plt.ylabel('Mean Squared Error')
plt.title('Learning Curves (Regression)')
plt.legend()
plt.show()

#### Learning Curves for Time Series
To monitor the performance of a time series model (e.g., MAPE, RMSE) on the training set and validation set over time.
- When to use: To diagnose overfitting or underfitting in time series forecasting tasks.
To determine if the model needs more data or better feature engineering.


In [None]:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import RandomForestRegressor
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic time series data
X = np.array([i for i in range(100)])
y = np.sin(X) + np.random.normal(0, 0.1, 100)

# Time Series Split
tscv = TimeSeriesSplit(n_splits=5)

# Generate learning curves
train_sizes = []
train_scores = []
val_scores = []
for train_index, val_index in tscv.split(X):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]
    model = RandomForestRegressor()
    model.fit(X_train.reshape(-1, 1), y_train)
    train_scores.append(model.score(X_train.reshape(-1, 1), y_train))
    val_scores.append(model.score(X_val.reshape(-1, 1), y_val))
    train_sizes.append(len(X_train))

# Plot learning curves
plt.plot(train_sizes, train_scores, label='Training Score')
plt.plot(train_sizes, val_scores, label='Validation Score')
plt.xlabel('Training Set Size')
plt.ylabel('R² Score')
plt.title('Learning Curves (Time Series)')
plt.legend()
plt.show()

#### Learning Curves for Clustering

To monitor the performance of a clustering model (e.g., silhouette score, Davies-Bouldin index) on the training set over time.For clustering, learning curves are less common but can still be used to monitor convergence or clustering quality
- When to use:
To diagnose if the clustering algorithm is converging or if it needs more iterations.
To determine the optimal number of clusters.

In [None]:
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np

# Generate synthetic clustering dataset
X, _ = make_blobs(n_samples=500, centers=4, cluster_std=1.0)

# Generate learning curves
train_sizes = np.linspace(0.1, 1.0, 10)
silhouette_scores = []
for size in train_sizes:
    n_samples = int(size * len(X))
    X_subset = X[:n_samples]
    model = KMeans(n_clusters=4)
    model.fit(X_subset)
    silhouette_scores.append(silhouette_score(X_subset, model.labels_))

# Plot learning curves
plt.plot(train_sizes * len(X), silhouette_scores)
plt.xlabel('Training Set Size')
plt.ylabel('Silhouette Score')
plt.title('Learning Curves (Clustering)')
plt.show()

### Classification Metrics

#### Confusion Matrix

A table that visualizes the performance of a classification model by showing true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).


In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=data.target_names)
disp.plot()
plt.show()

#### Accuracy
Measures the proportion of correctly classified instances.

- When to use: When classes are balanced.
When the cost of misclassification is symmetric (e.g., equal cost for false positives and false negatives).

In [None]:
from sklearn.metrics import accuracy_score

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

#### Precision
Measures the proportion of true positives among all predicted positives.
- When to use:
When the cost of false positives is high (e.g., spam detection).
For imbalanced datasets where the positive class is rare.

In [None]:
from sklearn.metrics import precision_score

# Calculate precision (for binary classification, use average='binary')
precision = precision_score(y_test, y_pred, average='macro')  # For multiclass
print("Precision:", precision)

#### Recall / Sensitivity
Measures the proportion of true positives among all actual positives.
- When to Use:
When the cost of false negatives is high (e.g., disease detection).
For imbalanced datasets where missing positive instances is costly.

In [None]:
from sklearn.metrics import recall_score

# Calculate recall (for binary classification, use average='binary')
recall = recall_score(y_test, y_pred, average='macro')  # For multiclass
print("Recall:", recall)

#### F1-Score
Harmonic mean of precision and recall, providing a balance between the two.

-When to Use:
When you need a single metric that balances precision and recall.
For imbalanced datasets where both false positives and false negatives are important.


In [None]:
from sklearn.metrics import f1_score

# Calculate F1-Score (for binary classification, use average='binary')
f1 = f1_score(y_test, y_pred, average='macro')  # For multiclass
print("F1-Score:", f1)

#### ROC-AUC (Receiver Operating Characteristic - Area Under Curve)
Measures the model's ability to discriminate between classes by plotting the true positive rate (TPR) against the false positive rate (FPR) at various thresholds.
- When to Use:
For threshold-independent evaluation.
When ranking or probability outputs are important (e.g., credit scoring).


In [None]:
from sklearn.metrics import roc_auc_score

# ROC-AUC for binary classification
# For multiclass, use `multi_class='ovr'` or `multi_class='ovo'`
roc_auc = roc_auc_score(y_test, model.predict_proba(X_test), multi_class='ovr')
print("ROC-AUC:", roc_auc)

#### PR-AUC (Precision-Recall - Area Under Curve)
Measures the trade-off between precision and recall, especially useful for imbalanced datasets.

- When to Use:
For highly imbalanced datasets where the positive class is rare.
When false positives and false negatives have asymmetric costs.

In [None]:
from sklearn.metrics import precision_recall_curve, auc

# Precision-Recall Curve and AUC
precision, recall, _ = precision_recall_curve(y_test, model.predict_proba(X_test)[:, 1])  # Binary classification
pr_auc = auc(recall, precision)
print("PR-AUC:", pr_auc)

### Regression Metrics

#### Mean Squared Error (MSE)
Measures the average squared difference between predicted and actual values.

- When to Use:
For general-purpose regression evaluation.
When large errors should be penalized more heavily.

In [None]:
from sklearn.metrics import mean_squared_error

# Calculate MSE
mse = mean_squared_error(y_test, y_pred)
print("MSE:", mse)

#### Mean Absolute Error (MAE)
Measures the average absolute difference between predicted and actual values.

- When to Use:
When interpretability is important.
When the dataset contains outliers (less sensitive to outliers than MSE).

In [None]:
from sklearn.metrics import mean_absolute_error

# Calculate MAE
mae = mean_absolute_error(y_test, y_pred)
print("MAE:", mae)

#### R-squared (Coefficient of Determination)
Measures the proportion of variance in the dependent variable that is explained by the model.
- When to Use:
For comparing models.
When explaining model performance to stakeholders.

In [None]:
from sklearn.metrics import r2_score

# Calculate R-squared
r2 = r2_score(y_test, y_pred)
print("R-squared:", r2)

#### Mean Absolute Percentage Error (MAPE)
Measures the average percentage error between predicted and actual values.

- When to Use:
When comparing errors across datasets with different scales.
For business contexts where percentage errors are more intuitiv


In [None]:
import numpy as np

# Calculate MAPE
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
print("MAPE:", mape)

### Clustering Metrics

#### Silhouette Score
Measures how similar an object is to its own cluster compared to other clusters. Ranges from -1 (poor clustering) to 1 (excellent clustering).

- When to Use:
For evaluating the quality of clustering when ground truth labels are unavailable.
To determine the optimal number of clusters.

In [None]:
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic clustering dataset
X, _ = make_blobs(n_samples=500, centers=4, cluster_std=1.0)

# Train KMeans
model = KMeans(n_clusters=4)
model.fit(X)

# Calculate Silhouette Score
silhouette = silhouette_score(X, model.labels_)
print("Silhouette Score:", silhouette)

#### Davies-Bouldin Index
Measures the average similarity ratio of each cluster with the cluster that is most similar to it. Lower values indicate better clustering.
- When to Use:
For comparing clustering algorithms or configurations.
When ground truth labels are unavailable.

In [None]:
from sklearn.metrics import davies_bouldin_score

# Calculate Davies-Bouldin Index
db_index = davies_bouldin_score(X, model.labels_)
print("Davies-Bouldin Index:", db_index)

### Time Series Metrics

#### Mean Absolute Percentage Error (MAPE)
Measures the average percentage error between predicted and actual values in time series forecasting.

- When to Use:
For evaluating time series models where percentage errors are meaningful.
When comparing models across different time series.

In [None]:
import numpy as np

# Example time series data
y_true = np.array([100, 200, 300, 400, 500])
y_pred = np.array([110, 190, 310, 420, 480])

# Calculate MAPE
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
print("MAPE:", mape)

#### AIC (Akaike Information Criterion)
Measures the trade-off between model complexity and goodness of fit. Lower values indicate better models.

- When to Use:
For comparing time series models (e.g., ARIMA, SARIMA).
When balancing model fit and complexity.

In [None]:
from statsmodels.tsa.arima.model import ARIMA

# Example time series data
data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Fit ARIMA model
model = ARIMA(data, order=(1, 1, 1))
model_fit = model.fit()

# Calculate AIC
aic = model_fit.aic
print("AIC:", aic)

#### BIC (Bayesian Information Criterion)
Similar to AIC but penalizes model complexity more heavily. Lower values indicate better models.
- When to Use:
For comparing time series models when simplicity is a priority.
When selecting models for forecasting.

In [None]:
# Calculate BIC (using the same ARIMA model as above)
bic = model_fit.bic
print("BIC:", bic)