### 🧮 Classification Accuracy

**Definition:**  
Classification accuracy measures how often a classification model correctly predicts the class labels.

---

**Formula:**  
$$
\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} \times 100
$$

---

**Example:**  
If your model made **100 predictions**, and **85** of them were correct:

$$
\text{Accuracy} = \frac{85}{100} \times 100 = 85\%
$$


In [1]:
def accuracy_metric(actual, predicted):
    correct = 0
    for label, prediction in  zip(actual, predicted):
        if label == prediction:
            correct += 1
    return (correct/len(actual))*100

In [4]:
actual = [0,0,0,0,0,1,1,1,1,1]
predicted = [0,1,0,0,0,1,0,1,1,1]

accuracy = accuracy_metric(actual, predicted)
print(accuracy)

80.0


### 4.2.2 Confusion Matrix

A **confusion matrix** provides a summary of all predictions made by a classification model compared to the expected (actual) values.  
It is represented as a table (matrix) showing counts of predictions versus actual outcomes.

---

**Explanation:**

- Each **row** of the matrix represents the **predicted** class.
- Each **column** represents the **actual** class.
- The **diagonal elements** (from top-left to bottom-right) indicate **correct predictions**.
- Off-diagonal elements show **misclassifications**.

---

**Interpretation:**

A perfect classifier will have all predictions along the **main diagonal**, meaning the predicted labels match the actual labels for every instance.

**Example Confusion Matrix:**

$$
\begin{bmatrix}
50 & 2 & 0 \\
1 & 45 & 4 \\
0 & 3 & 47
\end{bmatrix}
$$



In [14]:
def confusion_matrix(actual, predicted):
    unique = set(actual)
    matrix = [ list() for x in range(len(unique)) ]
    for i in range(len(unique)):
        matrix[i] = [ 0 for x in range(len(unique)) ]

    lookup = dict()

    for i, value in enumerate(unique):
        lookup[value] = i

    for i in range(len(actual)):
        x = lookup[actual[i]]
        y = lookup[predicted[i]]

        matrix[y][x] += 1

    print('(A)'+ ' '.join(str(x) for x in unique))
    print( '(P)---------------' )
    for i, x in enumerate(unique):
        print("%s| %s"%(x, " ".join(str(x) for x in matrix[i])))
    
    # return unique, matrix
    

In [15]:
actual = [0,0,0,0,0,1,1,1,1,1]
predicted = [0,1,1,0,0,1,0,1,1,1]
confusion_matrix(actual, predicted)


(A)0 1
(P)---------------
0| 3 1
1| 2 4


### 4.2.3 Mean Absolute Error (MAE)

**Definition:**  
Regression problems involve predicting continuous (real) values.  
A simple and intuitive evaluation metric for regression is the **Mean Absolute Error (MAE)**,  
which measures the average magnitude of the errors between predicted and actual values.

---

**Explanation:**  
- The **error** is the difference between the predicted and actual values.  
- The **absolute value** of each error is taken to avoid negative values canceling positive ones.  
- The MAE is then the **average** of these absolute errors.

---

**Formula:**

$$
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \left| \text{predicted}_i - \text{actual}_i \right|
$$

where:  
- $ n $ = total number of predictions  
- $ \text{predicted}_i $ = predicted value for the $ i^{th} $ observation  
- $ \text{actual}_i $ = actual (true) value for the $ i^{th} $ observation  

---

**Interpretation:**  
A lower MAE value indicates that the predictions are closer to the actual values, meaning better model performance.


In [21]:
def mae_metric(actual, predicted):
    sum_error = 0.0
    for a, p in zip(actual, predicted):
        sum_error += abs(a-p)

    return sum_error/float(len(actual))

In [22]:
actual = [0.1, 0.2, 0.3, 0.4, 0.5]
predicted = [0.11, 0.19, 0.29, 0.41, 0.5]
mae = mae_metric(actual, predicted)
print(mae)

0.007999999999999993


### 4.2.4 Root Mean Squared Error (RMSE)

**Definition:**  
Another widely used metric for evaluating regression models is the **Root Mean Squared Error (RMSE)**.  
It measures the average magnitude of the squared differences between the predicted and actual values.  
Sometimes, the metric is referred to as **Mean Squared Error (MSE)** when the square root is not taken.

---

**Explanation:**  
- Each prediction error (difference between predicted and actual) is **squared** to remove negative signs.  
- The **mean** of these squared errors is computed.  
- Finally, the **square root** of that mean gives RMSE, bringing the error measure back to the original units.

---

**Formula:**

$$
\text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (\text{predicted}_i - \text{actual}_i)^2 }
$$

where:  
- \( n \) = total number of predictions  
- \( \text{predicted}_i \) = predicted value for the \( i^{th} \) observation  
- \( \text{actual}_i \) = actual (true) value for the \( i^{th} \) observation  

---

**Interpretation:**  
- RMSE gives a higher weight to large errors (due to squaring), making it more sensitive to outliers.  
- A **lower RMSE** indicates better model performance and predictions closer to actual values.


In [24]:
from math import sqrt
def rmse_metric(actual, predicted):
    sum_error = 0
    for i in range(len(actual)):
        sum_error = sum_error + (actual[i]- predicted[i])**2
    mean_error = sum_error/len(actual)

    return sqrt(mean_error)

In [25]:
actual = [0.1, 0.2, 0.3, 0.4, 0.5]
predicted = [0.11, 0.19, 0.29, 0.41, 0.5]
rmse = rmse_metric(actual, predicted)
print(rmse)

0.00894427190999915


### 4.2.5 Precision for Classification

**Definition:**  
Precision is an important evaluation metric for classification problems, especially when the cost of false positives is high.  
It measures how many of the instances predicted as *positive* are actually *positive*.  
In other words, it answers the question:  
*"Of all the samples the model labeled as positive, how many were correct?"*

---

**Explanation:**  
Precision focuses only on the **positive predictions** made by the model and helps assess how reliable those predictions are.  
It is particularly useful in applications such as **spam detection**, **medical diagnosis**, or **fraud detection**,  
where incorrectly predicting a positive result (false positive) can have serious consequences.

---

**Formula:**

$$
\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
$$

where:  
- **True Positives (TP):** Correctly predicted positive instances  
- **False Positives (FP):** Incorrectly predicted positive instances  

---

**Interpretation:**  
A higher precision value indicates that the model produces fewer false positives,  
meaning its positive predictions are more **accurate and trustworthy**.  
Precision is often used together with **Recall** to provide a more complete evaluation of classification performance.


In [28]:
def precision_metric_binary(actual, predicted):
    true_positives = 0
    false_positives =0
    for i in range(len(actual)):
        if actual[i] == predicted[i] and actual[i] == 1:
            true_positives += 1
        elif actual[i] != predicted[i] and predicted[i] == 1:
            false_positives += 1
        else:
            pass
            
    return true_positives/ ( true_positives + false_positives )
        

In [29]:
actual = [0,0,0,0,0,1,1,1,1,1]
predicted = [0,1,1,0,0,1,0,1,1,1]

print(precision_metric_binary(actual,predicted))

0.6666666666666666


## 2. Multi-Class (Categorical) Classification

For categorical problems (with 3 or more classes), **precision** is computed per class, treating each class as the “positive” class while others are “negative.”

Then, you combine these class-wise precisions using one of the following approaches:

---

### a) **Macro Precision**

**Formula:**  
$ \text{Macro Precision} = \frac{1}{K} \sum_{i=1}^{K} \text{Precision}_i $

- Calculates precision for each class independently  
- Takes the **simple average** (treats all classes equally)

---

### b) **Weighted Precision**

**Formula:**  
$ \text{Weighted Precision} = \frac{\sum_{i=1}^{K} n_i \times \text{Precision}_i}{\sum_{i=1}^{K} n_i} $

- Each class’s precision is **weighted by how many samples it has**  
- Useful when classes are **imbalanced**

---

### c) **Micro Precision**

**Formula:**  
$ \text{Micro Precision} = \frac{\sum_{i=1}^{K} TP_i}{\sum_{i=1}^{K} (TP_i + FP_i)} $

- Aggregates **True Positives (TP)** and **False Positives (FP)** across all classes before computing precision  
- Gives **more weight to larger classes**


In [25]:
def precision_metric_categorical(actual, predicted, method):
    unique = set(actual)
    true_positives = [ 0 for x in range(len(unique)) ]
    false_positives = [ 0 for x in range(len(unique)) ]
    precisions = [ 0 for x in range(len(unique)) ]
    lookup = dict()
        
    for i, v in enumerate(unique):
        lookup[v] = i
    for i in range(len(actual)):
        if actual[i] == predicted[i]:
            index = lookup[actual[i]]
            true_positives[index] += 1
        else:
            index = lookup[predicted[i]]
            false_positives[index] += 1

    for v in unique:
        prec = true_positives[lookup[v]]/(true_positives[lookup[v]] + false_positives[lookup[v]])
        precisions[lookup[v]] = prec
    print(precisions)
    
    if method=="macro":
        return sum(precisions)/len(unique)
        
    elif method == "weighted":
        weighted_precision_sum = 0
        for label in unique:
            number_of_labels = actual.count(label)
            weighted_precision_sum += (number_of_labels*precisions[lookup[label]])
        total_samples = sum(actual.count(label) for label in unique)
        return weighted_precision_sum/total_samples

    elif method == "micro":
        sum_of_true_positives = sum(true_positives)
        sum_of_false_positives = sum(false_positives)
        return sum_of_true_positives/(sum_of_true_positives+sum_of_false_positives)
        

In [26]:
# --- Example test data ---
actual =    [0,0,0,1,1,1,2,2,2,3,3,3,0,1,2,3,0,1,2,3]
predicted = [0,1,0,1,1,2,2,0,2,3,2,3,1,1,2,3,0,0,3,3]

# --- Evaluate all methods ---
macro_precision = precision_metric_categorical(actual, predicted, method="macro")
weighted_precision = precision_metric_categorical(actual, predicted, method="weighted")
micro_precision = precision_metric_categorical(actual, predicted, method="micro")

# --- Print results ---
print(f"Macro Precision   : {macro_precision:.4f}")
print(f"Weighted Precision: {weighted_precision:.4f}")
print(f"Micro Precision   : {micro_precision:.4f}")

[0.6, 0.6, 0.6, 0.8]
[0.6, 0.6, 0.6, 0.8]
[0.6, 0.6, 0.6, 0.8]
Macro Precision   : 0.6500
Weighted Precision: 0.6500
Micro Precision   : 0.6500


### 4.2.5 Recall for Classification

Recall is a common evaluation metric used in **classification problems**, particularly when it is important to correctly identify all positive instances.  
It measures the proportion of **actual positive samples** that are correctly predicted by the model.

In other words, Recall answers the question:

> *“Out of all the true positive cases, how many did the model successfully identify?”*

A high recall indicates that most positive instances were captured by the model,  
while a low recall suggests that many positives were missed.

Mathematically, Recall is defined as:

**Recall =** $ \frac{TP}{TP + FN} $

where:  
- **TP (True Positives)** → Number of positive samples correctly classified as positive  
- **FN (False Negatives)** → Number of positive samples incorrectly classified as negative  

The value of Recall ranges from **0 to 1**, where:  
- **1** → Perfect recall (no positive samples were missed)  
- **0** → Model failed to identify any positive samples  

---

### 🔹 Difference Between Accuracy, Precision, and Recall

| Metric | Formula | Measures | Best Used When |
|:-------|:---------|:----------|:----------------|
| **Accuracy** | $  \frac{TP + TN}{TP + TN + FP + FN}  $| Overall correctness of the model | When all classes are equally important |
| **Precision** | $ \frac{TP}{TP + FP} $ | How many predicted positives are actually positive | When the cost of a false positive is high |
| **Recall** | $ \frac{TP}{TP + FN} $ | How many actual positives are correctly identified | When the cost of missing a positive is high |

---

### 🔹 Confusion Matrix (Conceptual)

|                | **Predicted Positive** | **Predicted Negative** |
|:----------------|:----------------------:|:----------------------:|
| **Actual Positive** | True Positive (TP) | False Negative (FN) |
| **Actual Negative** | False Positive (FP) | True Negative (TN) |

---

**Example:**  
In medical diagnosis, **Recall** is more important than Precision because it’s critical to identify *all* patients who actually have the disease, even if some healthy individuals are incorrectly flagged.


In [34]:
def recall(actual,predicted):
    unique = set(actual)
    true_positives = [ 0 for x in range(len(unique)) ]
    false_negatives = [ 0 for x in range(len(unique)) ]
    lookup = dict()
    for i, v in enumerate(unique):
        lookup[v] = i

    for i in range(len(actual)):
        index = lookup[actual[i]]
        if actual[i] == predicted[i]:
            true_positives[index] += 1
        else:
            false_negatives[index] += 1

    recalls = []
    for label in unique:
        index = lookup[label]
        tp = true_positives[index]
        fn = false_negatives[index]
        recalls.append(tp/(tp+fn))
    return recalls
    
    

In [35]:
actual =    [0,0,0,1,1,1,2,2,2,3,3,3,0,1,2,3,0,1,2,3]
predicted = [0,1,0,1,1,2,2,0,2,3,2,3,1,1,2,3,0,0,3,3]

print(recall(actual, predicted))

[0.6, 0.6, 0.6, 0.8]


### 4.2.6 F1 Score for Classification

The **F1 Score** is an important evaluation metric for **classification problems**, especially when there is an **imbalance between classes**.  
It combines both **Precision** and **Recall** into a single metric by taking their harmonic mean.

While **Precision** measures *how many predicted positives are actually positive*, and **Recall** measures *how many actual positives were correctly identified*,  
the **F1 Score** balances both — giving a better sense of the model’s overall effectiveness.

Mathematically, the F1 Score is defined as:

**F1 Score =** $  2 \times \frac{(\text{Precision} \times \text{Recall})}{(\text{Precision} + \text{Recall})}  $

The value of F1 ranges from **0 to 1**, where:  
- **1** → Perfect precision and recall (best performance)  
- **0** → Poor performance (either precision or recall is zero)

---

**Key Points:**
- F1 is more useful than Accuracy for **imbalanced datasets**.  
- It penalizes models that perform well on one metric (e.g., precision) but poorly on the other (e.g., recall).  
- It is commonly used in information retrieval, fraud detection, and medical diagnosis tasks.

---

| Metric | Balances | Best Used When |
|:-------|:-----------|:----------------|
| **F1 Score** | Precision and Recall | When both false positives and false negatives are costly |

---



### 4.2.7 Area Under ROC Curve or AUC for Classification

The **Area Under the Receiver Operating Characteristic (ROC) Curve**, or **AUC**, is a powerful evaluation metric for **binary classification** problems.  
It measures how well the model can **distinguish between classes** — that is, how well it separates positive and negative samples.

The **ROC Curve** plots:  
- **True Positive Rate (TPR)** on the Y-axis, and  
- **False Positive Rate (FPR)** on the X-axis.

Where:  
- $ \text{TPR} = \frac{TP}{TP + FN} $ (Recall)  
- $ \text{FPR} = \frac{FP}{FP + TN} $

The **AUC** represents the **area under this curve**, providing a single value summary of the model’s performance.

**Interpretation:**
- **AUC = 1.0** → Perfect classifier  
- **AUC = 0.5** → No discrimination (random guessing)  
- **AUC < 0.5** → Worse than random guessing

---

**Key Points:**
- AUC is **threshold-independent**, meaning it measures performance across all possible decision thresholds.  
- It’s useful when comparing models with different probability calibration.  
- Works best for **binary classification**; can be extended to multi-class via **one-vs-rest (OvR)** or **macro averaging**.

---

| AUC Range | Interpretation |
|:-----------|:----------------|
| 0.9 – 1.0 | Excellent |
| 0.8 – 0.9 | Good |
| 0.7 – 0.8 | Fair |
| 0.6 – 0.7 | Poor |
| 0.5 | Random |


### 4.2.8 Goodness of Fit (R² Score) for Regression

The **R-squared (R²)**, also known as the **Coefficient of Determination**, is a key metric used to evaluate **regression models**.  
It measures how well the model’s predictions approximate the actual data points.

In simple terms, R² shows the **proportion of variance** in the dependent variable that is explained by the independent variables.

Mathematically, R² is defined as:

**R² =** $ 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2} $

where:  
- $ y_i $ → Actual value  
- $ \hat{y}_i $ → Predicted value  
- $ \bar{y} $ → Mean of actual values  

---

**Interpretation:**
- **R² = 1** → Perfect fit (model explains all variance)  
- **R² = 0** → Model explains no variance (predictions are as good as the mean)  
- **R² < 0** → Model performs worse than simply predicting the mean

---

**Key Points:**
- A higher R² indicates a better model fit.  
- However, R² alone doesn’t indicate if a model is appropriate — it may increase with more predictors even if they’re irrelevant.  
- Adjusted R² is preferred when comparing models with different numbers of predictors.

---

| R² Value | Interpretation |
|:----------|:----------------|
| 0.9 – 1.0 | Excellent fit |
| 0.7 – 0.9 | Good fit |
| 0.5 – 0.7 | Moderate fit |
| < 0.5 | Poor fit |

---
