#### When Is Accuracy Misleading?

Accuracy can be misleading when it hides important details about model performance.
Although it measures overall correctness, it does not tell the complete story.

---

##### 1. Imbalanced Datasets

If one class dominates the dataset, accuracy can appear high even when the model performs poorly.

Example:
If 95% of samples belong to class 0,
a model predicting only class 0 will achieve 95% accuracy.

However, it completely fails to detect the minority class.

In such cases, consider:
- Precision
- Recall
- F1-score
- ROC-AUC
- Confusion Matrix

---

##### 2. Unequal Cost of Errors

Accuracy assumes:

Cost(False Positive) = Cost(False Negative)

But in many real-world problems, this assumption is false.

Example:
- Medical diagnosis → Missing a disease (False Negative) can be critical.
- Fraud detection → Missing fraud may be more costly than false alarms.
- Spam detection → Marking a genuine email as spam may be unacceptable.

Accuracy does not reflect the severity or cost of different types of errors.

---

##### 3. Multiclass with Class Imbalance

In multiclass problems, if one class dominates,
accuracy may mostly reflect performance on that dominant class.

Minor classes may be predicted poorly without significantly affecting the overall accuracy.

---

##### 4. Does Not Show Error Distribution

Two models can have the same accuracy but very different confusion matrices.

Example:
Both models have 90% accuracy,
but one model makes many False Negatives,
while the other makes many False Positives.

Accuracy cannot distinguish between these behaviors.

---

##### 5. During Model Optimization

Accuracy is not differentiable.
It is based on hard class predictions (0 or 1).

Most models optimize smooth loss functions such as:
- Cross-entropy
- Log loss

Accuracy is typically used only for evaluation, not for training.

---



Accuracy is reliable when:
- Classes are balanced
- Error costs are similar
- Overall correctness is the main concern

Accuracy is misleading when:
- Data is imbalanced
- Error costs are unequal
- Class-wise performance matters

In such cases, always examine the confusion matrix and class-specific metrics.

#### model 1

|              | Sent to Spam | Not Sent to Spam |
| ------------ | ------------ | ---------------- |
| **Spam**     | 100 (TP)     | 170 (FN)         |
| **Not Spam** | 30 (FP)      | 700 (TN)         |


#### model 2

|              | Sent to Spam | Not Sent to Spam |
| ------------ | ------------ | ---------------- |
| **Spam**     | 100 (TP)     | 190 (FN)         |
| **Not Spam** | 10 (FP)      | 700 (TN)         |


 > **Model 2 is safer to deploy**

---

what prediction of predited positive is truly positive ? is called **Precision**

$$
\text{Precision} = \frac{TP}{TP + FP}
$$

## Understanding Recall 

**model 1**

|                | Detected Cancer | Not Detected |
| -------------- | --------------- | ------------ |
| **Has Cancer** | 1000 (TP)       | 200 (FN)     |
| **No Cancer**  | 800 (FP)        | 8000 (TN)    |


**model 2**
|                | Detected Cancer | Not Detected |
| -------------- | --------------- | ------------ |
| **Has Cancer** | 1000 (TP)       | 500 (FN)     |
| **No Cancer**  | 500 (FP)        | 8000 (TN)    |


> as model 1 & model 2 both has accuracy same !!

$$
\text{Recall}_{\text{Model 1}} = \frac{1000}{1000 + 200} = \frac{1000}{1200} = 0.83
$$

$$
\text{Recall}_{\text{Model 2}} = \frac{1000}{1000 + 500} = \frac{1000}{1500} = 0.67
$$

what proportion of actual positives are correctly classified ? called **Recall**

$$
\text{Recall} = \frac{TP}{TP + FN}
$$

##### Choosing Between Precision and Recall Based on Error Type

It’s not that we "use" a metric — we optimize for the metric that reduces the more dangerous error.

---

##### When Type II Error (False Negative) Is More Dangerous

Type II Error = FN  

Example: Cancer detection, fraud detection  

We care about catching all real positives.

$$
\text{Recall} = \frac{TP}{TP + FN}
$$

Since:

$$
\text{Type II Error Rate} = 1 - \text{Recall}
$$

Higher Recall → Fewer False Negatives  

---

##### When Type I Error (False Positive) Is More Dangerous

Type I Error = FP  

Example: Spam detection, loan approval  

We care about avoiding false alarms.

$$
\text{Precision} = \frac{TP}{TP + FP}
$$

Higher Precision → Fewer False Positives  

---

##### Decision Rule

- If False Negatives are more costly → Optimize Recall  
- If False Positives are more costly → Optimize Precision  

## F1 Score 

##### Why Do We Use F1 Score?

In some cases, Precision and Recall alone cannot clearly tell us which model is better.

For example:

- Model A has higher Precision but lower Recall  
- Model B has higher Recall but lower Precision  

Now it becomes difficult to decide which model to choose.

In such situations, we use **F1 Score**, which balances both Precision and Recall into a single metric.

$$
F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

F1 Score is the harmonic mean of Precision and Recall.

It is useful when:

- Dataset is imbalanced  
- Both False Positives and False Negatives matter  
- We need a single evaluation metric  

F1 penalizes extreme imbalance between Precision and Recall, ensuring that both must be reasonably high for a good score.

##### Harmonic Mean

The harmonic mean of two numbers \(a\) and \(b\) is defined as:

$$
H = \frac{2ab}{a + b}
$$

For F1 Score, the harmonic mean is applied to Precision and Recall:

$$
F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

##### Why Harmonic Mean?

The harmonic mean is used instead of the arithmetic mean because it penalizes large differences between the two values.

Example:

If  
Precision = 0.9  
Recall = 0.1  

Arithmetic Mean:

$$
\frac{0.9 + 0.1}{2} = 0.5
$$

Harmonic Mean (F1):

$$
\frac{2 \cdot 0.9 \cdot 0.1}{0.9 + 0.1} = 0.18
$$

The harmonic mean becomes low when one value is very small.

##### Key Insight

Harmonic mean forces both Precision and Recall to be high.

If either Precision or Recall is low, the F1 score will also be low.

In [2]:
import pandas as pd 
import numpy as np

In [3]:
df = pd.read_csv("heart_disease_uci.csv")

In [4]:
df.head()

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [6]:
# Drop ID
df = df.drop('id', axis=1)

In [7]:
# Drop highly missing columns
df = df.drop(columns=['ca', 'thal', 'slope'])

In [8]:
# Drop remaining missing rows
df = df.dropna()

In [9]:
df['num'] = (df['num'] > 0).astype(int)

In [10]:
X = df.drop('num', axis=1)
y = df['num']

In [11]:
# One-hot encode
X = pd.get_dummies(X, drop_first=True)

In [12]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [13]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [14]:
dt = DecisionTreeClassifier(random_state=42)
lr = LogisticRegression(max_iter=1000)

In [15]:
dt.fit(X_train , y_train)

0,1,2
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`.",'gini'
,"splitter  splitter: {""best"", ""random""}, default=""best"" The strategy used to choose the split at each node. Supported strategies are ""best"" to choose the best split and ""random"" to choose the best random split.",'best'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: int, float or {""sqrt"", ""log2""}, default=None The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at  each split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. note::  The search for a split does not stop until at least one  valid partition of the node samples is found, even if it requires to  effectively inspect more than ``max_features`` features.",
,"random_state  random_state: int, RandomState instance or None, default=None Controls the randomness of the estimator. The features are always randomly permuted at each split, even if ``splitter`` is set to ``""best""``. When ``max_features < n_features``, the algorithm will select ``max_features`` at random at each split before finding the best split among them. But the best found split may vary across different runs, even if ``max_features=n_features``. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, ``random_state`` has to be fixed to an integer. See :term:`Glossary ` for details.",42
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0


In [16]:
lr.fit(X_train_scaled, y_train)

0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


In [17]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

In [18]:
y_pred_dt = dt.predict(X_test)

print("----- Decision Tree -----")
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_dt))
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print("Precision:", precision_score(y_test, y_pred_dt))
print("Recall:", recall_score(y_test, y_pred_dt))
print("F1 Score:", f1_score(y_test, y_pred_dt))

----- Decision Tree -----
Confusion Matrix:
 [[52 19]
 [17 60]]
Accuracy: 0.7567567567567568
Precision: 0.759493670886076
Recall: 0.7792207792207793
F1 Score: 0.7692307692307693


In [19]:
y_pred_lr = lr.predict(X_test_scaled)

print("\n----- Logistic Regression -----")
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lr))
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print("Precision:", precision_score(y_test, y_pred_lr))
print("Recall:", recall_score(y_test, y_pred_lr))
print("F1 Score:", f1_score(y_test, y_pred_lr))


----- Logistic Regression -----
Confusion Matrix:
 [[60 11]
 [12 65]]
Accuracy: 0.8445945945945946
Precision: 0.8552631578947368
Recall: 0.8441558441558441
F1 Score: 0.8496732026143791


### These Metrics in Multiclass Classification 

In [21]:
print("Precision per class:", precision_score(y_test, y_pred_lr, average=None))

Precision per class: [0.83333333 0.85526316]


##### Precision and Recall in Binary vs Multiclass Classification

In binary classification, we technically compute Precision and Recall for both classes.

However, by convention, we usually focus on the **Positive class**.

Why?

Because in most real-world problems, the positive class represents the event of interest, such as:

- Disease present  
- Fraud detected  
- Spam email  
- Default risk  

So when we say:

$$
\text{Precision} = \frac{TP}{TP + FP}
$$

$$
\text{Recall} = \frac{TP}{TP + FN}
$$

We are referring to the Positive class.

---

In multiclass classification, the situation is different.

We compute Precision, Recall, and F1 **for each class separately** using a one-vs-rest approach.

Each class is treated as the positive class once, while the remaining classes are treated as negative.

Then we combine the results using:

- Macro average  
- Micro average  
- Weighted average  

---



- Binary classification → Metrics usually reported for Positive class  
- Multiclass classification → Metrics computed per class and then averaged  

### Multi-class 

| Actual \ Predicted | Dog | Cat | Rabbit | Total |
| ------------------ | --- | --- | ------ | ----- |
| **Dog**            | 25  | 5   | 10     | 40    |
| **Cat**            | 0   | 30  | 4      | 34    |
| **Rabbit**         | 4   | 10  | 20     | 34    |
| **Total**          | 29  | 45  | 34     | 108   |


##### Precision (Per Class)

Precision formula:

$$
\text{Precision} = \frac{TP}{TP + FP}
$$

From the confusion matrix:

- For Dog:
$$
P_{\text{Dog}} = \frac{25}{29} = 0.86
$$

- For Cat:
$$
P_{\text{Cat}} = \frac{30}{45} = 0.67
$$

- For Rabbit:
$$
P_{\text{Rabbit}} = \frac{20}{34} = 0.59
$$

---

##### Macro Precision

Macro Precision is the simple average of per-class precision:

$$
\text{Macro Precision} = \frac{0.86 + 0.67 + 0.59}{3}
$$

$$
= 0.71
$$

---

##### Micro Precision

Micro Precision is calculated using total true positives divided by total samples:

Total TP = 25 + 30 + 20 = 75  
Total samples = 108  

$$
\text{Micro Precision} = \frac{75}{108}
$$

$$
= 0.69
$$

In multiclass single-label classification, Micro Precision = Micro Recall = Micro F1.

---

##### Weighted Precision

Weighted Precision weights each class by its support (actual samples).

Supports:
- Dog = 40  
- Cat = 34  
- Rabbit = 34  
- Total = 108  

$$
\text{Weighted Precision}
=
\frac{
(40 \cdot 0.86) + (34 \cdot 0.67) + (34 \cdot 0.59)
}{108}
$$

$$
=
\frac{
34.4 + 22.78 + 20.06
}{108}
$$

$$
= 0.72
$$

---

##### Final Results

- Macro Precision = 0.71  
- Micro Precision = 0.69  
- Weighted Precision = 0.72  

##### When Do We Use Different Types of Precision?

In multiclass classification, we use different averaging methods depending on the problem and data distribution.

---

##### 1. Macro Precision

Macro Precision treats all classes equally.

$$
\text{Macro Precision} = \frac{1}{K} \sum_{i=1}^{K} P_i
$$

Use Macro Precision when:

- All classes are equally important.
- You want to evaluate performance per class fairly.
- You want to detect poor performance on minority classes.
- Dataset is imbalanced and you do not want majority class to dominate the metric.

Example:
Medical severity levels where each class matters equally.

---

##### 2. Micro Precision

Micro Precision aggregates all true positives and false positives across classes before computing precision.

$$
\text{Micro Precision} = \frac{\sum TP}{\sum TP + \sum FP}
$$

Use Micro Precision when:

- You care about overall system performance.
- You want a global metric.
- Class imbalance exists but you are fine with majority class dominating.
- You want consistency with overall accuracy (in single-label multiclass, micro precision ≈ accuracy).

Example:
General image classification where total correct predictions matter most.

---

##### 3. Weighted Precision

Weighted Precision averages class precision weighted by class support.

$$
\text{Weighted Precision} = \sum \left( \frac{n_i}{N} \cdot P_i \right)
$$

Use Weighted Precision when:

- Dataset is imbalanced.
- You want class imbalance reflected in the final score.
- You want something between Macro and Micro behavior.

Example:
Real-world datasets where some classes naturally occur more often.

---

##### Summary Decision Rule

- Equal importance to all classes → Use Macro
- Overall performance focus → Use Micro
- Imbalanced data with realistic weighting → Use Weighted

## Recall in Multiclass Data

In multiclass classification, Recall is computed separately for each class using a one-vs-rest approach.

For each class, we treat that class as Positive and all other classes as Negative.

---

##### Recall Formula (Per Class)

$$
\text{Recall}_i = \frac{TP_i}{TP_i + FN_i}
$$

Where:

- $TP_i$ = True Positives for class i  
- $FN_i$ = False Negatives for class i  

Recall answers:

What proportion of actual samples of class i were correctly classified?

---

##### Example (Dog, Cat, Rabbit)

From the confusion matrix:

| Actual \ Predicted | Dog | Cat | Rabbit |
|--------------------|-----|-----|--------|
| Dog                | 25  | 5   | 10     |
| Cat                | 0   | 30  | 4      |
| Rabbit             | 4   | 10  | 20     |

Supports:
- Dog = 40  
- Cat = 34  
- Rabbit = 34  

Per-class Recall:

For Dog:
$$
R_{\text{Dog}} = \frac{25}{40} = 0.63
$$

For Cat:
$$
R_{\text{Cat}} = \frac{30}{34} = 0.88
$$

For Rabbit:
$$
R_{\text{Rabbit}} = \frac{20}{34} = 0.59
$$

---

##### Macro Recall

Average of per-class recall:

$$
\text{Macro Recall} = \frac{0.63 + 0.88 + 0.59}{3}
$$

$$
= 0.70
$$

---

##### Micro Recall

Micro Recall aggregates all true positives:

Total TP = 25 + 30 + 20 = 75  
Total samples = 108  

$$
\text{Micro Recall} = \frac{75}{108}
$$

$$
= 0.69
$$

In single-label multiclass classification:
Micro Recall = Micro Precision = Micro F1.

---

##### Weighted Recall

Weighted by class support:

$$
\text{Weighted Recall}
=
\frac{
(40 \cdot 0.63) + (34 \cdot 0.88) + (34 \cdot 0.59)
}{108}
$$

$$
= 0.69
$$

---

##### Key Insight

- Macro Recall treats all classes equally.
- Micro Recall reflects overall performance.
- Weighted Recall accounts for class imbalance.

## F1 Score in Multiclass Data

In multiclass classification, F1 Score is computed for each class separately
using a one-vs-rest approach, and then averaged.

---

##### F1 Formula (Per Class)

For class i:

$$
F1_i = 2 \cdot \frac{P_i \cdot R_i}{P_i + R_i}
$$

Where:

- $P_i$ = Precision of class i  
- $R_i$ = Recall of class i  

F1 balances Precision and Recall for each class.

---

##### Example (Dog, Cat, Rabbit)

From previous calculations:

- Dog:
  - Precision = 0.86
  - Recall = 0.63

$$
F1_{\text{Dog}} =
2 \cdot \frac{0.86 \cdot 0.63}{0.86 + 0.63}
=
0.73
$$

- Cat:
  - Precision = 0.67
  - Recall = 0.88

$$
F1_{\text{Cat}} =
2 \cdot \frac{0.67 \cdot 0.88}{0.67 + 0.88}
=
0.76
$$

- Rabbit:
  - Precision = 0.59
  - Recall = 0.59

$$
F1_{\text{Rabbit}} =
2 \cdot \frac{0.59 \cdot 0.59}{0.59 + 0.59}
=
0.59
$$

---

##### Macro F1

Simple average of per-class F1:

$$
\text{Macro F1} =
\frac{0.73 + 0.76 + 0.59}{3}
$$

$$
= 0.69
$$

---

##### Micro F1

Micro F1 uses total TP, FP, FN:

Total TP = 75  
Total samples = 108  

$$
\text{Micro F1} =
\frac{75}{108}
$$

$$
= 0.69
$$

In single-label multiclass classification:

$$
\text{Micro F1} = \text{Micro Precision} = \text{Micro Recall}
$$

---

##### Weighted F1

Weighted by class support:

Supports:
- Dog = 40
- Cat = 34
- Rabbit = 34

$$
\text{Weighted F1}
=
\frac{
(40 \cdot 0.73) + (34 \cdot 0.76) + (34 \cdot 0.59)
}{108}
$$

$$
= 0.69
$$

---

##### Key Insight

- Macro F1 treats all classes equally.
- Micro F1 reflects overall performance.
- Weighted F1 accounts for class imbalance.
- F1 does not consider True Negatives.