# **Task 5: Train-Test Split & Evaluation Metrics:**

## **Importing Required Modules and Loading Dataset:**

In [34]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

In [35]:
df = pd.read_csv('Datasets/heart.csv')

## **Analysis of Dataset:**

In [36]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [37]:
df.shape

(1025, 14)

In [38]:
df.dtypes

Unnamed: 0,0
age,int64
sex,int64
cp,int64
trestbps,int64
chol,int64
fbs,int64
restecg,int64
thalach,int64
exang,int64
oldpeak,float64


In [39]:
np.sort(df['age'].unique()) # Ages between 29 to 77 age

array([29, 34, 35, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 74, 76, 77])

In [40]:
df['sex'].unique() # 1 - male and 0 - female

array([1, 0])

In [41]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0
mean,54.434146,0.69561,0.942439,131.611707,246.0,0.149268,0.529756,149.114146,0.336585,1.071512,1.385366,0.754146,2.323902,0.513171
std,9.07229,0.460373,1.029641,17.516718,51.59251,0.356527,0.527878,23.005724,0.472772,1.175053,0.617755,1.030798,0.62066,0.50007
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,132.0,0.0,0.0,1.0,0.0,2.0,0.0
50%,56.0,1.0,1.0,130.0,240.0,0.0,1.0,152.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,275.0,0.0,1.0,166.0,1.0,1.8,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB


## **Splitting dataset into train and test sets:**

### **The purpose of Training and Testing:**
In **machine learning**, **training** and **testing** serve two distinct but equally important purposes. During the training phase, the model **learns** from labeled data by **identifying patterns** and adjusting its internal parameters to **minimize errors**, effectively building the predictive logic of the model. However, good performance on training data alone is not sufficient, because the model may simply memorize the data, a problem known as **overfitting**. This is where testing becomes essential. In the **testing phase**, the already trained model is evaluated using **unseen data** to assess how well it generalizes to new, real-world situations. Testing provides an unbiased measure of performance through metrics such as **accuracy or error rate** and helps confirm whether the model’s learning is genuine rather than memorization. Together, training and testing ensure that a model is both well-learned and reliable in practical use.


In [43]:
# target is the predicted and all the other are features

In [44]:
X = df.drop(columns=['target'])
y = df['target']

In [45]:
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2


In [46]:
y

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0
...,...
1020,1
1021,0
1022,0
1023,1


In [47]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

## **Train a simple model (Logistic Regression):**

In [48]:
from sklearn.linear_model import LogisticRegression

In [49]:
model = LogisticRegression(max_iter=1000)

In [50]:
model.fit(X_train,y_train)

## **Predict on test data  & Calculate accuracy, precision, recall:**

In [51]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, f1_score

In [52]:
y_pred = model.predict(X_test)

In [53]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

### **Accuracy:**

**Accuracy** is the most intuitive classification metric. It simply measures how often the model is correct.

#### **Definition:**
The percentage of total predictions that were correct (both True Positives and True Negatives).

#### **Formula:**
$$
\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}
$$

or, using Confusion Matrix terms:

$$
\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}
$$

#### **When to Use It:**
* **Balanced Datasets:** When the classes are roughly equal (e.g., 50% Cat images, 50% Dog images).
* **Equal Error Cost:** When a False Positive is just as bad as a False Negative.

##### **The "Accuracy Trap" (Imbalanced Data):**
Accuracy is highly misleading when datasets are imbalanced.

**Example:**
* **Scenario:** Terrorist detection system.
* **Data:** 99,900 Normal Passengers, 100 Terrorists.
* **Lazy Model:** Predicts "Normal Passenger" for *everyone*.
* **Result:** The model is **99.9% Accurate**.
* **Reality:** The model is useless because it missed every single terrorist.

> **Takeaway:** Never trust accuracy alone on imbalanced datasets. Always look at the Confusion Matrix or F1 Score.

In [54]:
accuracy

0.848780487804878

### **Precision & Recall:**

When accuracy fails, we turn to Precision and Recall. These metrics usually trade off against one another—improving one often lowers the other.

#### **Precision (Quality of Positive Predictions):**
Precision answers: *"Of all the times the model predicted YES, how often was it right?"*

##### **Formula:**
$$
\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
$$

##### **When to Prioritize Precision:**
Use Precision when **False Positives (Type 1 Errors)** are costly or annoying.

* **Example:** **Email Spam Detection**.
* **Reasoning:** You want to be absolutely sure before you move an email to the junk folder. Losing an important email (False Positive) is worse than seeing a spam email (False Negative).

---

#### **Recall (Quantity of Positives Found):**
Recall answers: *"Of all the actual YES cases, how many did the model find?"*

##### **Formula:**
$$
\text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
$$

---
##### **When to Prioritize Recall:**
Use Recall when **False Negatives (Type 2 Errors)** are dangerous or expensive.

* **Example:** **Cancer Diagnosis** or **Fraud Detection**.
* **Reasoning:** It is acceptable to flag a few healthy people for further testing (False Positive) if it means you catch every single person who actually has cancer. Missing a case (False Negative) could be fatal.

In [55]:
precision

0.7948717948717948

In [56]:
recall

0.93

### **F1 Score:**

The **F1 Score** is the "middle ground" metric. It combines Precision and Recall into a single number.

#### **Definition:**
The F1 Score is the **Harmonic Mean** of Precision and Recall.

#### **Formula:**
$$
\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

---
#### **Why Harmonic Mean?**
Why not just take the average? $$\frac{P+R}{2}$$?

The Harmonic Mean punishes extreme values. If your model has 100% Recall but 0% Precision, the arithmetic average would say 50% (decent), but the F1 Score will say 0% (terrible).

**F1 ensures the model is decent at BOTH Precision and Recall.**

#### **When to Use It:**
1.  **Imbalanced Datasets:** When you have far more negatives than positives (or vice versa).
2.  **Unclear Trade-offs:** When you don't clearly prefer Precision over Recall (e.g., classifying images of Cats vs. Dogs).
3.  **Comparing Models:** When you need a single metric to decide which model is "better" overall.

#### Multi-Class F1 Score:
For datasets with more than 2 classes (e.g., Cat, Dog, Rabbit), F1 is calculated differently:
* **Macro F1:** Average F1 across all classes (treats all classes equally).
* **Weighted F1:** Average F1 weighted by class size (gives more importance to larger classes).

In [57]:
f1_score(y_test,y_pred)

0.8571428571428571

## **Confusion Matrix & Types of Errors:**


The **Confusion Matrix** is a table that breaks down the model's predictions to show *where* it is getting confused. It is the foundation for calculating Precision, Recall, and F1 Score.

#### **The Structure (Binary Classification):**

| | **Predicted Negative (0)** | **Predicted Positive (1)** |
| :--- | :--- | :--- |
| **Actual Negative (0)** | **True Negative (TN)**<br>*(Correctly predicted No)* | **False Positive (FP)**<br>*(Type 1 Error)* |
| **Actual Positive (1)** | **False Negative (FN)**<br>*(Type 2 Error)* | **True Positive (TP)**<br>*(Correctly predicted Yes)* |

#### **Terminology:**
* **True Positive (TP):** Reality was **Yes**, Model said **Yes**.
* **True Negative (TN):** Reality was **No**, Model said **No**.
* **False Positive (FP):** Reality was **No**, Model said **Yes**.
* **False Negative (FN):** Reality was **Yes**, Model said **No**.

#### **Types of Errors:**

##### **Type 1 Error (False Positive):**
* **Definition:** The model raised a false alarm.
* **Example:** A spam filter marks a legitimate job offer email as "Spam".

##### **Type 2 Error (False Negative):**
* **Definition:** The model failed to detect an event.
* **Example:** A medical test tells a sick patient they are "Healthy".
* **Impact:** Type 2 errors are often more dangerous in critical fields like healthcare or security.

In [58]:
confusion_matrix = confusion_matrix(y_test, y_pred)

In [59]:
confusion_matrix

array([[81, 24],
       [ 7, 93]])

## **Interpreting the results:**


### **True Negatives (81):**  
✅ 81 perfectly healthy individuals were correctly identified by the model. These patients received the right reassurance without any unnecessary worry or follow-up testing.

### **False Positives (24):**  
⚠️ 24 healthy people received a false cancer alarm (Type I Error). While concerning, this leads to additional testing rather than missed treatment opportunities.

### **False Negatives (7):**  
❌ Most critical: 7 actual cancer patients were wrongly told they were healthy (Type II Error). These missed diagnoses delay life-saving treatment.

### **True Positives (93):**  
🎯 93 cancer patients were correctly detected. The model successfully identified them for immediate medical intervention and treatment.

In [61]:
accuracy

0.848780487804878

In [62]:
precision

0.7948717948717948

In [63]:
recall

0.93

In [64]:
confusion_matrix

array([[81, 24],
       [ 7, 93]])