### **Metrics on Classification Dataset (HeartDisease)**

* Metrics provide benchmark to the model predicting that the model performed well or not .
* Accuracy, recall, precision, and F1 score are essential metrics used to evaluate the performance of classification models. 
* These helps you to understand how well your model is performing especially where the dataset is imbalanced.

* **Accuracy** is the proportion of correct predictions out of all total predictions, 

* Formula :                                 **Accuracy = True Positives (TP) + True Negatives (TN) / Total samples**                                                   

Total samples = TP + TN + FP + FN
​
 


* **Precision** measures the proportion of true positives out of all positive values, in other words out of all instances the model predicts positive , how many are actiually positive.

* Formula :   **Precision = True Positive (TP) / True Positive (TP) + False Postive(FP)**

* **Recall** measures the proportion of actual positives that the model correctly identified, in other words Out of all instaqnces how many positives the model correctly identified.

* Formula : **Recall = True Positive(TP) / True Positive(TP) + False Negative (FN**)

* **F1 - score** is the harmonic mean of precision and recall , it provide balance between the two .
* It is used when you need a single metric to evaluate your models performance.
* Formula : **F1-Score = 2 * Precision * Recall / Precision + Recall**

**Step 1 : Load Necessary Libraries**

In [1]:
import pandas as pd 
import numpy as np 
from sklearn.impute import SimpleImputer 
from scipy.stats import f_oneway
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

**Step 2 : Load the Dataset**
* Ensure data loading works dynamically (relying on user-provided or test data).

In [2]:
df = pd.read_csv("E:\\Machine Learning\\Datasets\\heart.csv")

In [3]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [4]:
df.shape

(918, 12)

**Step 3 : Clean and preprocess the dataset.**

In [5]:
print("Missing values per columns :")
print(df.isnull().sum())

Missing values per columns :
Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64


In [6]:
df_num = df.select_dtypes(include = [np.number])
print(df_num.columns)

Index(['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak',
       'HeartDisease'],
      dtype='object')


In [7]:
df_cat = df.select_dtypes(include = ['object'])
print(df_cat.columns)

Index(['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope'], dtype='object')


**Step 4 : Encoding the categorical columns**

In [8]:
df = pd.DataFrame(df)
encoder = LabelEncoder()
Categorcial_columns  = df_cat

In [10]:
# Encoding the categorical columns
for column in Categorcial_columns:
    df[column] = encoder.fit_transform(df[column])

print(df.head())

   Age  Sex  ChestPainType  RestingBP  Cholesterol  FastingBS  RestingECG  \
0   40    1              1        140          289          0           1   
1   49    0              2        160          180          0           1   
2   37    1              1        130          283          0           2   
3   48    0              0        138          214          0           1   
4   54    1              2        150          195          0           1   

   MaxHR  ExerciseAngina  Oldpeak  ST_Slope  HeartDisease  
0    172               0      0.0         2             0  
1    156               0      1.0         1             1  
2     98               0      0.0         2             0  
3    108               1      1.5         1             1  
4    122               0      0.0         2             0  


**Step 5 : StandardScaling the Numerical column**

In [11]:
numerical_columns = ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak']
scaler = StandardScaler()

df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

print(df.head())

        Age  Sex  ChestPainType  RestingBP  Cholesterol  FastingBS  \
0 -1.433140    1              1   0.410909     0.825070  -0.551341   
1 -0.478484    0              2   1.491752    -0.171961  -0.551341   
2 -1.751359    1              1  -0.129513     0.770188  -0.551341   
3 -0.584556    0              0   0.302825     0.139040  -0.551341   
4  0.051881    1              2   0.951331    -0.034755  -0.551341   

   RestingECG     MaxHR  ExerciseAngina   Oldpeak  ST_Slope  HeartDisease  
0           1  1.382928               0 -0.832432         2             0  
1           1  0.754157               0  0.105664         1             1  
2           2 -1.525138               0 -0.832432         2             0  
3           1 -1.132156               1  0.574711         1             1  
4           1 -0.581981               0 -0.832432         2             0  


**Step 6 : Appliying Logistic Regression and Metrics on the data columns**

In [12]:
x = df.drop(columns = ['HeartDisease'], axis = 1)
y = df['HeartDisease']

X_train,X_test,y_train,y_test = train_test_split(x,y, test_size = 0.2, random_state = 42)


In [13]:
model = LogisticRegression()
model.fit(X_train,y_train)

In [14]:
y_pred = model.predict(X_test)

In [15]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.84
Classification Report:
              precision    recall  f1-score   support

           0       0.77      0.88      0.82        77
           1       0.91      0.81      0.86       107

    accuracy                           0.84       184
   macro avg       0.84      0.85      0.84       184
weighted avg       0.85      0.84      0.84       184

Confusion Matrix:
[[68  9]
 [20 87]]


In [16]:
from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

Precision: 0.91
Recall: 0.81
F1 Score: 0.86


**Step 7 : Applying Regularization on the model**

In [17]:
logreg_l1 = LogisticRegression(penalty='l1', solver='saga', C=1.0, max_iter=1000, random_state=42)
logreg_l1.fit(X_train, y_train)

In [18]:
logreg_l2 = LogisticRegression(penalty='l2', solver='lbfgs', C=1.0, max_iter=1000, random_state=42)
logreg_l2.fit(X_train, y_train)

In [19]:
y_pred_l1 = logreg_l1.predict(X_test)
y_pred_l2 = logreg_l2.predict(X_test)

In [20]:
accuracy_score(y_test,y_pred_l1)

0.8532608695652174

In [21]:
accuracy_score(y_test,y_pred_l2)

0.842391304347826

**Step 8 : Train the Decision Tree Classifier**

In [19]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier( random_state=42)
clf.fit(X_train, y_train)


In [20]:
accuracy = clf.score(X_train, y_train)

accuracy

1.0