<a href="https://colab.research.google.com/github/Rohan-1103/Data-Science/blob/main/classification_metrics_binary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd

### 1. Accuracy Score

*   **What it means:** The ratio of correctly predicted observations to the total number of observations. It's a common metric for classification problems.
*   **Applications:** General performance evaluation for balanced datasets.
*   **Interpretation:** Higher accuracy means more correct predictions. However, it can be misleading on imbalanced datasets.
*   **Formula:** `(True Positives + True Negatives) / (Total Observations)`

### 2. Confusion Matrix

*   **What it means:** A table used to describe the performance of a classification model on a set of test data for which the true values are known. It has four basic components:
    *   **True Positives (TP):** Correctly predicted positive cases.
    *   **True Negatives (TN):** Correctly predicted negative cases.
    *   **False Positives (FP):** Incorrectly predicted positive cases (Type I error, also known as `false alarm`).
    *   **False Negatives (FN):** Incorrectly predicted negative cases (Type II error, also known as `miss`).
*   **Applications:** Detailed analysis of classifier performance, especially for imbalanced datasets.
*   **Interpretation:** Helps to understand where the model is making errors. For example, a high FN count means the model is missing many positive cases.
*   **Example:** In heart disease prediction, TP are patients correctly identified with heart disease, FN are patients with heart disease but predicted as healthy.

### 3. Precision

*   **What it means:** The ratio of correctly predicted positive observations to the total predicted positive observations. It answers: "Of all instances predicted as positive, how many are actually positive?"
*   **Applications:** Scenarios where the cost of False Positives is high (e.g., spam detection, medical diagnosis where a false positive leads to unnecessary treatment).
*   **Interpretation:** High precision means fewer false positives.
*   **Formula:** `TP / (TP + FP)`

### 4. Recall (Sensitivity)

*   **What it means:** The ratio of correctly predicted positive observations to all observations in the actual class. It answers: "Of all actual positive instances, how many did we correctly predict as positive?"
*   **Applications:** Scenarios where the cost of False Negatives is high (e.g., fraud detection, disease screening where missing a positive case is critical).
*   **Interpretation:** High recall means fewer false negatives.
*   **Formula:** `TP / (TP + FN)`

### 5. F1-Score

*   **What it means:** The harmonic mean of Precision and Recall. It tries to find the balance between precision and recall, especially useful when there's an uneven class distribution.
*   **Applications:** General performance metric, especially on imbalanced datasets, when you need a balance between FP and FN.
*   **Interpretation:** A high F1-score indicates that the model has good values for both precision and recall. An F1-score of 1 is perfect precision and recall.
*   **Formula:** `2 * (Precision * Recall) / (Precision + Recall)`


In [2]:
df = pd.read_csv('heart.csv')

In [3]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [4]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.iloc[:,0:-1],df.iloc[:,-1],test_size=0.2,random_state=2)

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

In [6]:
clf1 = LogisticRegression()
clf2 = DecisionTreeClassifier()

In [7]:
clf1.fit(X_train,y_train)
clf2.fit(X_train,y_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [8]:
y_pred1 = clf1.predict(X_test)
y_pred2 = clf2.predict(X_test)

In [9]:
from sklearn.metrics import accuracy_score,confusion_matrix
print("Accuracy of Logistic Regression",accuracy_score(y_test,y_pred1))
print("Accuracy of Decision Trees",accuracy_score(y_test,y_pred2))

Accuracy of Logistic Regression 0.9016393442622951
Accuracy of Decision Trees 0.8524590163934426


In [10]:
confusion_matrix(y_test,y_pred1)

array([[26,  6],
       [ 0, 29]])

In [11]:
print("Logistic Regression Confusion Matrix\n")
pd.DataFrame(confusion_matrix(y_test,y_pred1),columns=list(range(0,2)))

Logistic Regression Confusion Matrix



Unnamed: 0,0,1
0,26,6
1,0,29


In [12]:
print("Decision Tree Confusion Matrix\n")
pd.DataFrame(confusion_matrix(y_test,y_pred2),columns=list(range(0,2)))

Decision Tree Confusion Matrix



Unnamed: 0,0,1
0,25,7
1,2,27


In [13]:
result = pd.DataFrame()
result['Actual Label'] = y_test
result['Logistic Regression Prediction'] = y_pred1
result['Decision Tree Prediction'] = y_pred2

In [14]:
result.sample(10)

Unnamed: 0,Actual Label,Logistic Regression Prediction,Decision Tree Prediction
259,0,1,1
157,1,1,1
292,0,0,0
99,1,1,1
66,1,1,1
184,0,0,0
126,1,1,1
74,1,1,1
65,1,1,1
11,1,1,1


In [15]:
from sklearn.metrics import recall_score,precision_score,f1_score

In [16]:
print("For Logistic regression Model")
print("-"*50)
cdf = pd.DataFrame(confusion_matrix(y_test,y_pred1),columns=list(range(0,2)))
print(cdf)
print("-"*50)
print("Precision - ",precision_score(y_test,y_pred1))
print("Recall - ",recall_score(y_test,y_pred1))
print("F1 score - ",f1_score(y_test,y_pred1))

For Logistic regression Model
--------------------------------------------------
    0   1
0  26   6
1   0  29
--------------------------------------------------
Precision -  0.8285714285714286
Recall -  1.0
F1 score -  0.90625


In [17]:
print("For DT Model")
print("-"*50)
cdf = pd.DataFrame(confusion_matrix(y_test,y_pred2),columns=list(range(0,2)))
print(cdf)
print("-"*50)
print("Precision - ",precision_score(y_test,y_pred2))
print("Recall - ",recall_score(y_test,y_pred2))
print("F1 score - ",f1_score(y_test,y_pred2))

For DT Model
--------------------------------------------------
    0   1
0  25   7
1   2  27
--------------------------------------------------
Precision -  0.7941176470588235
Recall -  0.9310344827586207
F1 score -  0.8571428571428571


In [18]:
precision_score(y_test,y_pred1,average=None)

array([1.        , 0.82857143])

In [19]:
precision_score(y_test,y_pred2,average=None)

array([0.92592593, 0.79411765])

In [20]:
recall_score(y_test,y_pred2,average=None)

array([0.78125   , 0.93103448])