# ## 🩺 Liver Disease Dataset

### 📘 Overview
Binary classification dataset to predict **liver disease (1)** or **no disease (0)** based on medical test results.

### 🧩 Features
- **Age:** Age of patient  
- **Gender:** Male/Female  
- **Total_Bilirubin, Direct_Bilirubin:** Bilirubin levels  
- **Alkaline_Phosphotase, ALT, AST:** Liver enzymes  
- **Total_Protiens, Albumin:** Protein levels  
- **Albumin_and_Globulin_Ratio:** Protein ratio  
- **Dataset (Target):** 1 = Liver disease, 0 = No disease  

### 🧾 Sample
| Age | Gender | Total_Bilirubin | Direct_Bilirubin | Alk_Phos | ALT | AST | Total_Protiens | Albumin | A/G_Ratio | Dataset |
|-----|---------|-----------------|------------------|-----------|-----|-----|----------------|----------|------------|----------|
| 65 | Female | 0.7 | 0.1 | 187 | 16 | 18 | 6.8 | 3.3 | 0.90 | 1 |
| 62 | Male | 10.9 | 5.5 | 699 | 64 | 100 | 7.5 | 3.2 | 0.74 | 1 |
| 62 | Male | 7.3 | 4.1 | 490 | 60 | 68 | 7.0 | 3.3 | 0.89 | 1 |

### 🎯 Goal
Predict if a patient has **liver disease (1)** or **not (0)** using medical attributes.





# 1. Load Dataset

In [2]:
import pandas as pd 
df = pd.read_csv("indian_liver_patient.csv")
df.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Dataset
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1


# 2 Data Preprocessing

### Exploring Dataset (Missing Values)
### Encoding
### Scalling 
### Balancing Classes
### Feature Selection
### Train Test Split

# Exploring Dataset

In [3]:
df.isnull().sum()

Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    4
Dataset                       0
dtype: int64

In [4]:
df.dropna(inplace=True)

In [5]:
df.shape

(579, 11)

In [6]:
df.duplicated().sum()

13

In [7]:
df.drop_duplicates(inplace=True)

# Encoding


- **Gender** is a categorical feature with values `Male` and `Female`.  
- Convert it into numeric form using **Label Encoding** or **One-Hot Encoding**.

### Example:
| Gender | Encoded_Value |
|---------|----------------|
| Male    | 1 |
| Female  | 0 |

🧠 Encoding helps ML models process non-numeric (categorical) data effectively.


In [8]:
# using sklearn
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

encoder = LabelEncoder()
df['Gender'] = encoder.fit_transform(df['Gender'])
df.head(2)

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Dataset
0,65,0,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,1,10.9,5.5,699,64,100,7.5,3.2,0.74,1


In [9]:
# manually
# df['Gender'] = df['Gender'].map({"Male":1, "Female":0})


## ⚙️ Feature Scaling

- Scaling ensures all numerical features are on a similar range for better model performance.  
- Common methods:
  - **StandardScaler:** Scales data to mean = 0 and std = 1  
  - **MinMaxScaler:** Scales values between 0 and 1

### Example:
| Feature | Before | After (MinMax) |
|----------|---------|----------------|
| Age | 65 | 0.78 |
| Total_Bilirubin | 10.9 | 0.85 |

📊 Apply scaling to all numeric columns (except target) before model training.


In [10]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler 

features = ['Age', 'Total_Bilirubin', 'Direct_Bilirubin',
       'Alkaline_Phosphotase', 'Alamine_Aminotransferase',
       'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin',
       'Albumin_and_Globulin_Ratio']


scaler = StandardScaler()

df[features] = scaler.fit_transform(df[features])

df.head(3)

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Dataset
0,1.236928,0,-0.420124,-0.49519,-0.429625,-0.352659,-0.315148,0.280819,0.194225,-0.150315,1
1,1.052432,1,1.203777,1.406906,1.654054,-0.088755,-0.033926,0.925059,0.068445,-0.651328,1
2,1.052432,1,0.630636,0.91377,0.80349,-0.110747,-0.143671,0.464887,0.194225,-0.181628,1


# Data Balancing

#### ⚖️ Class Balance

- Check if target classes (`0` = No Disease, `1` = Liver Disease) are balanced.  
- Imbalanced data can bias the model toward the majority class.  
- Use techniques like:
  - **SMOTE / Oversampling** → add minority samples  
  - **Undersampling** → reduce majority samples  

### Example:
| Class | Count |
|--------|--------|
| 0 | 150 |
| 1 | 416 |

📈 Aim for a balanced distribution before training.


In [11]:
!pip install -q imbalanced-learn


[notice] A new release of pip is available: 24.2 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [12]:
from imblearn.over_sampling import SMOTE

X = df.drop('Dataset',axis=1)
y = df['Dataset']

smote = SMOTE(random_state=42)

X_resampled, y_resampled = smote.fit_resample(X, y)

y_resampled.value_counts()

Dataset
1    404
2    404
Name: count, dtype: int64


## ✂️ Train-Test Split

- Split the dataset to evaluate model performance on unseen data.  
- Common ratio: **80% training**, **20% testing**.  
- Use `train_test_split()` from `sklearn.model_selection`.



In [13]:
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# 3 Train and Eval Models

### logistic regression
### SVC
### Decision Tree
### Random Forest

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

model = LogisticRegression()

model.fit(X_train,y_train)

y_pred = model.predict(X_test)

print("\n\n",classification_report(y_test, y_pred))

print("\n\n", confusion_matrix(y_test, y_pred))



               precision    recall  f1-score   support

           1       0.79      0.59      0.68        76
           2       0.70      0.86      0.77        86

    accuracy                           0.73       162
   macro avg       0.75      0.73      0.73       162
weighted avg       0.74      0.73      0.73       162



 [[45 31]
 [12 74]]


### Random Forest

In [15]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report

model_r = RandomForestClassifier()

model_r.fit(X_train,y_train)

y_pred = model_r.predict(X_test)

print("\n\n",classification_report(y_test, y_pred))

print("\n\n", confusion_matrix(y_test, y_pred))



               precision    recall  f1-score   support

           1       0.81      0.76      0.78        76
           2       0.80      0.84      0.82        86

    accuracy                           0.80       162
   macro avg       0.80      0.80      0.80       162
weighted avg       0.80      0.80      0.80       162



 [[58 18]
 [14 72]]


### SVC

In [16]:
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report

model = SVC()

model.fit(X_train,y_train)

y_pred = model.predict(X_test)

print("\n\n",classification_report(y_test, y_pred))

print("\n\n", confusion_matrix(y_test, y_pred))



               precision    recall  f1-score   support

           1       0.82      0.61      0.70        76
           2       0.72      0.88      0.79        86

    accuracy                           0.75       162
   macro avg       0.77      0.74      0.74       162
weighted avg       0.77      0.75      0.75       162



 [[46 30]
 [10 76]]


# 4. Inference Prediction System 

In [17]:
# test 1
import numpy as np 

pred = model_r.predict(np.array([df.iloc[0,:-1]]))

if pred[0] == 1:
    print("Liver Disease")
else:
    print("No Liver Disease")

Liver Disease




In [18]:
# test 2

import numpy as np 

pred = model_r.predict(np.array([df.iloc[23,:-1]]))

if pred[0] == 1:
    print("Liver Disease")
else:
    print("No Liver Disease")

No Liver Disease


