#About Dataset
link https://www.kaggle.com/datasets/uciml/indian-liver-patient-records


Patients with Liver disease have been continuously increasing because of excessive consumption of alcohol, inhale of harmful gases, intake of contaminated food, pickles and drugs. This dataset was used to evaluate prediction algorithms in an effort to reduce burden on doctors.

Content
This data set contains 416 liver patient records and 167 non liver patient records collected from North East of Andhra Pradesh, India. The "Dataset" column is a class label used to divide groups into liver patient (liver disease) or not (no disease). This data set contains 441 male patient records and 142 female patient records.

Any patient whose age exceeded 89 is listed as being of age "90".

###Imports

In [20]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import tensorflow as tf

#Load the Datasets

In [2]:
df = pd.read_csv("indian_liver_patient.csv")
df.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Dataset
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1


###Exploring the data

In [3]:
df.isnull().sum()

Unnamed: 0,0
Age,0
Gender,0
Total_Bilirubin,0
Direct_Bilirubin,0
Alkaline_Phosphotase,0
Alamine_Aminotransferase,0
Aspartate_Aminotransferase,0
Total_Protiens,0
Albumin,0
Albumin_and_Globulin_Ratio,4


In [4]:
df.duplicated().value_counts()

Unnamed: 0,count
False,570
True,13


In [5]:
print(df[df.duplicated()])

     Age  Gender  Total_Bilirubin  Direct_Bilirubin  Alkaline_Phosphotase  \
19    40  Female              0.9               0.3                   293   
26    34    Male              4.1               2.0                   289   
34    38  Female              2.6               1.2                   410   
55    42    Male              8.9               4.5                   272   
62    58    Male              1.0               0.5                   158   
106   36    Male              5.3               2.3                   145   
108   36    Male              0.8               0.2                   158   
138   18    Male              0.8               0.2                   282   
143   30    Male              1.6               0.4                   332   
158   72    Male              0.7               0.1                   196   
164   39    Male              1.9               0.9                   180   
174   31    Male              0.6               0.1                   175   

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         583 non-null    int64  
 1   Gender                      583 non-null    object 
 2   Total_Bilirubin             583 non-null    float64
 3   Direct_Bilirubin            583 non-null    float64
 4   Alkaline_Phosphotase        583 non-null    int64  
 5   Alamine_Aminotransferase    583 non-null    int64  
 6   Aspartate_Aminotransferase  583 non-null    int64  
 7   Total_Protiens              583 non-null    float64
 8   Albumin                     583 non-null    float64
 9   Albumin_and_Globulin_Ratio  579 non-null    float64
 10  Dataset                     583 non-null    int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 50.2+ KB


###Pre Processing

In [7]:
mean = df["Albumin_and_Globulin_Ratio"].mean()

In [8]:
df["Albumin_and_Globulin_Ratio"]= df["Albumin_and_Globulin_Ratio"].fillna(mean)

In [9]:
df.isnull().sum()

Unnamed: 0,0
Age,0
Gender,0
Total_Bilirubin,0
Direct_Bilirubin,0
Alkaline_Phosphotase,0
Alamine_Aminotransferase,0
Aspartate_Aminotransferase,0
Total_Protiens,0
Albumin,0
Albumin_and_Globulin_Ratio,0


###Encoding

In [10]:
def binary_encoding(df,column,positive_value):
  df = df.copy()
  df[column] = df[column].apply(lambda x: 1 if x== positive_value else 0)
  return df

In [11]:
df = binary_encoding(df, "Gender" ,"Male")

In [12]:
df["Dataset"].value_counts()

Unnamed: 0_level_0,count
Dataset,Unnamed: 1_level_1
1,416
2,167


In [13]:
df = binary_encoding(df, "Dataset" ,2)

In [14]:
df["Dataset"].value_counts()

Unnamed: 0_level_0,count
Dataset,Unnamed: 1_level_1
0,416
1,167


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         583 non-null    int64  
 1   Gender                      583 non-null    int64  
 2   Total_Bilirubin             583 non-null    float64
 3   Direct_Bilirubin            583 non-null    float64
 4   Alkaline_Phosphotase        583 non-null    int64  
 5   Alamine_Aminotransferase    583 non-null    int64  
 6   Aspartate_Aminotransferase  583 non-null    int64  
 7   Total_Protiens              583 non-null    float64
 8   Albumin                     583 non-null    float64
 9   Albumin_and_Globulin_Ratio  583 non-null    float64
 10  Dataset                     583 non-null    int64  
dtypes: float64(5), int64(6)
memory usage: 50.2 KB


### Train Test Split

In [17]:
x = df.drop(columns=["Dataset"],axis=1)
y = df["Dataset"]

In [18]:
scaler = StandardScaler()
x = scaler.fit_transform(x)

In [19]:
x_train,x_test, y_train, y_test = train_test_split(
    x,y,
    test_size=0.8,
    random_state=42
)

In [21]:
x.shape

(583, 10)

In [41]:
inputs = tf.keras.Input(shape=(10,))

X = tf.keras.layers.Dense(64,activation="relu")(inputs)
X = tf.keras.layers.Dropout(0.5)(X) # Add dropout layer
X = tf.keras.layers.Dense(64,activation="relu")(X)
X = tf.keras.layers.Dropout(0.5)(X) # Add dropout layer
outputs = tf.keras.layers.Dense(1,activation="sigmoid")(X)

In [42]:
model = tf.keras.Model(inputs=inputs, outputs=outputs)

In [43]:
model.compile(
    optimizer="adam",
    loss= "binary_crossentropy",
    metrics = [
        "accuracy"
    ]
)

In [44]:
history = model.fit(
    x_train,
    y_train,
    validation_split=0.2,
    batch_size = 64,
    epochs = 25,
    verbose=0
)

###Result

In [45]:
import plotly.express as px
fig = px.line(
    history.history,
    y=['loss', 'val_loss'],
    labels={'index': "Epoch", 'value': "Loss"},
    title="Training and Validation Loss"
)

fig.show()

Model is overfitting


In [46]:
model.evaluate(x_test, y_test)

[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.7314 - loss: 0.5415 


[0.5606002807617188, 0.7152034044265747]

;(