<a href="https://colab.research.google.com/github/Prakum14/Patient-Survival-Prediction/blob/main/Copy_of_M1_NB_MiniProject_2_Structured_Data_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Programme in AI and MLOps
## A programme by IISc and TalentSprint
### Mini-Project Notebook: Structured Data Classification

## Problem Statement

To predict whether a patient has a heart disease.

## Learning Objectives

At the end of the experiment, you will be able to

* understand the Cleveland Clinic Foundation for Heart Disease dataset
* pre-process this dataset
* build a neural network architecture/model using Keras sequential or functional api
* perform model training
* perform inference on an unseen data
* build a Gradio interface for this application

## Introduction

This example demonstrates how to do structured data classification, starting from a raw
CSV file. Our data includes both numerical and categorical features. We will do preprocessing to normalize the numerical features and vectorize the categorical
ones.

### Dataset

[Our dataset](https://archive.ics.uci.edu/ml/datasets/heart+Disease) is provided by the
Cleveland Clinic Foundation for Heart Disease.
It's a CSV file with 303 rows. Each row contains information about a patient (a
**sample**), and each column describes an attribute of the patient (a **feature**). We
use the features to predict whether a patient has a heart disease (**binary
classification**).

Here's the description of each feature:

Column| Description| Feature Type
------------|--------------------|----------------------
Age | Age in years | Numerical
Sex | (1 = male; 0 = female) | Categorical
CP | Chest pain type (0, 1, 2, 3, 4) | Categorical
Trestbpd | Resting blood pressure (in mm Hg on admission) | Numerical
Chol | Serum cholesterol in mg/dl | Numerical
FBS | fasting blood sugar in 120 mg/dl (1 = true; 0 = false) | Categorical
RestECG | Resting electrocardiogram results (0, 1, 2) | Categorical
Thalach | Maximum heart rate achieved | Numerical
Exang | Exercise induced angina (1 = yes; 0 = no) | Categorical
Oldpeak | ST depression induced by exercise relative to rest | Numerical
Slope | Slope of the peak exercise ST segment | Numerical
CA | Number of major vessels (0-3) colored by fluoroscopy | Both numerical & categorical
Thal | 3 = normal; 6 = fixed defect; 7 = reversible defect | Categorical
Target | Diagnosis of heart disease (1 = true; 0 = false) | Target

In [None]:
#@title Download the data
!wget -qq https://cdn.iisc.talentsprint.com/AIandMLOps/Datasets/heart.csv
print("Data Downloaded Successfuly!!")
!ls | grep '.csv'

Data Downloaded Successfuly!!
heart.csv


## Grading = 10 Points

### Import Required Packages

In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers

## Load the data and pre-process it [3 Marks]

### Load data into a Pandas dataframe

Hint:: pd.read_csv

In [None]:
file_url = "/content/heart.csv"
## YOUR CODE HERE

data = pd.read_csv(file_url)

Check the shape of the dataset:

In [None]:
## YOUR CODE HERE

data.shape

(303, 14)

Check the preview of a few samples:

Hint:: head()

In [None]:
## YOUR CODE HERE

data.head(50)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,fixed,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,normal,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,reversible,0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,normal,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,normal,0
5,56,1,2,120,236,0,0,178,0,0.8,1,0,normal,0
6,62,0,4,140,268,0,2,160,0,3.6,3,2,normal,1
7,57,0,4,120,354,0,0,163,1,0.6,1,0,normal,0
8,63,1,4,130,254,0,2,147,0,1.4,2,1,reversible,1
9,53,1,4,140,203,1,2,155,1,3.1,3,0,reversible,0


Draw some inference from the data. What does the target column indicate?

The last column, "target", indicates whether the patient has a heart disease (1) or not
(0).

### Missing values

In [None]:
# Check if any missing values is present
## YOUR CODE HERE

print(data.isna().sum() > 0 )

age         False
sex         False
cp          False
trestbps    False
chol        False
fbs         False
restecg     False
thalach     False
exang       False
oldpeak     False
slope       False
ca          False
thal        False
target      False
dtype: bool


### Show the unique values present in each categorical columns

- Remove the rows which has '1' and '2' as values in `thal` column

In [None]:
# Show all the columns in dataframe
## YOUR CODE HERE

data.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [None]:
# Print the unique values present in each categorical columns

categorical_cols = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'ca', 'thal']

## YOUR CODE HERE
data[categorical_cols].nunique()

Unnamed: 0,0
sex,2
cp,5
fbs,2
restecg,3
exang,2
ca,4
thal,5


In [None]:
# Print the unique values present in each categorical columns along with their counts

for cat_col in categorical_cols:
  print(data[cat_col].value_counts())

sex
1    205
0     98
Name: count, dtype: int64
cp
4    142
3     84
2     49
1     24
0      4
Name: count, dtype: int64
fbs
0    258
1     45
Name: count, dtype: int64
restecg
0    149
2    146
1      8
Name: count, dtype: int64
exang
0    204
1     99
Name: count, dtype: int64
ca
0    176
1     67
2     40
3     20
Name: count, dtype: int64
thal
normal        168
reversible    115
fixed          18
1               1
2               1
Name: count, dtype: int64


- Remove the rows which has '1' and '2' as values in `thal` column

In [None]:
# Find indices of the rows which has '1', '2' as values in `thal` column

idx = data.loc[(data['thal']=='1') | (data['thal']=='2')].index
idx

Index([247, 252], dtype='int64')

In [None]:
# Drop the above indexed rows

data.drop(index=idx,inplace=True)

In [None]:
# Recheck the unique values present in each categorical columns

for cat_col in categorical_cols:
  print(data[cat_col].value_counts())

sex
1    204
0     97
Name: count, dtype: int64
cp
4    142
3     84
2     49
1     23
0      3
Name: count, dtype: int64
fbs
0    257
1     44
Name: count, dtype: int64
restecg
0    147
2    146
1      8
Name: count, dtype: int64
exang
0    202
1     99
Name: count, dtype: int64
ca
0    176
1     66
2     39
3     20
Name: count, dtype: int64
thal
normal        168
reversible    115
fixed          18
Name: count, dtype: int64


### Convert the categorical values present in `thal` column to numerical labels

Hint: Create a dictionary mapping

In [None]:
thal_dict={'normal':0 , 'fixed':1 , 'reversible':2}
data['thal']=data['thal'].map(thal_dict)

### Split the dataset into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split

## YOUR CODE HERE (perform stratified sampling/splitting)
x_train,x_test,y_train,y_test = train_test_split(data.drop('target',axis=1),data['target'],test_size=0.2,stratify=data['target'],random_state=42)

In [None]:
x_train.shape,x_test.shape,y_train.shape,y_test.shape

((240, 13), (61, 13), (240,), (61,))

### Scale the numerical features

In [None]:
numerical_cols = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope']

In [None]:
from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()
x_train[numerical_cols]=scaler.fit_transform(x_train[numerical_cols])
x_test[numerical_cols]=scaler.transform(x_test[numerical_cols])

In [None]:
x_train.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
101,1.362636,0,3,-0.963595,5.868071,0,2,0.498048,0,0.429344,0.691529,0,2
55,-0.282397,1,2,-0.679558,1.414375,0,0,1.014755,0,-0.744452,-0.935599,0,0
107,0.594954,1,4,0.456589,0.818064,0,2,0.928637,0,0.093973,0.691529,2,2
183,0.156278,1,4,-0.111484,0.631717,1,2,-1.95631,1,0.429344,2.318657,0,2
17,-0.06306,1,4,0.456589,-0.188211,0,0,0.498048,0,0.093973,-0.935599,0,0


## Building the model [3 Marks]

* Use tf.keras.layers.Input() for input layer
* Add dense layers
* Add dropout layers
* Add a classification layer at the end


In [None]:
from keras import Input

In [None]:
input_shape=x_train.shape[1]

model_seq = keras.Sequential(name="sequential_model")
model_seq.add(tf.keras.layers.Input(shape=(input_shape,)))   #specifying the input here
model_seq.add(keras.layers.Dense(64, activation=tf.nn.relu, name="first_layer"))
model_seq.add(keras.layers.Dropout(0.2))
model_seq.add(keras.layers.Dense(10, activation=tf.nn.relu, name="second_layer"))
model_seq.add(keras.layers.Dropout(0.2))
model_seq.add(keras.layers.Dense(1, activation=tf.nn.sigmoid, name="final_layer"))
model_seq.summary()

In [None]:
# Compile model with 'adam' optimizer, appropriate loss and metric

model_seq.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy','precision','recall'])

In [None]:
# Perform training
epochs=50
batch_size=32
validation_split=0.2

model_seq.fit(x_train,y_train,epochs=epochs,batch_size=batch_size,validation_split=validation_split)

Epoch 1/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 85ms/step - accuracy: 0.5531 - loss: 0.6976 - precision: 0.3144 - recall: 0.5446 - val_accuracy: 0.7708 - val_loss: 0.6359 - val_precision: 0.6000 - val_recall: 0.6429
Epoch 2/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.7305 - loss: 0.6093 - precision: 0.5407 - recall: 0.5459 - val_accuracy: 0.7500 - val_loss: 0.6069 - val_precision: 0.6250 - val_recall: 0.3571
Epoch 3/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.7281 - loss: 0.5862 - precision: 0.5204 - recall: 0.3424 - val_accuracy: 0.7292 - val_loss: 0.5792 - val_precision: 0.6000 - val_recall: 0.2143
Epoch 4/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.8039 - loss: 0.5311 - precision: 0.6473 - recall: 0.5204 - val_accuracy: 0.7292 - val_loss: 0.5517 - val_precision: 0.6000 - val_recall: 0.2143
Epoch 5/50
[1m6/6[0m [32m━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x7bbe954e0610>

In [None]:
# Performance on test set

model_seq.evaluate(x_test,y_test)

[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.8381 - loss: 0.3986 - precision: 0.7143 - recall: 0.6303


[0.4191562533378601,
 0.8196721076965332,
 0.7142857313156128,
 0.5882353186607361]

## Inference on new data [1 Mark]

To get a prediction for a new sample, you can simply call `model.predict()`.

In [None]:
# Inference on new data

sample = {
    "age": 60,
    "sex": 1,
    "cp": 1,
    "trestbps": 145,
    "chol": 233,
    "fbs": 1,
    "restecg": 2,
    "thalach": 150,
    "exang": 0,
    "oldpeak": 2.3,
    "slope": 3,
    "ca": 0,
    "thal": 0,
}
sample_df=pd.DataFrame(sample,index=[0])
sample_df[numerical_cols]=scaler.transform(sample_df[numerical_cols])

In [None]:
sample_df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,0.594954,1,1,0.740626,-0.300019,1,2,0.067459,0,1.016241,2.318657,0,0


In [None]:
out = model_seq.predict(sample_df)
if out < 0.5:
    print ("No")
else:
    print("Yes")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
No


## Gradio Implementation [3 Marks]

Create a Gradio interface for this `Heart Disease Prediction` application. For the feature values given by the user as input, perform predcition using the trained model, and return the result back to user.

Make use of gradio elements such as Textbox, Radio buttons, etc.

In [None]:
%%capture
!pip -q install gradio

In [None]:
import gradio
import gradio as gr

In [None]:
# UI - Input components
## YOUR CODE HERE ...

# import gradio as gr

def greet(age, male, female, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal):
    if male:
        sex = 1
    if female:
        sex = 0
    sample = {
        "age": age,
        "sex": sex,
        "cp": cp,
        "trestbps": trestbps,
        "chol": chol,
        "fbs": fbs,
        "restecg": restecg,
        "thalach": thalach,
        "exang": exang,
        "oldpeak": oldpeak,
        "slope": slope,
        "ca": ca,
        "thal": thal,
    }
    sample_df=pd.DataFrame(sample,index=[0])
    sample_df[numerical_cols]=scaler.transform(sample_df[numerical_cols])

# UI - Output component
## YOUR CODE HERE ...
    out = model_seq.predict(sample_df)

    if out < 0.5:
        return("No Heart disease")
    else:
        return("The patient has Heart disease. Admit immediately!!")



demo = gr.Interface(
    fn=greet,
    inputs=["text", "checkbox", "checkbox", "text", "text", "text", "text", "text", "text", "text", "text", "text", "text", "text"],
    outputs=["text"],
)


    # "age": 60,
    # "sex": 1,
    # "cp": 1,
    # "trestbps": 145,
    # "chol": 233,
    # "fbs": 1,
    # "restecg": 2,
    # "thalach": 150,
    # "exang": 0,
    # "oldpeak": 2.3,
    # "slope": 3,
    # "ca": 0,
    # "thal": 0,


demo.launch(share=True)




Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://35f0eff19df52b132e.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [None]:
age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal

NameError: name 'age' is not defined

In [None]:
# Label prediction function

## YOUR CODE HERE


In [None]:
# Create Gradio interface object and launch it with (share=True)

## YOUR CODE HERE
