# Disease Prediction Model

This notebook contains the original code for defining and training the prediction model.

In [5]:
#Getting all necessary imports

import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
import pickle
import sklearn
import joblib

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.utils import to_categorical

In [6]:
#Reading Training and Testing files

df_train = pd.read_csv("Training.csv")
df_train2 = pd.read_csv("Training.csv")
df_test = pd.read_csv("Testing.csv")

In [7]:
#Identifying all possible diseases from Training file (and checking length)

train_label_copy = df_train["prognosis"]

train_labels = []

for val in train_label_copy:
  count = 0
  for val2 in train_labels:
    if (val == val2):
      count += 1
  if count == 0:
    train_labels.append(val)

#print(len(train_labels))
#print(train_labels)

In [8]:
#Defining function that manually encodes every prognosis with a corresponding integer
#(for one hot encoding later)

def prognosis_encode(arr):
  encoded_column = []

  for val1 in range(len(arr)):
    for val2 in range(len(train_labels)):
      if (arr[val1] == train_labels[val2]):
        encoded_column.append(val2)

  return encoded_column

For the input data to be compatible with the neural network, any string data must be one_hot_encoded. This is a method of binary encoding any categorical data. However, the tensorflow function that executes this process only takes integer input data.

As a result, the function defined above manually encodes the categorical data to be represented by ints from 0 to 40.

In [9]:
df_train2 = df_train["prognosis"] #Defining dataframe of just prognosis column
int_encoded_col = prognosis_encode(df_train2) #Storing manually encoded column to new variable
df_train2 = np.column_stack((df_train2, int_encoded_col)) #Adding manually encoded column to new dataframe
df_train2 = np.delete(df_train2, 0, 1) #Deleting old String based prognosis column
y_train = to_categorical(df_train2, num_classes=41)
#Using tensorflow's one hot encoding function on integer encoded column (more compatible with neural network)

In [10]:
#Dropping unneccesary columns from X_train and X_test

X_train = df_train.drop(["prognosis", "encoder"], axis='columns')
X_test = df_test.drop(["prognosis", "encoder"], axis='columns')

#Converting all input data to float32 (decimals)

X_train = X_train.astype('float32')
y_train = y_train.astype('float32')
X_test = X_test.astype('float32')

#Converting X_train and X_test to numpy objects

X_train = X_train.to_numpy()
X_test = X_test.to_numpy()

In [11]:
#Creating Neural Network (defining architecture)

model = Sequential() #Definiing neural network type (feedforward)
model.add(Dense(64, input_dim=132, activation='relu')) #Input layer (64 parameters, 132 input features)
model.add(Dense(32, activation="relu")) #Hidding layer
model.add(Dense(41, activation='softmax')) #Output layer (41 possible diseases)

In [12]:
#Compiling model (given previously defined architecture)

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [13]:
#Fitting model to input data

model.fit(X_train, y_train, epochs=5, batch_size=32)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x1d8a1120750>

In [14]:
filename = "trained_disease_model.pkl"
joblib.dump(model, filename)

['trained_disease_model.pkl']