# Introduction

In this project, we will create an artificial neural network that will be trained to predict diseases based on a series of symptoms. In total, there are 132 different symptoms and 42 possible diseases.

The dataset used is not owned by me; all credit for its organization and creation goes to the user kaushil268, who made it freely available on the [Kaggle](https://www.kaggle.com/) website.

You can find everything about the dataset, including the download link, [here](https://www.kaggle.com/datasets/kaushil268/disease-prediction-using-machine-learning).

# About the Dataset

The dataset contains two different files, "Testing" and "Training," both in '.csv' format. We will train our model using the "Training" file and then make predictions on the data in the "Testing" file to verify our model's effectiveness and accuracy.

The "Training" file has 4,921 rows, meaning we have 4,921 different instances for training, and the "Testing" file has 43 instances available for testing our model.

# Starting the project

### Importing the packages

In [137]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import pandas as pd

### Creating the dataframes

In [138]:
# Creating both train and testing daframes
train_df = pd.read_csv("Data/Training.csv")
testing_df = pd.read_csv("Data/Testing.csv")

# Changing the name of the "prognosis" column to: "target", in both dataframes
train_df.rename(columns={"prognosis": "target"}, inplace=True)
testing_df.rename(columns={"prognosis": "target"}, inplace=True)

# Droping the last column of the train dataframe due to pandas error reading the csv file
train_df = train_df.iloc[:, :-1]

In [139]:
train_df.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,target
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
1,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
3,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
4,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection


In [140]:
testing_df.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,target
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
1,0,0,0,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Allergy
2,0,0,0,0,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,GERD
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Chronic cholestasis
4,1,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,Drug Reaction


In the cell below you can see the shape of the train dataframe

In [141]:
train_df.shape

(4920, 133)

In the cell below you can see the shape of the testing dataframe

In [142]:
testing_df.shape

(42, 133)

# Before we train the model

### Type conversion for better memory efficiency

In the cell below we can see the basic information of the train dataframe, and as we can notice, in each line we have 132 int64 variables, wich is total overkill for what we are storing in those columns wich is just 0's or 1's, so we can change that to be int8 (ranges from -128 to 127) and save a lot of space (we are going to apply this type conversion to both dataframes).

In [143]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4920 entries, 0 to 4919
Columns: 133 entries, itching to target
dtypes: int64(132), object(1)
memory usage: 5.0+ MB


In the cell below we are going to loop over the columns and change then one by one from int64 to int8, excluding the last column wich should be a string

In [147]:
from numpy import int8

# Conversion for the training dataframe
for column in train_df.columns[:-1]:  # Loop through the first 132 columns
    train_df[column] = train_df[column].astype('int8')

# Conversion for the testing dataframe
for column in testing_df.columns[:-1]:  # Loop through the first 132 columns
    testing_df[column] = testing_df[column].astype('int8')


In [149]:
train_df.info()
print("\n")
testing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4920 entries, 0 to 4919
Columns: 133 entries, itching to target
dtypes: int8(132), object(1)
memory usage: 672.8+ KB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Columns: 133 entries, itching to target
dtypes: int8(132), object(1)
memory usage: 5.9+ KB


Just from converting the data from int64 (which is the standard format for pandas dataframes) to int8, we reduced almost 87% of the amount of storage needed to keep the dataframe in memory, in this small dataset it might not make that big of a difference, but when dealing with bigger datasets it can make a huge difference.