# Disease Recommendation/Prediction Application

## Import Data

In [7]:
# Load the necessary libraries
import pandas as pd # pandas is used for data manipulation and analysis
import numpy as np  # numpy is used for numerical operations

In [9]:
# Load datasets

# Load the dataset containing diseases and symptoms
dataset_df = pd.read_csv('resources/dataset.csv') # reading the dataset.csv file into a pandas DataFrame

# Load the symptom severity dataset
severity_df = pd.read_csv('resources/Symptom-severity.csv') # reading the Symptom-severity.csv file into a pandas DataFrame

# Load the synthetic disease data for testing
synthetic_test_df = pd.read_csv('resources/synthetic_disease_data.csv') # reading the synthetic_disease_data.csv file into a pandas DataFrame

In [11]:
# Inspect the datasets to understand their structure and contents

# Display the first few rows of the dataset containing diseases and symptoms
print("Dataset (diseases and symptoms):")
print(dataset_df.head()) # displaying the first 5 rows of dataset_df

# Display the first few rows of the symptom severity dataset
print("\nSymptom Severity Dataset:")
print(severity_df.head()) # displaying the first 5 rows of severity_df

# Display the first few rows of the synthetic disease test dataset
print("\nSynthetic Disease Test Dataset:")
print(synthetic_test_df.head()) # displaying the first 5 rows of synthetic_test_df

Dataset (diseases and symptoms):
            Disease   Symptom_1              Symptom_2              Symptom_3  \
0  Fungal infection     itching              skin_rash   nodal_skin_eruptions   
1  Fungal infection   skin_rash   nodal_skin_eruptions    dischromic _patches   
2  Fungal infection     itching   nodal_skin_eruptions    dischromic _patches   
3  Fungal infection     itching              skin_rash    dischromic _patches   
4  Fungal infection     itching              skin_rash   nodal_skin_eruptions   

              Symptom_4 Symptom_5 Symptom_6 Symptom_7 Symptom_8 Symptom_9  \
0   dischromic _patches       NaN       NaN       NaN       NaN       NaN   
1                   NaN       NaN       NaN       NaN       NaN       NaN   
2                   NaN       NaN       NaN       NaN       NaN       NaN   
3                   NaN       NaN       NaN       NaN       NaN       NaN   
4                   NaN       NaN       NaN       NaN       NaN       NaN   

  Symptom_10 Symp

## Data Cleaning

In [20]:
# Merge the symptom severity data with the main dataset
# Ensure the severity_df has unique symptoms and no duplicates

# Drop any duplicate symptom names in severity_df, keeping the first occurrence
severity_df = severity_df.drop_duplicates(subset='Symptom')

In [22]:
# Initialize a new column for each symptom's weight
for col in dataset_df.columns[1:]:
    dataset_df[col + '_weight'] = dataset_df[col].map(severity_df.set_index('Symptom')['weight']) # mapping symptoms to their weights

In [24]:
# The resulting dataset_df will now include the weights associated with each symptom.
print("Dataset after adding symptom weights:")
print(dataset_df.head()) # displaying the first 5 rows of the modified dataset_df


Dataset after adding symptom weights:
            Disease  Symptom_1             Symptom_2             Symptom_3  \
0  Fungal infection    itching             skin_rash  nodal_skin_eruptions   
1  Fungal infection  skin_rash  nodal_skin_eruptions   dischromic _patches   
2  Fungal infection    itching  nodal_skin_eruptions   dischromic _patches   
3  Fungal infection    itching             skin_rash   dischromic _patches   
4  Fungal infection    itching             skin_rash  nodal_skin_eruptions   

             Symptom_4 Symptom_5 Symptom_6 Symptom_7 Symptom_8 Symptom_9  ...  \
0  dischromic _patches       NaN       NaN       NaN       NaN       NaN  ...   
1                  NaN       NaN       NaN       NaN       NaN       NaN  ...   
2                  NaN       NaN       NaN       NaN       NaN       NaN  ...   
3                  NaN       NaN       NaN       NaN       NaN       NaN  ...   
4                  NaN       NaN       NaN       NaN       NaN       NaN  ...   

  Symp

In [38]:
# Replace NaN values in the symptom weight columns with 0
# This is because not all diseases have 17 symptoms, and the absence of a symptom means its weight should be 0.

# Replace NaN values with 0 in all weight columns
dataset_df.fillna(0, inplace=True)


In [40]:
# Display the updated dataset to confirm the changes
print("Dataset after replacing NaN values with 0:")
print(dataset_df.head()) # displaying the first 5 rows of the modified dataset_df

Dataset after replacing NaN values with 0:
            Disease  Symptom_1             Symptom_2             Symptom_3  \
0  Fungal infection    itching             skin_rash  nodal_skin_eruptions   
1  Fungal infection  skin_rash  nodal_skin_eruptions    dischromic_patches   
2  Fungal infection    itching  nodal_skin_eruptions    dischromic_patches   
3  Fungal infection    itching             skin_rash    dischromic_patches   
4  Fungal infection    itching             skin_rash  nodal_skin_eruptions   

            Symptom_4 Symptom_5 Symptom_6 Symptom_7 Symptom_8 Symptom_9  ...  \
0  dischromic_patches         0         0         0         0         0  ...   
1                   0         0         0         0         0         0  ...   
2                   0         0         0         0         0         0  ...   
3                   0         0         0         0         0         0  ...   
4                   0         0         0         0         0         0  ...   

  Sympt

In [42]:
# Calculate the total weight for each disease based on the symptoms
# We will now calculate a 'weight_total' column that sums up the weights of all symptoms for each disease.

# Calculate the total weight for each row (disease)
weight_columns = [col for col in dataset_df.columns if '_weight' in col] # selecting all weight columns
dataset_df['weight_total'] = dataset_df[weight_columns].sum(axis=1) # summing the weights across the selected columns

In [44]:
# Display the updated dataset with the 'weight_total' column
print("Dataset after calculating total weight for each disease:")
print(dataset_df[['Disease', 'weight_total']].head()) # displaying the first 5 rows with Disease and weight_total columns

Dataset after calculating total weight for each disease:
            Disease  weight_total
0  Fungal infection          14.0
1  Fungal infection          13.0
2  Fungal infection          11.0
3  Fungal infection          10.0
4  Fungal infection           8.0


In [47]:
# Define the path to save the file
output_file_path = 'resources/combined_dataset.csv'

# Save the updated dataset as a CSV file
dataset_df.to_csv(output_file_path, index=False) # saving the DataFrame to a CSV file without the index

print(f"Updated dataset saved successfully to {output_file_path}")

Updated dataset saved successfully to resources/combined_dataset.csv


## Prepare the data for deep learning models

In [None]:
# Load the combined dataset from the 'resources' folder
combined_dataset_df = pd.read_csv('resources/combined_dataset.csv') # reading the combined_dataset.csv file into a pandas DataFrame

In [None]:
# Define the feature columns (all symptom and weight columns)
feature_columns = [col for col in combined_dataset_df.columns if col.startswith('Symptom')]

In [65]:
# Define the target column (disease)
target_column = 'Disease'

In [67]:
# Separate the features and the target
X = combined_dataset_df[feature_columns]  # features (symptoms and their weights)
y = combined_dataset_df[target_column]    # target (disease)

In [69]:
# Display the shapes of X and y to ensure everything is correct
print("Features (X) shape:", X.shape)
print("Target (y) shape:", y.shape)

Features (X) shape: (4920, 34)
Target (y) shape: (4920,)


In [73]:
# Split the data into training and testing sets
# We will now split the dataset into training and testing sets to evaluate the performance of the deep learning models.

from sklearn.model_selection import train_test_split # importing train_test_split for data splitting

In [75]:
# Split the data: 80% for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [77]:
# Display the shapes of the training and testing sets to confirm the split
print("Training features (X_train) shape:", X_train.shape)
print("Testing features (X_test) shape:", X_test.shape)
print("Training target (y_train) shape:", y_train.shape)
print("Testing target (y_test) shape:", y_test.shape)

Training features (X_train) shape: (3936, 34)
Testing features (X_test) shape: (984, 34)
Training target (y_train) shape: (3936,)
Testing target (y_test) shape: (984,)


In [93]:
# Normalize only the numeric feature data (symptom weights)
# We will normalize the symptom weight columns to ensure that the model training process is efficient and that the models converge properly.

from sklearn.preprocessing import StandardScaler # importing StandardScaler for feature normalization

In [95]:
# Identify the weight columns that need to be scaled
weight_columns = [col for col in X.columns if '_weight' in col]

In [97]:
# Initialize the scaler
scaler = StandardScaler()

In [101]:
# Fit the scaler on the training data and transform both training and testing data
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled[weight_columns] = scaler.fit_transform(X_train[weight_columns]) # fitting and transforming the training data
X_test_scaled[weight_columns] = scaler.transform(X_test[weight_columns])       # transforming the testing data using the same scaler

In [103]:
# Display the first few rows of the scaled training data to confirm
print("First few rows of the scaled training features (X_train_scaled):")
print(X_train_scaled[weight_columns].head()) # displaying the scaled weight columns

First few rows of the scaled training features (X_train_scaled):
      Symptom_1_weight  Symptom_2_weight  Symptom_3_weight  Symptom_4_weight  \
1807         -1.812392         -0.932567         -0.057841         -1.067440   
184          -0.298315          0.696270         -0.057841          1.634801   
205          -1.812392         -0.118149         -1.536180          0.013456   
4581          0.458724         -0.118149         -0.057841          0.013456   
410          -1.812392         -0.932567         -0.057841          1.094353   

      Symptom_5_weight  Symptom_6_weight  Symptom_7_weight  Symptom_8_weight  \
1807          1.769614          0.134829          1.260963          1.911182   
184          -0.086041          0.972538          0.832366          1.020692   
205           0.377873          0.553684          0.832366          1.020692   
4581          0.841787          0.972538          1.689560         -0.760287   
410          -1.477782         -1.121734         -0.88

## Select and Train Deep Learning Models
Here we will train, test and compare three deep learning models to see which one performs the best


##### Three top-performing deep learning models commonly used for multi-class classification tasks:

**Model 1: Multi-Layer Perceptron (MLP)**  
   - A fully connected feedforward neural network.
   - Suitable for tabular data with independent features.
   - Uses multiple layers to learn complex patterns.
   - **Pro:** Simple and effective for structured data.
   - **Con:** Struggles with spatial or sequential data.

**Model 2: Convolutional Neural Network (CNN)**  
   - Designed to capture local patterns in data.
   - Typically used for image data but adaptable to other data types.
   - Includes convolutional layers and pooling layers for feature extraction.
   - **Pro:** Excellent for data with spatial relationships.
   - **Con:** Computationally intensive, especially with large datasets.

**Model 3: Recurrent Neural Network (RNN) with LSTM units**  
   - Captures dependencies in sequence data.
   - LSTM units address the vanishing gradient problem.
   - Maintains a memory of previous inputs for better sequence prediction.
   - **Pro:** Powerful for tasks involving temporal dependencies.
   - **Con:** Less effective for non-sequential data and can be slower to train.