### Generating Synthetic Data for Model Accuracy Testing

Creating a synthetic dataset of 250,000 individuals with correctly matched diseases and symptoms is crucial for testing the accuracy of a deep learning model. This large and diverse dataset ensures that the model can learn from various examples, including cases with different numbers of symptoms. By simulating real-world scenarios where a disease might present itself with a range of symptoms, we can evaluate how well the model can predict the correct disease based on input symptoms. A population size of 250,000 is adequate because it provides enough variation and data points for the model to generalize well, making it more robust and accurate in real-world applications.

- **Diversity of Data:** A large dataset allows the model to learn from different scenarios, increasing its ability to generalize and perform well on unseen data.
- **Model Evaluation:** Testing on a dataset of this size ensures that the model's accuracy is assessed over a wide range of possible symptom combinations, which is vital for its reliability.
- **Realistic Training:** By simulating a large population, we mimic real-world conditions, providing a strong foundation for the model to perform accurately when deployed in real-life medical diagnostics.


In [2]:
import pandas as pd
import numpy as np

In [4]:
# Load the dataset
dataset_path = 'resources/dataset.csv'
df = pd.read_csv(dataset_path)

In [6]:
# List of symptom columns
symptom_columns = ['Symptom_1', 'Symptom_2', 'Symptom_3', 'Symptom_4',
                   'Symptom_5', 'Symptom_6', 'Symptom_7', 'Symptom_8',
                   'Symptom_9', 'Symptom_10', 'Symptom_11', 'Symptom_12',
                   'Symptom_13', 'Symptom_14', 'Symptom_15', 'Symptom_16',
                   'Symptom_17']

In [8]:
# Prepare the synthetic data
synthetic_data = []

In [12]:
# Generate 250,000 individuals
for _ in range(250000):
    # Randomly choose a disease
    disease_row = df.sample().iloc[0]
    disease = disease_row['Disease']
    
    # Get the actual symptoms for this disease
    symptoms = disease_row[symptom_columns].dropna().tolist()
    
    # Randomly choose the number of symptoms (between 1 and the number of available symptoms)
    num_symptoms = np.random.randint(1, len(symptoms) + 1)
    
    # Randomly select the specified number of symptoms
    selected_symptoms = np.random.choice(symptoms, num_symptoms, replace=False)
    
    # Create a row with the disease and selected symptoms
    row = [disease] + selected_symptoms.tolist() + [None] * (17 - num_symptoms)
    
    # Append to the synthetic data list
    synthetic_data.append(row)

In [13]:
# Convert the list to a DataFrame
synthetic_df = pd.DataFrame(synthetic_data, columns=['Disease'] + symptom_columns)

In [14]:
# Save the generated dataset to the resources folder
synthetic_df.to_csv('resources/synthetic_disease_data.csv', index=False)