### Important

This notebook explains the steps needed to create the synthetic dataset. However, the entire functionality has also been encapsuled in a script. In order to run it, navigate to the repository root and do:

```bash
python scripts/generate_health_data.py
```

# Synthetic Data Generation

This notebook aims at creating the synthetic data necessary to start the implementations of the first models. The real data will be given in the following steps of the project. The structure is the same, but the data that we'll find in the rows will be entirely different. This notebook is just to have placeholders for the actual data.

In [None]:
import os
import sys
from pathlib import Path

root = Path(os.getcwd()).parent
sys.path.append(str(root))

import pandas
import numpy as np
import random
import matplotlib.pyplot as plt

### The structure of the data

The data columns and related data types can be found in the `health_data.md` file, were the most important columns are also described. 


### Neighborhoods IDs

We can find the `neighborhood_id` unique values in the real file we have access to. We extract all the unique values that are contained and then randomly assign a number of participants for each neighborhood that we identify.

In [9]:
# The number of neighborhoods we want to simulate is contained in the "data/morphology_data_cleaned.csv" file
morphology_data = pandas.read_csv(root / "data" / "morphology_data_cleaned.csv")
neighborhoods_ids = morphology_data["id"].unique()
n_neighborhoods = len(neighborhoods_ids)

n_participants_per_neighborhood = np.random.randint(8, 15, size=n_neighborhoods)

### Generation of the Synthetic Data

In [21]:
# Total number of participants
n_participants = np.sum(n_participants_per_neighborhood)

# Participant IDs
participant_ids = np.arange(1, n_participants + 1)
assert len(participant_ids) == n_participants

# Neighborhood IDs
neighborhood_ids = []
for i, n in enumerate(n_participants_per_neighborhood):
    neighborhood_id = neighborhoods_ids[i]
    neighborhood_ids.extend([neighborhood_id] * n)
neighborhood_ids = np.array(neighborhood_ids)
assert len(neighborhood_ids) == n_participants


# Socio-Demographic Data
ages = np.random.randint(18, 80, size=n_participants)
sexes = np.random.choice(["Male", "Female"], size=n_participants)
incomes = np.random.choice(["Low", "Medium", "High"], size=n_participants)
education_levels = np.random.choice(
    ["High School", "Bachelor", "Master", "PhD"], size=n_participants
)
socio_demograph_data = pandas.DataFrame(
    {
        "participant_id": participant_ids,
        "neighborhood_id": neighborhood_ids,
        "age": ages,
        "sex": sexes,
        "income": incomes,
        "education_level": education_levels,
    }
)

# Health Data
heart_failures = np.random.choice([0, 1], size=n_participants, p=[0.9, 0.1])
heart_rhythms = np.random.choice([0, 1], size=n_participants, p=[0.85, 0.15])
d_metabolic_diabetes_I = np.random.choice([0, 1], size=n_participants, p=[0.95, 0.05])
d_metabolic_diabetes_II = np.random.choice([0, 1], size=n_participants, p=[0.9, 0.1])
d_metabolic_obesity = np.random.choice([0, 1], size=n_participants, p=[0.8, 0.2])
d_breath_respiratory = np.random.choice([0, 1], size=n_participants, p=[0.85, 0.15])
d_breath_asthma = np.random.choice([0, 1], size=n_participants, p=[0.9, 0.1])
GHQ12_scores = np.random.randint(0, 13, size=n_participants)
points_sleep_deprivation = np.random.randint(0, 21, size=n_participants)
sleep_disorder_hot = np.random.choice([0, 1], size=n_participants, p=[0.8, 0.2])
sleeping_hours = np.random.uniform(4, 10, size=n_participants)
bedtime_hours = [
    f"{random.randint(20, 23)}:{random.randint(0, 59):02d}"
    for _ in range(n_participants)
]
health_data = pandas.DataFrame(
    {
        "participant_id": participant_ids,
        "neighborhood_id": neighborhood_ids,
        "heart_failure": heart_failures,
        "heart_rhythm": heart_rhythms,
        "d_metabolic_diabetes_I": d_metabolic_diabetes_I,
        "d_metabolic_diabetes_II": d_metabolic_diabetes_II,
        "d_metabolic_obesity": d_metabolic_obesity,
        "d_breath_respiratory": d_breath_respiratory,
        "d_breath_asthma": d_breath_asthma,
        "GHQ12_score": GHQ12_scores,
        "points_sleep_deprivation": points_sleep_deprivation,
        "sleep_disorder_hot": sleep_disorder_hot,
        "sleeping_hours": sleeping_hours,
        "bedtime_hour": bedtime_hours,
    }
)

In [None]:
# Save the data to XLSX file, with two sheets: "Participant_SocioDemograph_Data" and "Participant_HEALTH_Data"

with pandas.ExcelWriter(root / "data" / "synthetic_health_data.xlsx") as writer:
    socio_demograph_data.to_excel(
        writer, sheet_name="Participant_SocioDemograph_Data", index=False
    )
    health_data.to_excel(writer, sheet_name="Participant_HEALTH_Data", index=False)