Import libraries
pandas: Used for data manipulation and exporting to CSV.
numpy: Used for generating synthetic data (e.g, random numbers, distributions).

In [1]:
import pandas as pd
import numpy as np

The script below generates a synthetic health dataset for 10,000 individuals. The variables includes demographic features, health conditions, symptom profiles, and severity indicators which are structured to support analytics, visualization, and predictive modeling. The simulation mimics real-world probabilities using NumPy and Pandas.

1. Demographics
age: Normally distributed with mean 50, std 15; clipped between 18–100
sex: Binary (0 = Female, 1 = Male), 50/50 distribution
family_history: Binary (0 = No, 1 = Yes); 20% positive history

2. Cancer Diagnosis (Target Variable)
cancer: Binary target; 30% diagnosed with cancer, 70% healthy

3. Symptom Variables (conditioned on cancer status)
Each symptom is generated with a different probability depending on whether the individual has cancer. The symptoms are fatigue, weight_loss, pain, fever, night_sweats, bleeding, lumps, cough, bowel_bladder_changes

4. Severity Indicators
Numeric and categorical variables reflecting the intensity of symptoms:
- pain_severity: Scale from 0–10; higher if cancer is present
- weight_loss_amount: Exponential or uniform distribution
- bleeding_severity: Ordinal scale (0–3), higher for cancer cases
- vital_sign_abnormalities: Binary; 30% abnormal for cancer patients

5. Emergency Classification
The emergency variable is computed based on a combination of cancer presence and severity criteria

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Number of samples
n_samples = 10000

# Initialize data dictionary
data = {}

# Demographic variables
data['age'] = np.clip(np.random.normal(50, 15, n_samples), 18, 100).astype(int)
data['sex'] = np.random.choice([0, 1], n_samples, p=[0.5, 0.5])  # 0=Female, 1=Male
data['family_history'] = np.random.choice([0, 1], n_samples, p=[0.8, 0.2])

# Target: Cancer diagnosis
data['cancer'] = np.random.choice([0, 1], n_samples, p=[0.7, 0.3])

# Symptoms (conditional on cancer)
data['fatigue'] = np.where(
    data['cancer'] == 1,
    np.random.choice([0, 1], n_samples, p=[0.4, 0.6]),
    np.random.choice([0, 1], n_samples, p=[0.7, 0.3])
)
data['weight_loss'] = np.where(
    data['cancer'] == 1,
    np.random.choice([0, 1], n_samples, p=[0.6, 0.4]),
    np.random.choice([0, 1], n_samples, p=[0.9, 0.1])
)
data['pain'] = np.where(
    data['cancer'] == 1,
    np.random.choice([0, 1], n_samples, p=[0.5, 0.5]),
    np.random.choice([0, 1], n_samples, p=[0.8, 0.2])
)
data['fever'] = np.where(
    data['cancer'] == 1,
    np.random.choice([0, 1], n_samples, p=[0.7, 0.3]),
    np.random.choice([0, 1], n_samples, p=[0.9, 0.1])
)
data['night_sweats'] = np.where(
    data['cancer'] == 1,
    np.random.choice([0, 1], n_samples, p=[0.75, 0.25]),
    np.random.choice([0, 1], n_samples, p=[0.95, 0.05])
)
data['bleeding'] = np.where(
    data['cancer'] == 1,
    np.random.choice([0, 1], n_samples, p=[0.7, 0.3]),
    np.random.choice([0, 1], n_samples, p=[0.95, 0.05])
)
data['lumps'] = np.where(
    data['cancer'] == 1,
    np.random.choice([0, 1], n_samples, p=[0.65, 0.35]),
    np.random.choice([0, 1], n_samples, p=[0.9, 0.1])
)
data['cough'] = np.where(
    data['cancer'] == 1,
    np.random.choice([0, 1], n_samples, p=[0.6, 0.4]),
    np.random.choice([0, 1], n_samples, p=[0.85, 0.15])
)
data['bowel_bladder_changes'] = np.where(
    data['cancer'] == 1,
    np.random.choice([0, 1], n_samples, p=[0.7, 0.3]),
    np.random.choice([0, 1], n_samples, p=[0.9, 0.1])
)

# Severity indicators
data['pain_severity'] = np.where(
    data['cancer'] == 1,
    np.clip(np.random.normal(6, 2, n_samples), 0, 10).astype(int),
    np.clip(np.random.normal(3, 2, n_samples), 0, 10).astype(int)
)
data['weight_loss_amount'] = np.where(
    data['cancer'] == 1,
    np.random.exponential(5, n_samples),
    np.random.uniform(0, 2, n_samples)
)
data['bleeding_severity'] = np.where(
    data['cancer'] == 1,
    np.random.choice([0, 1, 2, 3], n_samples, p=[0.3, 0.3, 0.2, 0.2]),
    np.random.choice([0, 1, 2, 3], n_samples, p=[0.95, 0.04, 0.01, 0.0])
)
data['vital_sign_abnormalities'] = np.where(
    data['cancer'] == 1,
    np.random.choice([0, 1], n_samples, p=[0.7, 0.3]),
    np.random.choice([0, 1], n_samples, p=[0.95, 0.05])
)

# Target: Emergency status (based on cancer and severity)
data['emergency'] = np.where(
    (data['cancer'] == 1) & (
        (data['pain_severity'] >= 8) | 
        (data['bleeding_severity'] >= 2) | 
        (data['vital_sign_abnormalities'] == 1)
    ),
    np.random.choice([0, 1], n_samples, p=[0.2, 0.8]),  # 80% emergency if criteria met
    np.random.choice([0, 1], n_samples, p=[0.99, 0.01])  # 1% emergency otherwise
)

After completing the simulation, the generated dictionary was converted into a structured Pandas DataFrame containing all variables. The dataset was then saved locally as a CSV file named coded_data.csv.

In [4]:
# Create DataFrame for coded data
coded_df = pd.DataFrame(data)

# Save coded data
coded_df.to_csv('coded_data.csv', index=False)
print("Coded dataset saved as 'coded_data.csv'")

Coded dataset saved as 'coded_data.csv'


To verify the structure and content of the generated dataset, the first five rows of the coded_data.csv file were displayed using pandas.DataFrame.head()

In [8]:
# Display first few rows for verification
print("\nFirst 5 rows of coded_data.csv:")
print(coded_df.head())


First 5 rows of coded_data.csv:
   age  sex  family_history  cancer  fatigue  weight_loss  pain  fever  \
0   57    0               0       0        1            0     0      0   
1   47    0               0       0        1            1     0      1   
2   59    0               1       0        0            0     0      1   
3   72    0               0       0        1            0     0      0   
4   46    1               0       0        1            0     0      0   

   night_sweats  bleeding  lumps  cough  bowel_bladder_changes  pain_severity  \
0             0         0      0      0                      0              3   
1             0         0      0      0                      0              1   
2             0         0      0      0                      0              5   
3             0         0      0      0                      0              2   
4             0         0      0      0                      0              6   

   weight_loss_amount  bleeding_sev

To enhance interpretability for non-technical users and visualization tools, a human-readable version of the dataset was created by converting binary and categorical variables into string labels. This was done by copying the original coded dataset and applying the following transformations:
- Binary variables such as cancer, family_history, fatigue, etc., were converted from 0/1 to "No"/"Yes".
- The sex column was mapped from 0 = "Female" and 1 = "Male".
- The bleeding_severity variable, originally coded as integers 0–3, was converted into descriptive levels: None, Mild, Moderate, and Severe.
- The column weight_loss_amount was renamed to weight_loss_amount_kg to clearly indicate the unit of measurement.

In [5]:
# Create raw data by converting binary/categorical variables to strings
raw_df = coded_df.copy()

# Convert binary variables to Yes/No
binary_columns = [
    'family_history', 'cancer', 'fatigue', 'weight_loss', 'pain', 'fever', 
    'night_sweats', 'bleeding', 'lumps', 'cough', 'bowel_bladder_changes', 
    'vital_sign_abnormalities', 'emergency'
]
for col in binary_columns:
    raw_df[col] = raw_df[col].map({0: 'No', 1: 'Yes'})

# Convert sex to Male/Female
raw_df['sex'] = raw_df['sex'].map({0: 'Female', 1: 'Male'})

# Convert bleeding_severity to None/Mild/Moderate/Severe
raw_df['bleeding_severity'] = raw_df['bleeding_severity'].map({
    0: 'None', 1: 'Mild', 2: 'Moderate', 3: 'Severe'
})

# Rename weight_loss_amount to include unit
raw_df = raw_df.rename(columns={'weight_loss_amount': 'weight_loss_amount_kg'})

After formatting the dataset into a more human-readable form, the final version was saved as a CSV file named raw_data.csv. This file contains descriptive string labels rather than numeric codes, making it easier to interpret and visualize. The raw dataset is intended for Power BI dashboards, SQL queries, Reports and presentations where clarity is essential

In [6]:
# Save raw data
raw_df.to_csv('raw_data.csv', index=False)
print("Raw dataset saved as 'raw_data.csv'")

Raw dataset saved as 'raw_data.csv'


To verify the structure and content of the generated dataset, the first five rows of the raw_data.csv file were displayed using pandas.DataFrame.head()

In [9]:
# Display first few rows for verification
print("\nFirst 5 rows of raw_data.csv:")
print(raw_df.head())


First 5 rows of raw_data.csv:
   age     sex family_history cancer fatigue weight_loss pain fever  \
0   57  Female             No     No     Yes          No   No    No   
1   47  Female             No     No     Yes         Yes   No   Yes   
2   59  Female            Yes     No      No          No   No   Yes   
3   72  Female             No     No     Yes          No   No    No   
4   46    Male             No     No     Yes          No   No    No   

  night_sweats bleeding lumps cough bowel_bladder_changes  pain_severity  \
0           No       No    No    No                    No              3   
1           No       No    No    No                    No              1   
2           No       No    No    No                    No              5   
3           No       No    No    No                    No              2   
4           No       No    No    No                    No              6   

   weight_loss_amount_kg bleeding_severity vital_sign_abnormalities emergency  
0    