
# Data Dictionary

| Column Name           | Data Type | Description |
|-----------------------|-----------|-------------|
| **Patient_ID**        | `string`  | Unique identifier for each patient. Format: `PID1000`, `PID1001`, etc. Helps uniquely identify records. |
| **Age**               | `int`     | Age of the patient (in years), randomly generated using a **normal distribution** centered around 50, clipped between 18 and 90 to simulate adult patients. |
| **Gender**            | `string`  | Biological sex of the patient. Randomly chosen as `'Male'` or `'Female'`. |
| **Smoking_Status**    | `string`  | Indicates if the patient is a smoker. Randomly chosen from `'Yes'` or `'No'`, with a 30% chance of being `'Yes'`. |
| **Alcohol_Use**       | `string`  | Indicates if the patient consumes alcohol. Randomly chosen as `'Yes'` or `'No'`, with a 40% chance of being `'Yes'`. |
| **Family_History**    | `string`  | Whether there's a family history of cancer. Chosen randomly from `'Yes'` (20% chance) or `'No'` (80%). Important as a risk factor. |
| **Blood_Marker_1**    | `float`   | A simulated **biomarker level** relevant to cancer (e.g. PSA, CA-125, etc.). Values range mostly between 0–10, with higher levels (>7) indicating risk. |
| **Blood_Marker_2**    | `float`   | Another simulated **biomarker**, e.g. white blood cell count or other lab value. Mostly in the 50–150 range. Values >130 considered high risk. |
| **Symptom_Score**     | `int`     | A subjective score (0–10) based on symptoms like fatigue, pain, weight loss, etc. Higher score may suggest more severe or noticeable symptoms. |
| **Cancer_Diagnosis**  | `int`     | **Target label**: `1` = cancer present, `0` = no cancer. Determined based on biomarker thresholds and slight randomness to simulate false positives/negatives. |



In [4]:
import pandas as pd
import numpy as np
import random

# Set random seed for reproducibility
np.random.seed(42)
random.seed(42)

# Number of synthetic patients
num_records = 1000

# Generate Patient IDs
patient_ids = [f"PID{1000+i}" for i in range(num_records)]

# Generate Ages (normal distribution, clipped)
ages = np.clip(np.random.normal(loc=50, scale=15, size=num_records), 18, 90).astype(int)

# Generate Gender (M/F)
genders = np.random.choice(['Male', 'Female'], size=num_records)

# Generate Smoking Status (Yes/No)
smoking_status = np.random.choice(['Yes', 'No'], size=num_records, p=[0.3, 0.7])

# Generate Alcohol Use (Yes/No)
alcohol_use = np.random.choice(['Yes', 'No'], size=num_records, p=[0.4, 0.6])

# Family History of Cancer (Yes/No)
family_history = np.random.choice(['Yes', 'No'], size=num_records, p=[0.2, 0.8])

# Blood Marker 1 (biomarker levels: normal range 0-10, cancerous >7)
marker1 = np.round(np.random.normal(loc=5, scale=2, size=num_records), 2)

# Blood Marker 2 (normal 50-150, cancerous >130)
marker2 = np.round(np.random.normal(loc=100, scale=20, size=num_records), 1)

# Symptom Score (0-10)
symptom_score = np.random.randint(0, 11, size=num_records)

# Target: Cancer Diagnosis (0 = No, 1 = Yes)
# We base it on a simple rule: marker1 > 7 and marker2 > 130 → more likely to have cancer
diagnosis = []
for i in range(num_records):
    if marker1[i] > 7 and marker2[i] > 130:
        diagnosis.append(1 if random.random() > 0.2 else 0)  # 80% chance of having cancer
    else:
        diagnosis.append(0 if random.random() > 0.2 else 1)  # 20% false positive rate

# Create DataFrame
df = pd.DataFrame({
    'Patient_ID': patient_ids,
    'Age': ages,
    'Gender': genders,
    'Smoking_Status': smoking_status,
    'Alcohol_Use': alcohol_use,
    'Family_History': family_history,
    'Blood_Marker_1': marker1,
    'Blood_Marker_2': marker2,
    'Symptom_Score': symptom_score,
    'Cancer_Diagnosis': diagnosis
})

# Preview the data
print(df.head())

# Save to Excel
df.to_excel("AI_Alula_CleanedDataset.xlsx", index=False)
print("Synthetic dataset saved to AI_Alula_CleanedDataset.xlsx")

  Patient_ID  Age  Gender Smoking_Status Alcohol_Use Family_History  \
0    PID1000   57    Male             No         Yes             No   
1    PID1001   47  Female            Yes          No            Yes   
2    PID1002   59    Male            Yes          No            Yes   
3    PID1003   72  Female            Yes         Yes             No   
4    PID1004   46    Male             No          No             No   

   Blood_Marker_1  Blood_Marker_2  Symptom_Score  Cancer_Diagnosis  
0            5.93           155.1              3                 0  
1            3.90            93.9              3                 1  
2            5.63           114.6              5                 0  
3            3.23            87.0              5                 0  
4            5.36            58.2              6                 0  
Synthetic dataset saved to AI_Alula_CleanedDataset.xlsx
