# 🧪 Synthetic Oral Cancer Dataset Generator

**Author:** Pranshu Goyal  
**Date:** 2025-08-17
**Python:** >=3.10  
**Dependencies:** `fpdf2`, `random`, `datetime`, `time`, `os`  

---

## 📖 Abstract
This project implements a **synthetic dataset generator** for oral/oropharyngeal cancer research, with emphasis on **biomarkers from saliva, blood, and urine** and **clinical note simulation**.  
It provides a scalable method to generate **structured** (demographics, biomarkers) and **unstructured** (admission, pathology, discharge notes) patient records for training and benchmarking AI/ML healthcare systems.  

The system compiles all generated records into a **single PDF document**, simulating an anonymized hospital dataset of up to **10,000 patients**.  

---

## ⚙️ Features
- **Synthetic Clinical API**  
  - Generates de-identified demographics (Patient ID, MRN, Name, Age, Gender, Ethnicity).  
  - Creates biomarker reports across saliva, blood, and urine with units/threshold flags.  
  - Produces unstructured clinical documentation:  
    - Admission Note  
    - Pathology Report  
    - Discharge Summary  

- **PDF Export**  
  - Uses `fpdf2` to compile all records into a well-structured, paginated PDF.  
  - Custom headers/footers for readability and research usability.  

- **Scalability**  
  - Designed for **10,000 records** (default) but can be scaled down for quick testing.  
  - Progress updates every 100 records for monitoring.  

In [None]:
# ============================
# 🔧 Configuration
# ============================

# Number of synthetic patient records to generate
# NOTE: 10,000 records can be computationally heavy, so use smaller numbers for demo/testing.
NUM_RECORDS_TO_GENERATE = 10000  

# Output file to save generated dataset
OUTPUT_FILENAME = "synthetic_oral_cancer_dataset.pdf"


# ============================
# 🧪 Synthetic Clinical Data API
# ============================

class SyntheticClinicalDataAPI:
    """
    A simulated API class for generating synthetic, de-identified clinical data.
    
    - Produces structured demographics (name, age, gender, ethnicity, etc.)
    - Produces biomarker panels for saliva, blood, urine
    - Produces unstructured text notes (Admission, Pathology, Discharge)

    Mimics interaction with a real-world generative clinical data API.
    """

    def __init__(self):
        # --- Internal Data Pools for Randomized Generation ---
        # Names, genders, ethnicities, symptoms, risk factors, biomarkers, diagnoses, and locations
        self._first_names = [                                                       # Predefined pool of common first names
            'James', 'Mary', 'John', 'Patricia', 'Robert', 'Jennifer', 'Michael', 'Linda', 'William', 'Elizabeth',
            'David', 'Barbara', 'Richard', 'Susan', 'Joseph', 'Jessica', 'Thomas', 'Sarah', 'Charles', 'Karen',
            'Christopher', 'Nancy', 'Daniel', 'Lisa', 'Matthew', 'Margaret', 'Anthony', 'Betty', 'Donald', 'Sandra',
            'Mark', 'Ashley', 'Paul', 'Kimberly', 'Steven', 'Emily', 'Andrew', 'Donna', 'Kenneth', 'Michelle',
            'Joshua', 'Dorothy', 'Kevin', 'Carol', 'Brian', 'Amanda', 'George', 'Melissa', 'Edward', 'Deborah',
            'Ronald', 'Stephanie', 'Timothy', 'Rebecca', 'Jason', 'Laura', 'Jeffrey', 'Sharon', 'Ryan', 'Cynthia',
            'Gary', 'Kathleen', 'Jacob', 'Amy', 'Nicholas', 'Shirley', 'Eric', 'Angela', 'Jonathan', 'Helen',
            'Stephen', 'Anna', 'Larry', 'Brenda', 'Justin', 'Pamela', 'Scott', 'Nicole', 'Brandon', 'Emma',
            'Benjamin', 'Samantha', 'Samuel', 'Katherine', 'Gregory', 'Christine', 'Frank', 'Debra', 'Alexander', 'Rachel',
            'Patrick', 'Catherine', 'Raymond', 'Carolyn', 'Jack', 'Janet', 'Dennis', 'Ruth', 'Jerry', 'Maria',
            'Tyler', 'Heather', 'Aaron', 'Diane', 'Jose', 'Virginia', 'Henry', 'Julie', 'Douglas', 'Joyce'
        ]
        self._last_names = [                                                        # Predefined pool of common last names
            'Smith', 'Johnson', 'Williams', 'Brown', 'Jones', 'Garcia', 'Miller', 'Davis', 'Rodriguez', 'Martinez',
            'Hernandez', 'Lopez', 'Gonzalez', 'Wilson', 'Anderson', 'Thomas', 'Taylor', 'Moore', 'Jackson', 'Martin',
            'Lee', 'Perez', 'Thompson', 'White', 'Harris', 'Sanchez', 'Clark', 'Ramirez', 'Lewis', 'Robinson',
            'Walker', 'Young', 'Allen', 'King', 'Wright', 'Scott', 'Torres', 'Nguyen', 'Hill', 'Flores',
            'Green', 'Adams', 'Nelson', 'Baker', 'Hall', 'Rivera', 'Campbell', 'Mitchell', 'Carter', 'Roberts'
        ]
        self._genders = ['Male', 'Female']
        self._ethnicities = [
            "WHITE", "BLACK/AFRICAN", "HISPANIC", "ASIAN", "MIDDLE EASTERN", "NATIVE AMERICAN", "OTHER", "UNKNOWN"
        ]
        self._complaints = [                                                        # Common presenting complaints in oral cancers
            "a non-healing ulcer on the lateral border of the tongue", "a persistent sore throat and dysphagia",
            "a new lump in the neck", "difficulty chewing and a poorly fitting denture",
            "a red and white patch (erythroplakia) on the floor of the mouth", "unexplained bleeding from the mouth"
        ]
        self._risk_factors = [                                                      # Epidemiological risk factors
            "long-term heavy tobacco use (2 PPD x 30 years)", "significant alcohol consumption",
            "positive history for HPV-16 infection", "poor oral hygiene and chronic irritation",
            "family history of head and neck cancers", "long-term smokeless tobacco use"
        ]
        self._biomarkers = {                                                        # Biomarker ranges with normal values
            "saliva_mirna_21": (1.5, 4.0), "saliva_cyfra_21_1": (3.5, 10.0), "blood_scc_ag": (2.0, 8.0),
            "urine_creatinine": (0.6, 1.2), "urine_urea": (20, 40), "blood_albumin": (3.4, 5.4),
            "saliva_il_6": (5.0, 20.0), "blood_cea": (3.0, 10.0)
        }
        self._diagnosis = [                                                        # Histological types of oral cancers
            "Moderately differentiated Squamous Cell Carcinoma (SCC)", "Poorly differentiated Squamous Cell Carcinoma (SCC)",
            "Well-differentiated Squamous Cell Carcinoma (SCC), HPV-16 positive", "Verrucous Carcinoma"
        ]
        self._locations = [                                                         # Common anatomical sites for oral cancers
            "right lateral tongue", "floor of mouth", "left tonsillar pillar",
            "soft palate", "buccal mucosa", "retromolar trigone"
        ]

        # --- Demographics Generator ---
    def _generate_demographics(self, patient_id):
        """Generates a demographic block (DOB, Age, MRN, Gender, Ethnicity)."""
        dob = datetime(1940, 1, 1) + timedelta(days=random.randint(0, 20000))
        age = (datetime.now() - dob).days // 365
        return {
            "patient_id": f"PAT{patient_id:07d}",   # Unique patient ID
            "mrn": f"{random.randint(1000000, 9999999)}",  # Random MRN
            "name": f"{random.choice(self._last_names)}, {random.choice(self._first_names)}",
            "dob": dob.strftime("%Y-%m-%d"),
            "age": age,
            "gender": random.choice(self._genders),
            "ethinicity": random.choice(self._ethnicities),
        }

    # --- Biomarker Report Generator ---
    def _generate_biomarker_report(self):
        """
        Creates a biomarker panel with realistic units and random values 
        flagged as 'HIGH' if above threshold.
        """
        report = "BIOMARKER ANALYSIS REPORT (Saliva/Blood/Urine Panel):\n"
        for marker, (low, high) in self._biomarkers.items():
            value = round(random.uniform(low, high), 2)  # Random within range
            # Choose unit depending on biomarker type
            unit = "ng/mL" if "ag" in marker or "cea" in marker or "cyfra" in marker else \
                   "fold change" if "mirna" in marker else \
                   "mg/dL" if "creatinine" in marker or "urea" in marker else \
                   "g/dL" if "albumin" in marker else "pg/mL"
            flag = " [HIGH]" if value > (low + (high-low)*0.5) else ""
            report += f"  - {marker.replace('_', ' ').title()}: {value} {unit}{flag}\n"
        # Add interpretation text
        report += ("\nINTERPRETATION: Panel results are highly suggestive of "
                   "cellular stress and turnover consistent with malignancy. "
                   "Elevated salivary and serum markers point towards a head and neck primary.")
        return report

    # --- Unstructured Clinical Notes ---
    def _generate_unstructured_note(self, patient_info, note_type):
        """
        Generates Admission Notes, Pathology Reports, or Discharge Summaries
        with random but medically coherent content.
        """
        admission_date = datetime(2022, 1, 1) + timedelta(days=random.randint(0, 700))
        location = random.choice(self._locations)
        diagnosis = random.choice(self._diagnosis)

        # --- Common Header Block ---
        header = f"ADMISSION DATE: {admission_date.strftime('%Y-%m-%d')}     REPORT DATE: {(admission_date + timedelta(days=random.randint(1, 5))).strftime('%Y-%m-%d')}\n"
        header += f"PATIENT: {patient_info['name']}     MRN: {patient_info['mrn']}\n"
        header += "-" * 70 + "\n"

        # --- Admission Note ---
        if note_type == "Admission Note":
            # Structured sections: Chief Complaint, HPI, Examination, Assessment
            note = header + "CHIEF COMPLAINT: Patient is a "
            note += f"{patient_info['age']}-year-old {patient_info['gender'].lower()} presenting with {random.choice(self._complaints)} for the past {random.randint(4, 12)} weeks.\n\n"
            note += "HISTORY OF PRESENT ILLNESS (HPI):\n"
            note += f"Patient reports progressive growth and pain, risk factor: {random.choice(self._risk_factors)}.\n\n"
            note += "PHYSICAL EXAMINATION:\n"
            note += f"OROPHARYNX: Lesion on {location}. Firm to palpation. No bleeding.\n"
            note += "NECK: Palpable lymphadenopathy.\n\n"
            note += "ASSESSMENT AND PLAN:\n"
            note += f"1. Suspicion for {diagnosis}.\n"
            note += f"2. Biopsy planned + biomarker panel.\n"
            return note

        # --- Pathology Report ---
        if note_type == "Pathology Report":
            note = header + f"SPECIMEN SOURCE: Biopsy of {location} lesion.\n\n"
            note += "MICROSCOPIC DESCRIPTION:\n"
            note += f"Histology shows features consistent with {diagnosis}.\n\n"
            note += f"FINAL DIAGNOSIS:\n{diagnosis.upper()}."
            return note

        # --- Discharge Summary ---
        if note_type == "Discharge Summary":
            note = header + f"DISCHARGE DIAGNOSIS: {diagnosis}.\n\n"
            note += "HOSPITAL COURSE:\n"
            note += "Biopsy confirmed diagnosis. PET/CT staging positive. Case discussed at tumor board.\n\n"
            note += "DISCHARGE PLAN:\nFollow-up with oncology; start chemoradiation.\n"
            return note

        return ""

    # --- Main Patient Record Generator ---
    def generate_patient_records(self, num_records: int):
        """
        Generator that yields synthetic patient records with:
        - demographics
        - biomarker reports
        - clinical notes (Admission, Pathology, Discharge)
        """
        print(f"API call received: Generating {num_records} synthetic patient records...")
        for i in range(1, num_records + 1):
            demographics = self._generate_demographics(i)
            biomarker_report = self._generate_biomarker_report()
            
            # Sequential notes for realism
            admission_note = self._generate_unstructured_note(demographics, "Admission Note")
            pathology_report = self._generate_unstructured_note(demographics, "Pathology Report")
            discharge_summary = self._generate_unstructured_note(demographics, "Discharge Summary")
            
            yield {
                "demographics": demographics,
                "biomarker_report": biomarker_report,
                "notes": [admission_note, pathology_report, discharge_summary]
            }

            # Print progress every 100 records
            if i % 100 == 0:
                print(f"  ...generated {i}/{num_records} records...")


# ============================
# 📄 PDF Writer Class
# ============================

class PDF(FPDF):
    """Subclass of FPDF with custom header and footer."""
    
    def header(self):
        self.ln(1)  # minimal spacing

    def footer(self):
        self.set_y(-15)  # Position footer 1.5 cm from bottom
        self.set_font('Arial', 'I', 8)
        self.cell(0, 10, 'Page ' + str(self.page_no()) + '/{nb}', 0, 0, 'C')


# ============================
# 📦 Dataset Creation Pipeline
# ============================

def create_dataset_pdf(api, num_records, output_filename):
    """
    Generates synthetic records via API and writes them to a single structured PDF.
    """
    start_time = time.time()
    
    # Initialize PDF document
    pdf = PDF()
    pdf.set_auto_page_break(auto=True, margin=15)
    pdf.alias_nb_pages()

    # Iterate over generated patient records
    record_generator = api.generate_patient_records(num_records)
    for record in record_generator:
        pdf.add_page()
        
        # --- Patient Header ---
        demographics = record['demographics']
        pdf.set_font('Courier', 'B', 12)
        pdf.cell(0, 8, f"--- PATIENT RECORD: {demographics['patient_id']} ---", ln=1, align='C')
        pdf.set_font('Courier', '', 10)
        pdf.multi_cell(0, 5, 
            f"MRN: {demographics['mrn']} | NAME: {demographics['name']} | "
            f"AGE: {demographics['age']} | GENDER: {demographics['gender']} | ETHINICITY: {demographics['ethinicity']}"
        )
        pdf.ln(5)

        # --- Biomarker Report ---
        pdf.set_font('Courier', 'B', 11)
        pdf.cell(0, 8, "Biomarker Analysis", ln=1)
        pdf.set_font('Courier', '', 9)
        pdf.multi_cell(0, 5, record['biomarker_report'])
        pdf.ln(5)

        # --- Clinical Notes (Admission, Pathology, Discharge) ---
        for note in record['notes']:
            pdf.set_font('Courier', 'B', 11)
            # Infer type of note
            if "CHIEF COMPLAINT" in note:
                note_title = "Admission Note"
            elif "SPECIMEN SOURCE" in note:
                note_title = "Pathology Report"
            else:
                note_title = "Discharge Summary"
            
            pdf.cell(0, 8, f"Clinical Note: {note_title}", ln=1)
            pdf.set_font('Courier', '', 9)
            pdf.multi_cell(0, 5, note)
            pdf.ln(8)

    # Save PDF
    print("\nWriting data to PDF. This may take a moment...")
    pdf.output(output_filename)
    
    # Log execution time
    end_time = time.time()
    print("-" * 50)
    print("Dataset Generation Complete!")
    print(f"Successfully generated '{output_filename}' with {num_records} records.")
    print(f"Total time taken: {end_time - start_time:.2f} seconds.")
    print("-" * 50)


# ============================
# 🚀 Main Execution
# ============================

if __name__ == "__main__":
    # Initialize synthetic API
    api_client = SyntheticClinicalDataAPI()
    
    # Generate dataset and compile into PDF
    create_dataset_pdf(api_client, NUM_RECORDS_TO_GENERATE, OUTPUT_FILENAME)

API call received: Generating 10000 synthetic patient records...


  pdf.cell(0, 8, f"--- PATIENT RECORD: {demographics['patient_id']} ---", ln=1, align='C')
  pdf.cell(0, 8, "Biomarker Analysis", ln=1)
  pdf.cell(0, 8, f"Clinical Note: {note_title}", ln=1)
  self.set_font('Arial', 'I', 8)
  self.cell(0, 10, 'Page ' + str(self.page_no()) + '/{nb}', 0, 0, 'C')


  ...generated 100/10000 records...
  ...generated 200/10000 records...
  ...generated 300/10000 records...
  ...generated 400/10000 records...
  ...generated 500/10000 records...
  ...generated 600/10000 records...
  ...generated 700/10000 records...
  ...generated 800/10000 records...
  ...generated 900/10000 records...
  ...generated 1000/10000 records...
  ...generated 1100/10000 records...
  ...generated 1200/10000 records...
  ...generated 1300/10000 records...
  ...generated 1400/10000 records...
  ...generated 1500/10000 records...
  ...generated 1600/10000 records...
  ...generated 1700/10000 records...
  ...generated 1800/10000 records...
  ...generated 1900/10000 records...
  ...generated 2000/10000 records...
  ...generated 2100/10000 records...
  ...generated 2200/10000 records...
  ...generated 2300/10000 records...
  ...generated 2400/10000 records...
  ...generated 2500/10000 records...
  ...generated 2600/10000 records...
  ...generated 2700/10000 records...
  ...gener