# Generating Synthetic Data

One of the hardest things to do in data science is get access to high quality datasets that relate to your specific questions. There's any number of reasons why researchers and analysts get incorrect results and misread the answers they're getting. [Bad data practice](https://www.forbes.com/sites/kalevleetaru/2018/02/19/how-bad-data-practice-is-leading-to-bad-research/#5e7cfbad1c35) may be leading to bad research. Some bad data practice themes might include:

* Honest Statistical/Computing Error
* Honest Misunderstanding of Data
* Honest Misapplication of Methods
* Honest Failure to Normalize and Malicious Manipulation
* (made worse through the) Poor citation practices of Copy-Paste Google Scholar-ship.

In this class we're going to learn how to process and analyze data using python. Since I work in healthcare we'll be using a tool called [Synthea](https://github.com/synthetichealth/synthea/wiki/Getting-Started) which will help us create consistent meaningful datasets at scale in a vareity of formats (Text, CSV, C-CDA, and FHIR). We'll use these datasets to:

* Learn about basic and advanced python concepts
* Learn about handling data in python
* Answer simple and complex questions about the data
* Generate interesting visualizations


# What is Synthea?

[Synthea Getting Started](https://github.com/synthetichealth/synthea/wiki/Getting-Started)

> SyntheaTM is a synthetic patient generator that models the medical history of synthetic patients. Our mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions. It can be used without restriction for a variety of secondary uses in academia, research, industry, and government (although a citation would be appreciated).

> SyntheaTM generates synthetic patient records using an agent-based approach. Each synthetic patient is generated independently, as they progress from birth to death through modular representations of various diseases and conditions. Each patient runs through every module in the system. Once a patient dies or the simulation reaches the current day, that patient record can be exported in a number of different formats.

## Synthea Data Formats

Synthea generates these datasets in a variety of commonly used healthcare data formats including (Text, CSV, C-CDA, and FHIR)

### Text

Text format is a quick human readable format. This format doesn't adhere to any particular standard. Text formats are most commonly consumed by humans who may be clinicians or other application end users. The other data formats (CSV, C-CDA, and FHIR) can be easily converted to a Text format. However, converting a Text format to any of the other data formats is extremely challenging. Often times we use Natuarl Language Processing (NLP) and Regular Expressions (RegEx) as we attempt the Text to (CSV, C-CDA, FHIR) conversion.

**Sample:**
```
Mekhi724 Kemmer911
==================
Race:           White
Ethnicity:      Non-Hispanic
Gender:         F
Age:            33
Birth Date:     1983-11-04
Marital Status: M
--------------------------------------------------------------------------------
ALLERGIES: N/A
--------------------------------------------------------------------------------
MEDICATIONS:
2013-08-22 [CURRENT] : Acetaminophen 160 MG for Acute bronchitis (disorder)
1996-05-12 [CURRENT] : Acetaminophen 160 MG for Acute bronchitis (disorder)
1995-04-13 [CURRENT] : Acetaminophen 160 MG for Acute bronchitis (disorder)
1984-01-14 [CURRENT] : Penicillin V Potassium 250 MG for Streptococcal sore throat (disorder)
--------------------------------------------------------------------------------
CONDITIONS:
2015-10-30 - 2015-11-07 : Fetus with chromosomal abnormality
2015-10-30 - 2015-11-07 : Miscarriage in first trimester
2015-10-30 - 2015-11-07 : Normal pregnancy
2013-08-22 - 2013-09-08 : Acute bronchitis (disorder)
1985-08-07 -            : Food Allergy: Fish
--------------------------------------------------------------------------------
CARE PLANS:
2013-08-22 [STOPPED] : Respiratory therapy
                         Reason: Acute bronchitis (disorder)
                         Activity: Recommendation to avoid exercise
                         Activity: Deep breathing and coughing exercises
--------------------------------------------------------------------------------
OBSERVATIONS:
2014-01-14 : Body Weight                              73.9 kg
2014-01-14 : Body Height                              163.7 cm
2014-01-14 : Body Mass Index                          27.6 kg/m2
2014-01-14 : Systolic Blood Pressure                  133.0 mmHg
2014-01-14 : Diastolic Blood Pressure                 76.0 mmHg
2014-01-14 : Blood Pressure                           2.0 
--------------------------------------------------------------------------------
PROCEDURES:
2015-10-30 : Standard pregnancy test for Normal pregnancy
2014-01-14 : Documentation of current medications
--------------------------------------------------------------------------------
ENCOUNTERS:
2015-11-07 : Encounter for Fetus with chromosomal abnormality
2015-10-30 : Encounter for Normal pregnancy
2014-01-14 : Outpatient Encounter
2013-08-22 : Encounter for Acute bronchitis (disorder)
--------------------------------------------------------------------------------
```

### CSV

Comma Separated Value (CSV) files are common in healthcare and one could argue one of the 3 most common data formats with the others being HL7v2 and C-CDA. Unlike the Text format generated by Synthea, which only contains a single patient per file, the CSV format contains many patients per file. However, the files themselves are "resource" based. The resources generated include:

* Patients - patients.csv
* Encounters - encounters.csv
* Allergies - allergies.csv
* Medications - medications.csv
* Conditions - conditions.csv
* Care Plans - careplans.csv
* Observations - observations.csv
* Procedures - Procedures.csv
* Immunizations - immunizations.csv

**Sample:**

patients.csv
```CSV
ID,BIRTHDATE,DEATHDATE,SSN,DRIVERS,PASSPORT,PREFIX,FIRST,LAST,SUFFIX,MAIDEN,MARITAL,RACE,ETHNICITY,GENDER,BIRTHPLACE,ADDRESS
5e0d195e-1cd9-494d-8f9a-757c15da2aed,1946-12-14,2015-10-03,999-12-2377,S99962866,false,Mrs.,Miracle267,Ledner332,,Raynor597,M,white,irish,F,Millbury MA,2502 Fisher Manor Boston MA 02132
52082709-06ce-4fde-9c93-cfb4e6542ae1,1968-05-23,,999-17-1808,S99941406,X41451685X,Mrs.,Alda869,Gorczany848,,Funk527,M,white,italian,F,Gardner MA,46973 Velda Gateway Franklin Town MA 02038
8b4c62c8-b116-4b58-9259-466485b0345c,1967-06-22,1985-07-04,999-11-1173,S99955795,,Ms.,Moshe832,Zulauf396,,,,white,english,F,Boston MA,250 Reba Park Carver MA 02330
965c5539-598b-4a9b-a670-e0259667deb8,1934-11-04,2015-06-19,999-63-2195,S99931866,X71888970X,Mr.,Verla554,Roberts329,,,S,white,irish,M,Fall River MA,321 Abdullah Bridge Needham MA 02492
2b28d6c3-9e0c-48d4-99f9-292488133101,1964-08-13,,999-55-5054,S99990374,X68574707X,Ms.,Henderson277,Labadie810,,,S,black,dominican,F,North Attleborough MA,55825 Barrows Prairie Suite 144 Boston MA 02134
```

conditions.csv

```CSV
START,STOP,PATIENT,ENCOUNTER,CODE,DESCRIPTION
1965-10-10,,5e0d195e-1cd9-494d-8f9a-757c15da2aed,918b17f4-e815-44ef-9eeb-41953bbcf7e9,38341003,Hypertension
1966-09-09,,5e0d195e-1cd9-494d-8f9a-757c15da2aed,918b17f4-e815-44ef-9eeb-41953bbcf7e9,15777000,Prediabetes
1988-09-25,,5e0d195e-1cd9-494d-8f9a-757c15da2aed,918b17f4-e815-44ef-9eeb-41953bbcf7e9,239872002,Osteoarthritis of hip
1990-09-01,,5e0d195e-1cd9-494d-8f9a-757c15da2aed,918b17f4-e815-44ef-9eeb-41953bbcf7e9,410429000,Cardiac Arrest
1990-09-01,,5e0d195e-1cd9-494d-8f9a-757c15da2aed,918b17f4-e815-44ef-9eeb-41953bbcf7e9,429007001,History of cardiac arrest (situation)
```

We'll talk more about each of these resources below when we discuss the FHIR data format.

### C-CDA
Consolidated Clinical Document Architecture (C-CDA) format is an XML-based standard defined by HL7, that uses templates from a standard library to represent clinical concepts. For more information on C-CDA, see http://www.hl7.org/implement/standards/product_brief.cfm?product_id=258.

**Sample:**

```XML
<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:sdtc="urn:hl7-org:sdtc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 http://xreg2.nist.gov:8080/hitspValidation/schema/cdar2c32/infrastructure/cda/C32_CDA.xsd">
  <realmCode code="US"/>
  <typeId root="2.16.840.1.113883.1.3" extension="POCD_HD000040"/>
  <templateId root="2.16.840.1.113883.10.20.22.1.1" extension="2015-08-01"/>
  <templateId root="2.16.840.1.113883.10.20.22.1.2" extension="2015-08-01"/>
  <id root="2.16.840.1.113883.19.5" extension="47b10305-eb1b-4a47-a27c-24b7e96ee1da" assigningAuthorityName="https://github.com/synthetichealth/synthea"/>
  <code code="34133-9" displayName="Summarization of episode note" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC"/>
  <title>C-CDA R2.1 Patient Record: Augustine565 Cummings51</title>
  <effectiveTime value="20190308211454"/>
  <confidentialityCode code="N"/>
  <languageCode code="en-US"/>
  <recordTarget>
    <patientRole>
      <id root="2.16.840.1.113883.19.5" extension="47b10305-eb1b-4a47-a27c-24b7e96ee1da" assigningAuthorityName="https://github.com/synthetichealth/synthea"/>
      <addr use="HP">
        <streetAddressLine>931 Watsica Lock</streetAddressLine>
        <city>Pittsburgh</city>
        <state>Pennsylvania</state>
        <postalCode>15106</postalCode>
      </addr>
      <telecom nullFlavor="NI"/>
      <patient>
        <name>
          <given>Augustine565</given>
          <family>Cummings51</family>
        </name>
        <administrativeGenderCode code="F" codeSystem="2.16.840.1.113883.5.1" codeSystemName="HL7 AdministrativeGender"/>
        <birthTime value="19690509221454"/>
        <raceCode code="2028-9" displayName="asian" codeSystemName="CDC Race and Ethnicity" codeSystem="2.16.840.1.113883.6.238"/>
        <ethnicGroupCode code="2186-5" displayName="non-hispanic" codeSystemName="CDC Race and Ethnicity" codeSystem="2.16.840.1.113883.6.238"/>
        <languageCommunication>
          <languageCode code="en-US"/>
        </languageCommunication>
      </patient>
    </patientRole>
  </recordTarget>
  <!-- ... -->
</ClinicalDocument>
```

### FHIR

[HL7 FHIR](https://www.hl7.org/fhir/) is possibly the most exciting and interesting data format to deal with. Many of the formats discussed above were created in a time where the vast majority of Health IT systems ran on-premise. These data format and transport protocol standards are extremely reliable, but not so friendly to the web developer. 

HL7 FHIR started as a community response to the legacy and hard to deal with Health IT standards. In just a few years the community has grown dramatically and most (if not all) EMR vendors have some support for the standard which is unheard of in the Health IT space. The primary drive for this rapid pace has been the US governement who has released a number of regulations and mandates for open data access. 

> FHIR® – Fast Healtcare Interoperability Resources (hl7.org/fhir) – is a next generation standards framework created by HL7. FHIR combines the best features of HL7’s Version 2, Version 3 and CDA® product lines while leveraging the latest web standards and applying a tight focus on implementability.

SyntheaTM currently supports exporting patients as Fast Healthcare Interoperability Resources (FHIR), versions 3.5.0 (R4), 3.0.1 (STU3) and 1.0.2 (DSTU2). FHIR is a standard created by HL7 for exchanging healthcare information electronically. While FHIR supports both XML and JSON, Synthea exports FHIR as JSON only.

**Sample:**
```JSON
{
  "resourceType": "Bundle",
  "type": "transaction",
  "entry": [
    {
      "fullUrl": "urn:uuid:4bd23de9-7d28-48a5-8093-1ac7ff1c64b7",
      "resource": {
        "resourceType": "Patient",
        "id": "4bd23de9-7d28-48a5-8093-1ac7ff1c64b7",
        "text": {
          "status": "generated",
          "div": "<div xmlns=\"http://www.w3.org/1999/xhtml\">Generated by <a href=\"https://github.com/synthetichealth/synthea\">Synthea</a>.Version identifier: v2.4.0-44-g6dbf88c6\n .   Person seed: -1236052134575208584  Population seed: 12345</div>"
        },
        "name": [
          {
            "use": "official",
            "family": "Cummings51",
            "given": [
              "Augustine565"
            ],
            "prefix": [
              "Mrs."
            ]
          },
          {
            "use": "maiden",
            "family": "Cremin516",
            "given": [
              "Augustine565"
            ],
            "prefix": [
              "Mrs."
            ]
          }
        ]
    }]
}
```