# "Synthesize Medical Data 2"

> Second attempt
- toc: true
- branch: master
- badges: false
- comments: true
- hide: false
- search_exclude: true
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2
- image: images/SurvivalAnalysis_lifelines.png
- categories: [Medical_Industry,  Medical_Data_Synthesis,  faker]
- show_tags: true

This is a second attempt to synthesize the data. The product of this notebook represents the "true" data. This data should then be 

* contaminated with noise, e.g. 
    * for values in text features
        * dropping/inserting a character here and there, using variants of names, etc 
    * for values generally
        * make some NaNs
* split into different source datasets, for example, one for *patients*, one for *encounters*, one for *conditions*, and one for *medications*. May want to keep a 'phantom id' and distribute it to each fragment to verify correct assembly again, and possibly report on erros?
    * ssn is a handy id to tie all tables together, but may want to deliberately not have it in some tables to make joining more challenging (and simulate reality where ssns are sometimes not included)

* Once this is done, the different datasets can be subjected to a data science pipeline to combine (join) again and perform machine learning.
    * will use fuzzy matching
    * also traditional joining techniques

In [1]:
# !pip install faker

Collecting faker
  Downloading Faker-13.4.0-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 892 kB/s eta 0:00:01     |████████████                    | 573 kB 892 kB/s eta 0:00:02     |█████████████▏                  | 624 kB 892 kB/s eta 0:00:02     |████████████████████            | 952 kB 892 kB/s eta 0:00:01
Installing collected packages: faker
Successfully installed faker-13.4.0


In [7]:
import faker
#- faker.__version__

In [12]:
import pandas as pd
pd.__version__

'1.4.2'

In [15]:
import numpy as np
np.__version__

'1.22.3'

## 1 DATA SOURCES

### 1.1 Definitions

Words and terminology are often overloaded with a wide variety of meanings, so context is everything. Here are some formal definitions (in the *medical* context) to get us all on the same page:

#### 1.1.1 Patient
We all know what a patient is - most of us have been patients before ... ;-)

#### 1.1.2 Encounter
Medical encounter means an encounter between a client and a physician, physician assistant, podiatrist, ophthalmologist, optometrist, chiropractor, advanced practice registered nurse, or nurse midwife, for the purpose of diagnosis or treatment of an illness or injury;
- https://www.lawinsider.com/dictionary/medical-encounter#:~:text=Medical%20encounter%20means%20an%20encounter,Sample%201

#### 1.1.3 Condition
A medical condition is a broad term that includes all diseases, lesions, and disorders.
While the term medical condition generally includes mental illnesses, in some contexts the term is used specifically to denote any illness, injury, or disease except for mental illnesses. The Diagnostic and Statistical Manual of Mental Disorders (DSM) uses the term "general medical condition" to refer to all diseases, illnesses, and injuries except for mental disorders.
As it is more value-neutral than terms like disease, people sometimes prefer the term "medical condition."
The term medical condition is also a synonym for medical state, which describes an individual patient's current state from a medical standpoint.
- https://pallipedia.org/medical-condition/

#### 1.1.4 Medication
Medication, on the other hand, is defined as the process of treating an illness with medicine. It refers to the administration or application of medicine to remedy an illness or injury. It may also refer to the chemical substance either natural or synthetic which has a pharmacologic effect on the body.
- http://www.differencebetween.net/science/health/drugs-health/difference-between-medicine-and-medication/

### 1.2 Condition-Medication mappings
This is a many-to-many mapping. The file *condition_medication.csv* is an example of a 'starter file' (unless someone wants to volunteer their credit card number; queries against this api are paid for - not ideal when many tests will be run). It should be uploaded to S3, waiting for someone to push the button for everything to start assembling.

In [49]:
df_con_med = pd.read_csv('condition_medication.csv')
df_con_med

Unnamed: 0,api_num,disease,drug
0,0,Alkylating Agent Cystitis,sodium bicarbonate
1,1,Alkylating Agent Cystitis,citric acid / sodium citrate
2,2,Abdominal Distension,bethanechol
3,3,Abdominal Distension,pamabrom
4,4,Abdominal Distension,bethanechol
...,...,...,...
14678,14677,Zollinger-Ellison Syndrome,lansoprazole
14679,14678,Zollinger-Ellison Syndrome,ranitidine
14680,14679,Zollinger-Ellison Syndrome,rabeprazole
14681,14680,Zollinger-Ellison Syndrome,cimetidine


In [17]:
# 
# test sampling against this file
con_med_sample = df_con_med.sample()
con_med_sample

Unnamed: 0,api_num,disease,drug
13651,13656,Thyrotoxicosis,propranolol


In [18]:
con_med_sample['disease'].values[0]

'Thyrotoxicosis'

In [19]:
con_med_sample['drug'].values[0]

'propranolol'

## 2 SYNTHESIZE DATA
Still need to address the setting of a random seed to ensure reproducability. TODO.

In [20]:
# 
# setup Faker
fake = faker.Faker()

In [23]:
# 
# test a profile
profile = fake.profile()
profile

{'job': 'International aid/development worker',
 'company': 'Rasmussen PLC',
 'ssn': '787-55-0662',
 'residence': '22522 Hendricks Unions\nParsonsmouth, KS 95739',
 'current_location': (Decimal('5.960006'), Decimal('120.784558')),
 'blood_group': 'AB+',
 'website': ['https://www.todd.com/',
  'http://lewis-murillo.com/',
  'https://chang-schultz.com/'],
 'username': 'rachel21',
 'name': 'Erin Smith',
 'sex': 'F',
 'address': '882 Nicholas Mountain\nSmithberg, NM 79695',
 'mail': 'kowens@hotmail.com',
 'birthdate': datetime.date(1992, 6, 24)}

In [59]:
# hide
# import random
# from dateutil.relativedelta import relativedelta
# prof = fake.profile()
# print(prof['birthdate'])
# print(prof['birthdate'] + relativedelta(years=random.randint(10, 20)))

2014-02-02
2024-02-02


In [90]:
# 
#create a dataframe with fake values for patients
def make_patients(num):
    # lists to randomly assign to workers
    practitioner_roles = ['physician', 'physician assistant', 'registered nurse', 'advanced practice registered nurse']
    encounter_types = ['first visit', 'checkup', 'refill']
    fake_patients = []
    for x in range(num):
        prof = fake.profile()
        prof_prac = fake.profile()
        con_med_sample = df_con_med.sample()
        fake_patients.append({
            'pat_ssn':prof['ssn'],
            'pat_name':prof['name'],
            'pat_address':prof['address'], 
            'pat_username':prof['username'],
            'pat_sex':prof['sex'],
            'pat_birthdate':prof['birthdate'],
            'pat_blood_group':prof['blood_group'],
            'pat_email':prof['mail'],
            'pat_job':prof['job'],
            'pat_company':prof['company'],
            'enc_date':fake.date_between(start_date='-30y', end_date='today'),
            'enc_practitioner_name':prof_prac['name'],
            'enc_prac_role':np.random.choice(practitioner_roles, p=[0.50, 0.30, 0.10, 0.10]),
            'enc_type':np.random.choice(encounter_types, p=[0.20, 0.40, 0.40]),
            'con_name':con_med_sample['disease'].values[0],
            'con_diag_date':fake.date_between(start_date='-30y', end_date='today'), #needs to become more realistic
            'med_name':con_med_sample['drug'].values[0],
        })
    return fake_patients

patients_df = pd.DataFrame(make_patients(num=10000))

In [91]:
patients_df.shape

(10000, 17)

In [93]:
pd.set_option('display.max_rows', None)
patients_df[:200]

Unnamed: 0,pat_ssn,pat_name,pat_address,pat_username,pat_sex,pat_birthdate,pat_blood_group,pat_email,pat_job,pat_company,enc_date,enc_practitioner_name,enc_prac_role,enc_type,con_name,con_diag_date,med_name
0,752-49-0206,Hannah Harvey,"860 Smith Point Suite 687\nSouth Jeffrey, MN 1...",bradleyjames,F,1958-12-14,AB+,davisalan@gmail.com,Commercial art gallery manager,Hart Group,2015-04-22,Jose Valencia,physician,checkup,"Constipation, Chronic",2001-11-22,lactulose
1,518-31-3638,Casey Barnes,"389 Yvette Lock\nNorth Mariechester, OR 72680",fevans,F,1935-06-05,B+,meganmoses@hotmail.com,Youth worker,Warren-White,2003-02-18,Tony Rodriguez,physician assistant,refill,Testicular Cancer,2021-04-02,dactinomycin
2,718-30-9548,Shawn Lane,"228 Duke Passage Apt. 102\nJosephfort, SD 10909",zsnyder,M,1995-08-23,AB+,kurtduke@hotmail.com,IT technical support officer,Jones Group,1997-03-13,Sandra Weaver,physician,first visit,GERD,2003-10-07,ranitidine
3,427-69-0299,Kelly Cobb,"35085 Hutchinson Avenue\nEast Michael, NY 76633",sarah66,F,1960-01-30,O-,christopher98@gmail.com,Merchant navy officer,"English, Harper and Nielsen",1997-10-17,Brandon Jones,physician,refill,Venography,1992-12-30,iodixanol
4,656-34-0036,Taylor Bennett,"9021 Sierra Forges Suite 597\nLake Thomas, WV ...",mollyramirez,F,1970-12-16,AB+,sanchezanna@hotmail.com,Garment/textile technologist,Murphy and Sons,2014-02-13,Mark Garcia,physician,first visit,Bipolar Disorder,1998-02-12,topiramate
5,256-15-7607,Kayla Hoffman,"4291 Patel Alley Suite 976\nOlsonfurt, TN 03067",silvacrystal,F,1934-09-15,B-,cwilliamson@yahoo.com,"Lecturer, further education",Wallace-Martinez,1993-12-05,Douglas Simmons,physician assistant,refill,Multiple Myeloma,1996-02-04,melphalan
6,411-92-2086,Tracy Burch,"7639 Greene Flats\nMatthewsside, GA 79654",barbaraclark,F,1914-07-10,O-,joerodriguez@gmail.com,Fashion designer,"Myers, Cain and Beck",2014-05-02,Carrie Bauer,registered nurse,refill,Iron Deficiency Anemia,2015-04-05,iron polysaccharide
7,140-99-4662,John Sawyer,"7596 Abigail Extensions\nDixonmouth, NM 45001",taylorrice,M,1957-09-18,B+,gregorygrant@yahoo.com,Horticultural consultant,"Barnes, Quinn and Kennedy",2003-06-19,Jerry Peterson,physician assistant,checkup,Osteoarthritis,2010-10-05,aspirin
8,338-28-0776,Terry Perez,2068 Williams Haven Apt. 748\nPort Christineto...,gloria03,M,1906-10-24,A-,laurie70@hotmail.com,Professor Emeritus,"Walker, Walter and Blake",2019-10-21,Eric Hernandez,physician assistant,checkup,Intraabdominal Infection,1993-10-07,cefotaxime
9,121-89-3506,Mr. James Jones,"169 Brenda Landing\nNorth Patriciashire, GA 53769",johnryan,M,1956-06-06,A+,philip94@yahoo.com,"Programme researcher, broadcasting/film/video",Edwards-West,2006-12-13,Sharon Allen,physician,checkup,Left Ventricular Dysfunction,2004-08-06,metoprolol
