## Data Dictionary

### Covariates
For this datathon challenge, we are using a real-world evidence dataset from Health Verity (HV), one of the largest healthcare data ecosystems in the US, as the main data source for the Datathon. In particular, the HV dataset that we use for this challenge contains health related information of patients who were diagnosed with metastatic triple negative breast cancers in the US. We also enriched the data set with the US Zip Codes Database which were built from the ground up using authoritative sources including the U.S. Postal Service™, U.S. Census Bureau, National Weather Service, American Community Survey, and the IRS, to obtain additional social economic information based on the locations of the patients. The dataset was then further enriched, also using zip code level, with toxicology data from NASA/Columbia University, to explore the relations between health outcomes and toxic air conditions.

### Target

- `DiagPeriodL90D`: Diagnosis Period Less Than 90 Days. This is an indication of whether the cancer was diagnosed within 90 Days.

---------------------------------

### Patient Related info

#### Identifier:
- `patient_id` - Unique identification number of patient

#### Physical parameters of a patient: 
- `patient_race` - Asian, African American, Hispanic or Latino, White, Other Race
- `patient_age` - Derived from Patient Year of Birth (index year minus year of birth)
- `patient_gender` - F, M on the metastatic date
- `bmi` - If Available, will show available BMI information (Earliest BMI recording post metastatic date)

#### Diagnosis related info:
- `breast_cancer_diagnosis_code` - ICD10 or ICD9 diagnoses code
- `breast_cancer_diagnosis_desc` - ICD10 or ICD9 code description. This column is raw text and may require NLP/ processing and cleaning
- `metastatic_cancer_diagnosis_code` - ICD10 diagnoses code

#### Treatment related info:
- `metastatic_first_novel_treatment` - Generic drug name of the first novel treatment (e.g. "Cisplatin") after metastatic diagnosis
- `metastatic_first_novel_treatment_type` - Description of Treatment (e.g. Antineoplastic) of first novel treatment after metastatic diagnosis

#### Payment type: 
- `payer_type` - payer type at Medicaid, Commercial, Medicare on the metastatic date

---------------------------------

### Geolocation related info

#### Geographical location of a patient:
- `patient_state` - Patient State (e.g. AL, AK, AZ, AR, CA, CO etc…) on the metastatic date
- `patient_zip3` - Patient Zip3 (e.g. 190) on the metastatic date
- `region` - Region of patient location
- `division` - Division of patient location

#### Air Quality in patient's Geolocation:
- `ozone` - Annual Ozone (O3) concentration data at Zip3 level. This data shows how air quality data may impact health.
- `PM25` - Annual Fine Particulate Matter (PM2.5) concentration data at Zip3 level. This data shows how air quality data may impact health.
- `N02` - Annual Nitrogen Dioxide (NO2) concentration data at Zip3 level. This data shows how air quality data may impact health.

---------------------------------

### Population related info in patient's geolocation

##### General:
- `population` - An estimate of the zip code's population.
- `density` - The estimated population per square kilometer.
- `poverty` - The median value of owner occupied homes.
- `commute_time` - The median commute time of resident workers in minutes.

##### Age: 
- `age_median` - The median age of residents in the zip code.
- `age_under_10` - The percentage of residents aged 0-9.
- `age_10_to_19` - The percentage of residents aged 10-19.
- `age_20s` - The percentage of residents aged 20-29.
- `age_30s` - The percentage of residents aged 30-39.
- `age_40s` - The percentage of residents aged 40-49.
- `age_50s` - The percentage of residents aged 50-59.
- `age_60s` - The percentage of residents aged 60-69.
- `age_70s` - The percentage of residents aged 70-79.
- `age_over_80` - The percentage of residents aged over 80.

##### Gender: 
- `male` - The percentage of residents who report being male (e.g. 55.1).
- `female` - The percentage of residents who report being female (e.g. 44.9).

##### Race:
- `race_multiple` - The percentage of residents who report their race as Two or more races.
- `race_white` - The percentage of residents who report their race White.
- `race_black` - The percentage of residents who report their race as Black or African American.
- `race_asian` - The percentage of residents who report their race as Asian.
- `race_native` - The percentage of residents who report their race as American Indian and Alaska Native.
- `race_pacific` - The percentage of residents who report their race as Native Hawaiian and Other Pacific Islander.
- `race_other` - The percentage of residents who report their race as Some other race.
- `hispanic` - The percentage of residents who report being Hispanic. Note: Hispanic is considered to be an ethnicity and not a race.

##### Health determining situation:
- `health_uninsured` - The percentage of residents who report not having health insurance.
- `disabled` - The percentage of residents who report a disability.
- `veteran` - The percentage of residents who are veterans.

##### Social status:
- `married` - The percentage of residents who report being married (e.g. 44.9).
- `divorced` - The percentage of residents divorced.
- `never_married` - The percentage of residents never married.
- `widowed` - The percentage of residents never widowed.

##### Family: 
- `family_size` - The average size of resident families (e.g. 3.22).

##### Home ownership: 
- `home_ownership` - Percentage of households that own (rather than rent) their residence.
- `housing_units` - The number of housing units (or households) in the zip code.
- `home_value` - The median value of homes that are owned by residents.

#### Rent:
- `rent_median` - The median rent paid by renters.
- `rent_burden` - The median rent as a percentage of the median renter's household income.
    
##### Educaton:
- `education_college_or_above` - The percentage of residents with at least a 4-year degree.
- `education_less_highschool` - The percentage of residents with less than a high school education.
- `education_highschool` - The percentage of residents with a high school diploma but no more.
- `education_some_college` - The percentage of residents with some college but no more.
- `education_bachelors` - The percentage of residents with a bachelor's degree (or equivalent) but no more.
- `education_graduate` - The percentage of residents with a graduate degree.
- `education_stem_degree` - The percentage of college graduates with a Bachelor's degree or higher in a Science and Engineering (or related) field.
- `limited_english` - The percentage of residents who only speak limited English.

##### Employment:
- `labor_force_participation` - The percentage of residents 16 and older in the labor force.
- `unemployment_rate` - The percentage of residents unemployed.
- `self_employed` - The percentage of households reporting self-employment income on their 2016 IRS tax return.
   
  
##### Houshold income:
- `income_household_median` - Median household income in USD.
- `income_household_six_figure` - Percentage of households that earn at least $100,000 (e.g. 25.3)
- `family_dual_income` - The percentage of families with dual income earners.
- `income_household_under_5` - The percentage of households with income under $5,000.
- `income_household_5_to_10` - The percentage of households with income from $5,000-$10,000.
- `income_household_10_to_15` - The percentage of households with income from $10,000-$15,000.
- `income_household_15_to_20` - The percentage of households with income from $15,000-$20,000.
- `income_household_20_to_25` - The percentage of households with income from $20,000-$25,000.
- `income_household_25_to_35` - The percentage of households with income from $25,000-$35,000.
- `income_household_35_to_50` - The percentage of households with income from $35,000-$50,000.
- `income_household_50_to_75` - The percentage of households with income from $50,000-$75,000.
- `income_household_75_to_100` - The percentage of households with income from $75,000-$100,000.
- `income_household_100_to_150` - The percentage of households with income from $100,000-$150,000.
- `income_household_150_over` - The percentage of households with income over $150,000.
- `income_individual_median` - The median income of individuals in the zip code.
- `farmer` - The percentage of households reporting farm income on their 2016 IRS tax return.
    

In [2]:
import pandas as pd

In [3]:
pd.set_option('display.max_columns', None)

In [4]:
train = pd.read_csv('training.csv')
train

Unnamed: 0,patient_id,patient_race,payer_type,patient_state,patient_zip3,patient_age,patient_gender,bmi,breast_cancer_diagnosis_code,breast_cancer_diagnosis_desc,metastatic_cancer_diagnosis_code,metastatic_first_novel_treatment,metastatic_first_novel_treatment_type,Region,Division,population,density,age_median,age_under_10,age_10_to_19,age_20s,age_30s,age_40s,age_50s,age_60s,age_70s,age_over_80,male,female,married,divorced,never_married,widowed,family_size,family_dual_income,income_household_median,income_household_under_5,income_household_5_to_10,income_household_10_to_15,income_household_15_to_20,income_household_20_to_25,income_household_25_to_35,income_household_35_to_50,income_household_50_to_75,income_household_75_to_100,income_household_100_to_150,income_household_150_over,income_household_six_figure,income_individual_median,home_ownership,housing_units,home_value,rent_median,rent_burden,education_less_highschool,education_highschool,education_some_college,education_bachelors,education_graduate,education_college_or_above,education_stem_degree,labor_force_participation,unemployment_rate,self_employed,farmer,race_white,race_black,race_asian,race_native,race_pacific,race_other,race_multiple,hispanic,disabled,poverty,limited_english,commute_time,health_uninsured,veteran,Ozone,PM25,N02,DiagPeriodL90D
0,475714,,MEDICAID,CA,924,84,F,,C50919,Malignant neoplasm of unsp site of unspecified...,C7989,,,West,Pacific,31437.75000,1189.562500,30.642857,16.014286,15.542857,17.614286,14.014286,11.614286,11.557143,7.571429,4.000000,2.100000,49.857143,50.142857,36.571429,11.885714,47.114286,4.442857,3.928571,52.228571,52996.28571,3.142857,4.000000,6.157143,5.142857,6.271429,10.142857,13.300000,20.000000,12.742857,11.571429,7.528571,19.100000,24563.57143,44.585714,8674.500000,2.646343e+05,1165.000000,37.442857,33.257143,29.200000,25.914286,8.357143,3.257143,11.614286,39.557143,61.528571,8.471429,13.428571,0.000000,44.100000,13.100000,5.100000,1.485714,0.342857,27.114286,8.757143,66.685714,12.871429,22.542857,10.100000,27.814286,11.200000,3.500000,52.237210,8.650555,18.606528,1
1,349367,White,COMMERCIAL,CA,928,62,F,28.49,C50411,Malig neoplm of upper-outer quadrant of right ...,C773,,,West,Pacific,39121.87879,2295.939394,38.200000,11.878788,13.354545,14.230303,13.418182,13.333333,14.060606,10.248485,5.951515,3.503030,49.893939,50.106061,50.245455,9.827273,35.290909,4.651515,3.622727,61.736364,102741.63640,2.327273,1.536364,2.648485,2.178788,2.409091,5.163636,7.972727,13.936364,12.469697,19.760606,29.596970,49.357576,41287.27273,61.463636,11725.666670,6.776885e+05,2003.125000,34.753125,14.230303,19.987879,29.796970,23.739394,12.245455,35.984848,47.918182,65.230303,5.103030,15.224242,0.027273,54.030303,2.527273,20.827273,0.587879,0.300000,11.645455,10.081818,37.948485,8.957576,10.109091,8.057576,30.606061,7.018182,4.103030,42.301121,8.487175,20.113179,1
2,138632,White,COMMERCIAL,TX,760,43,F,38.09,C50112,Malignant neoplasm of central portion of left ...,C773,,,South,West South Central,21996.68333,626.236667,37.906667,13.028333,14.463333,12.531667,13.545000,12.860000,12.770000,11.426667,6.565000,2.811667,50.123333,49.876667,55.753333,12.330000,27.195000,4.710000,3.260667,55.801667,85984.74138,2.483333,1.305000,2.716667,2.938333,2.766667,6.763333,12.061667,15.835000,13.560000,20.875000,18.680000,39.555000,40399.03333,72.745000,7786.583333,2.377131e+05,1235.907407,29.358491,10.811667,27.038333,32.368333,19.678333,10.115000,29.793333,37.308475,66.428333,4.560000,13.722034,3.650847,75.820000,9.231667,3.618333,0.463333,0.146667,3.816667,6.898333,19.370000,11.253333,9.663333,3.356667,31.394915,15.066667,7.446667,40.108207,7.642753,14.839351,1
3,617843,White,COMMERCIAL,CA,926,45,F,,C50212,Malig neoplasm of upper-inner quadrant of left...,C773,,,West,Pacific,32795.32558,1896.220930,42.871429,10.071429,12.135714,12.538095,12.464286,12.650000,14.847619,12.280952,8.216667,4.759524,49.066667,50.933333,52.604762,11.623810,31.142857,4.623810,3.098095,54.564286,120533.83330,3.435714,1.273810,2.180952,2.211905,2.100000,4.380952,5.885714,10.897619,10.721429,18.850000,38.057143,56.907143,55336.28571,59.221429,12171.302330,1.012474e+06,2354.738095,32.030952,5.835714,12.145238,26.269048,33.285714,22.459524,55.745238,48.938095,64.430952,5.264286,18.502381,0.052381,65.014286,1.438095,18.845238,0.430952,0.252381,5.428571,8.611905,16.716667,8.845238,8.688095,5.280952,27.561905,4.404762,4.809524,42.070075,7.229393,15.894123,0
4,817482,,COMMERCIAL,ID,836,55,F,,1749,"Malignant neoplasm of breast (female), unspeci...",C773,,,West,Mountain,10886.26000,116.886000,43.473469,10.824000,13.976000,9.492000,10.364000,12.600000,14.992000,14.836000,9.462000,3.466000,52.312000,47.688000,57.882000,14.964000,21.760000,5.406000,3.352653,47.214286,61075.13043,2.594000,1.960000,3.168000,3.240000,4.778000,11.462000,15.656000,22.432000,12.480000,13.620000,8.606000,22.226000,29073.18367,77.098000,3768.060000,2.498457e+05,919.743590,27.029730,11.576000,29.590000,39.168000,13.978000,5.684000,19.662000,42.332653,57.488000,4.258000,13.029545,6.890909,86.712000,0.426000,0.656000,0.760000,0.108000,5.080000,6.258000,13.334000,15.276000,11.224000,1.946000,26.170213,12.088000,13.106000,41.356058,4.110749,11.722197,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12901,674178,White,,OH,436,50,F,32.11,C50411,Malig neoplm of upper-outer quadrant of right ...,C773,,,Midwest,East North Central,19413.05882,1196.805882,36.911765,12.876471,13.435294,14.394118,12.705882,11.694118,13.329412,11.764706,6.188235,3.617647,48.264706,51.735294,36.429412,15.700000,42.158824,5.705882,3.039412,43.217647,48452.41176,5.517647,6.005882,7.405882,4.800000,6.058824,10.364706,14.194118,16.217647,11.047059,10.994118,7.358824,18.352941,27888.52941,55.905882,8227.764706,1.005470e+05,772.647059,31.776471,12.923529,31.723529,32.564706,14.400000,8.370588,22.770588,38.288235,61.429412,9.135294,9.105882,0.023529,62.182353,27.770588,1.217647,0.270588,0.064706,2.476471,6.005882,7.747059,17.400000,23.600000,0.864706,19.841176,6.300000,6.247059,38.753055,8.068682,21.140731,1
12902,452909,,COMMERCIAL,CA,945,50,F,,C50912,Malignant neoplasm of unspecified site of left...,C773,,,West,Pacific,30153.87952,976.289157,42.135802,10.753086,12.714815,11.725926,13.101235,12.817284,13.301235,12.771605,8.413580,4.408642,49.727160,50.272840,53.076543,10.912346,30.534568,5.466667,3.271125,55.760000,122863.89610,2.051250,1.136250,2.127500,1.647500,2.073750,5.148750,6.473750,12.807500,11.286250,18.003750,37.245000,55.248750,52778.65000,67.480000,10267.108430,8.179491e+05,2223.445946,32.100000,8.916049,16.504938,29.396296,26.903704,18.277778,45.181481,52.645000,63.281481,5.332099,14.116250,0.416250,54.060494,5.906173,21.497531,0.586420,0.695062,7.986420,9.274074,21.861728,11.243210,7.837037,5.411250,34.700000,3.845679,5.671605,36.469947,6.265266,10.728732,1
12903,357486,,COMMERCIAL,CA,926,61,F,29.24,C50912,Malignant neoplasm of unspecified site of left...,C7931,,,West,Pacific,32795.32558,1896.220930,42.871429,10.071429,12.135714,12.538095,12.464286,12.650000,14.847619,12.280952,8.216667,4.759524,49.066667,50.933333,52.604762,11.623810,31.142857,4.623810,3.098095,54.564286,120533.83330,3.435714,1.273810,2.180952,2.211905,2.100000,4.380952,5.885714,10.897619,10.721429,18.850000,38.057143,56.907143,55336.28571,59.221429,12171.302330,1.012474e+06,2354.738095,32.030952,5.835714,12.145238,26.269048,33.285714,22.459524,55.745238,48.938095,64.430952,5.264286,18.502381,0.052381,65.014286,1.438095,18.845238,0.430952,0.252381,5.428571,8.611905,16.716667,8.845238,8.688095,5.280952,27.561905,4.404762,4.809524,42.070075,7.229393,15.894123,1
12904,935417,,,NY,112,37,F,31.00,1749,"Malignant neoplasm of breast (female), unspeci...",C773,,,Northeast,Middle Atlantic,71374.13158,17326.407890,36.476316,12.986842,11.318421,14.971053,17.255263,12.631579,11.460526,9.789474,6.000000,3.581579,47.668421,52.331579,39.923684,10.239474,44.642105,5.186842,3.412105,53.447368,74499.71053,4.334211,3.305263,5.863158,4.460526,4.042105,7.589474,9.897368,13.542105,10.742105,14.889474,21.318421,36.207895,39491.78947,29.931579,25922.552630,8.708732e+05,1678.447368,35.213158,16.200000,24.334211,18.447368,24.371053,16.655263,41.026316,40.857895,64.197368,7.184211,18.145946,0.002703,44.100000,28.831579,11.205263,0.515789,0.068421,9.184211,6.089474,18.960526,10.194737,18.642105,14.173684,42.502632,6.392105,1.755263,37.722740,7.879795,27.496367,0
