## Data Dictionary

### Covariates
For this datathon challenge, we are using a real-world evidence dataset from Health Verity (HV), one of the largest healthcare data ecosystems in the US, as the main data source for the Datathon. In particular, the HV dataset that we use for this challenge contains health related information of patients who were diagnosed with metastatic triple negative breast cancers in the US. We also enriched the data set with the US Zip Codes Database which were built from the ground up using authoritative sources including the U.S. Postal Service™, U.S. Census Bureau, National Weather Service, American Community Survey, and the IRS, to obtain additional social economic information based on the locations of the patients. The dataset was then further enriched, also using zip code level, with toxicology data from NASA/Columbia University, to explore the relations between health outcomes and toxic air conditions.

### Target

- `DiagPeriodL90D`: Diagnosis Period Less Than 90 Days. This is an indication of whether the cancer was diagnosed within 90 Days.

---------------------------------

### Patient Related info

#### Identifier:
- `patient_id` - Unique identification number of patient

#### Physical parameters of a patient: 
- `patient_race` - Asian, African American, Hispanic or Latino, White, Other Race
- `patient_age` - Derived from Patient Year of Birth (index year minus year of birth)
- `patient_gender` - F, M on the metastatic date
- `bmi` - If Available, will show available BMI information (Earliest BMI recording post metastatic date)

#### Diagnosis related info:
- `breast_cancer_diagnosis_code` - ICD10 or ICD9 diagnoses code
- `breast_cancer_diagnosis_desc` - ICD10 or ICD9 code description. This column is raw text and may require NLP/ processing and cleaning
- `metastatic_cancer_diagnosis_code` - ICD10 diagnoses code

#### Treatment related info:
- `metastatic_first_novel_treatment` - Generic drug name of the first novel treatment (e.g. "Cisplatin") after metastatic diagnosis
- `metastatic_first_novel_treatment_type` - Description of Treatment (e.g. Antineoplastic) of first novel treatment after metastatic diagnosis

#### Payment type: 
- `payer_type` - payer type at Medicaid, Commercial, Medicare on the metastatic date

---------------------------------

### Geolocation related info

#### Geographical location of a patient:
- `patient_state` - Patient State (e.g. AL, AK, AZ, AR, CA, CO etc…) on the metastatic date
- `patient_zip3` - Patient Zip3 (e.g. 190) on the metastatic date
- `region` - Region of patient location
- `division` - Division of patient location

#### Air Quality in patient's Geolocation:
- `ozone` - Annual Ozone (O3) concentration data at Zip3 level. This data shows how air quality data may impact health.
- `PM25` - Annual Fine Particulate Matter (PM2.5) concentration data at Zip3 level. This data shows how air quality data may impact health.
- `N02` - Annual Nitrogen Dioxide (NO2) concentration data at Zip3 level. This data shows how air quality data may impact health.

---------------------------------

### Population related info in patient's geolocation

##### General:
- `population` - An estimate of the zip code's population.
- `density` - The estimated population per square kilometer.
- `poverty` - The median value of owner occupied homes.
- `commute_time` - The median commute time of resident workers in minutes.

##### Age: 
- `age_median` - The median age of residents in the zip code.
- `age_under_10` - The percentage of residents aged 0-9.
- `age_10_to_19` - The percentage of residents aged 10-19.
- `age_20s` - The percentage of residents aged 20-29.
- `age_30s` - The percentage of residents aged 30-39.
- `age_40s` - The percentage of residents aged 40-49.
- `age_50s` - The percentage of residents aged 50-59.
- `age_60s` - The percentage of residents aged 60-69.
- `age_70s` - The percentage of residents aged 70-79.
- `age_over_80` - The percentage of residents aged over 80.

##### Gender: 
- `male` - The percentage of residents who report being male (e.g. 55.1).
- `female` - The percentage of residents who report being female (e.g. 44.9).

##### Race or ethnicity:
- `race_multiple` - The percentage of residents who report their race as Two or more races.
- `race_white` - The percentage of residents who report their race White.
- `race_black` - The percentage of residents who report their race as Black or African American.
- `race_asian` - The percentage of residents who report their race as Asian.
- `race_native` - The percentage of residents who report their race as American Indian and Alaska Native.
- `race_pacific` - The percentage of residents who report their race as Native Hawaiian and Other Pacific Islander.
- `race_other` - The percentage of residents who report their race as Some other race.
- `hispanic` - The percentage of residents who report being Hispanic. Note: Hispanic is considered to be an ethnicity and not a race.

##### Health determining situation:
- `health_uninsured` - The percentage of residents who report not having health insurance.
- `disabled` - The percentage of residents who report a disability.
- `veteran` - The percentage of residents who are veterans.

##### Social status:
- `married` - The percentage of residents who report being married (e.g. 44.9).
- `divorced` - The percentage of residents divorced.
- `never_married` - The percentage of residents never married.
- `widowed` - The percentage of residents never widowed.

##### Family: 
- `family_size` - The average size of resident families (e.g. 3.22).

##### Home ownership: 
- `home_ownership` - Percentage of households that own (rather than rent) their residence.
- `housing_units` - The number of housing units (or households) in the zip code.
- `home_value` - The median value of homes that are owned by residents.

##### Rent:
- `rent_median` - The median rent paid by renters.
- `rent_burden` - The median rent as a percentage of the median renter's household income.
    
##### Educaton:
- `education_college_or_above` - The percentage of residents with at least a 4-year degree.
- `education_less_highschool` - The percentage of residents with less than a high school education.
- `education_highschool` - The percentage of residents with a high school diploma but no more.
- `education_some_college` - The percentage of residents with some college but no more.
- `education_bachelors` - The percentage of residents with a bachelor's degree (or equivalent) but no more.
- `education_graduate` - The percentage of residents with a graduate degree.
- `education_stem_degree` - The percentage of college graduates with a Bachelor's degree or higher in a Science and Engineering (or related) field.
- `limited_english` - The percentage of residents who only speak limited English.

##### Employment:
- `labor_force_participation` - The percentage of residents 16 and older in the labor force.
- `unemployment_rate` - The percentage of residents unemployed.
- `self_employed` - The percentage of households reporting self-employment income on their 2016 IRS tax return.
   
  
##### Houshold income:
- `income_household_median` - Median household income in USD.
- `income_household_six_figure` - Percentage of households that earn at least $100,000 (e.g. 25.3)
- `family_dual_income` - The percentage of families with dual income earners.
- `income_household_under_5` - The percentage of households with income under $5,000.
- `income_household_5_to_10` - The percentage of households with income from $5,000-$10,000.
- `income_household_10_to_15` - The percentage of households with income from $10,000-$15,000.
- `income_household_15_to_20` - The percentage of households with income from $15,000-$20,000.
- `income_household_20_to_25` - The percentage of households with income from $20,000-$25,000.
- `income_household_25_to_35` - The percentage of households with income from $25,000-$35,000.
- `income_household_35_to_50` - The percentage of households with income from $35,000-$50,000.
- `income_household_50_to_75` - The percentage of households with income from $50,000-$75,000.
- `income_household_75_to_100` - The percentage of households with income from $75,000-$100,000.
- `income_household_100_to_150` - The percentage of households with income from $100,000-$150,000.
- `income_household_150_over` - The percentage of households with income over $150,000.
- `income_individual_median` - The median income of individuals in the zip code.
- `farmer` - The percentage of households reporting farm income on their 2016 IRS tax return.
    

In [26]:
import pandas as pd
import numpy as np

In [27]:
pd.set_option('display.max_columns', None)

In [28]:
train = pd.read_csv('training.csv')
train

Unnamed: 0,patient_id,patient_race,payer_type,patient_state,patient_zip3,patient_age,patient_gender,bmi,breast_cancer_diagnosis_code,breast_cancer_diagnosis_desc,metastatic_cancer_diagnosis_code,metastatic_first_novel_treatment,metastatic_first_novel_treatment_type,Region,Division,population,density,age_median,age_under_10,age_10_to_19,age_20s,age_30s,age_40s,age_50s,age_60s,age_70s,age_over_80,male,female,married,divorced,never_married,widowed,family_size,family_dual_income,income_household_median,income_household_under_5,income_household_5_to_10,income_household_10_to_15,income_household_15_to_20,income_household_20_to_25,income_household_25_to_35,income_household_35_to_50,income_household_50_to_75,income_household_75_to_100,income_household_100_to_150,income_household_150_over,income_household_six_figure,income_individual_median,home_ownership,housing_units,home_value,rent_median,rent_burden,education_less_highschool,education_highschool,education_some_college,education_bachelors,education_graduate,education_college_or_above,education_stem_degree,labor_force_participation,unemployment_rate,self_employed,farmer,race_white,race_black,race_asian,race_native,race_pacific,race_other,race_multiple,hispanic,disabled,poverty,limited_english,commute_time,health_uninsured,veteran,Ozone,PM25,N02,DiagPeriodL90D
0,475714,,MEDICAID,CA,924,84,F,,C50919,Malignant neoplasm of unsp site of unspecified...,C7989,,,West,Pacific,31437.75000,1189.562500,30.642857,16.014286,15.542857,17.614286,14.014286,11.614286,11.557143,7.571429,4.000000,2.100000,49.857143,50.142857,36.571429,11.885714,47.114286,4.442857,3.928571,52.228571,52996.28571,3.142857,4.000000,6.157143,5.142857,6.271429,10.142857,13.300000,20.000000,12.742857,11.571429,7.528571,19.100000,24563.57143,44.585714,8674.500000,2.646343e+05,1165.000000,37.442857,33.257143,29.200000,25.914286,8.357143,3.257143,11.614286,39.557143,61.528571,8.471429,13.428571,0.000000,44.100000,13.100000,5.100000,1.485714,0.342857,27.114286,8.757143,66.685714,12.871429,22.542857,10.100000,27.814286,11.200000,3.500000,52.237210,8.650555,18.606528,1
1,349367,White,COMMERCIAL,CA,928,62,F,28.49,C50411,Malig neoplm of upper-outer quadrant of right ...,C773,,,West,Pacific,39121.87879,2295.939394,38.200000,11.878788,13.354545,14.230303,13.418182,13.333333,14.060606,10.248485,5.951515,3.503030,49.893939,50.106061,50.245455,9.827273,35.290909,4.651515,3.622727,61.736364,102741.63640,2.327273,1.536364,2.648485,2.178788,2.409091,5.163636,7.972727,13.936364,12.469697,19.760606,29.596970,49.357576,41287.27273,61.463636,11725.666670,6.776885e+05,2003.125000,34.753125,14.230303,19.987879,29.796970,23.739394,12.245455,35.984848,47.918182,65.230303,5.103030,15.224242,0.027273,54.030303,2.527273,20.827273,0.587879,0.300000,11.645455,10.081818,37.948485,8.957576,10.109091,8.057576,30.606061,7.018182,4.103030,42.301121,8.487175,20.113179,1
2,138632,White,COMMERCIAL,TX,760,43,F,38.09,C50112,Malignant neoplasm of central portion of left ...,C773,,,South,West South Central,21996.68333,626.236667,37.906667,13.028333,14.463333,12.531667,13.545000,12.860000,12.770000,11.426667,6.565000,2.811667,50.123333,49.876667,55.753333,12.330000,27.195000,4.710000,3.260667,55.801667,85984.74138,2.483333,1.305000,2.716667,2.938333,2.766667,6.763333,12.061667,15.835000,13.560000,20.875000,18.680000,39.555000,40399.03333,72.745000,7786.583333,2.377131e+05,1235.907407,29.358491,10.811667,27.038333,32.368333,19.678333,10.115000,29.793333,37.308475,66.428333,4.560000,13.722034,3.650847,75.820000,9.231667,3.618333,0.463333,0.146667,3.816667,6.898333,19.370000,11.253333,9.663333,3.356667,31.394915,15.066667,7.446667,40.108207,7.642753,14.839351,1
3,617843,White,COMMERCIAL,CA,926,45,F,,C50212,Malig neoplasm of upper-inner quadrant of left...,C773,,,West,Pacific,32795.32558,1896.220930,42.871429,10.071429,12.135714,12.538095,12.464286,12.650000,14.847619,12.280952,8.216667,4.759524,49.066667,50.933333,52.604762,11.623810,31.142857,4.623810,3.098095,54.564286,120533.83330,3.435714,1.273810,2.180952,2.211905,2.100000,4.380952,5.885714,10.897619,10.721429,18.850000,38.057143,56.907143,55336.28571,59.221429,12171.302330,1.012474e+06,2354.738095,32.030952,5.835714,12.145238,26.269048,33.285714,22.459524,55.745238,48.938095,64.430952,5.264286,18.502381,0.052381,65.014286,1.438095,18.845238,0.430952,0.252381,5.428571,8.611905,16.716667,8.845238,8.688095,5.280952,27.561905,4.404762,4.809524,42.070075,7.229393,15.894123,0
4,817482,,COMMERCIAL,ID,836,55,F,,1749,"Malignant neoplasm of breast (female), unspeci...",C773,,,West,Mountain,10886.26000,116.886000,43.473469,10.824000,13.976000,9.492000,10.364000,12.600000,14.992000,14.836000,9.462000,3.466000,52.312000,47.688000,57.882000,14.964000,21.760000,5.406000,3.352653,47.214286,61075.13043,2.594000,1.960000,3.168000,3.240000,4.778000,11.462000,15.656000,22.432000,12.480000,13.620000,8.606000,22.226000,29073.18367,77.098000,3768.060000,2.498457e+05,919.743590,27.029730,11.576000,29.590000,39.168000,13.978000,5.684000,19.662000,42.332653,57.488000,4.258000,13.029545,6.890909,86.712000,0.426000,0.656000,0.760000,0.108000,5.080000,6.258000,13.334000,15.276000,11.224000,1.946000,26.170213,12.088000,13.106000,41.356058,4.110749,11.722197,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12901,674178,White,,OH,436,50,F,32.11,C50411,Malig neoplm of upper-outer quadrant of right ...,C773,,,Midwest,East North Central,19413.05882,1196.805882,36.911765,12.876471,13.435294,14.394118,12.705882,11.694118,13.329412,11.764706,6.188235,3.617647,48.264706,51.735294,36.429412,15.700000,42.158824,5.705882,3.039412,43.217647,48452.41176,5.517647,6.005882,7.405882,4.800000,6.058824,10.364706,14.194118,16.217647,11.047059,10.994118,7.358824,18.352941,27888.52941,55.905882,8227.764706,1.005470e+05,772.647059,31.776471,12.923529,31.723529,32.564706,14.400000,8.370588,22.770588,38.288235,61.429412,9.135294,9.105882,0.023529,62.182353,27.770588,1.217647,0.270588,0.064706,2.476471,6.005882,7.747059,17.400000,23.600000,0.864706,19.841176,6.300000,6.247059,38.753055,8.068682,21.140731,1
12902,452909,,COMMERCIAL,CA,945,50,F,,C50912,Malignant neoplasm of unspecified site of left...,C773,,,West,Pacific,30153.87952,976.289157,42.135802,10.753086,12.714815,11.725926,13.101235,12.817284,13.301235,12.771605,8.413580,4.408642,49.727160,50.272840,53.076543,10.912346,30.534568,5.466667,3.271125,55.760000,122863.89610,2.051250,1.136250,2.127500,1.647500,2.073750,5.148750,6.473750,12.807500,11.286250,18.003750,37.245000,55.248750,52778.65000,67.480000,10267.108430,8.179491e+05,2223.445946,32.100000,8.916049,16.504938,29.396296,26.903704,18.277778,45.181481,52.645000,63.281481,5.332099,14.116250,0.416250,54.060494,5.906173,21.497531,0.586420,0.695062,7.986420,9.274074,21.861728,11.243210,7.837037,5.411250,34.700000,3.845679,5.671605,36.469947,6.265266,10.728732,1
12903,357486,,COMMERCIAL,CA,926,61,F,29.24,C50912,Malignant neoplasm of unspecified site of left...,C7931,,,West,Pacific,32795.32558,1896.220930,42.871429,10.071429,12.135714,12.538095,12.464286,12.650000,14.847619,12.280952,8.216667,4.759524,49.066667,50.933333,52.604762,11.623810,31.142857,4.623810,3.098095,54.564286,120533.83330,3.435714,1.273810,2.180952,2.211905,2.100000,4.380952,5.885714,10.897619,10.721429,18.850000,38.057143,56.907143,55336.28571,59.221429,12171.302330,1.012474e+06,2354.738095,32.030952,5.835714,12.145238,26.269048,33.285714,22.459524,55.745238,48.938095,64.430952,5.264286,18.502381,0.052381,65.014286,1.438095,18.845238,0.430952,0.252381,5.428571,8.611905,16.716667,8.845238,8.688095,5.280952,27.561905,4.404762,4.809524,42.070075,7.229393,15.894123,1
12904,935417,,,NY,112,37,F,31.00,1749,"Malignant neoplasm of breast (female), unspeci...",C773,,,Northeast,Middle Atlantic,71374.13158,17326.407890,36.476316,12.986842,11.318421,14.971053,17.255263,12.631579,11.460526,9.789474,6.000000,3.581579,47.668421,52.331579,39.923684,10.239474,44.642105,5.186842,3.412105,53.447368,74499.71053,4.334211,3.305263,5.863158,4.460526,4.042105,7.589474,9.897368,13.542105,10.742105,14.889474,21.318421,36.207895,39491.78947,29.931579,25922.552630,8.708732e+05,1678.447368,35.213158,16.200000,24.334211,18.447368,24.371053,16.655263,41.026316,40.857895,64.197368,7.184211,18.145946,0.002703,44.100000,28.831579,11.205263,0.515789,0.068421,9.184211,6.089474,18.960526,10.194737,18.642105,14.173684,42.502632,6.392105,1.755263,37.722740,7.879795,27.496367,0


In [29]:
test = pd.read_csv('test.csv')
test

Unnamed: 0,patient_id,patient_race,payer_type,patient_state,patient_zip3,patient_age,patient_gender,bmi,breast_cancer_diagnosis_code,breast_cancer_diagnosis_desc,metastatic_cancer_diagnosis_code,metastatic_first_novel_treatment,metastatic_first_novel_treatment_type,Region,Division,population,density,age_median,age_under_10,age_10_to_19,age_20s,age_30s,age_40s,age_50s,age_60s,age_70s,age_over_80,male,female,married,divorced,never_married,widowed,family_size,family_dual_income,income_household_median,income_household_under_5,income_household_5_to_10,income_household_10_to_15,income_household_15_to_20,income_household_20_to_25,income_household_25_to_35,income_household_35_to_50,income_household_50_to_75,income_household_75_to_100,income_household_100_to_150,income_household_150_over,income_household_six_figure,income_individual_median,home_ownership,housing_units,home_value,rent_median,rent_burden,education_less_highschool,education_highschool,education_some_college,education_bachelors,education_graduate,education_college_or_above,education_stem_degree,labor_force_participation,unemployment_rate,self_employed,farmer,race_white,race_black,race_asian,race_native,race_pacific,race_other,race_multiple,hispanic,disabled,poverty,limited_english,commute_time,health_uninsured,veteran,Ozone,PM25,N02
0,573710,White,MEDICAID,IN,467,54,F,,C50412,Malig neoplasm of upper-outer quadrant of left...,C773,,,Midwest,East North Central,5441.435484,85.620968,40.880328,12.732258,14.088710,10.659677,11.625806,11.208065,15.619355,12.322581,8.409677,3.343548,49.154839,50.845161,55.175806,13.982258,24.266129,6.583871,3.073226,52.980645,66187.22807,1.611290,1.277419,2.645161,3.853226,3.172581,13.275806,12.633871,21.485484,16.717742,15.238710,8.070968,23.309677,33553.43333,84.112903,2064.741935,152749.5370,825.122449,23.895455,12.429032,40.667742,28.959677,11.895161,6.046774,17.941935,35.591379,63.303226,3.406557,10.655357,5.551786,94.793548,0.364516,0.303226,0.119355,0.009677,0.770968,3.630645,3.564516,13.996774,7.985484,0.969355,24.955357,10.838710,8.080645,38.724876,7.947165,11.157161
1,593679,,COMMERCIAL,FL,337,52,F,,C50912,Malignant neoplasm of unspecified site of left...,C787,,,South,South Atlantic,19613.820510,1555.107692,49.107692,8.069231,8.587179,10.684615,11.302564,10.971795,15.823077,15.902564,11.828205,6.815385,49.658974,50.341026,44.800000,17.779487,29.102564,8.310256,2.917105,46.665789,64711.71053,3.873684,2.044737,3.807895,4.239474,4.242105,9.347368,13.018421,17.373684,12.889474,14.442105,14.702632,29.144737,34678.61538,68.673684,8502.230769,265860.6053,1343.394737,34.957895,8.379487,26.558974,30.200000,22.100000,12.764103,34.864103,43.250000,57.035897,5.002632,11.564103,0.005128,78.217949,10.889744,3.453846,0.187179,0.076923,1.841026,5.328205,10.261538,16.020513,13.602564,2.836842,23.952632,10.579487,9.302564,36.918257,7.838973,13.599985
2,184532,Hispanic,MEDICAID,CA,917,61,F,,C50911,Malignant neoplasm of unsp site of right femal...,C773,,,West,Pacific,43030.500000,2048.578261,38.852174,11.306522,12.897826,14.121739,13.532609,13.160870,13.378261,11.473913,6.380435,3.736957,49.052174,50.947826,48.504348,10.117391,36.408696,4.969565,3.674783,59.219565,86330.39130,2.226087,1.528261,2.897826,2.747826,3.173913,6.647826,9.617391,15.965217,13.589130,19.752174,21.847826,41.600000,34317.82609,61.397826,12609.260870,572606.5000,1778.000000,34.595652,17.491304,22.656522,29.263043,20.200000,10.404348,30.604348,46.208696,63.154348,6.197826,15.708696,0.015217,38.708696,3.963043,25.565217,1.193478,0.269565,18.858696,11.426087,47.726087,9.895652,10.515217,12.745652,32.530435,7.263043,3.810870,47.310325,9.595719,20.084231
3,447383,Hispanic,MEDICARE ADVANTAGE,CA,917,64,F,,C50912,Malignant neoplasm of unspecified site of left...,C779,,,West,Pacific,43030.500000,2048.578261,38.852174,11.306522,12.897826,14.121739,13.532609,13.160870,13.378261,11.473913,6.380435,3.736957,49.052174,50.947826,48.504348,10.117391,36.408696,4.969565,3.674783,59.219565,86330.39130,2.226087,1.528261,2.897826,2.747826,3.173913,6.647826,9.617391,15.965217,13.589130,19.752174,21.847826,41.600000,34317.82609,61.397826,12609.260870,572606.5000,1778.000000,34.595652,17.491304,22.656522,29.263043,20.200000,10.404348,30.604348,46.208696,63.154348,6.197826,15.708696,0.015217,38.708696,3.963043,25.565217,1.193478,0.269565,18.858696,11.426087,47.726087,9.895652,10.515217,12.745652,32.530435,7.263043,3.810870,47.310325,9.595719,20.084231
4,687972,Black,,CA,900,40,F,23.00,C50412,Malig neoplasm of upper-outer quadrant of left...,C779,,,West,Pacific,36054.117650,5294.330882,36.653846,9.761538,11.267692,17.233846,17.441538,13.090769,12.304615,9.407692,5.673846,3.824615,50.510769,49.489231,33.478462,11.301538,50.456923,4.766154,3.442857,55.531746,69266.69355,6.320312,2.953125,6.806250,4.175000,4.125000,7.843750,10.164062,14.417188,10.479688,13.726562,18.962500,32.689062,36053.40000,31.504687,12949.117650,873755.9661,1651.145161,37.367742,22.915385,18.236923,21.269231,23.886154,13.689231,37.575385,41.748438,64.387692,8.683077,21.233333,0.006349,42.824615,12.216923,12.703077,1.120000,0.146154,22.135385,8.850769,45.526154,11.901538,20.760000,14.737500,30.709375,10.341538,3.030769,41.186992,11.166898,21.644261
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5787,977076,White,,KY,404,63,F,29.60,C50411,Malig neoplm of upper-outer quadrant of right ...,C773,,,South,East South Central,7765.000000,131.040000,40.434783,11.556522,14.021739,13.208696,11.160870,13.191304,14.239130,11.252174,7.860870,3.500000,49.473913,50.526087,52.482609,14.917391,25.456522,7.134783,3.036364,45.222727,47693.95238,3.886364,5.472727,8.936364,5.450000,6.850000,8.122727,13.622727,21.172727,11.695455,9.495455,5.318182,14.813636,23461.00000,80.309091,2985.360000,117472.2941,672.062500,27.250000,18.404348,43.273913,21.465217,10.956522,5.900000,16.856522,43.291304,52.500000,6.743478,12.259091,9.686364,95.486957,1.930435,0.321739,0.113043,0.013043,0.452174,1.673913,1.243478,20.404348,20.813636,0.350000,30.152174,6.473913,5.908696,39.947326,7.622672,9.154618
5788,922960,White,,IA,507,69,F,,C50912,Malignant neoplasm of unspecified site of left...,C773,,,Midwest,West North Central,19332.750000,346.250000,38.525000,12.200000,13.025000,14.675000,12.750000,10.975000,12.275000,12.925000,7.525000,3.600000,50.575000,49.425000,42.850000,14.525000,36.825000,5.775000,2.992500,51.175000,50797.00000,3.150000,3.475000,5.350000,5.900000,4.650000,11.100000,14.600000,22.075000,12.700000,11.100000,5.850000,16.950000,30365.50000,63.475000,8277.250000,116581.7500,801.750000,28.100000,11.975000,36.100000,33.350000,13.575000,5.100000,18.675000,37.500000,65.775000,6.675000,8.625000,0.600000,75.250000,14.875000,1.875000,0.250000,0.400000,2.250000,5.125000,6.175000,16.675000,15.900000,2.800000,16.800000,5.475000,6.875000,35.825340,7.610534,9.712786
5789,759690,,MEDICARE ADVANTAGE,WA,980,84,F,28.28,C50411,Malig neoplm of upper-outer quadrant of right ...,C773,,,West,Pacific,28628.290910,1091.825455,39.679245,12.143396,12.462264,11.320755,15.213208,14.449057,14.105660,11.226415,5.841509,3.230189,49.969811,50.030189,57.096226,10.601887,28.198113,4.105660,3.122264,58.656604,120277.07550,1.650943,1.166038,1.662264,1.879245,2.030189,4.135849,6.737736,13.224528,11.154717,19.994340,36.343396,56.337736,56562.28302,68.620755,10666.381820,683702.3585,1914.207547,30.030769,5.830189,16.381132,28.124528,30.443396,19.211321,49.654717,52.200000,68.041509,4.069811,12.731481,0.203704,64.613208,4.150943,17.926415,0.539623,0.728302,3.901887,8.132075,9.511321,9.752830,6.432075,5.094340,31.275472,5.309434,5.807547,36.618644,4.939852,23.393650
5790,911717,,COMMERCIAL,OK,740,58,F,,1749,"Malignant neoplasm of breast (female), unspeci...",C773,,,South,West South Central,9716.970149,150.602985,39.588060,11.768657,15.576119,12.500000,11.220896,12.462687,12.601493,12.555224,7.591045,3.707463,50.617910,49.382090,51.743284,14.323881,26.828358,7.110448,3.149231,45.758462,56232.32258,2.737879,4.737879,4.603030,5.266667,4.740909,12.315152,13.143939,17.715152,12.536364,13.215152,8.983333,22.198485,29720.98485,73.150000,3595.000000,131170.1525,808.467742,26.516393,11.173134,37.686567,30.708955,13.974627,6.447761,20.422388,38.242424,56.434328,4.831818,11.425000,7.768333,74.488060,1.653731,1.180597,10.462687,0.129851,1.074627,11.004478,4.374627,15.544776,16.603030,0.513636,25.877273,14.926866,7.600000,39.832235,8.030925,9.769358


## Processing NaN

In [30]:
test.isna().sum().sort_values(ascending=False)[:15]

metastatic_first_novel_treatment         5781
metastatic_first_novel_treatment_type    5781
bmi                                      4015
patient_race                             2901
payer_type                                760
patient_state                              21
Region                                     21
Division                                   21
N02                                        14
PM25                                       14
Ozone                                      14
income_household_15_to_20                   1
income_household_50_to_75                   1
income_household_35_to_50                   1
income_household_20_to_25                   1
dtype: int64

In [31]:
train.isna().sum().sort_values(ascending=False)[:15]

metastatic_first_novel_treatment         12882
metastatic_first_novel_treatment_type    12882
bmi                                       8965
patient_race                              6385
payer_type                                1803
Region                                      52
Division                                    52
patient_state                               51
N02                                         29
PM25                                        29
Ozone                                       29
income_household_25_to_35                    4
income_household_15_to_20                    4
income_household_35_to_50                    4
income_household_20_to_25                    4
dtype: int64

In [32]:
train.drop([
    'metastatic_first_novel_treatment', 
    'metastatic_first_novel_treatment_type', 
    'breast_cancer_diagnosis_desc', 
    'bmi'], 
    axis=1, 
    inplace=True
    )

test.drop([
    'metastatic_first_novel_treatment', 
    'metastatic_first_novel_treatment_type', 
    'breast_cancer_diagnosis_desc', 
    'bmi'], 
    axis=1, 
    inplace=True
    )

Drop the rows with too many absent values. 

In [33]:
train[train.family_size.isna()]

Unnamed: 0,patient_id,patient_race,payer_type,patient_state,patient_zip3,patient_age,patient_gender,breast_cancer_diagnosis_code,metastatic_cancer_diagnosis_code,Region,Division,population,density,age_median,age_under_10,age_10_to_19,age_20s,age_30s,age_40s,age_50s,age_60s,age_70s,age_over_80,male,female,married,divorced,never_married,widowed,family_size,family_dual_income,income_household_median,income_household_under_5,income_household_5_to_10,income_household_10_to_15,income_household_15_to_20,income_household_20_to_25,income_household_25_to_35,income_household_35_to_50,income_household_50_to_75,income_household_75_to_100,income_household_100_to_150,income_household_150_over,income_household_six_figure,income_individual_median,home_ownership,housing_units,home_value,rent_median,rent_burden,education_less_highschool,education_highschool,education_some_college,education_bachelors,education_graduate,education_college_or_above,education_stem_degree,labor_force_participation,unemployment_rate,self_employed,farmer,race_white,race_black,race_asian,race_native,race_pacific,race_other,race_multiple,hispanic,disabled,poverty,limited_english,commute_time,health_uninsured,veteran,Ozone,PM25,N02,DiagPeriodL90D
2996,514282,,COMMERCIAL,TX,772,47,F,C50919,C773,South,West South Central,4459.0,3376.1,20.6,0.0,35.3,62.1,1.5,0.8,0.0,0.2,0.0,0.0,47.1,52.9,0.9,0.2,98.9,0.0,,,,,,,,,,,,,,,,4316.0,,0.0,,,,0.0,0.0,44.8,41.7,13.5,55.2,73.0,30.7,18.8,,,47.6,23.0,20.1,0.0,0.0,1.7,7.6,18.2,4.6,,,16.2,4.5,1.6,35.849956,10.535904,22.426824,0
8481,387901,Other,MEDICAID,TX,772,63,F,C50911,C7931,South,West South Central,4459.0,3376.1,20.6,0.0,35.3,62.1,1.5,0.8,0.0,0.2,0.0,0.0,47.1,52.9,0.9,0.2,98.9,0.0,,,,,,,,,,,,,,,,4316.0,,0.0,,,,0.0,0.0,44.8,41.7,13.5,55.2,73.0,30.7,18.8,,,47.6,23.0,20.1,0.0,0.0,1.7,7.6,18.2,4.6,,,16.2,4.5,1.6,35.849956,10.535904,22.426824,1
10542,224030,Black,MEDICAID,FL,332,41,F,C50911,C7800,South,South Atlantic,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,35.544993,8.714016,16.558153,0
11314,411586,,COMMERCIAL,TX,772,57,F,C50812,C773,South,West South Central,4459.0,3376.1,20.6,0.0,35.3,62.1,1.5,0.8,0.0,0.2,0.0,0.0,47.1,52.9,0.9,0.2,98.9,0.0,,,,,,,,,,,,,,,,4316.0,,0.0,,,,0.0,0.0,44.8,41.7,13.5,55.2,73.0,30.7,18.8,,,47.6,23.0,20.1,0.0,0.0,1.7,7.6,18.2,4.6,,,16.2,4.5,1.6,35.849956,10.535904,22.426824,0


In [34]:
list(train[train.family_size.isna()].index)

[2996, 8481, 10542, 11314]

In [35]:
train.drop(10542, inplace=True)
train = train.drop(list(train[train.family_size.isna()].index)).reset_index(drop=True)

test = test.drop(1622).reset_index(drop=True)

In [36]:
train

Unnamed: 0,patient_id,patient_race,payer_type,patient_state,patient_zip3,patient_age,patient_gender,breast_cancer_diagnosis_code,metastatic_cancer_diagnosis_code,Region,Division,population,density,age_median,age_under_10,age_10_to_19,age_20s,age_30s,age_40s,age_50s,age_60s,age_70s,age_over_80,male,female,married,divorced,never_married,widowed,family_size,family_dual_income,income_household_median,income_household_under_5,income_household_5_to_10,income_household_10_to_15,income_household_15_to_20,income_household_20_to_25,income_household_25_to_35,income_household_35_to_50,income_household_50_to_75,income_household_75_to_100,income_household_100_to_150,income_household_150_over,income_household_six_figure,income_individual_median,home_ownership,housing_units,home_value,rent_median,rent_burden,education_less_highschool,education_highschool,education_some_college,education_bachelors,education_graduate,education_college_or_above,education_stem_degree,labor_force_participation,unemployment_rate,self_employed,farmer,race_white,race_black,race_asian,race_native,race_pacific,race_other,race_multiple,hispanic,disabled,poverty,limited_english,commute_time,health_uninsured,veteran,Ozone,PM25,N02,DiagPeriodL90D
0,475714,,MEDICAID,CA,924,84,F,C50919,C7989,West,Pacific,31437.75000,1189.562500,30.642857,16.014286,15.542857,17.614286,14.014286,11.614286,11.557143,7.571429,4.000000,2.100000,49.857143,50.142857,36.571429,11.885714,47.114286,4.442857,3.928571,52.228571,52996.28571,3.142857,4.000000,6.157143,5.142857,6.271429,10.142857,13.300000,20.000000,12.742857,11.571429,7.528571,19.100000,24563.57143,44.585714,8674.500000,2.646343e+05,1165.000000,37.442857,33.257143,29.200000,25.914286,8.357143,3.257143,11.614286,39.557143,61.528571,8.471429,13.428571,0.000000,44.100000,13.100000,5.100000,1.485714,0.342857,27.114286,8.757143,66.685714,12.871429,22.542857,10.100000,27.814286,11.200000,3.500000,52.237210,8.650555,18.606528,1
1,349367,White,COMMERCIAL,CA,928,62,F,C50411,C773,West,Pacific,39121.87879,2295.939394,38.200000,11.878788,13.354545,14.230303,13.418182,13.333333,14.060606,10.248485,5.951515,3.503030,49.893939,50.106061,50.245455,9.827273,35.290909,4.651515,3.622727,61.736364,102741.63640,2.327273,1.536364,2.648485,2.178788,2.409091,5.163636,7.972727,13.936364,12.469697,19.760606,29.596970,49.357576,41287.27273,61.463636,11725.666670,6.776885e+05,2003.125000,34.753125,14.230303,19.987879,29.796970,23.739394,12.245455,35.984848,47.918182,65.230303,5.103030,15.224242,0.027273,54.030303,2.527273,20.827273,0.587879,0.300000,11.645455,10.081818,37.948485,8.957576,10.109091,8.057576,30.606061,7.018182,4.103030,42.301121,8.487175,20.113179,1
2,138632,White,COMMERCIAL,TX,760,43,F,C50112,C773,South,West South Central,21996.68333,626.236667,37.906667,13.028333,14.463333,12.531667,13.545000,12.860000,12.770000,11.426667,6.565000,2.811667,50.123333,49.876667,55.753333,12.330000,27.195000,4.710000,3.260667,55.801667,85984.74138,2.483333,1.305000,2.716667,2.938333,2.766667,6.763333,12.061667,15.835000,13.560000,20.875000,18.680000,39.555000,40399.03333,72.745000,7786.583333,2.377131e+05,1235.907407,29.358491,10.811667,27.038333,32.368333,19.678333,10.115000,29.793333,37.308475,66.428333,4.560000,13.722034,3.650847,75.820000,9.231667,3.618333,0.463333,0.146667,3.816667,6.898333,19.370000,11.253333,9.663333,3.356667,31.394915,15.066667,7.446667,40.108207,7.642753,14.839351,1
3,617843,White,COMMERCIAL,CA,926,45,F,C50212,C773,West,Pacific,32795.32558,1896.220930,42.871429,10.071429,12.135714,12.538095,12.464286,12.650000,14.847619,12.280952,8.216667,4.759524,49.066667,50.933333,52.604762,11.623810,31.142857,4.623810,3.098095,54.564286,120533.83330,3.435714,1.273810,2.180952,2.211905,2.100000,4.380952,5.885714,10.897619,10.721429,18.850000,38.057143,56.907143,55336.28571,59.221429,12171.302330,1.012474e+06,2354.738095,32.030952,5.835714,12.145238,26.269048,33.285714,22.459524,55.745238,48.938095,64.430952,5.264286,18.502381,0.052381,65.014286,1.438095,18.845238,0.430952,0.252381,5.428571,8.611905,16.716667,8.845238,8.688095,5.280952,27.561905,4.404762,4.809524,42.070075,7.229393,15.894123,0
4,817482,,COMMERCIAL,ID,836,55,F,1749,C773,West,Mountain,10886.26000,116.886000,43.473469,10.824000,13.976000,9.492000,10.364000,12.600000,14.992000,14.836000,9.462000,3.466000,52.312000,47.688000,57.882000,14.964000,21.760000,5.406000,3.352653,47.214286,61075.13043,2.594000,1.960000,3.168000,3.240000,4.778000,11.462000,15.656000,22.432000,12.480000,13.620000,8.606000,22.226000,29073.18367,77.098000,3768.060000,2.498457e+05,919.743590,27.029730,11.576000,29.590000,39.168000,13.978000,5.684000,19.662000,42.332653,57.488000,4.258000,13.029545,6.890909,86.712000,0.426000,0.656000,0.760000,0.108000,5.080000,6.258000,13.334000,15.276000,11.224000,1.946000,26.170213,12.088000,13.106000,41.356058,4.110749,11.722197,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12897,674178,White,,OH,436,50,F,C50411,C773,Midwest,East North Central,19413.05882,1196.805882,36.911765,12.876471,13.435294,14.394118,12.705882,11.694118,13.329412,11.764706,6.188235,3.617647,48.264706,51.735294,36.429412,15.700000,42.158824,5.705882,3.039412,43.217647,48452.41176,5.517647,6.005882,7.405882,4.800000,6.058824,10.364706,14.194118,16.217647,11.047059,10.994118,7.358824,18.352941,27888.52941,55.905882,8227.764706,1.005470e+05,772.647059,31.776471,12.923529,31.723529,32.564706,14.400000,8.370588,22.770588,38.288235,61.429412,9.135294,9.105882,0.023529,62.182353,27.770588,1.217647,0.270588,0.064706,2.476471,6.005882,7.747059,17.400000,23.600000,0.864706,19.841176,6.300000,6.247059,38.753055,8.068682,21.140731,1
12898,452909,,COMMERCIAL,CA,945,50,F,C50912,C773,West,Pacific,30153.87952,976.289157,42.135802,10.753086,12.714815,11.725926,13.101235,12.817284,13.301235,12.771605,8.413580,4.408642,49.727160,50.272840,53.076543,10.912346,30.534568,5.466667,3.271125,55.760000,122863.89610,2.051250,1.136250,2.127500,1.647500,2.073750,5.148750,6.473750,12.807500,11.286250,18.003750,37.245000,55.248750,52778.65000,67.480000,10267.108430,8.179491e+05,2223.445946,32.100000,8.916049,16.504938,29.396296,26.903704,18.277778,45.181481,52.645000,63.281481,5.332099,14.116250,0.416250,54.060494,5.906173,21.497531,0.586420,0.695062,7.986420,9.274074,21.861728,11.243210,7.837037,5.411250,34.700000,3.845679,5.671605,36.469947,6.265266,10.728732,1
12899,357486,,COMMERCIAL,CA,926,61,F,C50912,C7931,West,Pacific,32795.32558,1896.220930,42.871429,10.071429,12.135714,12.538095,12.464286,12.650000,14.847619,12.280952,8.216667,4.759524,49.066667,50.933333,52.604762,11.623810,31.142857,4.623810,3.098095,54.564286,120533.83330,3.435714,1.273810,2.180952,2.211905,2.100000,4.380952,5.885714,10.897619,10.721429,18.850000,38.057143,56.907143,55336.28571,59.221429,12171.302330,1.012474e+06,2354.738095,32.030952,5.835714,12.145238,26.269048,33.285714,22.459524,55.745238,48.938095,64.430952,5.264286,18.502381,0.052381,65.014286,1.438095,18.845238,0.430952,0.252381,5.428571,8.611905,16.716667,8.845238,8.688095,5.280952,27.561905,4.404762,4.809524,42.070075,7.229393,15.894123,1
12900,935417,,,NY,112,37,F,1749,C773,Northeast,Middle Atlantic,71374.13158,17326.407890,36.476316,12.986842,11.318421,14.971053,17.255263,12.631579,11.460526,9.789474,6.000000,3.581579,47.668421,52.331579,39.923684,10.239474,44.642105,5.186842,3.412105,53.447368,74499.71053,4.334211,3.305263,5.863158,4.460526,4.042105,7.589474,9.897368,13.542105,10.742105,14.889474,21.318421,36.207895,39491.78947,29.931579,25922.552630,8.708732e+05,1678.447368,35.213158,16.200000,24.334211,18.447368,24.371053,16.655263,41.026316,40.857895,64.197368,7.184211,18.145946,0.002703,44.100000,28.831579,11.205263,0.515789,0.068421,9.184211,6.089474,18.960526,10.194737,18.642105,14.173684,42.502632,6.392105,1.755263,37.722740,7.879795,27.496367,0


### Fill NaN in State, Region and Division

In [37]:
zip_suffixes = pd.read_html("https://en.wikipedia.org/wiki/List_of_ZIP_Code_prefixes#Notes")

zip_state_dict = {}
# Iterate through the DataFrame
for i in range(len(zip_suffixes)):
    for column in zip_suffixes[i].columns:
        for j in range(len(zip_suffixes[i].columns)):
            try:
                key = int(zip_suffixes[i][column][j][:3])
                if len(str(key)) == 3:  # Check if the length of key is equal to 3
                    value = zip_suffixes[i][column][j][4:6]  # Assuming the state code is always after the space
                    zip_state_dict[key] = value
            except ValueError:
                pass


test.patient_state = [(lambda zip3: zip_state_dict.get(zip3, None))(zip3) for zip3 in test.patient_zip3]
train.patient_state = [(lambda zip3: zip_state_dict.get(zip3, None))(zip3) for zip3 in train.patient_zip3]

In [38]:
states = pd.read_csv('states.csv')
states.head()

Unnamed: 0,State,State Code,Region,Division
0,Alaska,AK,West,Pacific
1,Alabama,AL,South,East South Central
2,Arkansas,AR,South,West South Central
3,Arizona,AZ,West,Mountain
4,California,CA,West,Pacific


In [39]:
reg = states.set_index('State Code')['Region'].to_dict()
div = states.set_index('State Code')['Division'].to_dict()

train['Region'] = [(lambda state: reg.get(state, None))(state) for state in train.patient_state]
train['Division'] = [(lambda state: reg.get(state, None))(state) for state in train.patient_state]

test['Region'] = [(lambda state: reg.get(state, None))(state) for state in test.patient_state]
test['Division'] = [(lambda state: reg.get(state, None))(state) for state in test.patient_state]


In [40]:
train[['Region', 'Division']]

Unnamed: 0,Region,Division
0,West,West
1,West,West
2,South,South
3,West,West
4,West,West
...,...,...
12897,Midwest,Midwest
12898,West,West
12899,West,West
12900,Northeast,Northeast


## Categorical Values

Transform values with type `object` to `category` 

In [41]:
train = train.apply(lambda column: column.astype('category') if column.dtype == 'O' else column)
test = test.apply(lambda column: column.astype('category') if column.dtype == 'O' else column)

In [42]:
train.select_dtypes(include='category')

Unnamed: 0,patient_race,payer_type,patient_state,patient_gender,breast_cancer_diagnosis_code,metastatic_cancer_diagnosis_code,Region,Division
0,,MEDICAID,CA,F,C50919,C7989,West,West
1,White,COMMERCIAL,CA,F,C50411,C773,West,West
2,White,COMMERCIAL,TX,F,C50112,C773,South,South
3,White,COMMERCIAL,CA,F,C50212,C773,West,West
4,,COMMERCIAL,ID,F,1749,C773,West,West
...,...,...,...,...,...,...,...,...
12897,White,,OH,F,C50411,C773,Midwest,Midwest
12898,,COMMERCIAL,CA,F,C50912,C773,West,West
12899,,COMMERCIAL,CA,F,C50912,C7931,West,West
12900,,,NY,F,1749,C773,Northeast,Northeast


In [43]:
train.select_dtypes(include='category').isna().sum()

patient_race                        6383
payer_type                          1803
patient_state                          0
patient_gender                         0
breast_cancer_diagnosis_code           0
metastatic_cancer_diagnosis_code       0
Region                                 0
Division                               0
dtype: int64

### One-Hot Encode Cats

In [44]:
cats = list(train.select_dtypes('category').columns)
cats

['patient_race',
 'payer_type',
 'patient_state',
 'patient_gender',
 'breast_cancer_diagnosis_code',
 'metastatic_cancer_diagnosis_code',
 'Region',
 'Division']

In [45]:
train = pd.get_dummies(train, columns=cats, dummy_na=True, dtype=float)
test = pd.get_dummies(test, columns=cats, dummy_na=True, dtype=float)


In [46]:
train

Unnamed: 0,patient_id,patient_zip3,patient_age,population,density,age_median,age_under_10,age_10_to_19,age_20s,age_30s,age_40s,age_50s,age_60s,age_70s,age_over_80,male,female,married,divorced,never_married,widowed,family_size,family_dual_income,income_household_median,income_household_under_5,income_household_5_to_10,income_household_10_to_15,income_household_15_to_20,income_household_20_to_25,income_household_25_to_35,income_household_35_to_50,income_household_50_to_75,income_household_75_to_100,income_household_100_to_150,income_household_150_over,income_household_six_figure,income_individual_median,home_ownership,housing_units,home_value,rent_median,rent_burden,education_less_highschool,education_highschool,education_some_college,education_bachelors,education_graduate,education_college_or_above,education_stem_degree,labor_force_participation,unemployment_rate,self_employed,farmer,race_white,race_black,race_asian,race_native,race_pacific,race_other,race_multiple,hispanic,disabled,poverty,limited_english,commute_time,health_uninsured,veteran,Ozone,PM25,N02,DiagPeriodL90D,patient_race_Asian,patient_race_Black,patient_race_Hispanic,patient_race_Other,patient_race_White,patient_race_nan,payer_type_COMMERCIAL,payer_type_MEDICAID,payer_type_MEDICARE ADVANTAGE,payer_type_nan,patient_state_AK,patient_state_AL,patient_state_AR,patient_state_AZ,patient_state_CA,patient_state_CO,patient_state_DC,patient_state_DE,patient_state_FL,patient_state_GA,patient_state_HI,patient_state_IA,patient_state_ID,patient_state_IL,patient_state_IN,patient_state_KS,patient_state_KY,patient_state_LA,patient_state_MD,patient_state_MI,patient_state_MN,patient_state_MO,patient_state_MS,patient_state_MT,patient_state_NC,patient_state_ND,patient_state_NE,patient_state_NM,patient_state_NV,patient_state_NY,patient_state_OH,patient_state_OK,patient_state_OR,patient_state_PA,patient_state_SC,patient_state_SD,patient_state_TN,patient_state_TX,patient_state_UT,patient_state_VA,patient_state_WA,patient_state_WI,patient_state_WV,patient_state_WY,patient_state_nan,patient_gender_F,patient_gender_nan,breast_cancer_diagnosis_code_1741,breast_cancer_diagnosis_code_1742,breast_cancer_diagnosis_code_1743,breast_cancer_diagnosis_code_1744,breast_cancer_diagnosis_code_1745,breast_cancer_diagnosis_code_1746,breast_cancer_diagnosis_code_1748,breast_cancer_diagnosis_code_1749,breast_cancer_diagnosis_code_1759,breast_cancer_diagnosis_code_19881,breast_cancer_diagnosis_code_C50,breast_cancer_diagnosis_code_C5001,breast_cancer_diagnosis_code_C50011,breast_cancer_diagnosis_code_C50012,breast_cancer_diagnosis_code_C50019,breast_cancer_diagnosis_code_C50021,breast_cancer_diagnosis_code_C5011,breast_cancer_diagnosis_code_C50111,breast_cancer_diagnosis_code_C50112,breast_cancer_diagnosis_code_C50119,breast_cancer_diagnosis_code_C5021,breast_cancer_diagnosis_code_C50211,breast_cancer_diagnosis_code_C50212,breast_cancer_diagnosis_code_C50219,breast_cancer_diagnosis_code_C5031,breast_cancer_diagnosis_code_C50311,breast_cancer_diagnosis_code_C50312,breast_cancer_diagnosis_code_C50319,breast_cancer_diagnosis_code_C5041,breast_cancer_diagnosis_code_C50411,breast_cancer_diagnosis_code_C50412,breast_cancer_diagnosis_code_C50419,breast_cancer_diagnosis_code_C50421,breast_cancer_diagnosis_code_C5051,breast_cancer_diagnosis_code_C50511,breast_cancer_diagnosis_code_C50512,breast_cancer_diagnosis_code_C50519,breast_cancer_diagnosis_code_C50611,breast_cancer_diagnosis_code_C50612,breast_cancer_diagnosis_code_C50619,breast_cancer_diagnosis_code_C5081,breast_cancer_diagnosis_code_C50811,breast_cancer_diagnosis_code_C50812,breast_cancer_diagnosis_code_C50819,breast_cancer_diagnosis_code_C509,breast_cancer_diagnosis_code_C5091,breast_cancer_diagnosis_code_C50911,breast_cancer_diagnosis_code_C50912,breast_cancer_diagnosis_code_C50919,breast_cancer_diagnosis_code_C50929,breast_cancer_diagnosis_code_nan,metastatic_cancer_diagnosis_code_C770,metastatic_cancer_diagnosis_code_C771,metastatic_cancer_diagnosis_code_C772,metastatic_cancer_diagnosis_code_C773,metastatic_cancer_diagnosis_code_C774,metastatic_cancer_diagnosis_code_C775,metastatic_cancer_diagnosis_code_C778,metastatic_cancer_diagnosis_code_C779,metastatic_cancer_diagnosis_code_C7800,metastatic_cancer_diagnosis_code_C7801,metastatic_cancer_diagnosis_code_C7802,metastatic_cancer_diagnosis_code_C781,metastatic_cancer_diagnosis_code_C782,metastatic_cancer_diagnosis_code_C7830,metastatic_cancer_diagnosis_code_C7839,metastatic_cancer_diagnosis_code_C784,metastatic_cancer_diagnosis_code_C785,metastatic_cancer_diagnosis_code_C786,metastatic_cancer_diagnosis_code_C787,metastatic_cancer_diagnosis_code_C7880,metastatic_cancer_diagnosis_code_C7889,metastatic_cancer_diagnosis_code_C7900,metastatic_cancer_diagnosis_code_C7901,metastatic_cancer_diagnosis_code_C7910,metastatic_cancer_diagnosis_code_C7911,metastatic_cancer_diagnosis_code_C7919,metastatic_cancer_diagnosis_code_C792,metastatic_cancer_diagnosis_code_C7931,metastatic_cancer_diagnosis_code_C7932,metastatic_cancer_diagnosis_code_C7940,metastatic_cancer_diagnosis_code_C7949,metastatic_cancer_diagnosis_code_C7951,metastatic_cancer_diagnosis_code_C7952,metastatic_cancer_diagnosis_code_C7960,metastatic_cancer_diagnosis_code_C7961,metastatic_cancer_diagnosis_code_C7962,metastatic_cancer_diagnosis_code_C7970,metastatic_cancer_diagnosis_code_C7971,metastatic_cancer_diagnosis_code_C7972,metastatic_cancer_diagnosis_code_C7981,metastatic_cancer_diagnosis_code_C7982,metastatic_cancer_diagnosis_code_C7989,metastatic_cancer_diagnosis_code_C799,metastatic_cancer_diagnosis_code_nan,Region_Midwest,Region_Northeast,Region_South,Region_West,Region_nan,Division_Midwest,Division_Northeast,Division_South,Division_West,Division_nan
0,475714,924,84,31437.75000,1189.562500,30.642857,16.014286,15.542857,17.614286,14.014286,11.614286,11.557143,7.571429,4.000000,2.100000,49.857143,50.142857,36.571429,11.885714,47.114286,4.442857,3.928571,52.228571,52996.28571,3.142857,4.000000,6.157143,5.142857,6.271429,10.142857,13.300000,20.000000,12.742857,11.571429,7.528571,19.100000,24563.57143,44.585714,8674.500000,2.646343e+05,1165.000000,37.442857,33.257143,29.200000,25.914286,8.357143,3.257143,11.614286,39.557143,61.528571,8.471429,13.428571,0.000000,44.100000,13.100000,5.100000,1.485714,0.342857,27.114286,8.757143,66.685714,12.871429,22.542857,10.100000,27.814286,11.200000,3.500000,52.237210,8.650555,18.606528,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,349367,928,62,39121.87879,2295.939394,38.200000,11.878788,13.354545,14.230303,13.418182,13.333333,14.060606,10.248485,5.951515,3.503030,49.893939,50.106061,50.245455,9.827273,35.290909,4.651515,3.622727,61.736364,102741.63640,2.327273,1.536364,2.648485,2.178788,2.409091,5.163636,7.972727,13.936364,12.469697,19.760606,29.596970,49.357576,41287.27273,61.463636,11725.666670,6.776885e+05,2003.125000,34.753125,14.230303,19.987879,29.796970,23.739394,12.245455,35.984848,47.918182,65.230303,5.103030,15.224242,0.027273,54.030303,2.527273,20.827273,0.587879,0.300000,11.645455,10.081818,37.948485,8.957576,10.109091,8.057576,30.606061,7.018182,4.103030,42.301121,8.487175,20.113179,1,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,138632,760,43,21996.68333,626.236667,37.906667,13.028333,14.463333,12.531667,13.545000,12.860000,12.770000,11.426667,6.565000,2.811667,50.123333,49.876667,55.753333,12.330000,27.195000,4.710000,3.260667,55.801667,85984.74138,2.483333,1.305000,2.716667,2.938333,2.766667,6.763333,12.061667,15.835000,13.560000,20.875000,18.680000,39.555000,40399.03333,72.745000,7786.583333,2.377131e+05,1235.907407,29.358491,10.811667,27.038333,32.368333,19.678333,10.115000,29.793333,37.308475,66.428333,4.560000,13.722034,3.650847,75.820000,9.231667,3.618333,0.463333,0.146667,3.816667,6.898333,19.370000,11.253333,9.663333,3.356667,31.394915,15.066667,7.446667,40.108207,7.642753,14.839351,1,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,617843,926,45,32795.32558,1896.220930,42.871429,10.071429,12.135714,12.538095,12.464286,12.650000,14.847619,12.280952,8.216667,4.759524,49.066667,50.933333,52.604762,11.623810,31.142857,4.623810,3.098095,54.564286,120533.83330,3.435714,1.273810,2.180952,2.211905,2.100000,4.380952,5.885714,10.897619,10.721429,18.850000,38.057143,56.907143,55336.28571,59.221429,12171.302330,1.012474e+06,2354.738095,32.030952,5.835714,12.145238,26.269048,33.285714,22.459524,55.745238,48.938095,64.430952,5.264286,18.502381,0.052381,65.014286,1.438095,18.845238,0.430952,0.252381,5.428571,8.611905,16.716667,8.845238,8.688095,5.280952,27.561905,4.404762,4.809524,42.070075,7.229393,15.894123,0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,817482,836,55,10886.26000,116.886000,43.473469,10.824000,13.976000,9.492000,10.364000,12.600000,14.992000,14.836000,9.462000,3.466000,52.312000,47.688000,57.882000,14.964000,21.760000,5.406000,3.352653,47.214286,61075.13043,2.594000,1.960000,3.168000,3.240000,4.778000,11.462000,15.656000,22.432000,12.480000,13.620000,8.606000,22.226000,29073.18367,77.098000,3768.060000,2.498457e+05,919.743590,27.029730,11.576000,29.590000,39.168000,13.978000,5.684000,19.662000,42.332653,57.488000,4.258000,13.029545,6.890909,86.712000,0.426000,0.656000,0.760000,0.108000,5.080000,6.258000,13.334000,15.276000,11.224000,1.946000,26.170213,12.088000,13.106000,41.356058,4.110749,11.722197,0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12897,674178,436,50,19413.05882,1196.805882,36.911765,12.876471,13.435294,14.394118,12.705882,11.694118,13.329412,11.764706,6.188235,3.617647,48.264706,51.735294,36.429412,15.700000,42.158824,5.705882,3.039412,43.217647,48452.41176,5.517647,6.005882,7.405882,4.800000,6.058824,10.364706,14.194118,16.217647,11.047059,10.994118,7.358824,18.352941,27888.52941,55.905882,8227.764706,1.005470e+05,772.647059,31.776471,12.923529,31.723529,32.564706,14.400000,8.370588,22.770588,38.288235,61.429412,9.135294,9.105882,0.023529,62.182353,27.770588,1.217647,0.270588,0.064706,2.476471,6.005882,7.747059,17.400000,23.600000,0.864706,19.841176,6.300000,6.247059,38.753055,8.068682,21.140731,1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
12898,452909,945,50,30153.87952,976.289157,42.135802,10.753086,12.714815,11.725926,13.101235,12.817284,13.301235,12.771605,8.413580,4.408642,49.727160,50.272840,53.076543,10.912346,30.534568,5.466667,3.271125,55.760000,122863.89610,2.051250,1.136250,2.127500,1.647500,2.073750,5.148750,6.473750,12.807500,11.286250,18.003750,37.245000,55.248750,52778.65000,67.480000,10267.108430,8.179491e+05,2223.445946,32.100000,8.916049,16.504938,29.396296,26.903704,18.277778,45.181481,52.645000,63.281481,5.332099,14.116250,0.416250,54.060494,5.906173,21.497531,0.586420,0.695062,7.986420,9.274074,21.861728,11.243210,7.837037,5.411250,34.700000,3.845679,5.671605,36.469947,6.265266,10.728732,1,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
12899,357486,926,61,32795.32558,1896.220930,42.871429,10.071429,12.135714,12.538095,12.464286,12.650000,14.847619,12.280952,8.216667,4.759524,49.066667,50.933333,52.604762,11.623810,31.142857,4.623810,3.098095,54.564286,120533.83330,3.435714,1.273810,2.180952,2.211905,2.100000,4.380952,5.885714,10.897619,10.721429,18.850000,38.057143,56.907143,55336.28571,59.221429,12171.302330,1.012474e+06,2354.738095,32.030952,5.835714,12.145238,26.269048,33.285714,22.459524,55.745238,48.938095,64.430952,5.264286,18.502381,0.052381,65.014286,1.438095,18.845238,0.430952,0.252381,5.428571,8.611905,16.716667,8.845238,8.688095,5.280952,27.561905,4.404762,4.809524,42.070075,7.229393,15.894123,1,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
12900,935417,112,37,71374.13158,17326.407890,36.476316,12.986842,11.318421,14.971053,17.255263,12.631579,11.460526,9.789474,6.000000,3.581579,47.668421,52.331579,39.923684,10.239474,44.642105,5.186842,3.412105,53.447368,74499.71053,4.334211,3.305263,5.863158,4.460526,4.042105,7.589474,9.897368,13.542105,10.742105,14.889474,21.318421,36.207895,39491.78947,29.931579,25922.552630,8.708732e+05,1678.447368,35.213158,16.200000,24.334211,18.447368,24.371053,16.655263,41.026316,40.857895,64.197368,7.184211,18.145946,0.002703,44.100000,28.831579,11.205263,0.515789,0.068421,9.184211,6.089474,18.960526,10.194737,18.642105,14.173684,42.502632,6.392105,1.755263,37.722740,7.879795,27.496367,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [47]:
train.isna().sum().sort_values(ascending=False)[:15]

Ozone                                  29
PM25                                   29
N02                                    29
patient_id                              0
breast_cancer_diagnosis_code_C5051      0
breast_cancer_diagnosis_code_C50212     0
breast_cancer_diagnosis_code_C50219     0
breast_cancer_diagnosis_code_C5031      0
breast_cancer_diagnosis_code_C50311     0
breast_cancer_diagnosis_code_C50312     0
breast_cancer_diagnosis_code_C50319     0
breast_cancer_diagnosis_code_C5041      0
breast_cancer_diagnosis_code_C50411     0
breast_cancer_diagnosis_code_C50412     0
breast_cancer_diagnosis_code_C50419     0
dtype: int64

In [48]:
test.isna().sum().sort_values(ascending=False)[:15]

Ozone                                  14
PM25                                   14
N02                                    14
patient_id                              0
breast_cancer_diagnosis_code_C50412     0
breast_cancer_diagnosis_code_C5021      0
breast_cancer_diagnosis_code_C50211     0
breast_cancer_diagnosis_code_C50212     0
breast_cancer_diagnosis_code_C50219     0
breast_cancer_diagnosis_code_C5031      0
breast_cancer_diagnosis_code_C50311     0
breast_cancer_diagnosis_code_C50312     0
breast_cancer_diagnosis_code_C50319     0
breast_cancer_diagnosis_code_C5041      0
breast_cancer_diagnosis_code_C50411     0
dtype: int64

In [49]:
train.to_csv('train_1.csv')
test.to_csv('test_1.csv')

In [284]:
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler
from imblearn.over_sampling import RandomOverSampler

# Assuming 'train' is your DataFrame
# Extract the features and target columns
features = train.drop(['patient_race'], axis=1)
target_race = train['patient_race']
cat_features = ['patient_state', 'patient_gender', 'Region', 'Division', 'breast_cancer_diagnosis_code', 'metastatic_cancer_diagnosis_code']

# Identify the indices where 'patient_race' is NaN
missing_race_indices = target_race[target_race.isna()].index

# Drop rows with missing values from features
features.drop(missing_race_indices, inplace=True)

# Drop corresponding rows from the target_race
target_race.drop(missing_race_indices, inplace=True)


In [285]:
target_race = target_race.astype('category').cat.codes

In [286]:
# Split the data into training and testing sets
X_train_race, X_test_race, y_train_race, y_test_race = train_test_split(features, target_race, test_size=0.1, random_state=42)


In [287]:
np.bincount(y_train_race)

array([2623, 3244], dtype=int64)

In [288]:
# Define the desired number of samples for each class after oversampling
desired_samples = {
    0: 3244,  
    1: 3244, 
}

In [289]:
# Initialize the RandomOverSampler
over_sampler = RandomOverSampler(sampling_strategy=desired_samples, random_state=42)

# Fit and transform the training data
X_train_race, y_train_race = over_sampler.fit_resample(X_train_race, y_train_race)

In [290]:
# Identify non-categorical columns
non_cat_columns = [col for col in X_train_race.columns if col not in cat_features]

# Apply Min-Max scaling to non-categorical columns
scaler = MinMaxScaler()
X_train_race_scaled = X_train_race.copy()
X_train_race[non_cat_columns] = scaler.fit_transform(X_train_race[non_cat_columns])

X_test_race_scaled = X_test_race.copy()
X_test_race[non_cat_columns] = scaler.transform(X_test_race[non_cat_columns])


In [291]:
# Define the parameter grid for hyperparameter search
param_grid = {
    'iterations': [300, 350, 400, 450, 500],
    'depth': [4, 5, 6, 7, 8],
    'learning_rate': [0.02, 0.05, 0.1, 0.15, 0.2],
}

# Create CatBoost classifier
base_classifier = CatBoostClassifier(loss_function='MultiClass', cat_features=cat_features)

# Use RandomizedSearchCV for hyperparameter optimization
grid_search = RandomizedSearchCV(base_classifier, param_distributions=param_grid, n_iter=20, cv=3, random_state=42, scoring='accuracy', n_jobs=-1)

# Fit the classifier for 'patient_race' using scaled features and hyperparameter optimization
grid_search.fit(X_train_race, y_train_race)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Predict on the entire dataset
predicted_race_values = grid_search.predict(features)

# Continue with the evaluation metrics as previously shown


0:	learn: 0.6820150	total: 126ms	remaining: 56.6s
1:	learn: 0.6727949	total: 249ms	remaining: 55.8s
2:	learn: 0.6643021	total: 372ms	remaining: 55.4s
3:	learn: 0.6558476	total: 534ms	remaining: 59.6s
4:	learn: 0.6483842	total: 733ms	remaining: 1m 5s
5:	learn: 0.6413935	total: 879ms	remaining: 1m 5s
6:	learn: 0.6342838	total: 1.02s	remaining: 1m 4s
7:	learn: 0.6282118	total: 1.16s	remaining: 1m 4s
8:	learn: 0.6235787	total: 1.31s	remaining: 1m 4s
9:	learn: 0.6182291	total: 1.4s	remaining: 1m 1s
10:	learn: 0.6129288	total: 1.49s	remaining: 59.6s
11:	learn: 0.6079213	total: 1.58s	remaining: 57.9s
12:	learn: 0.6028975	total: 1.76s	remaining: 59s
13:	learn: 0.5984786	total: 1.9s	remaining: 59.3s
14:	learn: 0.5945284	total: 2.04s	remaining: 59.3s
15:	learn: 0.5904466	total: 2.17s	remaining: 58.9s
16:	learn: 0.5872098	total: 2.34s	remaining: 59.7s
17:	learn: 0.5843188	total: 2.48s	remaining: 59.6s
18:	learn: 0.5818735	total: 2.62s	remaining: 59.5s
19:	learn: 0.5793933	total: 2.77s	remaining: 

In [292]:
best_params

{'learning_rate': 0.05, 'iterations': 450, 'depth': 8}

In [293]:
# Create CatBoost classifier with the best hyperparameters from the grid search
best_classifier = CatBoostClassifier(
    iterations=best_params['iterations'],
    depth=best_params['depth'],
    learning_rate=best_params['learning_rate'],
    loss_function='MultiClass',
    cat_features=cat_features
)

# Fit the classifier on the training data
best_classifier.fit(X_train_race, y_train_race)

# Extract feature importances
feature_importances = best_classifier.get_feature_importance()

# Identify the indices of the most important features
num_features_to_select = 10  # Adjust this based on your preference
selected_feature_indices = np.argsort(feature_importances)[::-1][:num_features_to_select]

# Select the most important features
X_train_selected = X_train_race.iloc[:, selected_feature_indices]
X_test_selected = X_test_race.iloc[:, selected_feature_indices]


0:	learn: 0.6820150	total: 140ms	remaining: 1m 2s
1:	learn: 0.6727949	total: 303ms	remaining: 1m 7s
2:	learn: 0.6643021	total: 431ms	remaining: 1m 4s
3:	learn: 0.6558476	total: 543ms	remaining: 1m
4:	learn: 0.6483842	total: 662ms	remaining: 58.9s
5:	learn: 0.6413935	total: 749ms	remaining: 55.4s
6:	learn: 0.6342838	total: 845ms	remaining: 53.5s
7:	learn: 0.6282118	total: 932ms	remaining: 51.5s
8:	learn: 0.6235787	total: 1.03s	remaining: 50.4s
9:	learn: 0.6182291	total: 1.11s	remaining: 48.9s
10:	learn: 0.6129288	total: 1.19s	remaining: 47.6s
11:	learn: 0.6079213	total: 1.3s	remaining: 47.6s
12:	learn: 0.6028975	total: 1.45s	remaining: 48.6s
13:	learn: 0.5984786	total: 1.57s	remaining: 48.9s
14:	learn: 0.5945284	total: 1.68s	remaining: 48.8s
15:	learn: 0.5904466	total: 1.8s	remaining: 48.9s
16:	learn: 0.5872098	total: 1.95s	remaining: 49.6s
17:	learn: 0.5843188	total: 2.09s	remaining: 50.2s
18:	learn: 0.5818735	total: 2.25s	remaining: 51s
19:	learn: 0.5793933	total: 2.38s	remaining: 51.

In [294]:

# Train a new CatBoost model using only the selected features
selected_classifier = CatBoostClassifier(
    iterations=best_params['iterations'],
    depth=best_params['depth'],
    learning_rate=best_params['learning_rate'],
    loss_function='MultiClass',
    cat_features=list(X_train_selected.select_dtypes(include=['category', 'object']).columns)  # Assuming cat_features are not the same as the original model
)

# Fit the classifier on the training data with selected features
selected_classifier.fit(X_train_selected, y_train_race)


0:	learn: 0.6821460	total: 78ms	remaining: 35s
1:	learn: 0.6723268	total: 180ms	remaining: 40.4s
2:	learn: 0.6637058	total: 241ms	remaining: 36s
3:	learn: 0.6555031	total: 311ms	remaining: 34.6s
4:	learn: 0.6481595	total: 369ms	remaining: 32.8s
5:	learn: 0.6409884	total: 423ms	remaining: 31.3s
6:	learn: 0.6340938	total: 479ms	remaining: 30.3s
7:	learn: 0.6287494	total: 530ms	remaining: 29.3s
8:	learn: 0.6231380	total: 588ms	remaining: 28.8s
9:	learn: 0.6190208	total: 613ms	remaining: 27s
10:	learn: 0.6140165	total: 662ms	remaining: 26.4s
11:	learn: 0.6094453	total: 716ms	remaining: 26.1s
12:	learn: 0.6052325	total: 766ms	remaining: 25.8s
13:	learn: 0.6011295	total: 821ms	remaining: 25.6s
14:	learn: 0.5978777	total: 879ms	remaining: 25.5s
15:	learn: 0.5944993	total: 938ms	remaining: 25.4s
16:	learn: 0.5924730	total: 965ms	remaining: 24.6s
17:	learn: 0.5895711	total: 1.03s	remaining: 24.7s
18:	learn: 0.5864342	total: 1.09s	remaining: 24.7s
19:	learn: 0.5842410	total: 1.15s	remaining: 24.

<catboost.core.CatBoostClassifier at 0x1695f372d40>

In [295]:
X_test_selected.breast_cancer_diagnosis_code = X_test_selected.breast_cancer_diagnosis_code.astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_selected.breast_cancer_diagnosis_code = X_test_selected.breast_cancer_diagnosis_code.astype(str)


In [296]:
# Predict on the test set with selected features
y_pred_test = selected_classifier.predict(X_test_selected)

In [297]:
# Assuming y_true contains the true labels and y_pred contains the predicted labels
y_true = target_race.loc[X_test_race.index]  # Replace with the actual true labels


In [298]:
y_true

6051     1
5744     1
7447     1
2276     1
7608     0
        ..
2313     1
12889    1
6347     1
10506    0
12034    1
Length: 652, dtype: int8

In [299]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report



# Convert categorical labels in y_true to numerical labels
y_true_numerical = y_true

# Calculate accuracy
accuracy = accuracy_score(y_true_numerical, y_pred_test)
print(f"Accuracy: {accuracy:.2%}")

# Calculate precision, recall, and F1 score
precision = precision_score(y_true_numerical, y_pred_test, average='weighted')
recall = recall_score(y_true_numerical, y_pred_test, average='weighted')
f1 = f1_score(y_true_numerical, y_pred_test, average='weighted')

print(f"Precision: {precision:.2%}")
print(f"Recall: {recall:.2%}")
print(f"F1 Score: {f1:.2%}")

# Confusion matrix
conf_matrix = confusion_matrix(y_true_numerical, y_pred_test)
print("Confusion Matrix:")
print(conf_matrix)

# Classification report
class_report = classification_report(y_true_numerical, y_pred_test)
print("Classification Report:")
print(class_report)



Accuracy: 70.09%
Precision: 70.21%
Recall: 70.09%
F1 Score: 70.11%
Confusion Matrix:
[[218  90]
 [105 239]]
Classification Report:
              precision    recall  f1-score   support

           0       0.67      0.71      0.69       308
           1       0.73      0.69      0.71       344

    accuracy                           0.70       652
   macro avg       0.70      0.70      0.70       652
weighted avg       0.70      0.70      0.70       652

