## Data Dictionary

### Covariates
For this datathon challenge, we are using a real-world evidence dataset from Health Verity (HV), one of the largest healthcare data ecosystems in the US, as the main data source for the Datathon. In particular, the HV dataset that we use for this challenge contains health related information of patients who were diagnosed with metastatic triple negative breast cancers in the US. We also enriched the data set with the US Zip Codes Database which were built from the ground up using authoritative sources including the U.S. Postal Service™, U.S. Census Bureau, National Weather Service, American Community Survey, and the IRS, to obtain additional social economic information based on the locations of the patients. The dataset was then further enriched, also using zip code level, with toxicology data from NASA/Columbia University, to explore the relations between health outcomes and toxic air conditions.

### Target

- `DiagPeriodL90D`: Diagnosis Period Less Than 90 Days. This is an indication of whether the cancer was diagnosed within 90 Days.

---------------------------------

### Patient Related info

#### Identifier:
- `patient_id` - Unique identification number of patient

#### Physical parameters of a patient: 
- `patient_race` - Asian, African American, Hispanic or Latino, White, Other Race
- `patient_age` - Derived from Patient Year of Birth (index year minus year of birth)
- `patient_gender` - F, M on the metastatic date
- `bmi` - If Available, will show available BMI information (Earliest BMI recording post metastatic date)

#### Diagnosis related info:
- `breast_cancer_diagnosis_code` - ICD10 or ICD9 diagnoses code
- `breast_cancer_diagnosis_desc` - ICD10 or ICD9 code description. This column is raw text and may require NLP/ processing and cleaning
- `metastatic_cancer_diagnosis_code` - ICD10 diagnoses code

#### Treatment related info:
- `metastatic_first_novel_treatment` - Generic drug name of the first novel treatment (e.g. "Cisplatin") after metastatic diagnosis
- `metastatic_first_novel_treatment_type` - Description of Treatment (e.g. Antineoplastic) of first novel treatment after metastatic diagnosis

#### Payment type: 
- `payer_type` - payer type at Medicaid, Commercial, Medicare on the metastatic date

---------------------------------

### Geolocation related info

#### Geographical location of a patient:
- `patient_state` - Patient State (e.g. AL, AK, AZ, AR, CA, CO etc…) on the metastatic date
- `patient_zip3` - Patient Zip3 (e.g. 190) on the metastatic date
- `region` - Region of patient location
- `division` - Division of patient location

#### Air Quality in patient's Geolocation:
- `ozone` - Annual Ozone (O3) concentration data at Zip3 level. This data shows how air quality data may impact health.
- `PM25` - Annual Fine Particulate Matter (PM2.5) concentration data at Zip3 level. This data shows how air quality data may impact health.
- `N02` - Annual Nitrogen Dioxide (NO2) concentration data at Zip3 level. This data shows how air quality data may impact health.

---------------------------------

### Population related info in patient's geolocation

##### General:
- `population` - An estimate of the zip code's population.
- `density` - The estimated population per square kilometer.
- `poverty` - The median value of owner occupied homes.
- `commute_time` - The median commute time of resident workers in minutes.

##### Age: 
- `age_median` - The median age of residents in the zip code.
- `age_under_10` - The percentage of residents aged 0-9.
- `age_10_to_19` - The percentage of residents aged 10-19.
- `age_20s` - The percentage of residents aged 20-29.
- `age_30s` - The percentage of residents aged 30-39.
- `age_40s` - The percentage of residents aged 40-49.
- `age_50s` - The percentage of residents aged 50-59.
- `age_60s` - The percentage of residents aged 60-69.
- `age_70s` - The percentage of residents aged 70-79.
- `age_over_80` - The percentage of residents aged over 80.

##### Gender: 
- `male` - The percentage of residents who report being male (e.g. 55.1).
- `female` - The percentage of residents who report being female (e.g. 44.9).

##### Race or ethnicity:
- `race_multiple` - The percentage of residents who report their race as Two or more races.
- `race_white` - The percentage of residents who report their race White.
- `race_black` - The percentage of residents who report their race as Black or African American.
- `race_asian` - The percentage of residents who report their race as Asian.
- `race_native` - The percentage of residents who report their race as American Indian and Alaska Native.
- `race_pacific` - The percentage of residents who report their race as Native Hawaiian and Other Pacific Islander.
- `race_other` - The percentage of residents who report their race as Some other race.
- `hispanic` - The percentage of residents who report being Hispanic. Note: Hispanic is considered to be an ethnicity and not a race.

##### Health determining situation:
- `health_uninsured` - The percentage of residents who report not having health insurance.
- `disabled` - The percentage of residents who report a disability.
- `veteran` - The percentage of residents who are veterans.

##### Social status:
- `married` - The percentage of residents who report being married (e.g. 44.9).
- `divorced` - The percentage of residents divorced.
- `never_married` - The percentage of residents never married.
- `widowed` - The percentage of residents never widowed.

##### Family: 
- `family_size` - The average size of resident families (e.g. 3.22).

##### Home ownership: 
- `home_ownership` - Percentage of households that own (rather than rent) their residence.
- `housing_units` - The number of housing units (or households) in the zip code.
- `home_value` - The median value of homes that are owned by residents.

##### Rent:
- `rent_median` - The median rent paid by renters.
- `rent_burden` - The median rent as a percentage of the median renter's household income.
    
##### Educaton:
- `education_college_or_above` - The percentage of residents with at least a 4-year degree.
- `education_less_highschool` - The percentage of residents with less than a high school education.
- `education_highschool` - The percentage of residents with a high school diploma but no more.
- `education_some_college` - The percentage of residents with some college but no more.
- `education_bachelors` - The percentage of residents with a bachelor's degree (or equivalent) but no more.
- `education_graduate` - The percentage of residents with a graduate degree.
- `education_stem_degree` - The percentage of college graduates with a Bachelor's degree or higher in a Science and Engineering (or related) field.
- `limited_english` - The percentage of residents who only speak limited English.

##### Employment:
- `labor_force_participation` - The percentage of residents 16 and older in the labor force.
- `unemployment_rate` - The percentage of residents unemployed.
- `self_employed` - The percentage of households reporting self-employment income on their 2016 IRS tax return.
   
  
##### Houshold income:
- `income_household_median` - Median household income in USD.
- `income_household_six_figure` - Percentage of households that earn at least $100,000 (e.g. 25.3)
- `family_dual_income` - The percentage of families with dual income earners.
- `income_household_under_5` - The percentage of households with income under $5,000.
- `income_household_5_to_10` - The percentage of households with income from $5,000-$10,000.
- `income_household_10_to_15` - The percentage of households with income from $10,000-$15,000.
- `income_household_15_to_20` - The percentage of households with income from $15,000-$20,000.
- `income_household_20_to_25` - The percentage of households with income from $20,000-$25,000.
- `income_household_25_to_35` - The percentage of households with income from $25,000-$35,000.
- `income_household_35_to_50` - The percentage of households with income from $35,000-$50,000.
- `income_household_50_to_75` - The percentage of households with income from $50,000-$75,000.
- `income_household_75_to_100` - The percentage of households with income from $75,000-$100,000.
- `income_household_100_to_150` - The percentage of households with income from $100,000-$150,000.
- `income_household_150_over` - The percentage of households with income over $150,000.
- `income_individual_median` - The median income of individuals in the zip code.
- `farmer` - The percentage of households reporting farm income on their 2016 IRS tax return.
    

In [None]:
import pandas as pd
import numpy as np

: 

In [None]:
pd.set_option('display.max_columns', None)

: 

In [None]:
train = pd.read_csv('training.csv')
train

: 

In [None]:
test = pd.read_csv('test.csv')
test

: 

In [None]:
test.isna().sum().sort_values(ascending=False)[:15]

: 

In [None]:
train.isna().sum().sort_values(ascending=False)[:15]

: 

In [None]:
train.drop([
    'metastatic_first_novel_treatment', 
    'metastatic_first_novel_treatment_type', 
    'breast_cancer_diagnosis_desc', 
    'bmi'], 
    axis=1, 
    inplace=True
    )

test.drop([
    'metastatic_first_novel_treatment', 
    'metastatic_first_novel_treatment_type', 
    'breast_cancer_diagnosis_desc', 
    'bmi'], 
    axis=1, 
    inplace=True
    )

: 

In [None]:
for column in train.select_dtypes(include='object').columns: 
    train[column] = train[column].astype('category')

train.select_dtypes(include='category')

: 

In [None]:
race_by_state = pd.DataFrame(train.groupby(['patient_state', 'patient_race'])['patient_id'].count()).reset_index()
race_by_state['total'] = race_by_state.apply(lambda row: sum(race_by_state[race_by_state['patient_state'] == row['patient_state']]['patient_id']), axis=1)
race_by_state['percentage'] = race_by_state['patient_id'] / race_by_state['total']
race_by_state.sort_values(by=['total', 'patient_id'], ascending=False)[:20]

: 

### Fill NaN in State, Region and Division

In [None]:
zip_suffixes = pd.read_html("https://en.wikipedia.org/wiki/List_of_ZIP_Code_prefixes#Notes")

zip_state_dict = {}
# Iterate through the DataFrame
for i in range(len(zip_suffixes)):
    for column in zip_suffixes[i].columns:
        for j in range(len(zip_suffixes[i].columns)):
            try:
                key = int(zip_suffixes[i][column][j][:3])
                if len(str(key)) == 3:  # Check if the length of key is equal to 3
                    value = zip_suffixes[i][column][j][4:6]  # Assuming the state code is always after the space
                    zip_state_dict[key] = value
            except ValueError:
                pass


test.patient_state = [(lambda zip3: zip_state_dict.get(zip3, None))(zip3) for zip3 in test.patient_zip3]
train.patient_state = [(lambda zip3: zip_state_dict.get(zip3, None))(zip3) for zip3 in train.patient_zip3]

: 

In [None]:
states = pd.read_csv('states.csv')
states.head()

: 

In [None]:
reg = states.set_index('State Code')['Region'].to_dict()
div = states.set_index('State Code')['Division'].to_dict()

train['Region'] = [(lambda state: reg.get(state, None))(state) for state in train.patient_state]
train['Division'] = [(lambda state: reg.get(state, None))(state) for state in train.patient_state]

test['Region'] = [(lambda state: reg.get(state, None))(state) for state in test.patient_state]
test['Division'] = [(lambda state: reg.get(state, None))(state) for state in test.patient_state]


: 

Drop the rows with too many absent values. 

In [None]:
train.drop(10542, inplace=True)
train.drop(list(train[train.family_size.isna()].index), inplace=True)

test.drop(1622, inplace=True)

: 

### Fill NaN in Race

In [None]:
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

# Assuming 'train' is your DataFrame
# Extract the features and target columns
features = train.drop(['patient_race', 'payer_type'], axis=1)
target_race = train['patient_race']
# target_payer = train['payer_type']
cat_features = ['patient_state', 'patient_gender', 'Region', 'Division', 'breast_cancer_diagnosis_code', 'metastatic_cancer_diagnosis_code']

# Identify the indices where 'patient_race' is NaN
missing_race_indices = target_race[target_race.isna()].index

# Drop rows with missing values from features
features.drop(missing_race_indices, inplace=True)

# Drop corresponding rows from the target_race
target_race.drop(missing_race_indices, inplace=True)


: 

In [None]:
len(features)

: 

In [None]:
len(target_race)

: 

In [None]:
target_race = target_race.astype('category').cat.codes
target_race.unique()

: 

In [None]:
# Split the data into training and testing sets
X_train_race, X_test_race, y_train_race, y_test_race = train_test_split(features, target_race, test_size=0.2, random_state=42)


: 

In [None]:
# Initialize the RandomUnderSampler
under_sampler = RandomUnderSampler(sampling_strategy='majority', random_state=42)

# Fit and transform the training data
X_train_resampled, y_train_resampled = under_sampler.fit_resample(X_train_race, y_train_race)

: 

In [None]:
np.bincount(y_train_resampled)

: 

In [None]:
# Define the desired number of samples for each class after oversampling
desired_samples = {
    0: 2893,  
    1: 2893, 
    2: 2893,
    3: 2893,
    4: 2893 
}

: 

In [None]:
# Initialize the RandomOverSampler
over_sampler = RandomOverSampler(sampling_strategy=desired_samples, random_state=42)

# Fit and transform the training data
X_train_race, y_train_race = over_sampler.fit_resample(X_train_race, y_train_race)

: 

In [None]:
np.bincount(y_train_race)

: 

In [None]:
# Calculate class weights based on class frequencies in the training set
# class_weights = list(X_train_race.shape[0] / (len(np.unique(y_train_race)) * np.bincount(y_train_race)))
# class_weights

: 

In [None]:
from sklearn.utils.class_weight import compute_class_weight

: 

In [None]:
# Calculate class weights based on class frequencies in the training set
# class_counts = np.bincount(y_train_race)
# total_samples = len(y_train_race)

# # Calculate the inverse of class frequencies
# class_weights = total_samples / (len(np.unique(y_train_race)) * class_counts)

# # Normalize the weights to sum to the number of classes
# class_weights /= class_weights.sum()
# class_weights

: 

In [None]:
# Identify non-categorical columns
non_cat_columns = [col for col in X_train_race.columns if col not in cat_features]

# Apply Min-Max scaling to non-categorical columns
scaler = MinMaxScaler()
X_train_race_scaled = X_train_race.copy()
X_train_race_scaled[non_cat_columns] = scaler.fit_transform(X_train_race[non_cat_columns])

X_test_race_scaled = X_test_race.copy()
X_test_race_scaled[non_cat_columns] = scaler.transform(X_test_race[non_cat_columns])


: 

In [None]:
# Define the parameter grid for hyperparameter search
param_grid = {
    'iterations': [250, 300, 350, 400, 450],
    'depth': [3, 4, 5, 6, 7],
    'learning_rate': [0.005, 0.02, 0.05, 0.1, 0.15],
}

# Create CatBoost classifier
base_classifier = CatBoostClassifier(loss_function='MultiClass', cat_features=cat_features)

# Use RandomizedSearchCV for hyperparameter optimization
grid_search = RandomizedSearchCV(base_classifier, param_distributions=param_grid, n_iter=20, cv=3, random_state=42, scoring='accuracy', n_jobs=-1)

# Fit the classifier for 'patient_race' using scaled features and hyperparameter optimization
grid_search.fit(X_train_race_scaled, y_train_race)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Predict on the entire dataset
predicted_race_values = grid_search.predict(features)

# Continue with the evaluation metrics as previously shown


: 

: 

In [None]:
best_params

: 

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Assuming y_true contains the true labels and y_pred contains the predicted labels
y_true = target_race.loc[X_test_race.index]  # Replace with the actual true labels
y_pred = grid_search.predict(X_test_race)

# Convert categorical labels in y_true to numerical labels
y_true_numerical = y_true.astype('category').cat.codes

# Calculate accuracy
accuracy = accuracy_score(y_true_numerical, y_pred)
print(f"Accuracy: {accuracy:.2%}")

# Calculate precision, recall, and F1 score
precision = precision_score(y_true_numerical, y_pred, average='weighted')
recall = recall_score(y_true_numerical, y_pred, average='weighted')
f1 = f1_score(y_true_numerical, y_pred, average='weighted')

print(f"Precision: {precision:.2%}")
print(f"Recall: {recall:.2%}")
print(f"F1 Score: {f1:.2%}")

# Confusion matrix
conf_matrix = confusion_matrix(y_true_numerical, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Classification report
class_report = classification_report(y_true_numerical, y_pred)
print("Classification Report:")
print(class_report)



: 

In [None]:
# Create CatBoost classifier for 'patient_race'
# race_classifier = CatBoostClassifier(iterations=200, depth=5, learning_rate=0.1, loss_function='MultiClass', cat_features=cat_features)


: 

In [None]:
# Fit the classifier for 'patient_race'
# race_classifier.fit(X_train_race, y_train_race)

: 

In [None]:
# Predict missing values for 'patient_race'
# predicted_race_values = race_classifier.predict(features)

: 

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Assuming y_true contains the true labels and y_pred contains the predicted labels
y_true = target_race.loc[X_test_race.index]  # Replace with the actual true labels
y_pred = predicted_race_values.predict(X_test_race)

# Calculate accuracy
accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy:.2%}")

# Calculate precision, recall, and F1 score
precision = precision_score(y_true, y_pred, average='weighted')
recall = recall_score(y_true, y_pred, average='weighted')
f1 = f1_score(y_true, y_pred, average='weighted')

print(f"Precision: {precision:.2%}")
print(f"Recall: {recall:.2%}")
print(f"F1 Score: {f1:.2%}")

# Confusion matrix
conf_matrix = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(conf_matrix)


: 

In [None]:
# # Update the 'patient_race' column with predicted values
# train.loc[missing_race_indices, 'patient_race'] = predicted_race_values

# # Split the data into training and testing sets for 'payer_type'
# X_train_payer, X_test_payer, y_train_payer, y_test_payer = train_test_split(features.drop(['patient_race', 'payer_type'], axis=1), target_payer, test_size=0.2, random_state=42)

# # Create CatBoost classifier for 'payer_type'
# payer_classifier = CatBoostClassifier(iterations=100, depth=5, learning_rate=0.1, loss_function='MultiClass', cat_features=['patient_state', 'patient_gender', 'Region', 'Division'])

# # Fit the classifier for 'payer_type'
# payer_classifier.fit(X_train_payer, y_train_payer)

# # Predict missing values for 'payer_type'
# predicted_payer_values = payer_classifier.predict(features.loc[missing_payer_indices].drop(['patient_race', 'payer_type'], axis=1))

# # Update the 'payer_type' column with predicted values
# train.loc[missing_payer_indices, 'payer_type'] = predicted_payer_values

: 