### Univariate Distribution of Features

#### Load and Examine Data

In [None]:
# Import libraries and modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

In [None]:
# Read in data, examine first few rows
df = pd.read_csv("../data/train.csv")
pd.set_option('display.max_columns', None)
df.head(5)

In [None]:
print("This data set frame has {} rows and {} columns".format(df.shape[0], df.shape[1]))

##### Check for Missing Values

In [None]:
# Check the percent of values in ech column that are missing. 
# We see that there are eight features with 20% or more missing values,
# and two features with over 50% missing values.
# The two column which encode our target (efs and efs_time)
# have no missing values.

pd.DataFrame(df.isna().sum()/df.shape[0] * 100).reset_index() \
    .rename(columns={"index":"Feature", 0:"Percent Missing"}) \
    .sort_values(by="Percent Missing", ascending=False)

In [None]:
# For now, we will keep all rows and columns--even those with missing data.
# We drop the ID column, which does not contain information useful for modeling.
# We also replace the numerical values in the efs column
# with text values, which are easier to interpret.

df = df.drop("ID", axis=1)
df['efs'] = df['efs'].replace({0:"Censored", 1:"Event"})

##### Basic Descriptive Statistics for Numeric Columns

In [None]:
# Check which datatypes exist in the data.
# We have dtype 'O' (for 'object', pandas categorical datatype)
# as well as integer and floating-point datatypes.
# We will confirm that the datatype for each column makes sense
# when we examine the columns individually.
df.dtypes.unique()

In [None]:
# Compute basic descriptive statistics for numerical variables.
# It appears that many of these features take just a few 
# integer values. 

df_numeric = df.select_dtypes(['float64', 'int64'])
df_numeric.describe()

#### Define Helper Functions to Summarize and Plot Features

In [None]:
# Function that takes the name of a discrete feature
# and produces a barplot of the number of 
# cases for each value of the feature.
# May be used for either categorical features
# or integer features which only take a
# few distinct values.

def plot_discrete_feature(feat_name, df=df, tick_angle = 0, figsize=(5, 3)):
    fig, ax = plt.subplots(figsize=figsize)
    cat_order = None
    if df[feat_name].dtype == 'O':
        cat_order = df[feat_name].value_counts().index.to_list()
    sns.countplot(df, x = feat_name, order=cat_order)
    plt.xlabel(feat_name)
    plt.ylabel("Number of cases ")
    plt.title("Number of cases by {}".format(feat_name))
    ax.tick_params(axis='x', rotation=tick_angle)
    plt.show()

In [None]:
# Function the takes the name of a discrete feature
# and returns a dataframe with the percentage
# of cases that take on each level of the feature.
# Note that this ignores any cases where the
# value of the feature is missing.

def get_percentages(feat_name, df=df):
    percentages = df[feat_name].value_counts()/df[feat_name].count() * 100
    formatted = round(pd.DataFrame(percentages).reset_index(), 2) \
         .rename(columns={"count":"percent"}) 
    
    if df[feat_name].dtype == 'O':
        formatted = formatted.sort_values(by="percent", ascending=False)
    else:
        formatted = formatted.sort_values(by = feat_name)
        
    return formatted

In [None]:
# Function that takes the name of a numeric feature
# and produces a figure with two subplots:
# a boxplot of the feature, and a histogram.
# Boxplots are often useful for detecting outliers,
# while histograms give more insight into the
# overall shape of a distribution.

def plot_numeric_feature(feat_name, data = df, bins=10, kde=True, discrete=False):
    fig, ax = plt.subplots(1, 2, figsize=(10, 3))
    fig.suptitle("Distribution of {}".format(feat_name))
    sns.boxplot(data = df, y=feat_name, ax=ax[0])
    sns.histplot(data = df, x=feat_name, ax=ax[1], kde=kde, bins=bins, discrete=discrete)
    plt.subplots_adjust(wspace=0.4)
    ax[1].set_ylabel("Number of cases")
    plt.show()

In [None]:
# While this notebook is mostly for univariate analysis,
# I am including a function here which plots efs_time (survival time)
# broken down by efs (event/censored).
# These two variables can be viewed as jointly encoding
# our true modeling objective (survival time).

def plot_efs_vs_efs_time(bins=10):
    fig, ax = plt.subplots(1, 2, figsize=(12, 3))
    fig.suptitle("Distribution of efs_time by efs")
    sns.boxplot(data = df, y="efs_time", hue="efs", ax=ax[0])
    sns.histplot(data = df, x="efs_time", hue="efs", ax=ax[1], multiple='stack', bins=bins)
    plt.subplots_adjust(wspace=0.3)
    ax[0].legend([], [], frameon=False)
    ax[1].set_ylabel("Number of cases")
    plt.show()

In [None]:
len(df.columns)

#### Univariate Distributions of Features

##### Distribution of dri_score

In [None]:
# View percentage of cases by value of dri_score
get_percentages('dri_score')

In [None]:
# Plot number of cases by value of dri_score
plot_discrete_feature('dri_score', tick_angle=90, figsize=(8, 3))

**Notes:** This feature represents a categorical disease risk index. This is a relatively high cardinality feature, with 11 levels. Of these, 5 levels are relatively rare, representing 5% of less of cases. For simplicity, we may want to combine some levels. For example, `High - TED AML case <missing cytogenetics` could be binned with `High`. Some of these levels seem to represent missing data: `TDB cytogenetics` and `Missing Disease Status` could be coded as missing data.

##### Distribution of psych_disturb

In [None]:
# View percentage of cases by value of psych_disturb
get_percentages('psych_disturb')

In [None]:
# Plot number of cases by value of psych_disturb
plot_discrete_feature('psych_disturb')

**Notes:** The value of `psych_disturb` is `Yes` in about 13% of cases where this feature is present. This is relatively rare, but potentially still worth considering. We may want to code the rare value `Not done` as missing data.

##### Distribution of cyto_score

In [None]:
# View percentage of cases by value of cyto_score
get_percentages('cyto_score')

In [None]:
# Plot number of cases by value of cyto_score
plot_discrete_feature('cyto_score', figsize=(8, 3))

**Notes:** This categorical variable encodes the level of abnormality observed while analyzing a patient's bone-marror cells. We may wish to code the values `TBD`, `Other` and `Not tested` as missing data. It's possible that `Normal` should be binned with `Favorable`. Analyzing how these levels correlate with survival times help show whether this make sense.

##### Distribution of diabetes

In [None]:
# View percentage of cases by value of diabetes
get_percentages('diabetes')

In [None]:
# Plot number of cases by value of diabetes
plot_discrete_feature('diabetes')

**Notes:** For cases where this feature is present, roughly 16% have diabetes. We may wish to thread the `Not done` category as missing data.

##### Distribution of hla_match_c_high

In [None]:
# View percentage of cases by value of hla_match_c_high
get_percentages('hla_match_c_high')

In [None]:
# Plot number of cases by value of hla_match_c_high
plot_discrete_feature('hla_match_c_high')

**Notes:** The feature represents recipient / 1st donor level allel level (high resolution) matching a HLA-C. The fact that less than 0.33% of cases have value `0` is striking. It's possible that donors with a poor match on this allele are usually excluded. We may want to examine these specific cases in more detail--they may be outliers, or unusual in some other way.

##### Distribution of hla_high_res_8

In [None]:
# View percentage of cases by value of hla_high_res_8
get_percentages('hla_high_res_8')

In [None]:
# Plot number of cases by value of hla_high_res_8
plot_discrete_feature('hla_high_res_8')

**Notes:** This feature represents recipient / 1st donor allele-level (high resolution) matching at multiple genetic loci: HLA-A, HLA-B, HLA-C and HLR-DRB1. This takes discrete integer values up to 8. However, we have essentially no values less than 4. The most common value is 8, at almost 60% of cases.

##### Distribution of tbi_status

In [None]:
# View percentage of cases by value of tbi_status
get_percentages('tbi_status')

In [None]:
# Plot number of cases by value of tbi_status
plot_discrete_feature('tbi_status', figsize=(8, 3), tick_angle = 90)

**Notes:** Since we are analyzing Leukemia cases, the meaning of `TBI` in this context is likely ["total-body irradiation"](https://www.cancerresearchuk.org/about-cancer/treatment/bone-marrow-stem-cell-transplants/total-body-irradiation-tbi), which is often performed before a bone-marrow or stem-cell transplant.  We will likely need to bin the feature into fewer levels--for example by grouping all except the first two categories together into a single  `Other` category.

##### Distribution of arrhythmia

In [None]:
# View percentage of cases by value of arrhythmia
get_percentages('arrhythmia')

In [None]:
# Plot number of cases by value of arrhythmia
plot_discrete_feature('arrhythmia')

**Notes:** Since the value of arrhythmia is `No` in almost 95% of cases where the feature is present, this feature may not be very useful. Unless it has a very strong relation to the target, this may be one we drop.

##### Distribution of hla_low_res_6

In [None]:
# View percentage of cases by value of hla_low_res_6
get_percentages('hla_low_res_6')

In [None]:
# Plot number of cases by value of hla_low_res_6
plot_discrete_feature('hla_low_res_6')

**Notes:** This feature represents ecipient / 1st donor antigen-level (low resolution) matching at HLA-A,-B,-DRB1. Note that this is similar to the definiton of `hla_high_res_8` refers to high-resolution matching and does not include locus HLA-C. Note also that the feature `hla_match_c_high` encodes high-resolution matchong on HLA-C. Hence `hla_high_res_6` may be redundant, unless one or both of the other two features are missing.

As in the case of `hla_high_res_8` and `hla_match_c_high`, this feature is equal to the maximum value a majority of the time when it is present. Values below 3 are very rare and may be outliers.

##### Distribution of graft_type

In [None]:
# View percentage of cases by value of graft_type
get_percentages('graft_type')

In [None]:
# Plot number of cases by value of graft_type
plot_discrete_feature('graft_type')

**Notes:** This feature has only two values, and both values are reasonable well-respresented in the data.

##### Distribution of vent_hist

In [None]:
# View percentage of cases by value of vent_hist
get_percentages('vent_hist')

In [None]:
# Plot number of cases by value of vent_hist
plot_discrete_feature('vent_hist')

**Notes:** This feature represents history of mechanical ventilation, and is `Yes` less than 3% of the time. Because mechanical ventilation potentially indicates a serious medical issue, this may be worth exploring further. For example, do we see more `Yes` values during the peak COVID-19 years, indicating that it may be a proxy for severe COVID-19 infection?

##### Distribution of renal_issue

In [None]:
# View percentage of cases by value of renal_issue
get_percentages('renal_issue')

In [None]:
# Plot number of cases by value of renal_issue
plot_discrete_feature('renal_issue')

**Notes:** This feature encodes the presence of moderate to severe kidney issues, with 98% of the values being `No`. Since this is a low-variance feature, it may not be useful for our model. Alternatively, we may want to combine it with other rare health issues such as history of mechanical ventilation. We may wish to code `Not done` as missing data.

##### Distribution of pulm_severe

In [None]:
# View percentage of cases by value of pulm_severe
get_percentages('pulm_severe')

In [None]:
# Plot number of cases by value of pulm_severe
plot_discrete_feature('pulm_severe')

**Notes:** This feature encodes the presence of severe pulmonary issues, with 93% of the values being `No`. Since this is a low-variance feature, it may not be useful for our model. Alternatively, we may want to combine it with other rare health issues such as history of mechanical ventilation. We may wish to code `Not done` as missing data.

##### Distribution of prim_disease_hct

In [None]:
# View percentage of cases by value of prim_disease_hct
get_percentages('prim_disease_hct')

In [None]:
# Plot number of cases by value of prim_disease_hct
plot_discrete_feature('prim_disease_hct', figsize=(8, 3), tick_angle=90)

**Notes:** This is a high-cardinality feature. There are 17 values total. Of these, 11 account for less than 5% of cases each. We may wish to bin less-common values together into an "other" category. 

##### Distribution of hla_high_res_6

In [None]:
# View percentage of cases by value of hla_high_res_6
get_percentages('hla_high_res_6')

In [None]:
# Plot number of cases by value of hla_high_res_6
plot_discrete_feature('hla_high_res_6')

**Notes:** This feature encodes recipient / 1st donor allele-level (high resolution) matching at HLA-A,-B,-DRB1. This is identical to `hla_low_res_6`, except that the matching is high-resolution instead of low. The distributions of the two features are also quite similar, with a majority of values equal to 6 and virtually none less than 3. These two features are likely to be very highly correlated. However, before discarding either, we may want to examine how often a patient has non-missing data in one of the two, but not both.

##### Distribution of cmv_status

In [None]:
# View percentage of cases by value of cmv_status
get_percentages('cmv_status')

In [None]:
# Plot number of cases by value of cmv_status
plot_discrete_feature('cmv_status')

**Notes:**  This feature encodes donor/recipient CMV serostatus--that is, presence of [antibodies to CMV](https://pmc.ncbi.nlm.nih.gov/articles/PMC3512215/) in the donor's and recipient's blood. It appears that a majority of both donors are recipients are positive. However, each of the four posible combinations is represented reasonably well in the data. 

##### Distribution of hla_high_res_10

In [None]:
# View percentage of cases by value of hla_high_res_10
get_percentages('hla_high_res_10')

In [None]:
# Plot number of cases by value of hla_high_res_10
plot_discrete_feature('hla_high_res_10')

**Notes:** This feature encodes recipient / 1st donor allele-level (high resolution) matching at generic loci HLA-A,-B,-C,-DRB1, -DQB1. Note that this is identical to `hla_high_res_8` except for present of a new allele, `DQB1`. Since we also have a feature `hla_match_dbq1_high`, the feature may be redunant unless either `hla_high_res_8` or `hla_match_dbq1_high` are missing.

##### Distribution of hla_match_dqb1_high

In [None]:
# View percentage of cases by value of hla_match_dqb1_high
get_percentages('hla_match_dqb1_high')

In [None]:
# Plot number of cases by value of hla_match_dqb1_high
plot_discrete_feature('hla_match_dqb1_high')

**Notes:** This feature encodes recipient / 1st donor allele level (high resolution) matching at HLA-DQB1, one of the alleles included in `hla_high_res_10`. As with other HLA match feature we've seen so a majority of cases take the maximum possible value. Virtually no cases have a value 0. These may be outliers.

##### Distribution of tce_imm_match

In [None]:
# View percentage of cases by value of tce_imm_match
get_percentages('tce_imm_match')

In [None]:
# Plot number of cases by value of tce_imm_match
plot_discrete_feature('tce_imm_match')

**Notes:** This feature encodes T-cell [epitope](https://en.wikipedia.org/wiki/Epitope) immunogenicity/diversity match. Epiptopes are short sequences of amino acids which are recognized by T-cells. An epiptope mismatch between donor and recipient may cause the recipient's immune system to attack the donor cells. The majority of cases in our data are matches of type P/P. Non-matching combinations between donor and recipient are rare. We may want to bin these as a `mismatched` category.

##### Distribution of hla_nmdp_6

In [None]:
# View percentage of cases by value of hla_nmdp_6
get_percentages('hla_nmdp_6')

In [None]:
# Plot number of cases by value of hla_nmdp_6
plot_discrete_feature('hla_nmdp_6')

**Notes:** This feature encodes recipient / 1st donor matching at HLA-A(lo),-B(lo),-DRB1(hi). Assuming "lo" and "hi" refer to low-resolution and high-resolution matching, this should be quite similar to the feature `hla_low_res_6`. However, it's possible that we will have patients with non-missing data in only one of these two features.

##### Distribution of hla_match_c_low

In [None]:
# View percentage of cases by value of hla_match_c_low
get_percentages('hla_match_c_low')

In [None]:
# Plot number of cases by value of hla_match_c_low
plot_discrete_feature('hla_match_c_low')

**Notes:** This feature encodes recipient / 1st donor antigen level (low resolution) matching at HLA-C. This is similar to `hla_match_c_high`, except that the resolution is low instead of high. The distribution of the two features is also very similar, with essentially no 0's and about 76% 2's. However, it's possible that we will have patients with non-missing data in only one of these two features.

##### Distribution of rituximab

In [None]:
# View percentage of cases by value of rituximab
get_percentages('rituximab')

In [None]:
# Plot number of cases by value of rituximab
plot_discrete_feature('rituximab')

**Notes:** This feature records whether the medication rituximab was used prior to the hct procedure. However, the value `Yes` only occurs in about 2% of patients. This is a low-variance feature, which may not be helpful for modeling.

##### Distribution of hla_match_drb1_low

In [None]:
# View percentage of cases by value of hla_match_drb1_low
get_percentages('hla_match_drb1_low')

In [None]:
# Plot number of cases by value of hla_match_drb1_low
plot_discrete_feature('hla_match_drb1_low')

**Notes:** This feature encodes recipient / 1st donor antigen level (low resolution) matching at HLA-DRB1. The only two values are 1 and 2, with 2 being most common. Note that HLA-DRB1 is included in many other scores (for example `hla_low_res_6`). In addition, we expect this feature to be similar to `hla_match_drb1_high`, which tests matching of the same allele. For these reasons, this feature may be partially redundant.

##### Distribution of hla_match_dqb1_low

In [None]:
# View percentage of cases by value of hla_match_dqb1_low
get_percentages('hla_match_dqb1_low')

In [None]:
# Plot number of cases by value of hla_match_dqb1_low
plot_discrete_feature('hla_match_dqb1_low')

**Notes:** The feature encodes recipient / 1st donor antigen level (low resolution) matching at HLA-DQB1. Low-resolution matches on this allele contribute to `hla_low_res_10` scores. In addition, we have a feature `hla_match_dbq1_high` which encodes high-resolution matches of the same allels, and has an almost-identical distribution. Hence this feature is at least somewhat redundant. 

As with similar features, most cases take the maximum value. There are virtually no 0's, and we may want to consider such cases outliers.

##### Distribution of prod_type

In [None]:
# View percentage of cases by value of prod_type
get_percentages('prod_type')

In [None]:
# Plot number of cases by value of prod_type
plot_discrete_feature('prod_type')

**Notes:** This is a categorical feature with only two values, and both values are reasonable well-represented in the data. However, the meaning of the categories is not obvious from the data dictionary.

##### Distribution of cyto_score_detail

In [None]:
# View percentage of cases by value of cyto_score_detail
get_percentages('cyto_score_detail')

In [None]:
# Plot number of cases by value of cyto_score_detail
plot_discrete_feature('cyto_score_detail')

**Notes:** This feature is described as 'Cytogenetics for DRI (AML/MDS)'. To understand this feature, we will need to see how it interacts with `dri_score`. We may want to code the values `TBD` and `Not tested` as missing data.

##### Distribution of conditioning_intensity

In [None]:
# View percentage of cases by value of conditioning_intensity
get_percentages('conditioning_intensity')


In [None]:
# Plot number of cases by value of conditioning_intensity
plot_discrete_feature('conditioning_intensity', tick_angle=90)

**Notes:** This feature encodes the type of chemotherapy given prior to the hct procedure. We may wish to code the value `TBD` as missing data. There are two very rare values indicating no drugs. Since this is such a rare condition, we may want to drop these rows, or at least examine them more closely.

##### Distribution of ethnicity

In [None]:
# View percentage of cases by value of ethnicity
get_percentages('ethnicity')

In [None]:
# Plot number of cases by value of ethnicity
plot_discrete_feature('ethnicity', figsize=(8, 3))

**Notes:** This feature encodes whether a US-based patient is of Hispanic/Latino ancestry. While a majority of cases are non-hispanic, there is a non-negligible number of hispanic patients. `Non-resident of the U.S.` is a separate category, presumably because other countries do not track Hispanic/Latino ancestry as a separate category. We may want to code this as missing data.

##### Distribution of year_hct

In [None]:
# Plot number of cases by value of year_hct
plot_discrete_feature('year_hct', figsize=(8, 3), tick_angle=90)

**Notes:** The cases in this dataset span just over a decade, from 2008 to 2020. The number of transplants peaks 2016-2018, and then drops off sharply in 2019. We note that using year as a feature in our predictive model would probably not make sense. However, it may be interesting to look for changes in survival time by year. If some years are very different from others, we may need to account for that in our analysis. For example, we may wish to discard the few values from 2020, since COVID-19 would potentially create very unusual conditions.

##### Distribution of obesity

In [None]:
# View percentage of cases by value of obesity
get_percentages('obesity')

In [None]:
# Plot number of cases by value of obesity
plot_discrete_feature('obesity')

**Notes:** Fewer then 7% of cases with the feature `obesity` are coded as `Yes`. This is surprising at first, since obesity is a common condition in the U.S. One possible explanation is that leukemia itself causes weight loss. Since this feature is low variance, it may be of limited value to our model. As with other features, we may wish to code 'Not done' as missing data.

##### Distribution of mrd_hct

In [None]:
# View percentage of cases by value of mrd_hct
get_percentages('mrd_hct')

In [None]:
# Plot number of cases by value of mrd_hct
plot_discrete_feature('mrd_hct')

**Notes:** This feature encode the presence of [minimal residual disease](https://www.mdanderson.org/cancerwise/what-is-minimal-residual-disease--mrd--multiple-myeloma-lymphoma-leukemia-patients.h00-159383523.html) (MRD) in patients with AML or ALL. MRD occurs when a very small number of cancer cells remain in the body, even after successful treatment. These cells can be detected with certain assays, even if they do not apear in a typical biopsy. A majority of cases are negative, but there are a reasonable number of positive cases in the data as well.

##### Distribution of in_vivo_tcd

In [None]:
# View percentage of cases by value of in_vivo_tcd
get_percentages('in_vivo_tcd')


In [None]:
# Plot number of cases by value of in_vivo_tcd
plot_discrete_feature('in_vivo_tcd')

**Notes:** This feature encodes whether an in_vivo [t-cell depletion ](https://jhoonline.biomedcentral.com/articles/10.1186/s13045-018-0668-3) was performed using ATG/alemtuzumab. In-vivo t-cell depletion is a procedure that helps prevent graft vs. host disease. For this feature, there is a reasonable number of not `Yes` and `No` values in the dataset.

##### Distribution of tce_match

In [None]:
# View percentage of cases by value of tce_match
get_percentages('tce_match')

In [None]:
# Plot number of cases by value of tce_match
plot_discrete_feature('tce_match', figsize=(8, 3))

**Notes:** This feature encodes whether the [t-cell epiptote matching](https://pmc.ncbi.nlm.nih.gov/articles/PMC3813000/) between donor and patient is an exact match, a low-risk mismatch (permissive), or a higher-risk mismatch. The most common value is `Permissive`. This feature also has a large number of missing values. It's possible that tce_match is only considered when there is a possible mismatch in certain HLA alleles. Note that overall, there are only about 2500 non-permissive matches, a very small fraction of the overall data.


##### Distribution of hla_match_a_high

In [None]:
# View percentage of cases by value of hla_match_a_high
get_percentages('hla_match_a_high')

In [None]:
# Plot number of cases by value of hla_match_a_high
plot_discrete_feature('hla_match_a_high')

**Notes:** The feature encodes Recipient / 1st donor allele level (high resolution) matching at HLA-A. As with other features based on HLA allele matches, there are concerns with redundancy. The feature `hla_match_a_low` encodes matching on the same allele, and there are several features which include total matches over a seq of alleles including A. As with other features based on HLA alleles, a majority of values are equal to the maximum. There are virtually no 0's.

##### Distribution of hepatic_severe

In [None]:
# View percentage of cases by value of hepatic_servere
get_percentages('hepatic_severe')

In [None]:
# Plot number of cases by value of hepatic_servere
plot_discrete_feature('hepatic_severe')

**Notes:** This feature encodes the presence of moderate to severe liver issues. The value `Yes` appears in less than 6% of cases. We may consider dropping this feature of combining with other health indicators. The value `Not done` may be coded as missing data.

##### Distribution of donor_age

In [None]:
# Plot distribution of values for prior_tumor
plot_numeric_feature('donor_age')

**Notes:** The distribution of donor_age has a minimum of approximately 18. It is likely not permitted for younger people to donate. The distribution has a peak around 30, and then is relatively flat through the mid-60's before sharply dropping off. The box plot does not show any outliers, but the presence of a donor over 80 years old is remarkable.

##### Distribution of prior_tumor

In [None]:
# View percentage of cases by value of prior_tumor
get_percentages('prior_tumor')

In [None]:
# Plot number of cases by value of prior_tumor
plot_discrete_feature('prior_tumor')

**Notes:** The feature encodes whether the patient had a prior solid tumor. The `Yes` values account for just over 11% of cases, making this feature somewhat low variance. We may want to encode the `Not done` category as missing data.

##### Distribution of hla_match_b_low

In [None]:
# View percentage of cases by value of hla_match_b_low
get_percentages('hla_match_b_low')

In [None]:
# Plot number of cases by value of hla_match_b_low
plot_discrete_feature('hla_match_b_low')

**Notes:** This feature encodes recipient / 1st donor antigen level (low resolution) matching at HLA-B. As with other features that encode matching on a specific HLA-allele, this feature may be somewhat redundant. We have other features that give overall matching on a set of alleles that include HLA-B. In addition, we have a feature `hla_match_b_high` which gives high-resolution match on the same allele. 

As with other features that encode HLA-allele matches, a majority take the maximum value. There are virtually no 0's.

##### Distribution of peptic_ulcer

In [None]:
# View percentage of cases by value of peptic_ulcer
get_percentages('peptic_ulcer')

In [None]:
# Plot number of cases by value of peptic_ulcer
plot_discrete_feature('peptic_ulcer')

**Notes:** This feature records the presence or absense of peptic ulcers. Since `Yes` values account for less than 1% of cases, this feature may not have much predictive power. We may consider dropping it, or combining with other health indicators. We may also want to code the value of `Not done` as missing data.

##### Distribution of age_at_hct

In [None]:
#Plot distribution of values for  Age at HCT
plot_numeric_feature('age_at_hct', kde=False)

**Notes:** Age at HCT has a bimodal distribution, with one peak near zero for pediatric cases, and a much flatter peak from roughly ages 30-65 representing adults. The distribution does not seem to have outliers per se, but it seems to drop off sharply in the mid-late 60's.`

##### Distribution of hla_match_a_low

In [None]:
# View percentage of cases by value of hla_match_a_low
get_percentages('hla_match_a_low')

In [None]:
# Plot number of cases by value of [hla_match_a_low]
plot_discrete_feature('hla_match_a_low')

**Notes:** The feature encodes Recipient / 1st donor allele level (low resolution) matching at HLA-A. As with other features based on HLA allele matches, there are concerns with redundancy. The feature `hla_match_a_high` encodes matching on the same allele, and has a very similar distribution of values. In addition, there are several features which include total matches over a seq of alleles including A. As with other features based on HLA alleles, a majority of values are equal to the maximum. There are virtually no 0's.

##### Distribution of gvhd_proph

In [None]:
# View percentage of cases by value of gvhd_proph
get_percentages('gvhd_proph')

In [None]:
# Plot number of cases by value of gvhd_proph
plot_discrete_feature('gvhd_proph', figsize=(8,3), tick_angle=90)


**Notes:** This feature encodes the treatments used for prevent graphs-vs-host disease. This is a high-cardinality feature, with 12 levels that each individually account for less than 5% of cases. We may wish to bin these less-common levels into an "other" category.

##### Distribution of rheum_issue

In [None]:
# View percentage of cases by value of rheum_issue
get_percentages('rheum_issue')

In [None]:
# Plot number of cases by value of rheum_issue
plot_discrete_feature('rheum_issue')

**Notes:** This feature encodes the presence or absence of a rheumatologic issue. However, the value is `Yes` for less than 2% of cases. We may want to drop this feature or combine it with other health indicators in some way.

##### Distribution of sex_match

In [None]:
# View percentage of cases by value of sex_match
get_percentages('sex_match')

In [None]:
# Plot number of cases by value of sex_match
plot_discrete_feature('sex_match')


**Notes:** This feature encodes donor/recipient sex match. Insterestingly, it appears that there are more more male recipients than female overall. The data does not seem to show a strong preference for sex match between donor and recipient.

##### Distribution of hla_match_b_high

In [None]:
# View percentage of cases by value of hla_match_b_high
get_percentages('hla_match_b_high')

In [None]:
# Plot number of cases by value of hla_match_b_high
plot_discrete_feature('hla_match_b_high')

**Notes:** This feature encodes recipient / 1st donor antigen level (high resolution) matching at HLA-B. As with other features that encode matching on a specific HLA alleles, this feature may be somewhat redundant. We have other features that give overall matching on a set of alleles that include HLA-B. In addition, we have a feature `hla_match_b_low` which gives high-resolution match on the same allele, and has a very similar distribution to `hla_match_b_high`.

As with other features that encode HLA-allele matches, a majority take the maximum value. There are virtually no 0's.

##### Distribution of race_group

In [None]:
# View percentage of cases by value of race_group
get_percentages('race_group')

In [None]:
# Plot number of cases by value of race_group
plot_discrete_feature('race_group', tick_angle=90)

**Notes:** The distribution of racial groups in the data is surprisingly well-balanced. Given that one of the goal of the project is to improve equity across demographic groups, it seems reasonable to suppose that the data was deliberately stratified by race.

##### Distribution of comorbidity_score

In [None]:
# View percentage of cases by value of comorbidity score
get_percentages('comorbidity_score')

In [None]:
plot_numeric_feature('comorbidity_score', kde=False, discrete=True)

**Notes:** This feature encodes Sorror comorbidity score, which takes integer values from 0 to 10. This is a right-skewed distribution, with clustered near 0 and a median of 1. While box plots may not be the most appropriate for a discrete variable with relatively few values, it is interesting to note that a box plot flags any score of 6 or more as an outlier.

##### Distribution of karnofsky_score

In [None]:
# View percentage of cases by value of karnofsky_score
get_percentages('karnofsky_score')

In [None]:
# Plot distribution of values for karnofsky_score
plot_discrete_feature('karnofsky_score')

**Notes:** The feature encodes the [karnofsky performance scale](https://www.npcrc.org/files/news/karnofsky_performance_scale.pdf), indicating how much a disease interferes with daily life. The most common score is 90, indicating normal functioning and minimal symptoms. A second peak a 70 corresponds to a person able to care for themselves, but not participate in normal activities such as work.

##### Distribution of hepatic_mild

In [None]:
# View percentage of cases by value of hepatic_mild
get_percentages('hepatic_mild')

In [None]:
# Plot number of cases by value of hepatic_mild
plot_discrete_feature('hepatic_mild')

**Notes:** This feature encodes the presence of mild hepatic (liver) issues. The 'Yes' value corresponds to only 7% of cases, which makes this a relatively low-variance feature. We may want to combine this feature with other health indicators, or drop it entirely. As with other features, we may want to encode the value `Not done` as missing data.

##### Distribution of tce_div_match

In [None]:
# View percentage of cases by value of tce_div_match
get_percentages('tce_div_match')

In [None]:
# Plot number of cases by value of tce_div_match
plot_discrete_feature('tce_div_match', figsize=(8, 3), tick_angle=90)

**Notes:** Similar to `tce_match`, tce_div_match seems to encode whether the [t-cell epiptote matching](https://pmc.ncbi.nlm.nih.gov/articles/PMC3813000/) between donor and patient is a low-risk mismatch (permissive), or a higher-risk mismatch. The most common value is `Permissive`. The difference between this feature and `tce_match` is not immediately clear from the data dictionary. Like `tce_match`, this feature also has a large number of missing values. It's possible that `tce_div_match` is only considered when there is a possible mismatch in certain HLA alleles. 

##### Distribution of donor_related

In [None]:
# View percentage of cases by value of donor_related
get_percentages('donor_related')

In [None]:
# Plot number of cases by value of donor_related
plot_discrete_feature('donor_related', figsize=(8, 3))

**Notes:** The dataset has a reasonable balance between related and unrelated donors. The `Multiple donor (non-UCB)` category is rare. We may want to drop those records, or combine that category with one of the two main ones.

##### Distribution of melphalan_dose

In [None]:
# View percentage of cases by value of melphalan_dose
get_percentages('melphalan_dose')

In [None]:
# Plot number of cases by value of melphalan_dose
plot_discrete_feature('melphalan_dose')

**Notes:** The feature records whether melphalan was given as a treatment prior to hct. This feature has no rare levels.