# Machine Learning Pipeline - Data Analysis

In this notebooks, we will go through the implementation of each of the steps in the Machine Learning Pipeline. 

We will discuss:

1. **Data Analysis**
2. **Feature Engineering**
3. Feature Selection
4. Model Training
5. Obtaining Predictions / Scoring

# Project Context

This radiomics experiment is focused on connecting tissue properties with imaging patterns. The data originated from two MRI-localized biopsy cohorts. GBM patients were conscented and enrolled in Columbia and Mayo Clinic biopsy collection programs where multiple image localized biopsies were extracted from their tumor prior to gross or subtotal resection. Samples were sent to pathology for tissue analysis. In this work we focus on Ki67, a marker of cell proliferation. On the other side we have coregistered imaging data associated with the same time point as tissue extraction, including qualitative MRI sequences such as T1-weighted post contrast injection (T1Gd), T2-weighted (T2), and a quantitative MRI, apparent diffusion coefficients (ADC). 

Each of these image type were preprocessed offline, appropriate for the type of MRI that they belonged to, and quantitative features were extracted from a small area around each biopsy location using the pyradiomics pipeline.
The result of this analysis is a input csv file where each row data related to a unique biopsy, including  pyradiomics imaging features from 3 MRI types as well as the target, and some potentially biologically relevant features such as patient sex, age at death if available, type of tumor: recurrent or primary, and the source institution.

===================================================================================================

## Predicting Ki67 abbundance

The aim of the project is to build a machine learning model to predict the abbundance of ki67 in biopsies based on different imaging features describing patterns around the biopsies.


### Why is this important? 

Predicting ki67 is useful to identify if imaging patterns explain proliferation is biospy samples. We know that proliferation is elevated where tumor cells are present. If we can predict where proliferation happens we can basically create maps corresponding to the spatial distribution of tumor cells across whole tumors. So basically identifying where tumor cells are. These maps can theoretically inform radiation plans,, improving the efficacy of radiation therapy.


### What is the objective of the machine learning model?

We aim to minimise the difference between the real and the estimated abbundance of the target by our model. We will evaluate model performance with the:

1. mean squared error (mse)
2. root squared of the mean squared error (rmse)
3. r-squared (r2).


### How do I download the dataset?

you cant. it is proprietary data, sorry.


# Data Analysis

Let's load the dataset.

In [1]:
# to handle paths
import os

# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# for the yeo-johnson transformation
import scipy.stats as stats

# to display all the columns of the dataframe in the notebook
pd.pandas.set_option('display.max_columns', None)

In [2]:
# load dataset
datadir = os.path.join(os.getcwd(), 'dataset')
data = pd.read_csv(os.path.join(datadir, 'cumc+csbc_t1gd+t2+adc_predKi67ForMayo_pyradiomics.csv'))

# rows and columns of the data
print(data.shape)

(318, 5251)


In [None]:
# drop Anon id, it is just a number given to identify each sample
# source too, it is just the institution name
dropcols = ['AnonID', 'Source']
for c in dropcols:
    data.drop(c, axis=1, inplace=True)

# set index to the unique biopsy id for each row
data.set_index('biopsyImage', inplace=True)
data.shape

The dataset contains 318 rows, that is, biopsy samples, and 5248 columns, i.e., variables. 

Some will be predictive, a lot are going to be redundant, some are sample descriptive, and one is the target variable: Ki67 LI

## Analysis

**We will analyse the following:**

1. The target variable
2. Variable types (categorical and numerical)
3. Missing data
4. Numerical variables
    - Discrete
    - Continuous
    - Distributions
    - Transformations

5. Categorical variables
    - Special mappings
    

## Target

Let's begin by exploring the target distribution. First lets check there are no missing values in target, if there is drop the row altogether.

In [None]:
target = 'Ki67 LI'
if data[target].isnull().sum() == 0:
    print('no missing target')
else:
    df = df.dropna(subset=[target])

print(data.shape)

In [None]:
# histogram to evaluate target distribution
data[target].hist(bins=50, density=True)
plt.ylabel('counts')
plt.xlabel(target)
plt.show()

We can see that the target is continuous, and the distribution is skewed towards the left. We have one extreme outlier also, which is going to complicate things. But dropping samples is never a good idea, specially in a case like this where samples are scarse.

Lets see if we can improve the spread of target with a transformation.

In [None]:
# transforming the target using the logarithm is not possible
# since there are 0s in the target. 
# we will use yeo-johnson instead
target_tr_vals, _ = stats.yeojohnson(data[target])
target_tr_vals = list(target_tr_vals)
#np.log(data[target]).hist(bins=50, density=True)
plt.hist(target_tr_vals, bins=50, density=True)
plt.ylabel('counts')
plt.xlabel('%s - transformed' % target)
plt.show()

much better.

## Variable Types

Next, let's identify the categorical and numerical variables

In [None]:
# let's identify the categorical variables
# we will capture those of type *object*

cat_vars = [var for var in data.columns if data[var].dtype == 'O']
print(len(cat_vars))
print(cat_vars)

In [None]:
# cast all cat variables as categorical
data[cat_vars] = data[cat_vars].astype('O')

In [None]:
# now let's identify the numerical variables
num_vars = [
    var for var in data.columns if var not in cat_vars and var != target
]

# number of numerical variables
len(num_vars)

ok.

# Missing values

Let's go ahead and find out which variables of the dataset contain missing values.

In [None]:
# make a list of the variables that contain missing values
vars_with_na = [var for var in data.columns if var!=target and data[var].isnull().sum() > 0]
print('%d variables with missing values' % len(vars_with_na))
# determine percentage of missing values (expressed as decimals)
# and display the result ordered by % of missin data

data[vars_with_na].isnull().mean().sort_values(ascending=False)

Our dataset contains one variables with a big proportion of missing values (Age at Dealth). And a whole lot of other variables with a small percentage of missing observations.

This means that to train a machine learning model with this data set, we need to impute the missing data in these variables.

We can also visualize the percentage of missing values in the variables as follows:

### ploting cell : 
### dont plot though, 5000 vars are alot

data[vars_with_na].isnull().mean().sort_values(
    ascending=False).plot.bar(figsize=(10, 4))
plt.ylabel('Percentage of missing data')
plt.axhline(y=0.50, color='r', linestyle='-')
plt.axhline(y=0.10, color='g', linestyle='-')

plt.show()

In [None]:
# now we can determine which variables, from those with missing data,
# are numerical and which are categorical

cat_na = [var for var in cat_vars if var in vars_with_na]
num_na = [var for var in num_vars if var in vars_with_na]


print('Number of categorical variables with na: ', len(cat_na))
print('Number of numerical variables with na: ', len(num_na))

## Relationship between missing data and Ki67

Let's evaluate the target in samples where the information is missing. We will do this for each variable that shows missing data.

In [None]:
def analyse_na_value(df, var):

    # copy of the dataframe, so that we do not override the original data
    # see the link for more details about pandas.copy()
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html
    df = df.copy()

    # let's make an interim variable that indicates 1 if the
    # observation was missing or 0 otherwise
    df[var] = np.where(df[var].isnull(), 1, 0)

    # let's compare the median SalePrice in the observations where data is missing
    # vs the observations where data is available

    # determine the median abundance of target in the groups 1 and 0,
    # and the standard deviation of the target,
    # and we capture the results in a temporary dataset
    tmp = df.groupby(var)[target].agg(['mean', 'std'])

    # plot into a bar graph
    tmp.plot(kind="barh", y="mean", legend=False,
             xerr="std", title=target, color='green')

    plt.show()

In [None]:
# let's run the function on each variable with missing data
for var in num_na[:5]:
    analyse_na_value(data, var)


Seems like in numeric variables, the average target in samples with missing is lowerfrom cases with the variable is present. This could suggest that data being missing could be a good predictor of target. However, confidence intervals overlap though, so take this with a grain of salt. 

## Temporal variables

We dont have temporal variables in this dataset.

## Discrete variables

Let's go ahead and find which variables are discrete, i.e., show a finite number of values

In [None]:
#  let's male a list of discrete variables
discrete_vars = [var for var in num_vars if len(
    data[var].unique()) < 20]

print('Number of discrete variables: ', len(discrete_vars))

Ok. no discrete variables. move on to categorical variables.

## Categorical variables

how many unique values do we have in these categorical values?

In [None]:
data[cat_vars].nunique().sort_values(ascending=False).plot.bar(figsize=(5,5))

seems like we only have 2 values per variable. great. lets map them to numeric.


In [None]:
for var in cat_vars:
    # make boxplot with Catplot
    sns.catplot(x=var, y=target, data=data, kind="box", height=4, aspect=1.5)
    # add data points to boxplot with stripplot
    sns.stripplot(x=var, y=target, data=data, jitter=0.1, alpha=0.3, color='k')
    plt.show()

for some categorical variables we dont see a marked difference: sex for instance. 
Whereas for the type of tumor (primary or recurrent) we see higher ki67 values for primary cases. This makes sense biologically. 

Transform categorical variables into numeric

In [None]:
# this is alphabetical order for which will be 0 and which will be 1.
cat_mappings = {'Sex':{'F':0, 'M': 1}, 'Primary/Recurrent':{'Primary':0, 'Recurrent': 1}}

# map the variable values
for var in cat_vars:
    var_map = cat_mappings[var]
    data[var] = data[var].map(var_map)

data[cat_vars].head()

## numeric variables

Let's go ahead and find the distribution of the numeric (continuous in this project). 

In [None]:
# let's visualise the numeric variables
data[num_vars].head()

plotting 5000 variables is ridiculous, we need to create a test that captures skewness quantitatively.
Statistically, two numerical measures of shape – skewness and excess kurtosis – can be used to test for normality. If skewness is not close to zero, then your data set is not normally distributed.

Skewness is a measure of the asymmetry of the probability distribution of a random variable about its mean. In other words, skewness tells you the amount and direction of skew (departure from horizontal symmetry). The skewness value can be positive or negative, or even undefined. If skewness is 0, the data are perfectly symmetrical, although it is quite unlikely for real-world data. As a general rule of thumb:

skewness < -1 or skewness > 1 ----> highly skewed.

-1 <skewness < -0.5 | 0.5 < skewness < 1, ----> moderately skewed.

-0.5 < skewness < 0.5 ----> approximately symmetric. or no skewness

Lets find out how many of each we have.

In [None]:
# sort based on skewness
def get_skewed_variables(df, varlist):
    
    skewness = df[varlist].skew(axis=0)
    
    moderately_skewed = []
    threshold = 0.5
    for var in varlist:
        if 0.5 <= abs(skewness[var]) <= 1.0:
            moderately_skewed.append(var)
    
    extremely_skewed = []    
    threshold = 1  
    for var in varlist:
        if abs(skewness[var]) > 1.0:
            extremely_skewed.append(var)
    
    return moderately_skewed, extremely_skewed

print('# total numeric vars:', len(num_vars))
num_vars_modskewed, num_vars_extskewed = get_skewed_variables(data, num_vars)
print('# extremely skewed variables:', len(num_vars_extskewed))
print('# moderately skewed variables:', len(num_vars_modskewed))

# capture the remaining continuous variables
cont_vars = [v for v in num_vars if v not in num_vars_modskewed+num_vars_extskewed]
print('# ok numeric variables:', len(cont_vars))

In [None]:
# lets visualize some of the moderately skewed ones
for var in num_vars_modskewed[:5]: 
    
    data[var].hist(bins=50, density=True)
    plt.ylabel('counts')
    plt.xlabel(var)
    plt.show()

Lets apply the yeo-johnson transformation to moderately skewed variables. Sometimes, transforming the variables to improve the value spread, improves the model performance.

### Yeo-Johnson transformation

In [None]:
data_tr = data.copy()
for var in num_vars_modskewed:
    data_tr[var], param = stats.yeojohnson(data_tr[var])

res1, res2 = get_skewed_variables(data_tr, num_vars_modskewed)
notfixable = res1 + res2
fixable = [v for v in num_vars_modskewed if v not in notfixable]

print('# modskewed variables that are maybe fixable with yeo-johnson:', len(fixable))
print('# modskewed variables that are not fixable:', len(notfixable))

In [None]:
# let's plot the original or transformed variables for fixable ones
# vs target, and see if there is a relationship

# visualize for the first 5
target_logvals = np.log(data[target].values)
for var in fixable[:5]:
    
    plt.figure(figsize=(12, 4))
    
    # plot the original variable vs target    
    plt.subplot(1, 2, 1)
    plt.scatter(data[var], target_logvals)
    plt.ylabel(target)
    plt.xlabel('Original ' + var)

    # plot transformed variable vs sale price
    plt.subplot(1, 2, 2)
    plt.scatter(data_tr[var], target_logvals)
    plt.ylabel(target)
    plt.xlabel('Transformed ' + var)
                
    plt.show()

By eye, the transformations doesn't seems to improve the relationship.
Let's try a different transformation. 
 
 ### Logarithmic transformation

In [None]:
# Let's go ahead and analyse the distributions of these variables
# after applying a logarithmic transformation. 
# we can only do this on variables that dont have 0
fixable = [v for v in fixable if 0 not in data[v].values]

data_tr = data.copy()

for var in fixable:
    # transform the variable with logarithm
    data_tr[var] = np.log(data[var].values)
    
# visualize for the first 5
target_logvals = np.log(data[target])
for var in fixable[:5]:
    
    plt.figure(figsize=(12, 4))
    
    # plot the original variable vs target    
    plt.subplot(1, 2, 1)
    plt.scatter(data[var], target_logvals)
    plt.ylabel(target)
    plt.xlabel('Original ' + var)

    # plot transformed variable vs sale price
    plt.subplot(1, 2, 2)
    plt.scatter(data_tr[var], target_logvals)
    plt.ylabel(target)
    plt.xlabel('Transformed ' + var)
                
    plt.show()

 doesn't look like log transform did much for these variables. We can decide later if we want to do a transformation or not. Note this whether transformation actually helps improve the predictive power remains to be seen. To determine if this is the case, we should train a model with the original values and one with the transformed values, and determine model performance, and feature importance.

## Super skewed variables

These super skewed variables are not normally distributed even with transformation. It is unlikely that a transformation will help change the distribution of these variables dramatically.
Let's transform them into binary variables and see how predictive they are. But first how can we transform them? lets plot some:

In [None]:
for var in num_vars_extskewed[:5]: 
    
    data[var].hist(bins=50, density=True)
    plt.ylabel('counts')
    plt.xlabel(var)
    plt.show()

well, yeah. they are very skewed. we can use the median to binarize them

In [None]:
for var in num_vars_extskewed[:5]:
    
    tmp = data.copy()
    
    # map the variable values into 0 and 1
    tmp[var] = np.where(data[var]<=data[var].median(), 0, 1)
    
    # determine mean sale price in the mapped values
    tmp = tmp.groupby(var)[target].agg(['mean', 'std'])

    # plot into a bar graph
    tmp.plot(kind="barh", y="mean", legend=False,
             xerr="std", title=target, color='green')

    plt.show()

There seem to be not much of a difference in target in the mapped values and the confidence intervals overlap, so most likely this is not significant or predictive.


## summarize all you'll take to next step

In [113]:
# functions


def drop_select_vars(df, cols):
    """Drop unrelated or leaky columns."""
    for c in cols:
        if c in df.columns.values:  # ensure column exists in df
            df.drop(c, axis=1, inplace=True)
    return df


def transform_target_var(df, target):
    """
    Transforms target variable based on skewness
    using either yeojohnson or log transform.
    
    Returns df with target col transformed.
    """

    assert target in df.columns.values
    
    # most of the times there are 0s in target 
    # find out if that is the case
    has_zero = df[target].isin([0]).any().any()
    
    # if has zero use yeojohnson
    if has_zero:
        transformedvals, _ = stats.yeojohnson(df[target])
        df[target] = transformedvals
        return df
    
    # else, use log
    df[target] = np.log(df[target].values)
    return df


def categorize_vars_based_on_skewness(df, varlist):
    """
    Breaks varlist into 3 sublists: 
    notskewed, moderately skewed, extremely skewed
    
    
    Definition for these categories:
    
    -  skewness < -1 or skewness > 1 : highly skewed.
    -  -1 < skewness < -0.5 or 0.5 < skewness < 1 : moderately skewed.
    -  -0.5 < skewness < 0.5 : approximately symmetric or no skewness
    
    Returns these 3 sublists.
    """

    assert len(varlist) > 0

    skewness = df[varlist].skew(axis=0)

    moderately_skewed = []
    threshold = 0.5
    for var in varlist:
        if 0.5 <= abs(skewness[var]) <= 1.0:
            moderately_skewed.append(var)

    extremely_skewed = []    
    threshold = 1  
    for var in varlist:
        if abs(skewness[var]) > 1.0:
            extremely_skewed.append(var)

    # catch all others
    notskewed = [v for v in varlist if var not in extremely_skewed+moderately_skewed]

    return notskewed, moderately_skewed, extremely_skewed


def get_cat_vars(df):
    return [var for var in df.columns.values if df[var].dtype == 'O']


def get_numericvars_transformation_plan(df, varlist):
    """
    Sequentially breaks varlist into bins
    based on their skewness and test a number of strategies fixes their skewness.
    
    strategies include 'Yeojohnson', and 'Log' transform. if these two dont work, we
    will move on with binarizing the remaining variables using their median.
    
    Returns transformation dataframe with index as varlist and values for 
    the column showing the decision for that column.
    """ 
    
    # ensure numeric variables are actually numeric
    # check if there are categorical variables among varlist
    # remove these from the varlist
    cat_vars = get_cat_vars(df[varlist])
    if cat_vars:
        for c in cat_vars:
            varlist.remove(c)
        
    # create result dataframe to keep track of decisions
    res = pd.DataFrame(index=varlist)
    res['transformType'] = None
    
    # test skewness in variables
    groups = categorize_vars_based_on_skewness(df, varlist)
    notskewed, modskewed, extskewed = groups
    print('notskewed, modskewed, extskewed counts:', [len(g) for g in groups])
    
    # If notskewed, no need for transformation
    res.loc[notskewed, 'transformType'] = 'Skip'
    
    # If moderately skewed, apply yeo transform
    # see if this transform fixed skewness
    print('focusing on modskewed..')
    df_tr = df.copy()
    for var in modskewed:
        df_tr[var], _ = stats.yeojohnson(df_tr[var])
    groups = categorize_vars_based_on_skewness(df_tr, modskewed)
    fixable, no_change, worsened = groups
    print('fixable, no_change, worsened counts after yeojohnnson:', [len(g) for g in groups])
    
    # keep fixable results
    res.loc[fixable, 'transformType'] = 'YeoJohnson'
    
    # see if log transform helps no_change and worsened vars
    # make sure to only apply to cases without 0s
    print('focusing on no change and worsened ones (without 0)')
    targetlist = no_change + worsened
    targetlist = [v for v in targetlist if not df[v].isin([0]).any().any()]
    for var in targetlist:
        df_tr[var] = np.log(df_tr[var])
    groups = categorize_vars_based_on_skewness(df, targetlist)
    fixable, no_change, worsened = groups
    print('fixable, no_change, worsened counts after yeojohnnson:', [len(g) for g in groups])

    # update res for fixable
    res.loc[fixable, 'transformType'] = 'Log'
    
    # all remaining have to be binarized
    # you can use different criteria, we'll use medium here
    # update varlist
    print('everything else cant be fixed. binarizing  ')
    binthese = [v for v in varlist if res.loc[v, 'transformType'] is None]
    res.loc[binthese, 'transformType'] = 'Binarize'
    
    # you are done. return result
    for v in ['Skip', 'YeoJohnson', 'Log', 'Binarize']:
        print(v, res['transformType'].count(v))
    return res


def transform_cat_vars(df, trans_map):
    """
    Given a transformmation plan, this function turns categorical vars 
    into numeric vars.
    """
    # ensure trans_plan is not empty
    assert type(trans_map) == dict
    
    if trans_map == {}:
        print('no plan provided. return df unchanged')
        return df
    
    for var in trans_map.keys():
        
        prevals = set(df[var].values)
        postvals = set(trans_map[var].values())
        
        # check if pre and post vals have no overlap
        if prevals.isdisjoint(postvals):
            df[var] = df[var].map(trans_map[var])
        else:
            print('%s already transformed.skip.' % var)
            # transform already done. dont do it again
            continue
        
    # done. return transformed df
    return df


def transform_num_vars(df, trans_map):
    """
    Given a transformmation plan, this function applies all the requested 
    transforms to the data.
    """
    
    # ensure trans_plan is not empty
    if trans_map.empty:
        print('no plan provided. return df unchanged')
        return df
    
    plancol = 'transformType'
    uniq_plans = list(set(trans_map[plancol].values))
    
    df_tr = df.copy()
    
    for pl in uniq_plans:
        pl_vars = trans_map[trans_map[plancol] == pl].index.values
        
        if pl == 'Skip':  # do nothing
            pass
        
        elif pl == 'YeoJohnson':
            for var in pl_vars:
                df_tr[var], _ = stats.yeojohnson(df[var])
        
        elif pl == 'Log':
            for var in pl_vars:
                df_tr[var] = np.log(df[var])
        
        elif pl == 'Binarize':
            for var in pl_vars:
                df_tr[var] = np.where(df[var]<=df[var].median(), 0, 1)
    
        else:  # do nothing
            print('transformation %s unknown. skip these' % pl)
            pass
    
    # done. return transformed df
    return df_tr


def fill_in_missing_vals(df):
    
    # find vars with missing values
    vars_na = [v for v in df.columns.values if df[v].isnull().sum()>0]
    
    # skip if no missing vals
    if not vars_na:
        return df
    
    # divide into cat and num variable
    cat_na = [v for v in vars_na if df[v].dtype == 'O']
    num_na = [v for v in vars_na if v not in cat_na]
    print('# categorical variables with na: ', len(cat_na))
    print('# numerical variables with na: ', len(num_na))
    
    # for cat vars: use 0.3 as a guide/threshold (this is heuristically-decided)
    # >= 30% missing: create a new category
    # < 30% missing: replace with most commmon category
    cat_miss_th = 0.3
    perc_miss_df = df[cat_na].isnull().mean().sort_values(ascending=False)
    
    for v in cat_na:
        na_indices = list(df[df[v].isnull() == True].index.values)
        
        if perc_miss_df[v] >= cat_miss_th:
            new_cat = 'Unknown'
            df.loc[na_indices, v] = new_cat
        else:
            # replace with most common category
            df.loc[na_indices, v] = df[v].mode()
            
    ## for numeric, use 0.5 as threshold
    # >= 50% missing: drop
    # < 30% missing: replace with median
    num_miss_th = 0.5
    perc_miss_df = df[num_na].isnull().mean().sort_values(ascending=False)
    
    for v in num_na:
        na_indices = list(df[df[v].isnull() == True].index.values)
        
        if perc_miss_df[v] >= num_miss_th:
            df.drop(v, inplace=True)
        else:
            # replace with most common category
            df.loc[na_indices, v] = df[v].median()
    
    # return new df
    return df


## unittests
def test_fill_missing(df):
    
    tmp = fill_in_missing_vals(df)
    vars_na = [v for v in tmp.columns.values if tmp[v].isnull().sum()>0]
    assert len(vars_na) == 0


def test_drop_select_vars(df, drop_vars):
    
    tmp = drop_select_vars(df, drop_vars)
    for c in drop_vars:
        assert c not in tmp.columns.values

def test_transform_vat_vars(df, mapping):

    tmp = transform_cat_vars(df, mapping)
    for var in mapping.keys():
        
        prevals  = set(df[var].values)
        postvals = set(tmp[var].values)
        assert prevals != postvals
        assert prevals not in postvals
        

In [106]:
datadir = os.path.join(os.getcwd(), 'dataset')
data = pd.read_csv(os.path.join(datadir, 'cumc+csbc_t1gd+t2+adc_predKi67ForMayo_pyradiomics.csv'))

target = 'Ki67 LI'
indexcol = 'biopsyImage'
drop_vars = ['Source', 'AnonID']
cat_vars = ['Sex', 'Primary/Recurrent']
num_vars =  [var for var in data.columns if var not in cat_vars+drop_vars+[target, indexcol]]

In [107]:
if data.index.name is not indexcol:
    data.set_index(indexcol, inplace=True)
print('baseline size:', data.shape)

# 1. deal with drop vars
data = drop_select_vars(data, drop_vars)
print('after cleaning:', data.shape)

# 2. clean up
data = fill_in_missing_vals(data)
print('size after filling missing:', data.shape)

## tests
test_drop_select_vars(data, drop_vars)
test_fill_missing(data)

baseline size: (318, 5250)
after cleaning: (318, 5248)
# categorical variables with na:  0
# numerical variables with na:  5245
size after filling missing: (318, 5248)


In [95]:
# 3. transform target if necessary
data = transform_target_var(data, target)



In [114]:
# transform numeric predictors
num_mappings = get_numericvars_transformation_plan(data, num_vars)
data = transform_num_vars(data, num_mappings)

# transformm categorical predictors
cat_mappings = {'Sex': {'Unknown': 0, 'F': 1, 'M': 2}, 
                'Primary/Recurrent': {'Unknown': 0, 'Primary':1, 'Recurrent': 2}}

data = transform_cat_vars(data, cat_mappings)
data[cat_mappings.keys()].head()

notskewed, modskewed, extskewed counts: [0, 717, 3740]
focusing on modskewed..
fixable, no_change, worsened counts after yeojohnnson: [717, 5, 0]
focusing on no change and worsened ones (without 0)
fixable, no_change, worsened counts after yeojohnnson: [0, 5, 0]
everything else cant be fixed. binarizing  
       transformType
count           5245
unique             2
top         Binarize
freq            4528


  result = getattr(ufunc, method)(*inputs, **kwargs)


Sex already transformed.skip.
Primary/Recurrent already transformed.skip.


Unnamed: 0_level_0,Sex,Primary/Recurrent
biopsyImage,Unnamed: 1_level_1,Unnamed: 2_level_1
x323-x20161115-351-239-42,2,2
x096-x20140109-377-170-74,2,2
x116-x20140409-141-160-83,1,2
x117-x20140414-34-109-105,1,2
x077-x20131010-82-138-55,2,2
