## <center>Missing data in supervised ML</center>
### <center>Andras Zsom</center>
<center>Lead Data Scientist and  Adjunct Lecturer in Data Science</center>
<center>Center for Computation and Visualization</center>
<center>Brown University</center>
<center>Providence, RI, USA</center>

## About me
- Born and raised in Hungary
- Astrophysics PhD at MPIA, Heidelberg, Germany
- Postdoctoral researcher at MIT (still in astrophysics at the time)
- Started at Brown in December 2015 as a Data Scientist
- Promoted to Lead Data Scientist in 2017
- Adjunct Lecturer in Data Science this semester
   - Teaching the course *DATA1030: Hands-on data science* to the DS master students at Brown

## Data Science at Brown
- Center for Computation and Visualization
- Institutional Data group
   - Data-driven decision support and predictive modeling for Brown’s administrative units
   - Academic research on data-intensive projects
- **OPEN POSITION** - more on this later

## Learning Objectives

By the end of this workshop, you will be able to
- Describe the three main types of missingness patterns
- Evaluate simple approaches for handling missing values
- Apply XGBoost to a dataset with missing values
- Apply multivariate imputation
- Apply the reduced-features model (also called the pattern submodel approach)
- Decide which approach is best for your dataset

## Before we start, a few words on our dataset: kaggle house price
- good for educational purposes
   - messy data that requires quite a bit of preprocessing
   - a nice mixture of continuous, ordinal, and categorical features, each feature type has missing values
- lots of excellent kernels on kaggle
   - check them out [here]()
- dataset and description available in repo
   - let's take a look!

## <font color='LIGHTGRAY'>Learning Objectives</font>

<font color='LIGHTGRAY'>By the end of this workshop, you will be able to</font>
- **Describe the three main types of missingness patterns**
- <font color='LIGHTGRAY'>Evaluate simple approaches for handling missing values</font>
- <font color='LIGHTGRAY'>Apply XGBoost to a dataset with missing values</font>
- <font color='LIGHTGRAY'>Apply multivariate imputation</font>
- <font color='LIGHTGRAY'>Apply the reduced-features model (also called the pattern submodel approach)</font>
- <font color='LIGHTGRAY'>Decide which approach is best for your dataset</font>

## Missing values often occur in datasets
- survey data: not everyone answers all the questions
- medical data: not all tests/treatments/etc are performed on all patients
- sensor can be offline or malfunctioning
- customer data: not every user uses all features of an app

## Missing values are an issue for multiple reasons

#### Concenptual reason
- missing values can introduce biases
    - bias: the samples (the data points) are not representative of the underlying distribution/population
    - any conclusion drawn from a biased dataset is also biased.
    - rich people tend to not fill out survey questions about their salaries and the mean salary estimated from survey data tend to be lower than true value


#### Practical reason
- missing values (NaN, NA, inf) are incompatible with sklearn
   - all values in an array need to be numerical otherwise sklearn will throw a *ValueError*
- there are a few supervised ML techniques that work with missing values (e.g., XGBoost, CatBoost)
   - we will cover those later today

# Missing data patterns

- **MCAR** - Missing Complete At Random
   - some people skip some survey questions by accident
- **MAR** - Missing At Random
   - males are less likely to fill out a survey on depression
   - this has nothing to do with their level of depression after accounting for maleness
- **MNAR** - Missing Not At Random
   - depressed people are less likely to fill out a survey on depression due to their level of depression

## MCAR test

- MCAR can be diagnosed with a statistical test ([Little, 1988](https://www.tandfonline.com/doi/abs/10.1080/01621459.1988.10478722))
   - python implementation available in the [pymice](https://github.com/RianneSchouten/pymice) package or in the skipped slide

In [1]:
# from the pymice package 
# https://github.com/RianneSchouten/pymice

import numpy as np
import pandas as pd
import math as ma
import scipy.stats as st

def checks_input_mcar_tests(data):
    """ Checks whether the input parameter of class McarTests is correct
            Parameters
            ----------
            data:
                The input of McarTests specified as 'data'
            Returns
            -------
            bool
                True if input is correct
            """

    if not isinstance(data, pd.DataFrame):
        print("Error: Data should be a Pandas DataFrame")
        return False

    if not any(data.dtypes.values == np.float):
        if not any(data.dtypes.values == np.int):
            print("Error: Dataset cannot contain other value types than floats and/or integers")
            return False

    if not data.isnull().values.any():
        print("Error: No NaN's in given data")
        return False

    return True


def mcar_test(data):
    """ Implementation of Little's MCAR test
    Parameters
    ----------
    data: Pandas DataFrame
        An incomplete dataset with samples as index and variables as columns
    Returns
    -------
    p_value: Float
        This value is the outcome of a chi-square statistical test, testing whether the null hypothesis
        'the missingness mechanism of the incomplete dataset is MCAR' can be rejected.
    """

    if not checks_input_mcar_tests(data):
        raise Exception("Input not correct")

    dataset = data.copy()
    vars = dataset.dtypes.index.values
    n_var = dataset.shape[1]

    # mean and covariance estimates
    # ideally, this is done with a maximum likelihood estimator
    gmean = dataset.mean()
    gcov = dataset.cov()

    # set up missing data patterns
    r = 1 * dataset.isnull()
    mdp = np.dot(r, list(map(lambda x: ma.pow(2, x), range(n_var))))
    sorted_mdp = sorted(np.unique(mdp))
    n_pat = len(sorted_mdp)
    correct_mdp = list(map(lambda x: sorted_mdp.index(x), mdp))
    dataset['mdp'] = pd.Series(correct_mdp, index=dataset.index)

    # calculate statistic and df
    pj = 0
    d2 = 0
    for i in range(n_pat):
        dataset_temp = dataset.loc[dataset['mdp'] == i, vars]
        select_vars = ~dataset_temp.isnull().any()
        pj += np.sum(select_vars)
        select_vars = vars[select_vars]
        means = dataset_temp[select_vars].mean() - gmean[select_vars]
        select_cov = gcov.loc[select_vars, select_vars]
        mj = len(dataset_temp)
        parta = np.dot(means.T, np.linalg.solve(select_cov, np.identity(select_cov.shape[1])))
        d2 += mj * (np.dot(parta, means))

    df = pj - n_var

    # perform test and save output
    p_value = 1 - st.chi2.cdf(d2, df)

    return p_value

## Takeaway
- it can be challenging to infer the missingness pattern from an incomplete dataset
   - There is a statistical test to diagnose MCAR
   - MAR and MNAR are difficult/impossible to diagnose to the best of my knowledge
- multiple patterns can be present in the data
   - even worse, multiple patterns can be present in one feature!
   - missing values in a feature can occur due to a mix of MCAR, MAR, MNAR 


## <font color='LIGHTGRAY'>Learning Objectives</font>

<font color='LIGHTGRAY'>By the end of this workshop, you will be able to</font>
- <font color='LIGHTGRAY'>Describe the three main types of missingness patterns</font>
- **Evaluate simple approaches for handling missing values**
- <font color='LIGHTGRAY'>Apply XGBoost to a dataset with missing values</font>
- <font color='LIGHTGRAY'>Apply multivariate imputation</font>
- <font color='LIGHTGRAY'>Apply the reduced-features model (also called the pattern submodel approach)</font>
- <font color='LIGHTGRAY'>Decide which approach is best for your dataset</font>

## Simple approaches for handling missing values
- 1) categorical/ordinal features: treat missing values as another category
    - missing values in categorical/ordinal features are not a big deal
- 2) continuous features: this is the tough part
    - sklearn's SimpleImputer
- 3) exclude points or features with missing values
    - might be OK

### 1a) Missing values in a categorical feature
- YAY - this is not an issue at all!
- Categorical feature needs to be one-hot encoded anyway
- Just replace the missing values with 'NA' or 'missing' and treat it as a separate category

### 1b) Missing values in a ordinal feature
- this can be a bit trickier but usually fine
- Ordinal encoder is applied to ordinal features
    - where does 'NA' or 'missing' fit into the order of the categories?
    - usually first or last
- if you can figure this out, you are golden

In [2]:
# read the data
import pandas as pd
import numpy  as np
from sklearn.model_selection import train_test_split

# Let's load the data
df = pd.read_csv('data/train.csv')
# drop the ID
df.drop(columns=['Id'],inplace=True)

# the target variable
y = df['SalePrice']
df.drop(columns=['SalePrice'],inplace=True)
# the unprocessed feature matrix
X = df.values
print(X.shape)
# the feature names
ftrs = df.columns

(1460, 79)


In [13]:
# let's split to train and test
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=0)
print(X_train.head())

     MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
618          20       RL         90.0    11694   Pave   NaN      Reg   
870          20       RL         60.0     6600   Pave   NaN      Reg   
92           30       RL         80.0    13360   Pave  Grvl      IR1   
817          20       RL          NaN    13265   Pave   NaN      IR1   
302          20       RL        118.0    13704   Pave   NaN      IR1   

    LandContour Utilities LotConfig  ... ScreenPorch PoolArea PoolQC Fence  \
618         Lvl    AllPub    Inside  ...         260        0    NaN   NaN   
870         Lvl    AllPub    Inside  ...           0        0    NaN   NaN   
92          HLS    AllPub    Inside  ...           0        0    NaN   NaN   
817         Lvl    AllPub   CulDSac  ...           0        0    NaN   NaN   
302         Lvl    AllPub    Corner  ...           0        0    NaN   NaN   

    MiscFeature MiscVal  MoSold  YrSold  SaleType  SaleCondition  
618         NaN       0       7

In [4]:
# collect the various features
cat_ftrs = ['MSZoning','Street','Alley','LandContour','LotConfig','Neighborhood','Condition1','Condition2',\
            'BldgType','HouseStyle','RoofStyle','RoofMatl','Exterior1st','Exterior2nd','MasVnrType','Foundation',\
           'Heating','CentralAir','Electrical','GarageType','PavedDrive','MiscFeature','SaleType','SaleCondition']
ordinal_ftrs = ['LotShape','Utilities','LandSlope','ExterQual','ExterCond','BsmtQual','BsmtCond','BsmtExposure',\
               'BsmtFinType1','BsmtFinType2','HeatingQC','KitchenQual','Functional','FireplaceQu','GarageFinish',\
               'GarageQual','GarageCond','PoolQC','Fence']
ordinal_cats = [['Reg','IR1','IR2','IR3'],['AllPub','NoSewr','NoSeWa','ELO'],['Gtl','Mod','Sev'],\
               ['Po','Fa','TA','Gd','Ex'],['Po','Fa','TA','Gd','Ex'],['NA','Po','Fa','TA','Gd','Ex'],\
               ['NA','Po','Fa','TA','Gd','Ex'],['NA','No','Mn','Av','Gd'],['NA','Unf','LwQ','Rec','BLQ','ALQ','GLQ'],\
               ['NA','Unf','LwQ','Rec','BLQ','ALQ','GLQ'],['Po','Fa','TA','Gd','Ex'],['Po','Fa','TA','Gd','Ex'],\
               ['Sal','Sev','Maj2','Maj1','Mod','Min2','Min1','Typ'],['NA','Po','Fa','TA','Gd','Ex'],\
               ['NA','Unf','RFn','Fin'],['NA','Po','Fa','TA','Gd','Ex'],['NA','Po','Fa','TA','Gd','Ex'],
               ['NA','Fa','TA','Gd','Ex'],['NA','MnWw','GdWo','MnPrv','GdPrv']]
num_ftrs = ['MSSubClass','LotFrontage','LotArea','OverallQual','OverallCond','YearBuilt','YearRemodAdd',\
             'MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF',\
             'LowQualFinSF','GrLivArea','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr',\
             'KitchenAbvGr','TotRmsAbvGrd','Fireplaces','GarageYrBlt','GarageCars','GarageArea','WoodDeckSF',\
             'OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal','MoSold','YrSold']

In [5]:
# preprocess with pipeline and columntransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# one-hot encoder
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant',fill_value='missing')),
    ('onehot', OneHotEncoder(sparse=False,handle_unknown='ignore'))])

# ordinal encoder
ordinal_transformer = Pipeline(steps=[
    ('imputer2', SimpleImputer(strategy='constant',fill_value='NA')),
    ('ordinal', OrdinalEncoder(categories = ordinal_cats))])

# standard scaler
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

# collect the transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_ftrs),
        ('cat', categorical_transformer, cat_ftrs),
        ('ord', ordinal_transformer, ordinal_ftrs)])

In [16]:
# fit_transform the data
X_prep = preprocessor.fit_transform(X_train)
# little hacky, but collect feature names
feature_names = preprocessor.transformers_[0][-1] + \
                list(preprocessor.named_transformers_['cat'][1].get_feature_names(cat_ftrs)) + \
                preprocessor.transformers_[2][-1]

df_train = pd.DataFrame(data=X_prep,columns=feature_names)

print(df_train.shape)

df_train.to_csv('data/house_price_prep.csv',index=False)

# transform the test
df_test = preprocessor.transform(X_test)
df_test = pd.DataFrame(data=df_test,columns = feature_names)
print(df_test.shape)

(1168, 225)
(292, 225)


### Let's take a closer look at our missing values!

In [17]:
print('data dimensions:',df_train.shape)
perc_missing_per_ftr = df_train.isnull().sum(axis=0)/df_train.shape[0]
print('fraction of missing values in features:')
print(perc_missing_per_ftr[perc_missing_per_ftr > 0])
frac_missing = sum(df_train.isnull().sum(axis=1)!=0)/df_train.shape[0]
print('fraction of points with missing values:',frac_missing)

data dimensions: (1168, 225)
fraction of missing values in features:
LotFrontage    0.181507
MasVnrArea     0.005137
GarageYrBlt    0.049658
dtype: float64
fraction of points with missing values: 0.23116438356164384


### 2) Continuous features: mean or median imputation
- Imputation means you infer the missing values from the known part of the data
- sklearn's SimpleImputer can do mean and median imputation
- USUALLY A BAD IDEA!
   - MCAR: mean/median of non-missing values is the same as the mean/median of the true underlying distribution, but the variances are different
   - not MCAR: the mean/median and the variance of the completed dataset will be off
   - supervised ML model is too confident (MCAR) or systematically off (not MCAR)

### 3) Exclude points or features with missing values

- easy to do with pandas
- it is an ACCEPTABLE approach under two conditions:
    - Little's test supports MCAR (p > 0.05)
    - only small fraction of points contain missing values (maybe a few percent?)
    - the missing values are limited to one or a few features and a large fraction of points are missing from those features (maybe up to 90%?)
- if the MCAR assumption is justified, dropping points will not introduce biases to your model
- due to the smaller sample size, the confidence of your model might suffer. 
- what will you do with missing values when you deploy the model?

In [8]:
print(df_train.shape)
# by default, rows/points are dropped
df_r = df_train.dropna()
print(df_r.shape)
# drop features with missing values
df_c = df_train.dropna(axis=1)
print(df_c.shape)

(1168, 225)
(898, 225)
(1168, 222)


## <font color='LIGHTGRAY'>Learning Objectives</font>

<font color='LIGHTGRAY'>By the end of this workshop, you will be able to</font>
- <font color='LIGHTGRAY'>Describe the three main types of missingness patterns</font>
- <font color='LIGHTGRAY'>Evaluate simple approaches for handling missing values</font>
- **Apply XGBoost to a dataset with missing values**
- <font color='LIGHTGRAY'>Apply multivariate imputation</font>
- <font color='LIGHTGRAY'>Apply the reduced-features model (also called the pattern submodel approach)</font>
- <font color='LIGHTGRAY'>Decide which approach is best for your dataset</font>

## XGBoost and missing values
- sklearn raises an error if the feature matrix (X) contains nans. 
- XGBoost doesn't! 
- If a feature with missing values is split:
    - XGBoost tries to put the points with missing values to the left and right
    - calculates the impurity metric for both options
    - puts the points with missing values to the side with the lower impurity
- if missingness correlates with the target variable, XGBoost extracts this info!

In [9]:
import xgboost
from sklearn.model_selection import ParameterGrid

param_grid = {"learning_rate": [0.03],
              "n_estimators": [2000],
              "seed": [0],
              #"n_jobs": [6],
              #"reg_alpha": [0e0,0.1,0.31622777,1.,3.16227766,10.],
              #"reg_lambda": [0e0,0.1,0.31622777,1.,3.16227766,10.],
              "missing": [np.nan], 
              #"max_depth": [1,2,3,4,5],
              "colsample_bytree": [0.9],              
              "subsample": [0.66]}

XGB = xgboost.XGBRegressor()
XGB.set_params(**ParameterGrid(param_grid)[0])
XGB.fit(df_train,y_train,early_stopping_rounds=50,eval_set=[(df_test, y_test)], verbose=False)

[0]	validation_0-rmse:193966
Will train until validation_0-rmse hasn't improved in 50 rounds.
[1]	validation_0-rmse:188693
[2]	validation_0-rmse:183680
[3]	validation_0-rmse:178666
[4]	validation_0-rmse:173696
[5]	validation_0-rmse:169039
[6]	validation_0-rmse:164603
[7]	validation_0-rmse:160056
[8]	validation_0-rmse:155814
[9]	validation_0-rmse:151740
[10]	validation_0-rmse:147663
[11]	validation_0-rmse:143614
[12]	validation_0-rmse:139828
[13]	validation_0-rmse:136254


  if getattr(data, 'base', None) is not None and \


[14]	validation_0-rmse:132864
[15]	validation_0-rmse:129577
[16]	validation_0-rmse:126168
[17]	validation_0-rmse:122937
[18]	validation_0-rmse:119756
[19]	validation_0-rmse:116694
[20]	validation_0-rmse:113587
[21]	validation_0-rmse:110667
[22]	validation_0-rmse:107775
[23]	validation_0-rmse:105057
[24]	validation_0-rmse:102448
[25]	validation_0-rmse:100066
[26]	validation_0-rmse:97689.4
[27]	validation_0-rmse:95427.2
[28]	validation_0-rmse:93092.6
[29]	validation_0-rmse:90871.3
[30]	validation_0-rmse:88908.3
[31]	validation_0-rmse:87011.3
[32]	validation_0-rmse:84970.5
[33]	validation_0-rmse:83085.8
[34]	validation_0-rmse:81236.1
[35]	validation_0-rmse:79562.7
[36]	validation_0-rmse:77793.3
[37]	validation_0-rmse:76070.2
[38]	validation_0-rmse:74424.1
[39]	validation_0-rmse:72957.6
[40]	validation_0-rmse:71446.9
[41]	validation_0-rmse:69923.5
[42]	validation_0-rmse:68533.6
[43]	validation_0-rmse:67238.4
[44]	validation_0-rmse:65808.2
[45]	validation_0-rmse:64508.2
[46]	validation_0-rm

[275]	validation_0-rmse:30758.9
[276]	validation_0-rmse:30723.9
[277]	validation_0-rmse:30712.6
[278]	validation_0-rmse:30685
[279]	validation_0-rmse:30637.2
[280]	validation_0-rmse:30641.9
[281]	validation_0-rmse:30638
[282]	validation_0-rmse:30635.5
[283]	validation_0-rmse:30627.9
[284]	validation_0-rmse:30643.7
[285]	validation_0-rmse:30605.7
[286]	validation_0-rmse:30608.4
[287]	validation_0-rmse:30602.7
[288]	validation_0-rmse:30614.8
[289]	validation_0-rmse:30624.2
[290]	validation_0-rmse:30615.7
[291]	validation_0-rmse:30606.3
[292]	validation_0-rmse:30595.7
[293]	validation_0-rmse:30587
[294]	validation_0-rmse:30561.7
[295]	validation_0-rmse:30535.9
[296]	validation_0-rmse:30482.8
[297]	validation_0-rmse:30467.3
[298]	validation_0-rmse:30441.9
[299]	validation_0-rmse:30465.4
[300]	validation_0-rmse:30472
[301]	validation_0-rmse:30477.5
[302]	validation_0-rmse:30476.1
[303]	validation_0-rmse:30446.1
[304]	validation_0-rmse:30467.4
[305]	validation_0-rmse:30478.4
[306]	validation

[534]	validation_0-rmse:29383.7
[535]	validation_0-rmse:29372.7
[536]	validation_0-rmse:29364.6
[537]	validation_0-rmse:29378.8
[538]	validation_0-rmse:29375.4
[539]	validation_0-rmse:29374.3
[540]	validation_0-rmse:29376.5
[541]	validation_0-rmse:29365.4
[542]	validation_0-rmse:29356.9
[543]	validation_0-rmse:29365.5
[544]	validation_0-rmse:29368.2
[545]	validation_0-rmse:29372.5
[546]	validation_0-rmse:29384.2
[547]	validation_0-rmse:29377.5
[548]	validation_0-rmse:29360.8
[549]	validation_0-rmse:29359.2
[550]	validation_0-rmse:29345.5
[551]	validation_0-rmse:29344.9
[552]	validation_0-rmse:29346.1
[553]	validation_0-rmse:29341.8
[554]	validation_0-rmse:29326
[555]	validation_0-rmse:29305.4
[556]	validation_0-rmse:29304
[557]	validation_0-rmse:29292.1
[558]	validation_0-rmse:29283.4
[559]	validation_0-rmse:29276
[560]	validation_0-rmse:29269.7
[561]	validation_0-rmse:29273.6
[562]	validation_0-rmse:29280.8
[563]	validation_0-rmse:29281.7
[564]	validation_0-rmse:29281.6
[565]	validati

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bytree=0.9, gamma=0, importance_type='gain',
             learning_rate=0.03, max_delta_step=0, max_depth=3,
             min_child_weight=1, missing=None, n_estimators=2000, n_jobs=1,
             nthread=None, objective='reg:linear', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, seed=0, silent=True,
             subsample=0.66)

## <font color='LIGHTGRAY'>Learning Objectives</font>

<font color='LIGHTGRAY'>By the end of this workshop, you will be able to</font>
- <font color='LIGHTGRAY'>Describe the three main types of missingness patterns</font>
- <font color='LIGHTGRAY'>Evaluate simple approaches for handling missing values</font>
- <font color='LIGHTGRAY'>Apply XGBoost to a dataset with missing values</font>
- **Apply multivariate imputation**
- <font color='LIGHTGRAY'>Apply the reduced-features model (also called the pattern submodel approach)</font>
- <font color='LIGHTGRAY'>Decide which approach is best for your dataset</font>

## Multivariate Imputation

- models each feature with missing values as a function of other features, and uses that estimate for imputation
   - at each step, a feature column is designated as target variable y and the other feature columns are treated as feature matrix X
   - a regressor is trained on (X, y) for known y
   - then, the regressor is used to predict the missing values of y
- in the ML pipeline:
   - create n imputed datasets
   - run all of them through the ML pipeline
   - generate n holdout scores
   - the uncertainty in the holdout scores is due to the uncertainty in imputation
- works on MCAR and MAR, fails on MNAR
- paper [here](https://www.jstatsoft.org/article/view/v045i03)

# sklearn's IterativeImputer

In [10]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

print(df_train[['LotFrontage','MasVnrArea','GarageYrBlt']].iloc[155:160])

imputer = IterativeImputer(estimator = RandomForestRegressor(n_estimators=10), random_state=0)
X_impute = imputer.fit_transform(df_train)
df_train_imp = pd.DataFrame(data=X_impute, columns = df_train.columns)

print(df_train_imp[['LotFrontage','MasVnrArea','GarageYrBlt']].iloc[155:160])

df_test_imp = pd.DataFrame(data=imputer.transform(df_test), columns = df_train.columns)


     LotFrontage  MasVnrArea  GarageYrBlt
155     0.234846   -0.566716          NaN
156          NaN   -0.566716    -1.935994
157     0.365656   -0.566716    -1.775132
158     1.063308   -0.566716          NaN
159          NaN   -0.336702     0.718226
     LotFrontage  MasVnrArea  GarageYrBlt
155     0.234846   -0.566716    -0.552582
156     0.317693   -0.566716    -1.935994
157     0.365656   -0.566716    -1.775132
158     1.063308   -0.566716    -0.078040
159     0.365656   -0.336702     0.718226




In [11]:
XGB.fit(df_train_imp,y_train,early_stopping_rounds=50,eval_set=[(df_test_imp, y_test)], verbose=False)

[0]	validation_0-rmse:193966
Will train until validation_0-rmse hasn't improved in 50 rounds.
[1]	validation_0-rmse:188693
[2]	validation_0-rmse:183680
[3]	validation_0-rmse:178666
[4]	validation_0-rmse:173696
[5]	validation_0-rmse:169039
[6]	validation_0-rmse:164603
[7]	validation_0-rmse:160056
[8]	validation_0-rmse:155814
[9]	validation_0-rmse:151740
[10]	validation_0-rmse:147663
[11]	validation_0-rmse:143614


  if getattr(data, 'base', None) is not None and \


[12]	validation_0-rmse:139828
[13]	validation_0-rmse:136254
[14]	validation_0-rmse:132864
[15]	validation_0-rmse:129577
[16]	validation_0-rmse:126168
[17]	validation_0-rmse:122937
[18]	validation_0-rmse:119756
[19]	validation_0-rmse:116694
[20]	validation_0-rmse:113587
[21]	validation_0-rmse:110667
[22]	validation_0-rmse:107775
[23]	validation_0-rmse:105057
[24]	validation_0-rmse:102448
[25]	validation_0-rmse:100066
[26]	validation_0-rmse:97689.4
[27]	validation_0-rmse:95426.9
[28]	validation_0-rmse:93092.2
[29]	validation_0-rmse:90870.7
[30]	validation_0-rmse:88907.7
[31]	validation_0-rmse:87010.4
[32]	validation_0-rmse:84969.6
[33]	validation_0-rmse:83084.7
[34]	validation_0-rmse:81235.1
[35]	validation_0-rmse:79561.4
[36]	validation_0-rmse:77791.7
[37]	validation_0-rmse:76068.2
[38]	validation_0-rmse:74403.3
[39]	validation_0-rmse:72930.8
[40]	validation_0-rmse:71418.8
[41]	validation_0-rmse:69895.2
[42]	validation_0-rmse:68515.2
[43]	validation_0-rmse:67206.9
[44]	validation_0-rmse

[273]	validation_0-rmse:30052.1
[274]	validation_0-rmse:30039.2
[275]	validation_0-rmse:29987.3
[276]	validation_0-rmse:29952
[277]	validation_0-rmse:29942.3
[278]	validation_0-rmse:29934.5
[279]	validation_0-rmse:29883.9
[280]	validation_0-rmse:29882.1
[281]	validation_0-rmse:29873.6
[282]	validation_0-rmse:29871.6
[283]	validation_0-rmse:29864.5
[284]	validation_0-rmse:29883
[285]	validation_0-rmse:29849.2
[286]	validation_0-rmse:29824.6
[287]	validation_0-rmse:29821.3
[288]	validation_0-rmse:29847.9
[289]	validation_0-rmse:29858.5
[290]	validation_0-rmse:29850.2
[291]	validation_0-rmse:29840.9
[292]	validation_0-rmse:29852.3
[293]	validation_0-rmse:29838.8
[294]	validation_0-rmse:29845.7
[295]	validation_0-rmse:29833
[296]	validation_0-rmse:29781.8
[297]	validation_0-rmse:29776.7
[298]	validation_0-rmse:29763.6
[299]	validation_0-rmse:29787
[300]	validation_0-rmse:29796.2
[301]	validation_0-rmse:29801.7
[302]	validation_0-rmse:29798.8
[303]	validation_0-rmse:29771
[304]	validation_0

[531]	validation_0-rmse:28965.5
[532]	validation_0-rmse:28959.1
[533]	validation_0-rmse:28949.3
[534]	validation_0-rmse:28932.2
[535]	validation_0-rmse:28919.5
[536]	validation_0-rmse:28922
[537]	validation_0-rmse:28939.1
[538]	validation_0-rmse:28931.4
[539]	validation_0-rmse:28930.9
[540]	validation_0-rmse:28933.2
[541]	validation_0-rmse:28936.5
[542]	validation_0-rmse:28926.2
[543]	validation_0-rmse:28936
[544]	validation_0-rmse:28937.3
[545]	validation_0-rmse:28936.6
[546]	validation_0-rmse:28937.8
[547]	validation_0-rmse:28926.3
[548]	validation_0-rmse:28908.8
[549]	validation_0-rmse:28896.9
[550]	validation_0-rmse:28904.7
[551]	validation_0-rmse:28904.8
[552]	validation_0-rmse:28906.7
[553]	validation_0-rmse:28903.6
[554]	validation_0-rmse:28897.7
[555]	validation_0-rmse:28874.7
[556]	validation_0-rmse:28882.4
[557]	validation_0-rmse:28869.8
[558]	validation_0-rmse:28865.1
[559]	validation_0-rmse:28860.2
[560]	validation_0-rmse:28857.7
[561]	validation_0-rmse:28860.4
[562]	valida

[789]	validation_0-rmse:28453.6
[790]	validation_0-rmse:28457.7
[791]	validation_0-rmse:28449.7
[792]	validation_0-rmse:28441.7
[793]	validation_0-rmse:28438.2
[794]	validation_0-rmse:28432.5
[795]	validation_0-rmse:28427.9
[796]	validation_0-rmse:28428.2
[797]	validation_0-rmse:28422.4
[798]	validation_0-rmse:28424.1
[799]	validation_0-rmse:28419.9
[800]	validation_0-rmse:28406.9
[801]	validation_0-rmse:28404.1
[802]	validation_0-rmse:28404.6
[803]	validation_0-rmse:28406.3
[804]	validation_0-rmse:28403.5
[805]	validation_0-rmse:28401.2
[806]	validation_0-rmse:28394.2
[807]	validation_0-rmse:28394.4
[808]	validation_0-rmse:28394.9
[809]	validation_0-rmse:28386
[810]	validation_0-rmse:28391.7
[811]	validation_0-rmse:28395.6
[812]	validation_0-rmse:28391.8
[813]	validation_0-rmse:28393.3
[814]	validation_0-rmse:28393.8
[815]	validation_0-rmse:28389.2
[816]	validation_0-rmse:28381.5
[817]	validation_0-rmse:28379.7
[818]	validation_0-rmse:28376
[819]	validation_0-rmse:28377
[820]	validati

[1046]	validation_0-rmse:28165.5
[1047]	validation_0-rmse:28167.3
[1048]	validation_0-rmse:28166.3
[1049]	validation_0-rmse:28163.5
[1050]	validation_0-rmse:28170.3
[1051]	validation_0-rmse:28171.8
[1052]	validation_0-rmse:28173.4
[1053]	validation_0-rmse:28190.6
[1054]	validation_0-rmse:28186.9
[1055]	validation_0-rmse:28181.5
[1056]	validation_0-rmse:28180
[1057]	validation_0-rmse:28181.6
[1058]	validation_0-rmse:28179.6
[1059]	validation_0-rmse:28179.7
[1060]	validation_0-rmse:28177.1
[1061]	validation_0-rmse:28177.7
[1062]	validation_0-rmse:28174.2
[1063]	validation_0-rmse:28175.4
[1064]	validation_0-rmse:28175.3
[1065]	validation_0-rmse:28169.2
[1066]	validation_0-rmse:28164
[1067]	validation_0-rmse:28162.6
[1068]	validation_0-rmse:28161.8
[1069]	validation_0-rmse:28168.7
[1070]	validation_0-rmse:28163.4
[1071]	validation_0-rmse:28160
[1072]	validation_0-rmse:28157.1
[1073]	validation_0-rmse:28161.7
[1074]	validation_0-rmse:28164.9
[1075]	validation_0-rmse:28168.5
[1076]	validatio

[1296]	validation_0-rmse:28041.9
[1297]	validation_0-rmse:28043.1
[1298]	validation_0-rmse:28042.5
[1299]	validation_0-rmse:28044.4
[1300]	validation_0-rmse:28045.8
[1301]	validation_0-rmse:28043.6
[1302]	validation_0-rmse:28044.2
[1303]	validation_0-rmse:28042.8
[1304]	validation_0-rmse:28038.9
[1305]	validation_0-rmse:28034.5
[1306]	validation_0-rmse:28036
[1307]	validation_0-rmse:28036.5
[1308]	validation_0-rmse:28035.2
[1309]	validation_0-rmse:28024.7
[1310]	validation_0-rmse:28021.2
[1311]	validation_0-rmse:28021.9
[1312]	validation_0-rmse:28023.8
[1313]	validation_0-rmse:28016.6
[1314]	validation_0-rmse:28018.2
[1315]	validation_0-rmse:28013.9
[1316]	validation_0-rmse:28013.6
[1317]	validation_0-rmse:28011.1
[1318]	validation_0-rmse:28014.8
[1319]	validation_0-rmse:28015.3
[1320]	validation_0-rmse:28018
[1321]	validation_0-rmse:28016.9
[1322]	validation_0-rmse:28014.6
[1323]	validation_0-rmse:28015
[1324]	validation_0-rmse:28017.4
[1325]	validation_0-rmse:28016.3
[1326]	validatio

[1547]	validation_0-rmse:27864.7
[1548]	validation_0-rmse:27864
[1549]	validation_0-rmse:27862.8
[1550]	validation_0-rmse:27863.9
[1551]	validation_0-rmse:27862.8
[1552]	validation_0-rmse:27863.5
[1553]	validation_0-rmse:27865.2
[1554]	validation_0-rmse:27860.6
[1555]	validation_0-rmse:27861.4
[1556]	validation_0-rmse:27861.1
[1557]	validation_0-rmse:27860.2
[1558]	validation_0-rmse:27860.4
[1559]	validation_0-rmse:27861.4
[1560]	validation_0-rmse:27860.3
[1561]	validation_0-rmse:27863.2
[1562]	validation_0-rmse:27860.7
[1563]	validation_0-rmse:27858.6
[1564]	validation_0-rmse:27854.9
[1565]	validation_0-rmse:27854.9
[1566]	validation_0-rmse:27858.2
[1567]	validation_0-rmse:27858.8
[1568]	validation_0-rmse:27858.1
[1569]	validation_0-rmse:27856
[1570]	validation_0-rmse:27855.8
[1571]	validation_0-rmse:27852.8
[1572]	validation_0-rmse:27848.3
[1573]	validation_0-rmse:27848.2
[1574]	validation_0-rmse:27848.9
[1575]	validation_0-rmse:27847.9
[1576]	validation_0-rmse:27845.7
[1577]	validat

[1797]	validation_0-rmse:27739.5
[1798]	validation_0-rmse:27741
[1799]	validation_0-rmse:27738.5
[1800]	validation_0-rmse:27735.4
[1801]	validation_0-rmse:27732.4
[1802]	validation_0-rmse:27731.2
[1803]	validation_0-rmse:27732.3
[1804]	validation_0-rmse:27732.3
[1805]	validation_0-rmse:27732.2
[1806]	validation_0-rmse:27729.8
[1807]	validation_0-rmse:27729.2
[1808]	validation_0-rmse:27728.2
[1809]	validation_0-rmse:27727.2
[1810]	validation_0-rmse:27728.1
[1811]	validation_0-rmse:27727.5
[1812]	validation_0-rmse:27724.1
[1813]	validation_0-rmse:27726.5
[1814]	validation_0-rmse:27725
[1815]	validation_0-rmse:27724.9
[1816]	validation_0-rmse:27723.9
[1817]	validation_0-rmse:27725.4
[1818]	validation_0-rmse:27726
[1819]	validation_0-rmse:27718.6
[1820]	validation_0-rmse:27722.1
[1821]	validation_0-rmse:27720.8
[1822]	validation_0-rmse:27720.9
[1823]	validation_0-rmse:27718.4
[1824]	validation_0-rmse:27717.4
[1825]	validation_0-rmse:27717.7
[1826]	validation_0-rmse:27715.5
[1827]	validatio

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bytree=0.9, gamma=0, importance_type='gain',
             learning_rate=0.03, max_delta_step=0, max_depth=3,
             min_child_weight=1, missing=None, n_estimators=2000, n_jobs=1,
             nthread=None, objective='reg:linear', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, seed=0, silent=True,
             subsample=0.66)

## <font color='LIGHTGRAY'>Learning Objectives</font>

<font color='LIGHTGRAY'>By the end of this workshop, you will be able to</font>
- <font color='LIGHTGRAY'>Describe the three main types of missingness patterns</font>
- <font color='LIGHTGRAY'>Evaluate simple approaches for handling missing values</font>
- <font color='LIGHTGRAY'>Apply XGBoost to a dataset with missing values</font>
- <font color='LIGHTGRAY'>Apply multivariate imputation</font>
- **Apply the reduced-features model (also called the pattern submodel approach)**
- <font color='LIGHTGRAY'>Decide which approach is best for your dataset</font>

## <font color='LIGHTGRAY'>Learning Objectives</font>

<font color='LIGHTGRAY'>By the end of this workshop, you will be able to</font>
- <font color='LIGHTGRAY'>Describe the three main types of missingness patterns</font>
- <font color='LIGHTGRAY'>Evaluate simple approaches for handling missing values</font>
- <font color='LIGHTGRAY'>Apply XGBoost to a dataset with missing values</font>
- <font color='LIGHTGRAY'>Apply multivariate imputation</font>
- <font color='LIGHTGRAY'>Apply the reduced-features model (also called the pattern submodel approach)</font>
- **Decide which approach is best for your dataset**

## Which approach is best for me data?
- **XGB**: run $n$ XGB models with $n$ different seeds
- **imputation**: prepare $n$ different imputations and run $n$ XGB models on them
- **reduced-features**: run $n$ reduced-features model with $n$ different seeds
- rank the three methods based on how significantly different the corresponding mean scores are
   - I hope to talk about the results of this experiment at ODSC East next year!

Now you can
- Describe the three main types of missingness patterns
- Evaluate simple approaches for handling missing values
- Apply XGBoost to a dataset with missing values
- Apply multivariate imputation
- Apply the reduced-features model (also called the pattern submodel approach)
- Decide which approach is best for your dataset

## We are hiring!
- If you don't mind the harsh New England winters and you enjoy working in an academic environment, please come and talk to me or send an email (andras_zsom@brown.edu).
- The successful applicant will collaborate with Brown's Advancement, they will work on academic research projects with faculty members, and they will be encouraged to organize and teach at workshops and supervise interns.
- MSc is required!
- PhD and/or industry experience preferred.
- Earliest starting date: February 1st 2020.

### Thanks for you attention!