# Preprocessing and Training
In this part of Capstone Project 2, the data will be preprocessed to make it ready for use in machine learning algorithms.

## Table of Contents
* [1 Import Packages and Load Data](#1-Import-Packages-and-Load-Data)
* [2 Initial Preprocessing](#2-Initial-Preprocessing)
    * [2.1 Imputing Missing Values](#2.1-Imputing-Missing-Values)
    * [2.2 Removing Rows Without Target Labels](#2.2-Removing-Rows-Without-Target-Labels)
    * [2.3 Splitting Dataframe into Feature Matrix and Target Vector](#2.3-Splitting-Dataframe-into-Feature-Matrix-and-Target-Vector)
* [3 One Hot Encoding of Categorical Features](#3-One-Hot-Encoding-of-Categorical-Features)
* [4 Train-Test Split](#4-Train-Test-Split)
* [5 Scaling Numeric Features](#5-Scaling-Numeric-Features)
* [6 Export Results](#6-Export-Results)
* [7 Final Remarks](#7-Final-Remarks)

# 1 Import Packages and Load Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import sparse
from sklearn.preprocessing import QuantileTransformer, StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
import pickle

In [2]:
path = '../Data/df_clean_null.pkl'
df = pd.read_pickle(path)

# 2 Initial Preprocessing

## 2.1 Imputing Missing Values

**Defining Imputer Function**

In [3]:
def imputer(dataframe, category_value_tofill, columns_drop, columns_mode, columns_median):
    '''Imputes missing values to the input dataframe.
    
       Parameters
       ----------
       dataframe: Pandas dataframe
           dataframe with which to impute missing values.
       
       category_value_tofill: int, float, or string
           Value to used to fill missing values in categorical features.
       
       columns_drop: list-like
           List of columns to drop.
       
       columns_mode: list-like
           List of numeric columns to impute with the mode.
           
       columns_median: list-like
           List of numeric columns to impute with the mean.
    '''
    
    #Fill null values in categorical features with value_null
    for column in dataframe.select_dtypes(include='category'):
        if any(dataframe[column].isnull()):
            dataframe[column] = dataframe[column].cat.add_categories([category_value_tofill])
            dataframe[column] = dataframe[column].fillna(value=category_value_tofill)
            
    #Droping columns, imputing with mode, and imputing with median.
    dataframe = dataframe.drop(columns=columns_drop)
    dataframe = dataframe.fillna(dataframe[columns_mode].mode().iloc[0, :])
    dataframe = dataframe.fillna(dataframe[columns_median].median())
    
    return dataframe

**Defining Parameters for the Imputing Function**

In [4]:
#Fill null values in categorical features with this one
val_null = 999.0

#Columns to drop
cols_to_drop = ['POORHLTH_ALT', 'DIABAGE2_ALT', 'AVEDRNK2_ALT', 'DRNK3GE5_ALT', 'MAXDRNKS_ALT', \
                'BLDSUGAR_ALT', 'FEETCHK2_ALT', 'DOCTDIAB_ALT', 'CHKHEMO3_ALT', 'FEETCHK_ALT', \
                'LONGWTCH_ALT', 'ASTHMAGE_ALT', 'ASERVIST_ALT', 'ASDRVIST_ALT', 'ASRCHKUP_ALT', \
                'ASACTLIM_ALT', 'SCNTWRK1_ALT', 'SCNTLWK1_ALT', '_STRWT', '_CLLCPWT', '_DUALCOR', \
                'EXACTOT1', 'EXACTOT2', 'IDATE', 'SEQNO', '_PSU', 'HIVTSTD3', 'CVDINFR4', \
                'CVDCRHD4', 'HAREHAB1', 'STREHAB1']

#Columns to impute with mode
cols_imp_mode = ['NUMADULT', 'NUMMEN', 'NUMWOMEN', 'HHADULT_ALT', 'PHYSHLTH_ALT', 'MENTHLTH_ALT', \
                 'CHILDREN_ALT', 'ALCDAY5_ALT', 'ADPLEASR_ALT', 'ADDOWN_ALT', 'ADSLEEP_ALT', 'ADENERGY_ALT', \
                 'ADEAT1_ALT', 'ADFAIL_ALT', 'ADTHINK_ALT', 'ADMOVE_ALT', 'FTJUDA1_', 'FRUTDA1_', 'BEANDAY_', \
                 'GRENDAY_', 'ORNGDAY_', 'VEGEDA1_', '_FRUTSUM', '_VEGESUM']

#Columns to impute with median
cols_imp_median = ['HTIN4', 'HTM4', 'WTKG3_ALT', '_BMI5', 'METVL11_', 'METVL21_', 'MAXVO2__ALT', 'FC60__ALT', \
                   'PADUR1_', 'PADUR2_', 'PAFREQ1__ALT', 'PAFREQ2__ALT', '_MINAC11', '_MINAC21', \
                   'STRFREQ__ALT', 'PAMIN11_', 'PAMIN21_', 'PA1MIN_', 'PAVIG11_', 'PAVIG21_', 'PA1VIGM_',\
                  'DROCDY3__ALT', '_DRNKWEK_ALT']

**Applying Imputing Function**

In [5]:
df = imputer(df, val_null, cols_to_drop, cols_imp_mode, cols_imp_median)

## 2.2 Removing Rows Without Target Labels

In [6]:
index_to_drop = df[df['_MICHD'] == val_null].index
df = df.drop(labels = index_to_drop)
df['_MICHD'] = df['_MICHD'].cat.remove_unused_categories()

## 2.3 Splitting Dataframe into Feature Matrix and Target Vector

In [7]:
y = df['_MICHD'].map({1.0:1.0, 2.0:0.0}) #Remapping of labels: 0.0 means "No", 1.0 means "Yes"
X = df.drop(columns=['_MICHD'])

del df

# 3 One Hot Encoding of Categorical Features

**Splitting the Features**  
There are three different types of transformations that will be applied to different types of features: categorical, skewed numeric, and non-skewed numeric. Different transformations will be applied to these subsets. One hot encoding will be applied to categorical features, quantile transformation will be applied to numeric features that have right-skewed distributions, and standard scaling will be applied to numeric features that approximate gaussian distributions.

One thing to keep in mind is that the one hot encoding will happen before the train-test split, but the quantile transformation and the standard scaling will happen after it.

In [8]:
cat_cols = X.select_dtypes(include='category').columns

skewed_num_cols = ['NUMADULT', 'NUMMEN', 'NUMWOMEN', 'HHADULT_ALT', 'PHYSHLTH_ALT', 'MENTHLTH_ALT', \
                   'CHILDREN_ALT', 'ALCDAY5_ALT', 'ADPLEASR_ALT', 'ADDOWN_ALT', 'ADSLEEP_ALT', \
                   'ADENERGY_ALT', 'ADEAT1_ALT', 'ADFAIL_ALT', 'ADTHINK_ALT', 'ADMOVE_ALT', '_RAWRAKE', \
                   '_WT2RAKE', '_LLCPWT', 'WTKG3_ALT', '_BMI5', 'DROCDY3__ALT', '_DRNKWEK_ALT', \
                   'FTJUDA1_', 'FRUTDA1_', 'BEANDAY_', 'GRENDAY_', 'ORNGDAY_', 'VEGEDA1_', '_MISFRTN', \
                   '_MISVEGN', '_FRUTSUM', '_VEGESUM', 'METVL11_', 'METVL21_', 'PADUR1_', \
                   'PADUR2_', 'PAFREQ1__ALT', 'PAFREQ2__ALT', '_MINAC11', '_MINAC21', 'STRFREQ__ALT', \
                   'PAMIN11_', 'PAMIN21_', 'PA1MIN_', 'PAVIG11_', 'PAVIG21_', 'PA1VIGM_']

non_skewed_num_cols = [column for column in X.select_dtypes(include='float').columns \
                       if column not in skewed_num_cols]

In [9]:
X_cat = X[cat_cols]
X_skewed = X[skewed_num_cols]
X_non_skewed = X[non_skewed_num_cols]

del X

**Saving Column Names of Numeric Features**  
Eventually, the feature matrix will become a sparse matrix and its column labels will be lost. I will therefore save them now so I can later search the columns of the sparse matrix.

In [10]:
skewed_col_names = X_skewed.columns
non_skewed_col_names = X_non_skewed.columns

**One Hot Encoding and Making Matrices of Numeric Features Sparse**  
The output of the one hot encoder is a sparse matrix. To be able to concatenate the three matrices, the other two need to also be made into sparse matrices.

In [11]:
oh_encoder = OneHotEncoder()
X_cat = oh_encoder.fit_transform(X_cat)

X_skewed = sparse.csr_matrix(X_skewed)

X_non_skewed = sparse.csr_matrix(X_non_skewed)

**Saving Column Names of One Hot Encoded Categorical Features**

In [12]:
cat_col_names = oh_encoder.get_feature_names_out()

**List of all Column Names after One-Hot Endocing**

In [13]:
col_names = np.concatenate((cat_col_names, np.array(skewed_col_names), np.array(non_skewed_col_names)))

**Getting Slices for each of the Three Subsets**  
Once the three subsets get transformed into sparse matrices, they lose their column labels, so they can't be used for accessing them. In order to access these subsets after they get concatenated, we need to get their slices.

In [14]:
cat_slice = slice(0, X_cat.shape[1])
skewed_slice = slice(cat_slice.stop, cat_slice.stop + X_skewed.shape[1])
non_skewed_slice = slice(skewed_slice.stop, skewed_slice.stop + X_non_skewed.shape[1])

**Concatenating the Three Subsets after One Hot Encoding**

In [15]:
X = sparse.hstack([X_cat, X_skewed, X_non_skewed], format='csr')

del X_cat, X_skewed, X_non_skewed

# 4 Train-Test Split

**Checking Proportion of Positive Labels**  
This shows that the dataset is imbalanced.

In [16]:
prop_1 = len(y[y == 1]) / len(y)
prop_0 = 1 - prop_1

prop_1, prop_0

(0.08830117436242041, 0.9116988256375795)

**Train-Test Split with Stratification**  
Since the dataset is imbalance, we want to make sure that in both the training and test set the proportion of positive labels is similar to what it was before the split.

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y)

del X

Checking that the proportion of positive labels is similar after the split as before the split.

In [18]:
prop_1_train = len(y_train[y_train==1]) / len(y_train)
prop_0_train = 1 - prop_1_train

prop_1_test = len(y_test[y_test==1]) / len(y_test)
prop_0_test = 1 - prop_1_test

prop_1_train, prop_0_train, prop_1_test, prop_0_test

(0.08830008199742294,
 0.9116999180025771,
 0.08830554380992651,
 0.9116944561900735)

# 5 Scaling Numeric Features
For each of the train and test feature matrices, the features will be scaled so that they have similar orders of magnitude. However, two different types of scaling will be applied to the numeric features. Quantile transformation will be applied to numeric features that have right-skewed distributions, while standard scaling will be applied to features that have approximate gaussian distributions.

**Splitting into Subsets**  
We use the previously defined slices to split the features into the three subsets again: categorical features, skewed features, and non-skewed features. 

In [19]:
X_train_cat = X_train[:, cat_slice]
X_train_skewed = X_train[:, skewed_slice]
X_train_non_skewed = X_train[:, non_skewed_slice]

X_test_cat = X_test[:, cat_slice]
X_test_skewed = X_test[:, skewed_slice]
X_test_non_skewed = X_test[:, non_skewed_slice]

del X_train, X_test

**Applying Quantile Transformation on Right-Skewed Features**

In [None]:
q_transformer = QuantileTransformer()
q_transformer.fit(X_train_skewed)

X_train_skewed = q_transformer.transform(X_train_skewed)
X_test_skewed = q_transformer.transform(X_test_skewed)

**Applying Standard Scaling on Gaussian-Distributed Features**  
It's important to note that features cannot be centered during standard scaling if the feature matrix is sparse. Therefore, the feature submatrix has to be made dense first, then scaled, then made sparse again, and finally be concatenated with the other two submatrices. 

In [None]:
#Making the matrices dense
X_train_non_skewed_dense = X_train_non_skewed.toarray()
X_test_non_skewed_dense = X_test_non_skewed.toarray()

#Initializing and fitting the scaler
scaler = StandardScaler()
scaler.fit(X_train_non_skewed_dense)

#Applying the scaling to the dense matrices
X_train_non_skewed_dense = scaler.transform(X_train_non_skewed_dense)
X_test_non_skewed_dense = scaler.transform(X_test_non_skewed_dense)

#Making the matrices sparse again
X_train_non_skewed = sparse.csr_matrix(X_train_non_skewed_dense)
X_test_non_skewed = sparse.csr_matrix(X_test_non_skewed_dense)

del X_train_non_skewed_dense, X_test_non_skewed_dense

**Concatenating the Submatrices After Applying Transformations**

In [None]:
X_train = sparse.hstack([X_train_cat, X_train_skewed, X_train_non_skewed])
X_test = sparse.hstack([X_test_cat, X_test_skewed, X_test_non_skewed])

del X_train_cat, X_train_skewed, X_train_non_skewed
del X_test_cat, X_test_skewed, X_test_non_skewed

# 6 Export Results

In [None]:
with open('../Data/preprocessed.pkl.zip', 'wb') as f:
    dump_list = [col_names, X_train, X_test, y_train, y_test]
    dump_list.insert(0, len(dump_list))
    for x in dump_list:
        pickle.dump(x, f)

# 7 Final Remarks
The data has now been preprocessed. I have used one hot encoding on categorical features, split the data into a training set and a test set, and applied appropriate scaling to numeric features. The data stored in the variables X_train, X_test, y_train, and y_test are now ready to be used as input for machine learning algorithms.