## Case Study Assignment – Data Mining
A healthcare organization together with a couple of government hospitals in a city has collected information about the vitals that would reveal if the person might have a coronary heart disease in the next ten years or not. This study is useful in early identification of disease and have medical intervention if necessary. This would help not only in improving the health conditions but also the economy as it has been identified that health performance and economic performance are interlinked. 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
# for the model
from sklearn.model_selection import train_test_split

# the scaler - for standardisation
from sklearn.preprocessing import StandardScaler

# for Q-Q plots
import scipy.stats as stats

# from feature-engine
from feature_engine import missing_data_imputers as mdi
from feature_engine.outlier_removers import Winsorizer

In [3]:
# load dataset

data = pd.read_csv('data_files/Problem2_Data.csv')

data.shape

(34281, 25)

In [None]:
data.head()

### Types of variables

In [None]:
# let's inspect the type of variables in pandas

data.dtypes

In [None]:
# let's inspect the variable values

for var in data.columns:
    print(var, data[var].unique()[0:20], '\n')

In [None]:
# numerical: discrete vs continuous

discrete = [var for var in data.columns if data[var].dtype!='O' and var!='Target' and data[var].nunique()<10]
continuous = [var for var in data.columns if data[var].dtype!='O' and var!='Target' and var not in discrete]

print('There are {} discrete variables : {}'.format(len(discrete), discrete))
print('There are {} continuous variables {}'.format(len(continuous), continuous))

### Variable characteristics

In [None]:
# missing data

data.isnull().mean()

In [None]:
data.isnull()

In [None]:
# outliers

data[continuous].boxplot(figsize=(10,4))

In [None]:
# outliers in discrete
data[discrete].boxplot(figsize=(10,4))

In [None]:
# feature magnitude

data.describe()

In [None]:
# separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop('Target', axis=1),  # predictors
    data['Target'],  # target
    test_size=0.2,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

In [None]:
X_train_copy = X_train.copy(deep=True)

### Missing data imputation

In [None]:
X_train.A2.isnull().mean()

In [None]:
## let's check the distribution of a few variables before and after 
# cca: histogram

fig = plt.figure()
ax = fig.add_subplot(111)

# original data
X_train['A2'].hist(bins=50, ax=ax, density=True, color='red')


In [None]:
# we call the imputer from feature-engine
# we specify the imputation strategy, median in this case
cols_to_use = ['A2']
imputer = mdi.MeanMedianImputer(imputation_method='median', variables=cols_to_use)

In [None]:
# we fit the imputer

X_train = imputer.fit_transform(X_train)

In [None]:
# here we can see the mean assigned to each variable
imputer.imputer_dict_

In [None]:
# feature-engine returns a dataframe

#X_train_t = imputer.transform(X_train)
#X_train_t.head()
X_train_copy['A2'].isnull().mean()

In [None]:
X_train['A2'].isnull().mean()

In [None]:
# we can see that the distribution has changed 
# with now more values accumulating towards the median
# or median

fig = plt.figure()
ax = fig.add_subplot(111)

# original variable distribution
X_train_copy['A2'].plot(kind='kde', ax=ax)

# variable imputed with the median
X_train['A2'].plot(kind='kde', ax=ax, color='red')

# add legends
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

As mentioned above, the mean / median imputation doesn't distorts the original distribution of the variable A2. As variable is skewed, the mean is biased by the values at the far end of the distribution. Therefore, the median is a better representation of the majority of the values in the variable.

In [None]:
# we can see a change in the variance after mean / median imputation
# this is expected, because the percentage of missing data is quite
# low in A2, ~5%

print('Original variable variance: ', X_train_copy['A2'].var())
print('Variance after median imputation: ', X_train['A2'].var())

In [None]:
# outliers in A2 before median imputation 

X_train_copy[['A2']].boxplot()

In [None]:
# outliers in A2 after median imputation 

X_train[['A2']].boxplot()

From the boxplot above, we can see that after the imputation we have few more outliers on the higher A2 values.

In [None]:
# function to create histogram, Q-Q plot and
# boxplot. We learned this in section 3 of the course


def diagnostic_plots(df, variable):
    # function takes a dataframe (df) and
    # the variable of interest as arguments

    # define figure size
    plt.figure(figsize=(16, 4))

    # histogram
    plt.subplot(1, 3, 1)
    sns.distplot(df[variable], bins=30)
    plt.title('Histogram')

    # Q-Q plot
    plt.subplot(1, 3, 2)
    stats.probplot(df[variable], dist="norm", plot=plt)
    plt.ylabel('Variable quantiles')

    # boxplot
    plt.subplot(1, 3, 3)
    sns.boxplot(y=df[variable])
    plt.title('Boxplot')

    plt.show()

In [None]:
# let's find outliers 

#for column in continuous:
diagnostic_plots(X_train, 'A2')

In [None]:
# create the capper

windsoriser = Winsorizer(distribution='skewed', # choose skewed for IQR rule boundaries or gaussian for mean and std
                          tail='both', # cap left, right or both tails 
                          fold=1.5,
                          variables=continuous)

X_train = windsoriser.fit_transform(X_train)

In [None]:
#for column in continuous:
diagnostic_plots(X_train, 'A2')

In [None]:
# we can inspect the minimum caps for each variable
windsoriser.left_tail_caps_

In [None]:
windsoriser.transform(X_test)

In [None]:
X_train.describe()

In [None]:
# standardisation: with the StandardScaler from sklearn

# set up the scaler
scaler = StandardScaler()

# fit the scaler to the train set, it will learn the parameters
scaler.fit(X_train)

# transform train and test sets
X_train_scaled = scaler.transform(X_train)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)

In [None]:
np.round(X_train_scaled.describe(), 1)

As expected, the mean of each variable, which were not centered at zero, is now around zero and the standard deviation is set to 1. 

In [None]:
# let's compare the variable distributions before and after scaling

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))

# before scaling
ax1.set_title('Before Scaling')
sns.kdeplot(X_train['A1'], ax=ax1)
sns.kdeplot(X_train['A2'], ax=ax1)
sns.kdeplot(X_train['A5'], ax=ax1)

# after scaling
ax2.set_title('After Standard Scaling')
sns.kdeplot(X_train_scaled['A1'], ax=ax2)
sns.kdeplot(X_train_scaled['A2'], ax=ax2)
sns.kdeplot(X_train_scaled['A5'], ax=ax2)
plt.show()

In the above plots standardisation centered all the distributions at zero, but it preserved their original distribution. The value range is not identical, but it looks more homogeneous across the variables.

In [None]:
data[(data['A15']<1) & (data['A15']>-1)].shape

In [None]:
data[(data['A15'] == -99)].shape

In [None]:
data.shape[0]

In [None]:
data[(data['A15'] == 99)].shape