# MS4610 Introduction to Data Analytics || Course Project 
### Data Cleaning and Augmentation
Notebook by **Group 12**

This notebook undertakes operations like correcting data types, names given to missing values, etc. Also, data columns have been (externally) given more understandable names to ease referencing (please check out the EDA notebook for information on column names). The following operations have been performed:

1. Missing value tags (missing, na, N/A) replaced with `numpy.nan`
2. Label encoding some categorical columns and typecasting to appropriate dtypes 

**NOTE**: Synthetic imputation of missing data has been performed externally through normal Python scripts. Since these processes are computationally very expensive and time consuming, they were run on **Google Colab** kernels with GPU support. The code files are available in the main repository.

In [15]:
# Data handling and transformation libraries

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Imputation libraries

from impyute.imputation.cs import mice
from impyute.imputation.cs import fast_knn
from missingpy import MissForest

# Resampling library

from imblearn.over_sampling import SMOTE as smote

# Other libraries

import sys
import warnings

sys.setrecursionlimit(1000000)
warnings.filterwarnings("ignore")

print("Dependencies loaded")

Dependencies loaded


In [4]:
# Load training dataset

train = pd.read_csv('.././data/train.csv')
train_mf = pd.read_csv("/home/nishant/Desktop/IDA Project/mod_data/train_mf.csv")
train_mf_res = pd.read_csv("/home/nishant/Desktop/IDA Project/mod_data/train_mf_res.csv")

In [24]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83000 entries, 0 to 82999
Data columns (total 50 columns):
application_key          83000 non-null int64
credit_score             83000 non-null object
risk_score               77114 non-null float64
sev_def_any              82465 non-null float64
sev_def_auto             82465 non-null float64
sev_def_edu              82465 non-null float64
min_credit_rev           83000 non-null object
max_credit_act           83000 non-null object
max_credit_act_rev       83000 non-null object
total_credit_1_miss      83000 non-null object
total_credit             83000 non-null object
due_collected            83000 non-null object
total_due                83000 non-null object
annual_pay               83000 non-null object
annual_income            83000 non-null int64
property_value           83000 non-null object
fc_cards_act_rev         83000 non-null object
fc_cards_act             83000 non-null object
fc_lines_act             83000 non-null obj

## Useful Functions
In this section, we have defined some functions that are commonly used in data cleaning. This makes readability of code and repeatability of operations convenient down the line. Available functions:
1. **missing_table**: Tally of missing values in the dataset by column, arranged in descending order.
2. **drop_bad_rows**: Drops rows with more percentage of null values than specified threshold
3. **drop_bad_cols**: Drops columns with more percentage of null values than specified threshold

In [7]:
def missing_table(df, threshold=None, ascending=False):
    """
    Counts number of missing values and percentage of missing values in every column
    of input pandas DataFrame object.
    
    :params: 
        threshold: (int/float) returns only those columns with number/percentage of missing
                   values higher than this value, default is None (returns all columns)
        ascending: (boolean) sorts table in ascending order of missing values
                   if set to True, default is False
                   
    :return: columns with missing values above threshold; pandas DataFrame object with column 
             name, number of missing values and percentage of missing values
    """
    cols = df.columns
    miss_vals = np.array([df[col].isnull().sum() for col in cols])
    miss_vals_percent = (miss_vals / len(df))*100
    
    miss_table = pd.DataFrame(np.vstack((cols, miss_vals, miss_vals_percent)).T,
                              columns=['column', 'missing values', '% missing values'])
    
    if threshold is None:
        miss_table = miss_table.sort_values(by='missing values', ascending=ascending)
        return miss_table
    else:
        if threshold <= 1.0:
            ret = miss_table.loc[miss_table['% missing values'] >= threshold*100.0, :]
            return ret.sort_values(by='missing values', ascending=ascending)
        elif threshold > 1.0:
            ret = miss_table.loc[miss_table['missing values'] >= threshold, :]
            return ret.sort_values(by='missing values', ascending=ascending)
        else:
            raise ValueError('Invalid threshold type')
            
            
def drop_bad_rows(df, target=None, threshold=0.8):
    
    bad_rows = train.loc[train.isnull().sum(axis=1)/train.shape[1]>=threshold, :]
    if target is not None:
        class0, class1 = bad_rows[target].value_counts().values[0], bad_rows[target].value_counts().values[1]
        class0_pc, class1_pc = class0/len(df), class1/len(df)
        print("Dropped %d (%.4f%%) class 0 and %d (%.4f%%) class 1 examples" % 
              (class0, class0_pc*100, class1, class1_pc*100))
    df_new = df.drop(bad_rows.index, axis=0)
    return df_new


def drop_bad_cols(df, threshold=0.4):
    
    miss_table = missing_table(df, threshold=threshold)
    bad_cols = miss_table['column'].values.tolist()
    df_new = df.drop(bad_cols, axis=1)
    print("Dropped %d of %d columns" % (len(bad_cols), df.shape[1]))
    return df_new

## Missing values tag correction
Missing values have been represented with variety of strings. It is difficult to deal with them during data exploration with `pandas`. We will convert all of them into numpy not-a-number values.

In [3]:
# NOTE: 
# Looking for a vectorized implementation of this operation
# The code below takes about 10 seconds to run, which is quite slow

miss_tags = ['missing', 'na', 'N/A']

for col in train.columns:
    for i in range(len(train)):
        if train.at[i, col] in miss_tags:
            train.at[i, col] = np.nan

This operation now gives a clearer picture of the dataset when accessed using `info()` attribute of the DataFrame. Very few columns have all non-null values.

In [27]:
# DataFrame information post null value tagging

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83000 entries, 0 to 82999
Data columns (total 50 columns):
application_key          83000 non-null int64
credit_score             79267 non-null object
risk_score               77114 non-null float64
sev_def_any              82465 non-null float64
sev_def_auto             82465 non-null float64
sev_def_edu              82465 non-null float64
min_credit_rev           63299 non-null object
max_credit_act           75326 non-null object
max_credit_act_rev       63291 non-null object
total_credit_1_miss      71318 non-null object
total_credit             82465 non-null object
due_collected            36283 non-null object
total_due                68422 non-null object
annual_pay               73311 non-null object
annual_income            83000 non-null int64
property_value           49481 non-null object
fc_cards_act_rev         63757 non-null object
fc_cards_act             66501 non-null object
fc_lines_act             67641 non-null obj

## Label Encoding and Typecasting
Some columns have type `object` and need encoding into `float` types so that further processing on them be possible. In this section, we perform relevant transformations.

In [44]:
# Label encoding categorical columns

enc_columns = ['card_type']

le = LabelEncoder()
train[enc_columns] = le.fit_transform(train[enc_columns])

In [45]:
# Converting all columns to float data type
# We only have two categorical columns (card_type and location_id) which we can 
# change back to categorical type later

for col in train.columns:
    train[col] = train[col].astype('float')

In [8]:
# First, we will remove some rows and columns with too many missing values

train = drop_bad_rows(train, target='default_ind', threshold=0.75)
train = drop_bad_cols(train, threshold=0.4)

Dropped 378 (0.4554%) class 0 and 157 (0.1892%) class 1 examples
Dropped 9 of 50 columns


## Missing value imputation
In this section, we will try some methods for synthetic imputation of missing values in the dataset. Based on model performance on these imputed datasets, we will choose the best imputation method. The following methods have been used.
1. **Zero imputation**: All missing values replaced with zero
2. **Missing Forest imputation**: A random forest imputes each column assuming it's the target and the other columns are features

In [5]:
# Zero imputation

train_zero = train.fillna(0)

In [None]:
# Missing Forest imputation
# NOTE (DO NOT RUN ON JUPYTER):
# This imputation is very computationally expensive and time consuming
# Was performed externally on PyCharm

cols = train.columns.tolist()

# Impute values
# Function returns a numpy ndarray, which we convert to DataFrame again
imputer = MissForest()

print("[INFO] Imputation started")
X_imputed = imputer.fit_transform(train.values)

print("[INFO] Imputation complete")
train_mf = pd.DataFrame(X_imputed, columns=cols)

# Save new DataFrame to drive
train_mf.to_csv("/home/nishant/Desktop/IDA Project/mod_data/train_mf.csv", index=False)

## Resampling minority class using SMOTE
**Synthetic Minority Over-sampling Technique (SMOTE)** is a novel method for reducing/removing class imbalance in datasets. In the cell below, we perform SMOTE resampling on Missing Forest imputed data to obtained a resampled dataset with equal representation of defaulters and non-defaulters.

In [6]:
# Import data

cols = train_mf.columns.tolist()
y = train_mf.default_ind.values
X = train_mf.drop(['default_ind'], axis=1).values

# SMOTE resampling
sm = smote(sampling_strategy='auto', random_state=123)
X_res, y_res = sm.fit_resample(X, y)

pc_increase = (X_res.shape[0] - X.shape[0])*100/X.shape[0]
print("Increase in dataset size: %.2f%% (%d rows added)" % (pc_increase, X_res.shape[0] - X.shape[0]))

# Reform dataframe
res_df_vals = np.hstack((X_res, y_res.reshape((-1, 1))))
res_df = pd.DataFrame(res_df_vals, columns=cols)

res_df = res_df.sample(frac=1).reset_index(drop=True)

res_df.to_csv("/home/nishant/Desktop/IDA Project/mod_data/train_mf_res.csv", index=False)

Increase in dataset size: 42.53% (35069 rows added)


## Something interesting happened
To make sure nothing's wrong with the way our newest model is working, I want to split our data manually into training and test, to see a final validation of what's going on.

In [14]:
data = train_mf_res
col_names = data.columns.tolist()

y = data.default_ind.values
X = data.drop('default_ind', axis=1).values

# Split dataset 80-20 into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=123)

# Recombine train and test X and y
train = np.hstack((X_train, y_train.reshape((-1, 1))))
test = np.hstack((X_test, y_test.reshape(-1, 1)))

# Convert to dataframes and save to drive
train_final = pd.DataFrame(train, columns=col_names)
test_final = pd.DataFrame(test, columns=col_names)

train_final.to_csv("/home/nishant/Desktop/IDA Project/mod_data/train_final.csv", index=False)
test_final.to_csv("/home/nishant/Desktop/IDA Project/mod_data/test_final.csv", index=False)