# MS4610 Introduction to Data Analytics || Course Project 
### Data Cleaning
Notebook by **Group 12**

This notebook undertakes operations like correcting data types, names given to missing values, etc. Also, data columns have been (externally) given more understandable names to ease referencing. The following operations have been performed:

1. Missing value tags (missing, na, N/A) replaced with `np.nan`

### Dataset Description

1. **train.csv**
    - `application_key`: primary key for dataset
    - `credit_score`: credit worthiness based on past transactions
    - `risk_score`: score based on number and riskiness of enquiries
    - `sev_def_any`: severity of default on any loan
    - `sev_def_auto`: severity of default on auto loans
    - `sev_def_edu`: severity of default on education loans
    - `min_credit_rev`: minimum credit on all revolving cards
    - `max_credit_act`: maximum credit on all active credit lines
    - `max_credit_act_rev`: maximum credit on all active revolving cards
    - `total_credit_1_miss`: sum of credit on all cards where borrower missed 1 payment
    - `total_credit`: total credit on all accepted credit lines
    - `due_collected`: amount of dues collected post default where due was more than 500
    - `total_due`: sum of amount due on all active credit cards
    - `annual_pay`: annual amount paid towards all cards last year
    - `annual_income`: annual income of individual
    - `property_value`: estimated market value of property owned by customer
    - `fc_cards_act_rev`: no. of active revolving credit cards on which full credit utilized
    - `fc_cards_act`: no. of active credit cards on which full credit utilized
    - `fc_lines_act`: no. of active credit lines on which full credit utilized
    - `pc_cards_act`: no. of active credit cards on which at least 75% credit utilized
    - `pc_lines_act`: no. of active credit lines on which at least 75% credit utilized
    - `loan_util_act_rev`: average utilization (%) on active revolving credit card loans
    - `line_util_past2`: average utilization (%) of line on all credit lines activated in past 2 years
    - `line_util_past1`: average utilization (%) of line on all credit cards activated in past 1 year
    - `line_util_1_miss`: Average utilization of line on credit cards on which the borrower has missed 1 payment during last 6 months (%)
    - `tenure_act_rev`: average tenure of active revolving credit cards
    - `tenure_oldest_act`: tenure of oldest card among all active cards
    - `tenure_oldest_act_rev`: tenure of oldest revolving card among all active revolving cards
    - `last_miss_time`: number days since last miss of payment on any credit lines
    - `tenure_oldest_line`: tenure of oldest line
    - `max_tenure_auto`: maximum tenure on all auto loans
    - `max_tenure_edu`: maximum tenure on all education loans
    - `total_tenure_act`: sum of tenures (months) on all active credit cards
    - `residence_time`: duration of stay at current residential address
    - `lines_act_1_miss`: number of active credit lines in past 6 months with 1 payment missed
    - `cards_rev_1_miss`: numer of revolving credit cards in last 2 year with 1 payment missed
    - `lines_act`: number of active credit lines
    - `cards_act_t2`: credit cards with tenure of at least 2 years
    - `lines_act_2yrs`: number of credit lines activated in last 2 years
    - `lines_deli`: number of lines on which borrower has current delinquency
    - `line_util_edu`: utilization of lines (%) on active education loans
    - `line_util_auto`: utilization of lines (%) on active auto loans
    - `stress_index`: financial stress index of borrower
    - `lines_high_risk`: Number of credit lines on which the borrower has never missed a payment in last 2 yrs, yet considered as high risk loans based on market prediction of economic scenario
    - `max_due_ratio`: ratio of maximum amount due on all credit lines to sum of amounts due on all credit lines
    - `mort_2_miss`: number of mortgage loans on which 2 payments are missed
    - `auto_2_miss`: number of auto loans on which 2 payments are missed
    - `card_type`: C = Charge card or L = Lending card applied for
    - `location_id`: location ID
    - `default_ind`: Default indicator

In [28]:
# Import dependencies

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

print("Dependencies loaded")

Dependencies loaded


In [29]:
# Load training dataset

train = pd.read_csv('.././data/train.csv')

In [32]:
train.head(10)

Unnamed: 0,application_key,credit_score,risk_score,sev_def_any,sev_def_auto,sev_def_edu,min_credit_rev,max_credit_act,max_credit_act_rev,total_credit_1_miss,...,line_util_edu,line_util_auto,stress_index,lines_high_risk,max_due_ratio,mort_2_miss,auto_2_miss,card_type,location_id,default_ind
0,230032,1696.0,1.6541,0.0,0.0,0.0,0.0,6015.0,322.0,40369.0,...,73.78,82.547,0.08696,10.0,0.63899,,0.0,C,10,0
1,230033,1846.0,0.8095,0.0,0.0,0.0,102.0,7532.0,3171.0,18234.0,...,99.129,,0.0,13.0,0.63836,,,L,732,1
2,230034,1745.0,0.4001,0.0,0.0,0.0,,2536.0,,,...,,29.29,0.0,1.0,1.0,,0.0,C,89,1
3,230035,1739.0,0.2193,0.0,0.0,0.0,1982.0,26440.0,4955.0,20316.0,...,96.272,,0.15385,3.0,0.53241,0.0,0.0,L,3,0
4,230036,1787.0,0.0118,0.225,0.0,0.0,5451.0,5494.0,5494.0,7987.0,...,115.019,,0.0,1.0,0.92665,,,L,5,0
5,230037,1579.0,,3.502,0.0,0.0,,,,,...,,,1.5,0.0,,,,C,35,1
6,230038,1818.0,0.4001,0.0,0.0,0.0,,1088.0,,1536.0,...,88.171,,0.0,2.0,0.87224,,0.0,C,2,1
7,230039,,,,,,,,,,...,,,,,,,,C,2,0
8,230040,1836.0,0.1358,0.0,0.0,0.0,347.0,38964.0,17828.0,70729.0,...,,,0.0,10.0,0.89868,0.0,0.0,L,5,1
9,230041,1839.0,0.1981,0.0,0.0,0.0,793.0,6131.0,6045.0,48959.0,...,,45.59,0.08824,14.0,0.33834,,0.0,L,3247,0


### Missing Values
Missing values have been represented with variety of strings. It is difficult to deal with them during data exploration with `pandas`. We will convert all of them into numpy not-a-number values.

In [31]:
# NOTE: 
# Looking for a vectorized implementation of this operation
# The code below takes about 10 seconds to run, which is quite slow

miss_tags = ['missing', 'na', 'N/A']

for col in train.columns:
    for i in range(len(train)):
        if train.at[i, col] in miss_tags:
            train.at[i, col] = np.nan

This operation now gives a clearer picture of the dataset when accessed using `info()` attribute of the DataFrame. Very few columns have all non-null values.

In [33]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83000 entries, 0 to 82999
Data columns (total 50 columns):
application_key          83000 non-null int64
credit_score             79267 non-null object
risk_score               77114 non-null float64
sev_def_any              82465 non-null float64
sev_def_auto             82465 non-null float64
sev_def_edu              82465 non-null float64
min_credit_rev           63299 non-null object
max_credit_act           75326 non-null object
max_credit_act_rev       63291 non-null object
total_credit_1_miss      71318 non-null object
total_credit             82465 non-null object
due_collected            36283 non-null object
total_due                68422 non-null object
annual_pay               73311 non-null object
annual_income            83000 non-null int64
property_value           49481 non-null object
fc_cards_act_rev         63757 non-null object
fc_cards_act             66501 non-null object
fc_lines_act             67641 non-null obj

In [34]:
# Export this modified dataset for further exploration

train.to_csv('/home/nishant/Desktop/IDA Project/mod_data/train.csv', index=False)