# Data Wrangling/Cleansing

### Data
It contains training data for customers along with renewal premium status (Renewed or Not?)

| **Variable**                     | **Definition**                                               |
| -------------------------------- | ------------------------------------------------------------ |
| id                               | Unique ID of the policy                                      |
| perc_premium_paid_by_cash_credit | Percentage of premium amount paid by cash or credit card     |
| age_in_days                      | Age in days of policy holder                                 |
| Income                           | Monthly Income of policy holder                              |
| Count_3-6_months_late            | No of premiums late by 3 to 6 months                         |
| Count_6-12_months_late           | No  of premiums late by 6 to 12 months                       |
| Count_more_than_12_months_late   | No of premiums late by more than 12 months                   |
| application_underwriting_score   | Underwriting Score of the applicant at the time of application (No applications under the score of 90 are insured) |
| no_of_premiums_paid              | Total premiums paid on time till now                         |
| sourcing_channel                 | Sourcing channel for application                             |
| residence_area_type              | Area type of Residence (Urban/Rural)                         |
| premium                          | Monthly premium amount                                       |
| renewal                          | Policy Renewed? (0 - not renewed, 1 - renewed                |


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set()

In [None]:
df = pd.read_csv('data/train.csv', index_col='id')
df.shape

In [None]:
df.head()

In [None]:
df.columns.values

In [None]:
df.dtypes

In [None]:
# df = df.drop(['id'], axis=1)
# df.head(5)

In [None]:
df = df.rename(columns={'Income':'income',
                   'Count_3-6_months_late':'count_3-6_months_late', 
                   'Count_6-12_months_late':'count_6-12_months_late',
                   'Count_more_than_12_months_late':'count_more_than_12_months_late'
                  })
df.head(5)

In [None]:
# Various Sourcing Channels
sorted( df.sourcing_channel.unique() )

In [None]:
# Various Residencec Area Type
sorted( df.residence_area_type.unique() )

In [None]:
# Check for null values
df.isnull().sum()

In [None]:
df.loc[ df['count_more_than_12_months_late'].isnull() ].head(10)

#### Pre-processing 1 : Convert age from days to years

In [None]:
df['age_in_yrs'] = (df['age_in_days'] / 365).astype(int)
df.head()

In [None]:
# rearrange columns
df = df[['age_in_days',
    'age_in_yrs',     
    'income',
    'application_underwriting_score',
    'premium',
    'perc_premium_paid_by_cash_credit',
    'no_of_premiums_paid',
    'count_3-6_months_late', 'count_6-12_months_late', 'count_more_than_12_months_late', 
    'sourcing_channel', 
    'residence_area_type',
    'renewal']]
df.head()

In [None]:
# Drop column 'age_in_days'
df.drop('age_in_days', axis=1, inplace=True)
df.head()

In [None]:
# Per Problem Statement: Underwriting Score of the applicant at the time of application 
# (No applications under the score of 90 are insured)
df[df['application_underwriting_score']<90].shape[0] == 0

In [None]:
def isNull(df,cols):
    mask = False
    for c in cols:
        mask = mask | (df[c].isnull())
    return mask

null_delay_pay = df.loc[ isNull(df,['count_3-6_months_late', 'count_6-12_months_late', 'count_more_than_12_months_late']) ]
# null_delay_pay

In [None]:
tmp = df.loc[ (~df['count_3-6_months_late'].isnull()) & (df['no_of_premiums_paid']<=2) ]
print(tmp['count_3-6_months_late'].median())
print(tmp['count_6-12_months_late'].median())
print(tmp['count_more_than_12_months_late'].median())

#### Pre-processing 2: Imputating delayed premium payments count columns with median value ZERO.

In [None]:
# Imputating delayed premium payments count columns with median value ZERO.
tmp = df[['count_3-6_months_late', 'count_6-12_months_late', 'count_more_than_12_months_late']].fillna(0)
df.update(tmp)
df.isnull().sum()

In [None]:
print(df['application_underwriting_score'].mean())
print(df['application_underwriting_score'].median())
print(df['application_underwriting_score'].mode())

#### Pre-processing 3: Imputating application_underwriting_score with mode (highest frequency value)

In [None]:
# Imputating application_underwriting_score with mode (highest frequency value)
df.update( df['application_underwriting_score'].fillna(99.89) ) # Filling with mode value
df.isnull().sum()

In [None]:
print( sorted( df['no_of_premiums_paid'].unique() ) )

In [None]:
df.head()

In [None]:
df = df.reset_index()
df.drop('id', axis=1, inplace=True)
df.head()

In [None]:
df.to_csv('data/train_processed_1.csv', index=False)

## Summary

Note the columns are re-ordered for convenience.

Renaming of columns

Following pre-procecssing activities are done as part of this deliverable/notebook
1. Convert age from days to years
2. Imputation of 3 delayed premium payments columns with its median value of Zero
3. Imputation of 'application_undedrwriting_score' with its mode/highest-frequency value of 99.89

Following columns are dropped
1. id
2. age_in_days (age_in_yrs is added instead)