# Part I

In [1]:
import pandas as pd
import numpy as np

In [2]:
train_dem = pd.read_csv("traindemographics.csv")
train_perf = pd.read_csv("trainperf.csv")
train_prev = pd.read_csv("trainprevloans.csv")

## Description of the datasets
(cf https://www.kaggle.com/c/data-science-nigeria-credit-risk-prediction/data)

There are 3 different datasets for both train and test. The sample submission has 2 key outcomes- good (1) or bad (0).

### Demographic data (traindemographics.csv)

* customerid: Primary key used to merge to other data
* birthdate: date of birth of the customer
* bank_account_type: type of primary bank account
* longitude_gps
* latitude_gps
* bank_name_clients: name of the bank
* bank_branch_clients: location of the branch - not compulsory - so missing in a lot of the cases
* employment_status_clients: type of employment that customer has
* level_of_education_clients: highest level of education

### Performance data (trainperf.csv) :

This is the repeat loan that the customer has taken for which we need to predict the performance of. Basically, we need to predict if whether this loan would default given all previous loans and demographics of a customer.

* customerid: Primary key used to merge to other data
* systemloanid: The id associated with the particular loan. The same customerId can have multiple systemloanid’s for each loan he/she has taken out
* loannumber: The number of the loan that you have to predict
* approveddate: Date that loan was approved
* creationdate: Date that loan application was created
* loanamount: Loan value taken
* totaldue: Total repayment required to settle the loan - this is the capital loan value disbursed +interest and fees
* termdays: Term of loan
* referredby: customerId of the customer that referred this person - is missing, then not referred
* good_bad_flag: good = settled loan on time; bad = did not settled loan on time) - this is the target variable that we need to predict

### Previous loans data (trainprevloans.csv) : 

This dataset contains all previous loans that the customer had prior to the loan above that we want to predict the performance of. Each loan will have a different systemloanid, but the same customerid for each customer.

* customerid: Primary key used to merge to other data
* systemloanid: The id associated with the particular loan. The same customerId can have multiple systemloanid’s for each loan he/she has taken out
* loannumber: The number of the loan that you have to predict
* approveddate: Date that loan was approved
* creationdate: Date that loan application was created
* loanamount: Date that loan application was created
* totaldue: Total repayment required to settle the loan - this is the capital loan value disbursed +interest and fees)
* termdays: Term of loan
* closeddate: Date that the loan was settled
* referredby: customerId of the customer that referred this person - is missing, then not refrerred
* firstduedate: Date of first payment due in cases where the term is longer than 30 days. So in the case where the term is 60+ days - then there are multiple monthly payments due - and this dates reflects the date of the first payment
* firstrepaiddate: Actual date that he/she paid the first payment as defined above

In [3]:
#train_dem.info()
train_dem.head()

Unnamed: 0,customerid,birthdate,bank_account_type,longitude_gps,latitude_gps,bank_name_clients,bank_branch_clients,employment_status_clients,level_of_education_clients
0,8a858e135cb22031015cbafc76964ebd,1973-10-10 00:00:00.000000,Savings,3.319219,6.528604,GT Bank,,,
1,8a858e275c7ea5ec015c82482d7c3996,1986-01-21 00:00:00.000000,Savings,3.325598,7.119403,Sterling Bank,,Permanent,
2,8a858e5b5bd99460015bdc95cd485634,1987-04-01 00:00:00.000000,Savings,5.7461,5.563174,Fidelity Bank,,,
3,8a858efd5ca70688015cabd1f1e94b55,1991-07-19 00:00:00.000000,Savings,3.36285,6.642485,GT Bank,,Permanent,
4,8a858e785acd3412015acd48f4920d04,1982-11-22 00:00:00.000000,Savings,8.455332,11.97141,GT Bank,,Permanent,


In [4]:
#train_perf.info()
train_perf.head()

Unnamed: 0,customerid,systemloanid,loannumber,approveddate,creationdate,loanamount,totaldue,termdays,referredby,good_bad_flag
0,8a2a81a74ce8c05d014cfb32a0da1049,301994762,12,2017-07-25 08:22:56.000000,2017-07-25 07:22:47.000000,30000.0,34500.0,30,,Good
1,8a85886e54beabf90154c0a29ae757c0,301965204,2,2017-07-05 17:04:41.000000,2017-07-05 16:04:18.000000,15000.0,17250.0,30,,Good
2,8a8588f35438fe12015444567666018e,301966580,7,2017-07-06 14:52:57.000000,2017-07-06 13:52:51.000000,20000.0,22250.0,15,,Good
3,8a85890754145ace015429211b513e16,301999343,3,2017-07-27 19:00:41.000000,2017-07-27 18:00:35.000000,10000.0,11500.0,15,,Good
4,8a858970548359cc0154883481981866,301962360,9,2017-07-03 23:42:45.000000,2017-07-03 22:42:39.000000,40000.0,44000.0,30,,Good


In [5]:
train_prev.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18183 entries, 0 to 18182
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   customerid       18183 non-null  object 
 1   systemloanid     18183 non-null  int64  
 2   loannumber       18183 non-null  int64  
 3   approveddate     18183 non-null  object 
 4   creationdate     18183 non-null  object 
 5   loanamount       18183 non-null  float64
 6   totaldue         18183 non-null  float64
 7   termdays         18183 non-null  int64  
 8   closeddate       18183 non-null  object 
 9   referredby       1026 non-null   object 
 10  firstduedate     18183 non-null  object 
 11  firstrepaiddate  18183 non-null  object 
dtypes: float64(2), int64(3), object(7)
memory usage: 1.7+ MB


In [8]:
train_perf.describe(include='all')

Unnamed: 0,customerid,systemloanid,loannumber,approveddate,creationdate,loanamount,totaldue,termdays,referredby,good_bad_flag
count,4368,4368.0,4368.0,4368,4368,4368.0,4368.0,4368.0,587,4368
unique,4368,,,4362,4364,,,,521,2
top,8a858966538deb1901539d3459133c24,,,2017-07-25 10:05:30.000000,2017-07-05 13:48:26.000000,,,,8a858fc55b2548dd015b286e452c678c,Good
freq,1,,,2,2,,,,8,3416
mean,,301981000.0,5.17239,,,17809.065934,21257.377679,29.261676,,
std,,13431.15,3.653569,,,10749.694571,11943.510416,11.512519,,
min,,301958500.0,2.0,,,10000.0,10000.0,15.0,,
25%,,301969100.0,2.0,,,10000.0,13000.0,30.0,,
50%,,301980100.0,4.0,,,10000.0,13000.0,30.0,,
75%,,301993500.0,7.0,,,20000.0,24500.0,30.0,,


In [9]:
train_dem.describe(include='all')

Unnamed: 0,customerid,birthdate,bank_account_type,longitude_gps,latitude_gps,bank_name_clients,bank_branch_clients,employment_status_clients,level_of_education_clients
count,4346,4346,4346,4346.0,4346.0,4346,51,3698,587
unique,4334,3297,3,,,18,45,6,4
top,8a858fe05d421ff4015d4c87d2a21ceb,1984-06-28 00:00:00.000000,Savings,,,GT Bank,OGBA,Permanent,Graduate
freq,2,5,3425,,,1598,3,3146,420
mean,,,,4.626189,7.251356,,,,
std,,,,7.184832,3.055052,,,,
min,,,,-118.247009,-33.868818,,,,
25%,,,,3.354953,6.47061,,,,
50%,,,,3.593302,6.621888,,,,
75%,,,,6.54522,7.425052,,,,
