**SOME OF HYPOTHESIS**

**AGE:-** Are young clients more likely to be defaulters?

**OCCUPATION:-** Are people with work in smaller positions more likely to be defaulter?

**CITY TYPE:-** Are people who belongs to rural areas(eg.farmers) more likely to be defaulters?

**INCOME:-** Are people with lower income brackets more likely to be defaulters?

**DEPENDENTS:-** Are people with dependents more likely to be defaulters?

**FAMILY SIZE:-** Are people with an average family size more than 5 more likely to be defaulters?

**LAST PAYMENTS:-** Are people who are have more last due payments more likely to be defaulters? 

**BRANCH CODE:-** Banks with very few accounts more likely to have less defaulters?

**INTEREST RATE:-** Are people whose interest rate is high more likely to be defaulters?

**LOAN TYPE:-** Are clients with student loan more likely to be defaulters?

## Loan Default Prediction

SuperLender is a local digital lending company, which prides itself in its effective use of credit risk models to deliver profitable and high-impact loan alternative. Its assessment approach is based on two main risk drivers of loan default prediction:.
- 1) willingness to pay and 
- 2) ability to pay. Since not all customers pay back, the company invests in experienced data scientist to build robust models to effectively predict the odds of repayment.

These two fundamental drivers need to be determined at the point of each application to allow the credit grantor to make a calculated decision based on repayment odds, which in turn determines if an applicant should get a loan, and if so - what the size, price and tenure of the offer will be.

There are two types of risk models in general: New business risk, which would be used to assess the risk of application(s) associated with the first loan that he/she applies. The second is a repeat or behaviour risk model, in which case the customer has been a client and applies for a repeat loan. In the latter case - we will have additional performance on how he/she repaid their prior loans, which we can incorporate into our risk model.

It is your job to predict if a loan was good or bad, i.e. accurately predict binary outcome variable, where Good is 1 and Bad is 0.

### 1.1 Problem

This competition is a supervised classification machine learning task. The objective is to use historical financial and socioeconomic data to predict whether or not an applicant will be able to repay a loan. This is a standard supervised classification task:

* __Supervised__: The labels are included in the training data and the goal is to train a model to learn to predict the labels from the features
* __Classification__: The label is a binary variable, 0 (will repay loan on time), 1 (will have difficulty repaying loan)

### 1.2 Description of the data

There are 3 different datasets for both train and test

#### 1.2.1 Demographic data (traindemographics.csv)
- customerid (Primary key used to merge to other data)
- birthdate (date of birth of the customer)
- bank_account_type (type of primary bank account)
- longitude_gps
- latitude_gps
- bank_name_clients (name of the bank)
- bank_branch_clients (location of the branch - not compulsory - so missing in a lot of the cases)
- employment_status_clients (type of employment that customer has)
- level_of_education_clients (highest level of education)

#### 1.2.2 Performance data (trainperf.csv) : 
This is the repeat loan that the customer has taken for which we need to predict the performance of. Basically, we need to predict if whether this loan would default given all previous loans and demographics of a customer.
- customerid (Primary key used to merge to other data)
- systemloanid (The id associated with the particular loan. The same customerId can have multiple systemloanid’s for each loan he/she has taken out)
- loannumber (The number of the loan that you have to predict)
- approveddate (Date that loan was approved)
- creationdate (Date that loan application was created)
- loanamount (Loan value taken)
- totaldue (Total repayment required to settle the loan - this is the capital loan value disbursed +interest and fees)
- termdays (Term of loan)
- referredby (customerId of the customer that referred this person - is missing, then not referred)
- good_bad_flag (good = settled loan on time; bad = did not settled loan on time) - this is the target variable that we need to predic

#### 1.2.3 Previous loans data (trainprevloans.csv) : 
This dataset contains all previous loans that the customer had prior to the loan above that we want to predict the performance of. Each loan will have a different systemloanid, but the same customerid for each customer.

- customerid (Primary key used to merge to other data)
- systemloanid (The id associated with the particular loan. The same customerId can have multiple systemloanid’s for each loan he/she has taken out)
- loannumber (The number of the loan that you have to predict)
- approveddate (Date that loan was approved)
- creationdate (Date that loan application was created)
- loanamount (Date that loan application was created)
- totaldue (Total repayment required to settle the loan - this is the capital loan value disbursed +interest and fees) termdays (Term of loan)
- closeddate (Date that the loan was settled)
- referredby (customerId of the customer that referred this person - is missing, then not refrerred)
- firstduedate (Date of first payment due in cases where the term is longer than 30 days. So in the case where the term is 60+ days - then there are multiple monthly payments due - and this dates reflects the date of the first payment)
- firstrepaiddate (Actual date that he/she paid the first payment as defined above)

**IMPORTING NECESSARY LIBRARIES**

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import datetime

**LOAD ALL THE THREE DATASETS**

In [2]:
df1=pd.read_csv("demographics.csv")
df2=pd.read_csv("perf.csv")
df3=pd.read_csv("trainprevloans.csv")

**ABOUT DEMOGRAPHICS DATASET i.e df1**

In [3]:
df1.head()

Unnamed: 0,customerid,birthdate,bank_account_type,longitude_gps,latitude_gps,bank_name_clients,bank_branch_clients,employment_status_clients,level_of_education_clients
0,8a858e135cb22031015cbafc76964ebd,1973-10-10 00:00:00.000000,Savings,3.319219,6.528604,GT Bank,,,
1,8a858e275c7ea5ec015c82482d7c3996,1986-01-21 00:00:00.000000,Savings,3.325598,7.119403,Sterling Bank,,Permanent,
2,8a858e5b5bd99460015bdc95cd485634,1987-04-01 00:00:00.000000,Savings,5.7461,5.563174,Fidelity Bank,,,
3,8a858efd5ca70688015cabd1f1e94b55,1991-07-19 00:00:00.000000,Savings,3.36285,6.642485,GT Bank,,Permanent,
4,8a858e785acd3412015acd48f4920d04,1982-11-22 00:00:00.000000,Savings,8.455332,11.97141,GT Bank,,Permanent,


In [4]:
df1.shape

(4346, 9)

In [5]:
df1.customerid.value_counts()

8a858fca5c35df2c015c39ad8695343e    2
8a858f965bb63a25015bbf63fd062e2e    2
8a858fe65675195a015679452588279c    2
8a858edd57f790040157ffe9b6ed3fbb    2
8a858ec65cc6352b015cc64525ea0763    2
                                   ..
8a858e675c3fe0a1015c49af1dd366b0    1
8a858e6f5cd5e874015cde13d1225f01    1
8a858f3159f585d30159f9b058fc1e2f    1
8a858f1358d4ac270158d995e64634b8    1
8a858ec95afc0922015afcf961164042    1
Name: customerid, Length: 4334, dtype: int64

In [6]:
df1[df1['customerid']=='8a858f9f5679951a01567a5b90644817']

Unnamed: 0,customerid,birthdate,bank_account_type,longitude_gps,latitude_gps,bank_name_clients,bank_branch_clients,employment_status_clients,level_of_education_clients
3546,8a858f9f5679951a01567a5b90644817,1984-12-17 00:00:00.000000,Savings,4.196662,12.429509,Access Bank,,Permanent,
4286,8a858f9f5679951a01567a5b90644817,1984-12-17 00:00:00.000000,Savings,4.196662,12.429509,Access Bank,,Permanent,


**NOTE:** Above output shows that there are duplicates present in the dataset we need to remove that.

In [7]:
df1=df1.drop_duplicates()    # REMOVING DUPLICATE VALUES

In [8]:
df1.shape

(4334, 9)

In [9]:
df1.dtypes

customerid                     object
birthdate                      object
bank_account_type              object
longitude_gps                 float64
latitude_gps                  float64
bank_name_clients              object
bank_branch_clients            object
employment_status_clients      object
level_of_education_clients     object
dtype: object

In [10]:
df1.isnull().sum()/df1.shape[0]

customerid                    0.000000
birthdate                     0.000000
bank_account_type             0.000000
longitude_gps                 0.000000
latitude_gps                  0.000000
bank_name_clients             0.000000
bank_branch_clients           0.988233
employment_status_clients     0.149515
level_of_education_clients    0.864790
dtype: float64

In [11]:
df1.drop(['bank_branch_clients','level_of_education_clients'],axis=1,inplace=True)

In [12]:
df1.head()

Unnamed: 0,customerid,birthdate,bank_account_type,longitude_gps,latitude_gps,bank_name_clients,employment_status_clients
0,8a858e135cb22031015cbafc76964ebd,1973-10-10 00:00:00.000000,Savings,3.319219,6.528604,GT Bank,
1,8a858e275c7ea5ec015c82482d7c3996,1986-01-21 00:00:00.000000,Savings,3.325598,7.119403,Sterling Bank,Permanent
2,8a858e5b5bd99460015bdc95cd485634,1987-04-01 00:00:00.000000,Savings,5.7461,5.563174,Fidelity Bank,
3,8a858efd5ca70688015cabd1f1e94b55,1991-07-19 00:00:00.000000,Savings,3.36285,6.642485,GT Bank,Permanent
4,8a858e785acd3412015acd48f4920d04,1982-11-22 00:00:00.000000,Savings,8.455332,11.97141,GT Bank,Permanent


In [13]:
df1.employment_status_clients.mode()

0    Permanent
dtype: object

In [14]:
df1['employment_status_clients'].fillna('Permanent', inplace=True) #fill the null values with mode

In [15]:
df1['birthdate']=pd.to_datetime(df1.birthdate)

In [16]:
df1.head()

Unnamed: 0,customerid,birthdate,bank_account_type,longitude_gps,latitude_gps,bank_name_clients,employment_status_clients
0,8a858e135cb22031015cbafc76964ebd,1973-10-10,Savings,3.319219,6.528604,GT Bank,Permanent
1,8a858e275c7ea5ec015c82482d7c3996,1986-01-21,Savings,3.325598,7.119403,Sterling Bank,Permanent
2,8a858e5b5bd99460015bdc95cd485634,1987-04-01,Savings,5.7461,5.563174,Fidelity Bank,Permanent
3,8a858efd5ca70688015cabd1f1e94b55,1991-07-19,Savings,3.36285,6.642485,GT Bank,Permanent
4,8a858e785acd3412015acd48f4920d04,1982-11-22,Savings,8.455332,11.97141,GT Bank,Permanent


*******************************************************************************************************************************
*******************************************************************************************************************************

**ABOUT PERFORMANCE DATASET i.e df2**

In [17]:
df2.head()

Unnamed: 0,customerid,systemloanid,loannumber,approveddate,creationdate,loanamount,totaldue,termdays,referredby,good_bad_flag
0,8a2a81a74ce8c05d014cfb32a0da1049,301994762,12,2017-07-25 08:22:56.000000,2017-07-25 07:22:47.000000,30000.0,34500.0,30,,Good
1,8a85886e54beabf90154c0a29ae757c0,301965204,2,2017-07-05 17:04:41.000000,2017-07-05 16:04:18.000000,15000.0,17250.0,30,,Good
2,8a8588f35438fe12015444567666018e,301966580,7,2017-07-06 14:52:57.000000,2017-07-06 13:52:51.000000,20000.0,22250.0,15,,Good
3,8a85890754145ace015429211b513e16,301999343,3,2017-07-27 19:00:41.000000,2017-07-27 18:00:35.000000,10000.0,11500.0,15,,Good
4,8a858970548359cc0154883481981866,301962360,9,2017-07-03 23:42:45.000000,2017-07-03 22:42:39.000000,40000.0,44000.0,30,,Good


In [18]:
df2.shape

(4368, 10)

In [19]:
df2=df2.drop_duplicates()

In [20]:
df2.shape             #No duplicates present 

(4368, 10)

In [21]:
df2.isnull().sum()/df2.shape[0]

customerid       0.000000
systemloanid     0.000000
loannumber       0.000000
approveddate     0.000000
creationdate     0.000000
loanamount       0.000000
totaldue         0.000000
termdays         0.000000
referredby       0.865614
good_bad_flag    0.000000
dtype: float64

In [22]:
df2.drop(['referredby'],inplace=True,axis=1) #dropping column with missing value greater than 80%

In [23]:
df2.dtypes

customerid        object
systemloanid       int64
loannumber         int64
approveddate      object
creationdate      object
loanamount       float64
totaldue         float64
termdays           int64
good_bad_flag     object
dtype: object

In [24]:
df2['approveddate']=pd.to_datetime(df2.approveddate)
df2['creationdate']=pd.to_datetime(df2.creationdate)

In [25]:
df2.head()

Unnamed: 0,customerid,systemloanid,loannumber,approveddate,creationdate,loanamount,totaldue,termdays,good_bad_flag
0,8a2a81a74ce8c05d014cfb32a0da1049,301994762,12,2017-07-25 08:22:56,2017-07-25 07:22:47,30000.0,34500.0,30,Good
1,8a85886e54beabf90154c0a29ae757c0,301965204,2,2017-07-05 17:04:41,2017-07-05 16:04:18,15000.0,17250.0,30,Good
2,8a8588f35438fe12015444567666018e,301966580,7,2017-07-06 14:52:57,2017-07-06 13:52:51,20000.0,22250.0,15,Good
3,8a85890754145ace015429211b513e16,301999343,3,2017-07-27 19:00:41,2017-07-27 18:00:35,10000.0,11500.0,15,Good
4,8a858970548359cc0154883481981866,301962360,9,2017-07-03 23:42:45,2017-07-03 22:42:39,40000.0,44000.0,30,Good


*******************************************************************************************************************************
*******************************************************************************************************************************

**ABOUT PREVIOUS LOAN DATASET i.e df3**

In [26]:
df3.head()

Unnamed: 0,customerid,systemloanid,loannumber,approveddate,creationdate,loanamount,totaldue,termdays,closeddate,referredby,firstduedate,firstrepaiddate
0,8a2a81a74ce8c05d014cfb32a0da1049,301682320,2,2016-08-15 18:22:40.000000,2016-08-15 17:22:32.000000,10000.0,13000.0,30,2016-09-01 16:06:48.000000,,2016-09-14 00:00:00.000000,2016-09-01 15:51:43.000000
1,8a2a81a74ce8c05d014cfb32a0da1049,301883808,9,2017-04-28 18:39:07.000000,2017-04-28 17:38:53.000000,10000.0,13000.0,30,2017-05-28 14:44:49.000000,,2017-05-30 00:00:00.000000,2017-05-26 00:00:00.000000
2,8a2a81a74ce8c05d014cfb32a0da1049,301831714,8,2017-03-05 10:56:25.000000,2017-03-05 09:56:19.000000,20000.0,23800.0,30,2017-04-26 22:18:56.000000,,2017-04-04 00:00:00.000000,2017-04-26 22:03:47.000000
3,8a8588f35438fe12015444567666018e,301861541,5,2017-04-09 18:25:55.000000,2017-04-09 17:25:42.000000,10000.0,11500.0,15,2017-04-24 01:35:52.000000,,2017-04-24 00:00:00.000000,2017-04-24 00:48:43.000000
4,8a85890754145ace015429211b513e16,301941754,2,2017-06-17 09:29:57.000000,2017-06-17 08:29:50.000000,10000.0,11500.0,15,2017-07-14 21:18:43.000000,,2017-07-03 00:00:00.000000,2017-07-14 21:08:35.000000


In [27]:
df3=df3.drop_duplicates()

In [28]:
df3.shape        # No duplicates present

(18183, 12)

In [29]:
df3.isnull().sum()/df3.shape[0]

customerid         0.000000
systemloanid       0.000000
loannumber         0.000000
approveddate       0.000000
creationdate       0.000000
loanamount         0.000000
totaldue           0.000000
termdays           0.000000
closeddate         0.000000
referredby         0.943574
firstduedate       0.000000
firstrepaiddate    0.000000
dtype: float64

In [30]:
df3.drop(['referredby'],axis=1,inplace=True)

In [31]:
df3['approveddate']=pd.to_datetime(df3.approveddate)
df3['creationdate']=pd.to_datetime(df3.creationdate)
df3['closeddate']=pd.to_datetime(df3.closeddate)
df3['firstduedate']=pd.to_datetime(df3.firstduedate)
df3['firstrepaiddate']=pd.to_datetime(df3.firstrepaiddate)

In [32]:
df3.dtypes

customerid                 object
systemloanid                int64
loannumber                  int64
approveddate       datetime64[ns]
creationdate       datetime64[ns]
loanamount                float64
totaldue                  float64
termdays                    int64
closeddate         datetime64[ns]
firstduedate       datetime64[ns]
firstrepaiddate    datetime64[ns]
dtype: object

In [33]:
df3.head()

Unnamed: 0,customerid,systemloanid,loannumber,approveddate,creationdate,loanamount,totaldue,termdays,closeddate,firstduedate,firstrepaiddate
0,8a2a81a74ce8c05d014cfb32a0da1049,301682320,2,2016-08-15 18:22:40,2016-08-15 17:22:32,10000.0,13000.0,30,2016-09-01 16:06:48,2016-09-14,2016-09-01 15:51:43
1,8a2a81a74ce8c05d014cfb32a0da1049,301883808,9,2017-04-28 18:39:07,2017-04-28 17:38:53,10000.0,13000.0,30,2017-05-28 14:44:49,2017-05-30,2017-05-26 00:00:00
2,8a2a81a74ce8c05d014cfb32a0da1049,301831714,8,2017-03-05 10:56:25,2017-03-05 09:56:19,20000.0,23800.0,30,2017-04-26 22:18:56,2017-04-04,2017-04-26 22:03:47
3,8a8588f35438fe12015444567666018e,301861541,5,2017-04-09 18:25:55,2017-04-09 17:25:42,10000.0,11500.0,15,2017-04-24 01:35:52,2017-04-24,2017-04-24 00:48:43
4,8a85890754145ace015429211b513e16,301941754,2,2017-06-17 09:29:57,2017-06-17 08:29:50,10000.0,11500.0,15,2017-07-14 21:18:43,2017-07-03,2017-07-14 21:08:35


**SHAPE OF DATASETS**

In [34]:
df1.shape,df2.shape,df3.shape

((4334, 7), (4368, 9), (18183, 11))

**Merging datasets**

In [35]:
customers=pd.merge(df2,df1,on='customerid',how='left')  #performance data with demographics

In [36]:
Loan_data=pd.merge(df3,customers,on='customerid',how='left')

SHAPE OF THE DATASET

In [37]:
Loan_data.shape

(18183, 25)

Proportion of null values

In [38]:
Loan_data.isnull().sum()/Loan_data.shape[0]

customerid                   0.000000
systemloanid_x               0.000000
loannumber_x                 0.000000
approveddate_x               0.000000
creationdate_x               0.000000
loanamount_x                 0.000000
totaldue_x                   0.000000
termdays_x                   0.000000
closeddate                   0.000000
firstduedate                 0.000000
firstrepaiddate              0.000000
systemloanid_y               0.000000
loannumber_y                 0.000000
approveddate_y               0.000000
creationdate_y               0.000000
loanamount_y                 0.000000
totaldue_y                   0.000000
termdays_y                   0.000000
good_bad_flag                0.000000
birthdate                    0.248034
bank_account_type            0.248034
longitude_gps                0.248034
latitude_gps                 0.248034
bank_name_clients            0.248034
employment_status_clients    0.248034
dtype: float64

# FEATURE ENGINEERING

In [39]:
Loan_data['bank_name_clients'].mode(),Loan_data['bank_account_type'].mode(),Loan_data['employment_status_clients'].mode()

(0    GT Bank
 dtype: object,
 0    Savings
 dtype: object,
 0    Permanent
 dtype: object)

In [40]:
Loan_data['bank_name_clients'].fillna('GT Bank', inplace=True)

In [41]:
Loan_data['bank_account_type'].fillna('Savings', inplace=True)

In [42]:
Loan_data['employment_status_clients'].fillna('Permanent', inplace=True)

In [43]:
Loan_data["longitude_gps"]=Loan_data["longitude_gps"].interpolate(method='linear')

In [44]:
Loan_data["latitude_gps"]=Loan_data["latitude_gps"].interpolate(method='linear')

In [45]:
Loan_data.isnull().sum()

customerid                      0
systemloanid_x                  0
loannumber_x                    0
approveddate_x                  0
creationdate_x                  0
loanamount_x                    0
totaldue_x                      0
termdays_x                      0
closeddate                      0
firstduedate                    0
firstrepaiddate                 0
systemloanid_y                  0
loannumber_y                    0
approveddate_y                  0
creationdate_y                  0
loanamount_y                    0
totaldue_y                      0
termdays_y                      0
good_bad_flag                   0
birthdate                    4510
bank_account_type               0
longitude_gps                   0
latitude_gps                    0
bank_name_clients               0
employment_status_clients       0
dtype: int64

In [46]:
Loan_data.good_bad_flag[Loan_data.good_bad_flag == 'Good'] = 0
Loan_data.good_bad_flag[Loan_data.good_bad_flag == 'Bad'] = 1

In [47]:
Loan_data.head()

Unnamed: 0,customerid,systemloanid_x,loannumber_x,approveddate_x,creationdate_x,loanamount_x,totaldue_x,termdays_x,closeddate,firstduedate,...,loanamount_y,totaldue_y,termdays_y,good_bad_flag,birthdate,bank_account_type,longitude_gps,latitude_gps,bank_name_clients,employment_status_clients
0,8a2a81a74ce8c05d014cfb32a0da1049,301682320,2,2016-08-15 18:22:40,2016-08-15 17:22:32,10000.0,13000.0,30,2016-09-01 16:06:48,2016-09-14,...,30000.0,34500.0,30,0,1972-01-15,Other,3.43201,6.433055,Diamond Bank,Permanent
1,8a2a81a74ce8c05d014cfb32a0da1049,301883808,9,2017-04-28 18:39:07,2017-04-28 17:38:53,10000.0,13000.0,30,2017-05-28 14:44:49,2017-05-30,...,30000.0,34500.0,30,0,1972-01-15,Other,3.43201,6.433055,Diamond Bank,Permanent
2,8a2a81a74ce8c05d014cfb32a0da1049,301831714,8,2017-03-05 10:56:25,2017-03-05 09:56:19,20000.0,23800.0,30,2017-04-26 22:18:56,2017-04-04,...,30000.0,34500.0,30,0,1972-01-15,Other,3.43201,6.433055,Diamond Bank,Permanent
3,8a8588f35438fe12015444567666018e,301861541,5,2017-04-09 18:25:55,2017-04-09 17:25:42,10000.0,11500.0,15,2017-04-24 01:35:52,2017-04-24,...,20000.0,22250.0,15,0,1984-09-18,Other,11.13935,10.292041,EcoBank,Permanent
4,8a85890754145ace015429211b513e16,301941754,2,2017-06-17 09:29:57,2017-06-17 08:29:50,10000.0,11500.0,15,2017-07-14 21:18:43,2017-07-03,...,10000.0,11500.0,15,0,1977-10-10,Savings,3.98577,7.491708,First Bank,Permanent


**LOGISTIC REGRESSION**

In [48]:
Loan_data.columns

Index(['customerid', 'systemloanid_x', 'loannumber_x', 'approveddate_x',
       'creationdate_x', 'loanamount_x', 'totaldue_x', 'termdays_x',
       'closeddate', 'firstduedate', 'firstrepaiddate', 'systemloanid_y',
       'loannumber_y', 'approveddate_y', 'creationdate_y', 'loanamount_y',
       'totaldue_y', 'termdays_y', 'good_bad_flag', 'birthdate',
       'bank_account_type', 'longitude_gps', 'latitude_gps',
       'bank_name_clients', 'employment_status_clients'],
      dtype='object')

In [49]:
Loan_data=Loan_data[['customerid', 'systemloanid_x', 'loannumber_x', 'approveddate_x',
       'creationdate_x', 'loanamount_x', 'totaldue_x', 'termdays_x',
       'closeddate', 'firstduedate', 'firstrepaiddate', 'systemloanid_y',
       'loannumber_y', 'approveddate_y', 'creationdate_y', 'loanamount_y',
       'totaldue_y', 'termdays_y', 'good_bad_flag', 'birthdate',
       'bank_account_type', 'longitude_gps', 'latitude_gps',
       'bank_name_clients', 'employment_status_clients']]

In [50]:
Loan_data=pd.get_dummies(Loan_data,columns=['bank_name_clients','employment_status_clients','bank_account_type'],drop_first=True)

In [51]:
Loan_data

Unnamed: 0,customerid,systemloanid_x,loannumber_x,approveddate_x,creationdate_x,loanamount_x,totaldue_x,termdays_x,closeddate,firstduedate,...,bank_name_clients_Unity Bank,bank_name_clients_Wema Bank,bank_name_clients_Zenith Bank,employment_status_clients_Permanent,employment_status_clients_Retired,employment_status_clients_Self-Employed,employment_status_clients_Student,employment_status_clients_Unemployed,bank_account_type_Other,bank_account_type_Savings
0,8a2a81a74ce8c05d014cfb32a0da1049,301682320,2,2016-08-15 18:22:40,2016-08-15 17:22:32,10000.0,13000.0,30,2016-09-01 16:06:48,2016-09-14,...,0,0,0,1,0,0,0,0,1,0
1,8a2a81a74ce8c05d014cfb32a0da1049,301883808,9,2017-04-28 18:39:07,2017-04-28 17:38:53,10000.0,13000.0,30,2017-05-28 14:44:49,2017-05-30,...,0,0,0,1,0,0,0,0,1,0
2,8a2a81a74ce8c05d014cfb32a0da1049,301831714,8,2017-03-05 10:56:25,2017-03-05 09:56:19,20000.0,23800.0,30,2017-04-26 22:18:56,2017-04-04,...,0,0,0,1,0,0,0,0,1,0
3,8a8588f35438fe12015444567666018e,301861541,5,2017-04-09 18:25:55,2017-04-09 17:25:42,10000.0,11500.0,15,2017-04-24 01:35:52,2017-04-24,...,0,0,0,1,0,0,0,0,1,0
4,8a85890754145ace015429211b513e16,301941754,2,2017-06-17 09:29:57,2017-06-17 08:29:50,10000.0,11500.0,15,2017-07-14 21:18:43,2017-07-03,...,0,0,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18178,8a858899538ddb8e0153a2b555421fc5,301611754,2,2016-04-16 13:36:34,2016-04-16 12:36:28,10000.0,13000.0,30,2016-05-14 00:04:52,2016-05-16,...,0,0,1,1,0,0,0,0,0,1
18179,8a858899538ddb8e0153a2b555421fc5,301761267,9,2016-11-18 14:26:07,2016-11-18 13:25:51,30000.0,34400.0,30,2016-12-13 16:08:57,2016-12-19,...,0,0,1,1,0,0,0,0,0,1
18180,8a858899538ddb8e0153a2b555421fc5,301631653,4,2016-06-12 15:30:56,2016-06-12 14:30:50,10000.0,13000.0,30,2016-07-09 15:39:00,2016-07-12,...,0,0,1,1,0,0,0,0,0,1
18181,8a858f0656b7820c0156c92ca3ba436f,301697691,1,2016-08-27 20:03:45,2016-08-27 19:03:34,10000.0,13000.0,30,2016-10-15 10:17:54,2016-09-26,...,0,0,0,1,0,0,0,0,0,1


In [52]:
Loan_data.columns

Index(['customerid', 'systemloanid_x', 'loannumber_x', 'approveddate_x',
       'creationdate_x', 'loanamount_x', 'totaldue_x', 'termdays_x',
       'closeddate', 'firstduedate', 'firstrepaiddate', 'systemloanid_y',
       'loannumber_y', 'approveddate_y', 'creationdate_y', 'loanamount_y',
       'totaldue_y', 'termdays_y', 'good_bad_flag', 'birthdate',
       'longitude_gps', 'latitude_gps', 'bank_name_clients_Diamond Bank',
       'bank_name_clients_EcoBank', 'bank_name_clients_FCMB',
       'bank_name_clients_Fidelity Bank', 'bank_name_clients_First Bank',
       'bank_name_clients_GT Bank', 'bank_name_clients_Heritage Bank',
       'bank_name_clients_Keystone Bank', 'bank_name_clients_Skye Bank',
       'bank_name_clients_Stanbic IBTC',
       'bank_name_clients_Standard Chartered',
       'bank_name_clients_Sterling Bank', 'bank_name_clients_UBA',
       'bank_name_clients_Union Bank', 'bank_name_clients_Unity Bank',
       'bank_name_clients_Wema Bank', 'bank_name_clients_Zenith 

In [54]:
x = Loan_data[['loannumber_x','loanamount_x','totaldue_x','termdays_x','systemloanid_y','loannumber_y','loanamount_y','totaldue_y','termdays_y','longitude_gps','latitude_gps','bank_account_type_Savings','bank_account_type_Other','bank_name_clients_Diamond Bank',
       'bank_name_clients_EcoBank', 'bank_name_clients_FCMB',
       'bank_name_clients_Fidelity Bank', 'bank_name_clients_First Bank',
       'bank_name_clients_GT Bank', 'bank_name_clients_Heritage Bank',
       'bank_name_clients_Keystone Bank', 'bank_name_clients_Skye Bank',
       'bank_name_clients_Stanbic IBTC',
       'bank_name_clients_Standard Chartered',
       'bank_name_clients_Sterling Bank', 'bank_name_clients_UBA',
       'bank_name_clients_Union Bank', 'bank_name_clients_Unity Bank',
       'bank_name_clients_Wema Bank', 'bank_name_clients_Zenith Bank',
       'employment_status_clients_Permanent',
       'employment_status_clients_Retired',
       'employment_status_clients_Self-Employed',
       'employment_status_clients_Student',
       'employment_status_clients_Unemployed']]
y = Loan_data['good_bad_flag']

In [58]:
cols = train_x.columns

In [55]:
# Importing the train test split function
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x,y, random_state = 56)

In [56]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [59]:
train_x_scaled = scaler.fit_transform(train_x)
train_x_scaled = pd.DataFrame(train_x_scaled, columns=cols)
train_x_scaled.head()

Unnamed: 0,loannumber_x,loanamount_x,totaldue_x,termdays_x,systemloanid_y,loannumber_y,loanamount_y,totaldue_y,termdays_y,longitude_gps,...,bank_name_clients_UBA,bank_name_clients_Union Bank,bank_name_clients_Unity Bank,bank_name_clients_Wema Bank,bank_name_clients_Zenith Bank,employment_status_clients_Permanent,employment_status_clients_Retired,employment_status_clients_Self-Employed,employment_status_clients_Student,employment_status_clients_Unemployed
0,0.0,0.122807,0.124517,0.0,0.234083,0.0,0.0,0.051635,0.2,0.453025,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.122807,0.124517,0.0,0.54645,0.12,0.2,0.24957,0.2,0.451298,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.458333,0.473684,0.515855,0.6,0.649753,0.6,0.8,0.817556,1.0,0.460284,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.122807,0.147718,0.2,0.372018,0.04,0.0,0.051635,0.2,0.451434,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.041667,0.122807,0.147718,0.2,0.735696,0.24,0.0,0.019363,0.0,0.468354,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [60]:
test_x_scaled = scaler.transform(test_x)
test_x_scaled = pd.DataFrame(test_x_scaled, columns=cols)
test_x_scaled.head()

Unnamed: 0,loannumber_x,loanamount_x,totaldue_x,termdays_x,systemloanid_y,loannumber_y,loanamount_y,totaldue_y,termdays_y,longitude_gps,...,bank_name_clients_UBA,bank_name_clients_Union Bank,bank_name_clients_Unity Bank,bank_name_clients_Wema Bank,bank_name_clients_Zenith Bank,employment_status_clients_Permanent,employment_status_clients_Retired,employment_status_clients_Self-Employed,employment_status_clients_Student,employment_status_clients_Unemployed
0,0.041667,0.122807,0.147718,0.2,0.984615,0.04,0.1,0.144148,0.2,0.45132,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.166667,0.298246,0.285383,0.0,0.200483,0.32,0.4,0.421687,0.2,0.45128,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.375,0.649123,0.682908,0.6,0.586503,0.4,0.6,0.654045,0.6,0.472091,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.25,0.298246,0.325599,0.2,0.594843,0.28,0.4,0.499139,0.6,0.470611,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.122807,0.147718,0.2,0.386371,0.2,0.2,0.24957,0.2,0.471709,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [61]:
#importing Logistic Regression and metric F1-score
from sklearn.linear_model import LogisticRegression as LogReg
from sklearn.metrics import f1_score

In [65]:
# Creating instance of Logistic Regresssion
logreg=LogReg()

In [71]:
y=

0    0
1    0
2    0
3    0
4    0
Name: good_bad_flag, dtype: int32

In [68]:
# Fitting the model
logreg.fit(train_x, train_y)

ValueError: Unknown label type: 'unknown'