# Delinquency Telecom Model

# Definition:
Delinquency is a condition that arises when an activity or situation does not occur at its scheduled (or expected) date i.e., it occurs later than expected.
# Use Case: 
Many donors, experts, and microfinance institutions (MFI) have become convinced that using mobile financial services (MFS) is more convenient and efficient, and less costly, than the traditional high-touch model for delivering microfinance services. MFS becomes especially useful when targeting the unbanked poor living in remote areas. The implementation of MFS, though, has been uneven with both significant challenges and successes.
Today, microfinance is widely accepted as a poverty-reduction tool, representing $70 billion in outstanding loans and a global outreach of 200 million clients.

One of our Client in Telecom collaborates with an MFI to provide micro-credit on mobile balances to be paid back in 5 days. The Consumer is believed to be delinquent if he deviates from the path of paying back the loaned amount within 5 days.
The sample data from our client database is hereby given to you for the exercise.



# Exercise:
Create a delinquency model which can predict in terms of a probability for each loan transaction, whether the customer will be paying back the loaned amount within 5 days of insurance of loan 
(Label ‘1’ & ’0’)
Find Enclosed the Data Description File and The Sample Data for the Modeling Exercise.


# Loading the Data

In [17]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sn
import warnings
warnings.filterwarnings('ignore')

# Data Description
The following is the description of each of the columns of the data provided. We will first remove the columns that won't be useful in training the model.

In [18]:
pd.options.display.max_colwidth = 200
pd.read_excel('Data_Description.xlsx', sheet_name='Description')[['Variable', 'Definition']]

Unnamed: 0,Variable,Definition
0,label,"Flag indicating whether the user paid back the credit amount within 5 days of issuing the loan{1:success, 0:failure}"
1,msisdn,mobile number of user
2,aon,age on cellular network in days
3,daily_decr30,"Daily amount spent from main account, averaged over last 30 days (in Indonesian Rupiah)"
4,daily_decr90,"Daily amount spent from main account, averaged over last 90 days (in Indonesian Rupiah)"
5,rental30,Average main account balance over last 30 days
6,rental90,Average main account balance over last 90 days
7,last_rech_date_ma,Number of days till last recharge of main account
8,last_rech_date_da,Number of days till last recharge of data account
9,last_rech_amt_ma,Amount of last recharge of main account (in Indonesian Rupiah)


In [19]:
df = pd.read_csv('sample_data_intw.csv')
df.head()
df.shape

Unnamed: 0.1,Unnamed: 0,label,msisdn,aon,daily_decr30,daily_decr90,rental30,rental90,last_rech_date_ma,last_rech_date_da,...,maxamnt_loans30,medianamnt_loans30,cnt_loans90,amnt_loans90,maxamnt_loans90,medianamnt_loans90,payback30,payback90,pcircle,pdate
0,1,0,21408I70789,272.0,3055.05,3065.15,220.13,260.13,2.0,0.0,...,6.0,0.0,2.0,12,6,0.0,29.0,29.0,UPW,2016-07-20
1,2,1,76462I70374,712.0,12122.0,12124.75,3691.26,3691.26,20.0,0.0,...,12.0,0.0,1.0,12,12,0.0,0.0,0.0,UPW,2016-08-10
2,3,1,17943I70372,535.0,1398.0,1398.0,900.13,900.13,3.0,0.0,...,6.0,0.0,1.0,6,6,0.0,0.0,0.0,UPW,2016-08-19
3,4,1,55773I70781,241.0,21.228,21.228,159.42,159.42,41.0,0.0,...,6.0,0.0,2.0,12,6,0.0,0.0,0.0,UPW,2016-06-06
4,5,1,03813I82730,947.0,150.619333,150.619333,1098.9,1098.9,4.0,0.0,...,6.0,0.0,7.0,42,6,0.0,2.333333,2.333333,UPW,2016-06-22


(209593, 37)

It is always a good idea to first have a look at the data. `df.describe()` summarizes valuable statistics about the data.

It will also help us decide which columns we can eliminate straightaway.

In [20]:
df.describe()

Unnamed: 0.1,Unnamed: 0,label,aon,daily_decr30,daily_decr90,rental30,rental90,last_rech_date_ma,last_rech_date_da,last_rech_amt_ma,...,cnt_loans30,amnt_loans30,maxamnt_loans30,medianamnt_loans30,cnt_loans90,amnt_loans90,maxamnt_loans90,medianamnt_loans90,payback30,payback90
count,209593.0,209593.0,209593.0,209593.0,209593.0,209593.0,209593.0,209593.0,209593.0,209593.0,...,209593.0,209593.0,209593.0,209593.0,209593.0,209593.0,209593.0,209593.0,209593.0,209593.0
mean,104797.0,0.875177,8112.343445,5381.402289,6082.515068,2692.58191,3483.406534,3755.8478,3712.202921,2064.452797,...,2.758981,17.952021,274.658747,0.054029,18.520919,23.645398,6.703134,0.046077,3.398826,4.321485
std,60504.431823,0.330519,75696.082531,9220.6234,10918.812767,4308.586781,5770.461279,53905.89223,53374.83343,2370.786034,...,2.554502,17.379741,4245.264648,0.218039,224.797423,26.469861,2.103864,0.200692,8.813729,10.308108
min,1.0,0.0,-48.0,-93.012667,-93.012667,-23737.14,-24720.58,-29.0,-29.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,52399.0,1.0,246.0,42.44,42.692,280.42,300.26,1.0,0.0,770.0,...,1.0,6.0,6.0,0.0,1.0,6.0,6.0,0.0,0.0,0.0
50%,104797.0,1.0,527.0,1469.175667,1500.0,1083.57,1334.0,3.0,0.0,1539.0,...,2.0,12.0,6.0,0.0,2.0,12.0,6.0,0.0,0.0,1.666667
75%,157195.0,1.0,982.0,7244.0,7802.79,3356.94,4201.79,7.0,0.0,2309.0,...,4.0,24.0,6.0,0.0,5.0,30.0,6.0,0.0,3.75,4.5
max,209593.0,1.0,999860.755168,265926.0,320630.0,198926.11,200148.11,998650.377733,999171.80941,55000.0,...,50.0,306.0,99864.560864,3.0,4997.517944,438.0,12.0,3.0,171.5,171.5


As we can see below, the data is highly skewed. The majority of data items in the dataset represent items belonging to the `positive` class, i.e., those who have paid back the loan in time.

In [21]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2,random_state=1)
print("training data is :", train.shape)
train['label'].value_counts()
print("Test data is:", test.shape)
test['label'].value_counts()

training data is : (167674, 37)


1    146692
0     20982
Name: label, dtype: int64

Test data is: (41919, 37)


1    36739
0     5180
Name: label, dtype: int64

# Preprocessing the data

First we will remove the columns that won't be fed to the model. All the entries need to be converted to numeric values first. Note that we have set `errors='coerce'` so that any non-numeric values are set to NaN. Then, check if there exist any NaN entries.

In [22]:
train.drop(['Unnamed: 0','msisdn','pcircle','pdate'],axis=1,inplace=True)
test.drop(['Unnamed: 0','msisdn','pcircle','pdate'],axis=1,inplace=True)

In [23]:
train.apply(lambda x: pd.to_numeric(x, errors='coerce'))
test.apply(lambda x: pd.to_numeric(x, errors='coerce'))

Unnamed: 0,label,aon,daily_decr30,daily_decr90,rental30,rental90,last_rech_date_ma,last_rech_date_da,last_rech_amt_ma,cnt_ma_rech30,...,cnt_loans30,amnt_loans30,maxamnt_loans30,medianamnt_loans30,cnt_loans90,amnt_loans90,maxamnt_loans90,medianamnt_loans90,payback30,payback90
69724,0,240.000000,8.000000,8.000000,1569.60,1569.60,7.0,0.0,770,1,...,1,6,6.0,0.0,1.000000,6,6,0.0,0.000000,0.000000
136917,1,742.000000,12.900000,12.900000,44.10,44.10,4.0,0.0,773,2,...,3,18,6.0,1.0,3.000000,18,6,1.0,7.000000,7.000000
172099,1,91.000000,19012.634667,19065.690000,14002.06,15832.84,3.0,0.0,3178,9,...,1,12,12.0,0.0,1.000000,12,12,0.0,0.000000,0.000000
88803,1,466.000000,8468.000000,8734.440000,800.52,1040.52,1.0,0.0,1539,3,...,4,24,6.0,0.0,8.000000,48,6,0.0,3.000000,4.125000
70250,1,1405.000000,17029.748000,17121.320000,2359.28,3365.48,6.0,0.0,8000,3,...,1,12,12.0,0.0,1.000000,12,12,0.0,0.000000,0.000000
2932,0,697.000000,12056.191667,12065.230000,445.96,475.16,2.0,0.0,5787,1,...,2,12,6.0,0.0,2.000000,12,6,0.0,0.000000,9.000000
207129,1,476.000000,3.803000,3.803000,74.37,74.37,4.0,0.0,773,3,...,1,6,6.0,0.0,1.000000,6,6,0.0,0.000000,0.000000
43610,0,1603.000000,0.013333,0.013333,0.00,0.00,0.0,0.0,0,0,...,1,6,6.0,0.0,1.000000,6,6,0.0,0.000000,0.000000
19283,1,1399.000000,7882.000000,9389.000000,2618.51,2678.51,5.0,0.0,2309,2,...,0,0,0.0,0.0,2.000000,12,6,0.0,0.000000,22.000000
179502,1,927.000000,17757.000000,20740.420000,12031.79,21220.86,3.0,0.0,1539,12,...,9,54,6.0,0.0,25.000000,150,6,0.0,1.777778,1.760000


Unnamed: 0,label,aon,daily_decr30,daily_decr90,rental30,rental90,last_rech_date_ma,last_rech_date_da,last_rech_amt_ma,cnt_ma_rech30,...,cnt_loans30,amnt_loans30,maxamnt_loans30,medianamnt_loans30,cnt_loans90,amnt_loans90,maxamnt_loans90,medianamnt_loans90,payback30,payback90
195331,1,201.000000,2842.400000,2853.000000,439.25,474.89,1.0,0.0,1539,5,...,3,18,6.000000,0.0,3.0,18,6,0.0,5.500000,5.500000
61584,1,189.000000,73.905000,73.905000,448.76,448.76,6.0,0.0,3178,1,...,1,6,6.000000,0.0,1.0,6,6,0.0,0.000000,0.000000
103931,1,192.000000,21.237000,21.237000,182.10,182.10,7.0,0.0,2320,1,...,1,6,6.000000,0.0,1.0,6,6,0.0,0.000000,0.000000
113088,1,1046.000000,87.583667,87.583667,-66.69,-66.69,1.0,0.0,1547,5,...,5,30,6.000000,0.0,5.0,30,6,0.0,1.600000,1.600000
95108,1,183.000000,14969.120000,15164.900000,4321.07,5750.55,4.0,0.0,2309,8,...,5,30,6.000000,0.0,6.0,36,6,0.0,2.250000,2.200000
47805,1,98.000000,41.600000,41.600000,1195.22,1195.22,1.0,0.0,2309,5,...,2,12,59915.171046,0.0,2.0,12,6,0.0,10.500000,10.500000
72268,1,1991.000000,7580.000000,8200.200000,5991.69,9656.22,3.0,0.0,1539,3,...,3,30,12.000000,0.0,4.0,36,12,0.0,4.500000,10.250000
149926,1,1353.000000,14.842667,14.842667,159.28,159.28,0.0,0.0,0,0,...,1,6,6.000000,0.0,1.0,6,6,0.0,0.000000,0.000000
31946,1,87.000000,77.345000,77.345000,2586.30,2586.30,1.0,0.0,1539,15,...,7,42,6.000000,0.0,7.0,42,6,0.0,1.714286,1.714286
129634,1,229.000000,1230.400000,1238.000000,218.84,248.66,1.0,0.0,770,2,...,3,18,6.000000,0.0,4.0,24,6,0.0,11.500000,8.000000


In [24]:
train.isnull().any()
test.isnull().any()

label                   False
aon                     False
daily_decr30            False
daily_decr90            False
rental30                False
rental90                False
last_rech_date_ma       False
last_rech_date_da       False
last_rech_amt_ma        False
cnt_ma_rech30           False
fr_ma_rech30            False
sumamnt_ma_rech30       False
medianamnt_ma_rech30    False
medianmarechprebal30    False
cnt_ma_rech90           False
fr_ma_rech90            False
sumamnt_ma_rech90       False
medianamnt_ma_rech90    False
medianmarechprebal90    False
cnt_da_rech30           False
fr_da_rech30            False
cnt_da_rech90           False
fr_da_rech90            False
cnt_loans30             False
amnt_loans30            False
maxamnt_loans30         False
medianamnt_loans30      False
cnt_loans90             False
amnt_loans90            False
maxamnt_loans90         False
medianamnt_loans90      False
payback30               False
payback90               False
dtype: boo

label                   False
aon                     False
daily_decr30            False
daily_decr90            False
rental30                False
rental90                False
last_rech_date_ma       False
last_rech_date_da       False
last_rech_amt_ma        False
cnt_ma_rech30           False
fr_ma_rech30            False
sumamnt_ma_rech30       False
medianamnt_ma_rech30    False
medianmarechprebal30    False
cnt_ma_rech90           False
fr_ma_rech90            False
sumamnt_ma_rech90       False
medianamnt_ma_rech90    False
medianmarechprebal90    False
cnt_da_rech30           False
fr_da_rech30            False
cnt_da_rech90           False
fr_da_rech90            False
cnt_loans30             False
amnt_loans30            False
maxamnt_loans30         False
medianamnt_loans30      False
cnt_loans90             False
amnt_loans90            False
maxamnt_loans90         False
medianamnt_loans90      False
payback30               False
payback90               False
dtype: boo

There are no null entries in the data, so we are good to go! Otherwise, we can fill those entries by their respective column's mean by uncommenting the following cell

In [25]:
# train.fillna(train.mean(), inplace=True)
# test.fillna(test.mean(), inplace=True)

When the majority of data items in our dataset represents items belonging to one class, we say the dataset is skewed or imbalanced. To prevent the model from overfitting we will try resampling and data augmentation separately and see which technique fares better.

### Resampling: Oversampling minority class
Over-sampling involves adding more examples from the minority class to the training dataset so that the model does not overfit on the majority class. Here we are resampling such that the new no. of negatives now equal the old no. of posiitves. The simplest implementation of over-sampling is to duplicate random records from the minority class.

In [26]:
from sklearn.utils import resample
from scipy import stats

In [27]:
def oversample_minority(x_train, y_train) :
    train_data = x_train
    train_data['label'] = y_train
    
    train_positive=train_data[train_data.label==1]
    train_negative=train[train_data.label==0]
    
    print("\nApplying oversampling on training data...")
    
    train_upsamp=resample(train_negative,replace=True,n_samples=train_positive.shape[0],random_state=1)
    train_upsamp=pd.concat([train_positive,train_upsamp])
    
    print("\nupsampled train:", train_upsamp.shape)
    print(train_upsamp['label'].value_counts())
    
    y_train = train_upsamp['label']
    x_train = train_upsamp.drop(['label'], axis=1)
    
    return x_train, y_train

print("\ntrain:", train.shape)
print(train['label'].value_counts())
    


train: (167674, 33)
1    146692
0     20982
Name: label, dtype: int64


In [38]:
!pip install msgpack

Collecting msgpack
  Downloading https://files.pythonhosted.org/packages/e1/53/6a3da2b55587e42b220358e0961d4b00917b50aa776ca5c279b466aebfc1/msgpack-1.0.0-cp36-cp36m-win_amd64.whl (72kB)
Installing collected packages: msgpack
Successfully installed msgpack-1.0.0


You are using pip version 10.0.1, however version 20.2b1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


### Resampling: Undersampling majority class
Under-sampling involves removing samples from the majority class. Although, we are losing some information, however, it is still worth giving a shot.

In [28]:
def downsample_minority(x_train, y_train) :
    train_data = x_train
    train_data['label'] = y_train
    
    train_positive=train_data[train_data.label==1]
    train_negative=train[train_data.label==0]
    
    
    print("\nApplying downsampling on training data...")
    
    train_downsamp=resample(train_positive,replace=False,n_samples=5*train_negative.shape[0],random_state=1)
    train_downsamp=pd.concat([train_negative,train_downsamp])
    
    print("\ndownsampled train:", train_downsamp.shape)
    print(train_downsamp['label'].value_counts())
    
    y_train = train_downsamp['label']
    x_train = train_downsamp.drop(['label'], axis=1)
    
    return x_train, y_train

print("\ntrain:", train.shape)
print(train['label'].value_counts())


train: (167674, 33)
1    146692
0     20982
Name: label, dtype: int64


### Resampling: SMOTE

Synthetic Minority Oversampling Technique (SMOTE) is a data augmentation technique that creates synthetic minority class examples. It used when there is a class imbalance in the training data.

In [40]:
!pip3 install imbalanced-learn



You are using pip version 9.0.1, however version 20.1.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [37]:
!pip install -U imbalanced-learn

Collecting imbalanced-learn
  Using cached https://files.pythonhosted.org/packages/c8/81/8db4d87b03b998fda7c6f835d807c9ae4e3b141f978597b8d7f31600be15/imbalanced_learn-0.7.0-py3-none-any.whl
Requirement not upgraded as not directly required: scipy>=0.19.1 in c:\users\mratu\anaconda3\lib\site-packages (from imbalanced-learn) (1.1.0)
Requirement not upgraded as not directly required: numpy>=1.13.3 in c:\users\mratu\anaconda3\lib\site-packages (from imbalanced-learn) (1.14.3)
Requirement not upgraded as not directly required: joblib>=0.11 in c:\users\mratu\anaconda3\lib\site-packages (from imbalanced-learn) (0.16.0)
Requirement not upgraded as not directly required: scikit-learn>=0.23 in c:\users\mratu\anaconda3\lib\site-packages (from imbalanced-learn) (0.23.1)
Requirement not upgraded as not directly required: threadpoolctl>=2.0.0 in c:\users\mratu\anaconda3\lib\site-packages (from scikit-learn>=0.23->imbalanced-learn) (2.1.0)
Installing collected packages: imbalanced-learn
Successfully 

distributed 1.21.8 requires msgpack, which is not installed.
You are using pip version 10.0.1, however version 20.2b1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [14]:
def smote_resampling(x_train, y_train) :
    
    print("\nTrain:", x_train.shape)
    print(y_train.value_counts())
    print("\nApplying SMOTE...")
    
    sm = SMOTE(random_state=27)
    x_train, y_train = sm.fit_sample(x_train, y_train)
    
    print("\nTrain:", x_train.shape)
    print(y_train.value_counts())
    
    return x_train, y_train

### Data Normalization
Standardization of datasets is a common requirement for many machine learning estimators.

Normalization is the process of scaling individual samples to have unit norm. We are going to use the Scikit-Learn's implementation of Normalization.

In [15]:
from sklearn.preprocessing import Normalizer

In [16]:
def normalize_data(x_train, x_test) :
    cols = [str(i) for i in x_train.columns]
    
    print("\nApplying data normalization...")
    train_normalized = Normalizer().fit_transform(x_train)
    train_normalized = pd.DataFrame(train_normalized, columns=cols)
    test_normalized = Normalizer().fit_transform(x_test)
    test_normalized = pd.DataFrame(test_normalized, columns=cols)
    
    x_train = train_normalized
    x_test = test_normalized
    
    return x_train, x_test

### Feature Selection with RFE

RFE stands for Recursive Feature Elimination. Finding optimal features for our model to train on is very important.

This technique begins by building a model on the entire set of predictors and computing an importance score for each predictor. The least important predictor(s) are then removed, the model is re-built, and importance scores are computed again.

Since RFE takes a lot of time to execute, I have already run RFE on training data with `num_features = 25` and `num_features=7`. 

`features_RFE_25` and `features_RFE_7` denote the selected 25 and 7 features respectively out of the 32 features that we have so far.

Feel free to uncomment the lines `fit=rfe.fit(x,y_train)` and `selected_features_boolean = fit.support_` to run RFE on specified `num_features`.

The function below will return `x_train` with only `num_features` no. of columns.


In [17]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE


In [18]:
def feature_ranking(x_train, y_train, x_test, num_features) :
    
    print("\nApplying feature selection...")

    features_RFE_25 = [ True,  True,  True,  True,  True,  True, False,  True,  True,
                True,  True,  True,  True,  True,  True,  True,  True,  True,
               False, False, False, False,  True,  True,  True, False,  True,
                True,  True, False,  True,  True]
    
    features_RFE_7 = [False,  True,  True,  True, False,  True, False, False, False, False, False, False,
             False, False, False,  True, False, False, False, False, False, False, False, False,
              True, False, False, False,  True, False, False, False]
    
    rfe=RFE(RandomForestClassifier(verbose=1), num_features)
    x = x_train.values
    selected_features_boolean = features_RFE_7
    
#     fit=rfe.fit(x,y_train)
#     selected_features_boolean = fit.support_
    
    
    print("\nSelected features:", x_train.columns[selected_features_boolean])
    
    x_train = x_train[x_train.columns[selected_features_boolean]]
    x_test = x_test[x_test.columns[selected_features_boolean]]
    
    return x_train, x_test

# Training and Prediction

In [19]:
from sklearn.metrics import accuracy_score,precision_score,recall_score
from sklearn.metrics import classification_report

In [20]:
def train_and_predict(x_train, y_train, x_test, y_test, model) :
    print("\nTraining model...")
    model.fit(x_train, y_train)
    
    print("\nPredicting on test data...")
    prediction = model.predict(x_test)
    
    accuracy = round(accuracy_score(prediction,y_test)*100,2)
    precision = round(precision_score(prediction,y_test)*100,2)
    recall = round(recall_score(prediction,y_test)*100,2)
    
    print("\naccuracy:", accuracy)
    print("precision:", precision)
    print("recall:", recall)
    
    print(classification_report(prediction,y_test))

### Training and Prediction
 - Model: Random Forest Classifier
 - Resampling: SMOTE
 - Feature selection: Recursive Feature Extraction (RFE)

In [21]:
y_train = train['label']
y_test = test['label']

x_train = train.drop(['label'], axis=1)
x_test = test.drop(['label'], axis=1)

x_train, y_train = smote_resampling(x_train, y_train)

model1 = RandomForestClassifier(verbose=1)

x_train, x_test = normalize_data(x_train, x_test)

x_train, x_test = feature_ranking(x_train, y_train, x_test, 7)



Train: (167674, 32)
1    146692
0     20982
Name: label, dtype: int64

Applying SMOTE...

Train: (293384, 32)
1    146692
0    146692
Name: label, dtype: int64

Applying data normalization...

Applying feature selection...

Selected features: Index(['daily_decr30', 'daily_decr90', 'rental30', 'last_rech_date_ma',
       'sumamnt_ma_rech90', 'maxamnt_loans30', 'maxamnt_loans90'],
      dtype='object')


In [22]:
train_and_predict(x_train, y_train, x_test, y_test, model1)


Training model...


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:  1.1min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.



Predicting on test data...

accuracy: 85.72
precision: 88.36
recall: 95.01
              precision    recall  f1-score   support

           0       0.67      0.45      0.54      7752
           1       0.88      0.95      0.92     34167

    accuracy                           0.86     41919
   macro avg       0.78      0.70      0.73     41919
weighted avg       0.84      0.86      0.85     41919



[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    1.0s finished


### Training and Prediction
 - Model: Random Forest Classifier
 - Resampling: Oversampling minority
 - Feature selection: Recursive Feature Extraction (RFE)

In [23]:
y_train = train['label']
y_test = test['label']

x_train = train.drop(['label'], axis=1)
x_test = test.drop(['label'], axis=1)

x_train, y_train = oversample_minority(x_train, y_train)

x_train, x_test = normalize_data(x_train, x_test)

model2 = RandomForestClassifier(verbose=1)

x_train, x_test = feature_ranking(x_train, y_train, x_test, 25)


train: (167674, 33)
1    146692
0     20982
Name: label, dtype: int64

Applying oversampling on training data...

upsampled train: (293384, 33)
1    146692
0    146692
Name: label, dtype: int64

Applying data normalization...

Applying feature selection...

Selected features: Index(['daily_decr30', 'daily_decr90', 'rental30', 'last_rech_date_ma',
       'sumamnt_ma_rech90', 'maxamnt_loans30', 'maxamnt_loans90'],
      dtype='object')


In [24]:
train_and_predict(x_train, y_train, x_test, y_test, model2)


Training model...


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   56.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.



Predicting on test data...

accuracy: 90.29
precision: 96.07
recall: 93.07
              precision    recall  f1-score   support

           0       0.49      0.64      0.56      3994
           1       0.96      0.93      0.95     37925

    accuracy                           0.90     41919
   macro avg       0.73      0.78      0.75     41919
weighted avg       0.92      0.90      0.91     41919



[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.8s finished


### Training and Prediction
 - Model: Random Forest Classifier
 - Resampling: Downsampling majority
 - Feature selection: Recursive Feature Extraction (RFE)

In [25]:
y_train = train['label']
y_test = test['label']

x_train = train.drop(['label'], axis=1)
x_test = test.drop(['label'], axis=1)

x_train, y_train = downsample_minority(x_train, y_train)

x_train, x_test = normalize_data(x_train, x_test)

model3 = RandomForestClassifier(verbose=1)

x_train, x_test = feature_ranking(x_train, y_train, x_test, 25)


train: (167674, 33)
1    146692
0     20982
Name: label, dtype: int64

Applying downsampling on training data...

downsampled train: (125892, 33)
1    104910
0     20982
Name: label, dtype: int64

Applying data normalization...

Applying feature selection...

Selected features: Index(['daily_decr30', 'daily_decr90', 'rental30', 'last_rech_date_ma',
       'sumamnt_ma_rech90', 'maxamnt_loans30', 'maxamnt_loans90'],
      dtype='object')


In [26]:
train_and_predict(x_train, y_train, x_test, y_test, model3)


Training model...


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   29.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.



Predicting on test data...

accuracy: 90.58
precision: 96.55
recall: 92.98
              precision    recall  f1-score   support

           0       0.48      0.66      0.56      3768
           1       0.97      0.93      0.95     38151

    accuracy                           0.91     41919
   macro avg       0.72      0.80      0.75     41919
weighted avg       0.92      0.91      0.91     41919



[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.8s finished


### Training and Prediction
 - Model: XGBoost
 - Resampling: SMOTE
 - Feature selection: Recursive Feature Extraction (RFE)

In [27]:
from xgboost import XGBClassifier
y_train = train['label']
y_test = test['label']

x_train = train.drop(['label'], axis=1)
x_test = test.drop(['label'], axis=1)

x_train, y_train = smote_resampling(x_train, y_train)

x_train, x_test = normalize_data(x_train, x_test)

model4 = XGBClassifier(verbose=1)

x_train, x_test = feature_ranking(x_train, y_train, x_test, 25)


Train: (167674, 32)
1    146692
0     20982
Name: label, dtype: int64

Applying SMOTE...

Train: (293384, 32)
1    146692
0    146692
Name: label, dtype: int64

Applying data normalization...

Applying feature selection...

Selected features: Index(['daily_decr30', 'daily_decr90', 'rental30', 'last_rech_date_ma',
       'sumamnt_ma_rech90', 'maxamnt_loans30', 'maxamnt_loans90'],
      dtype='object')


In [28]:
train_and_predict(x_train, y_train, x_test, y_test, model4)


Training model...

Predicting on test data...

accuracy: 81.99
precision: 82.68
recall: 96.24
              precision    recall  f1-score   support

           0       0.77      0.39      0.51     10355
           1       0.83      0.96      0.89     31564

    accuracy                           0.82     41919
   macro avg       0.80      0.67      0.70     41919
weighted avg       0.81      0.82      0.80     41919



In [43]:
!pip install  PrettyTable

Collecting PrettyTable
Installing collected packages: PrettyTable
Successfully installed PrettyTable-0.7.2


You are using pip version 10.0.1, however version 20.2b1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [44]:
from prettytable import PrettyTable
    
x = PrettyTable()

x.field_names = ["Model","Resampling","Feature Selection", "Accuracy"]
x.add_row(["Random Forest Classifier","Smote","Recursive feature Extraction",0.86])
x.add_row(["Random Forest Classifier","Oversampling minority","Recursive feature Extraction",0.90])
x.add_row(["Random Forest Classifier","Downsampling majority","Recursive feature Extraction",0.91 ])
x.add_row(["XGBoost","SMOTE","Recursive feature Extraction",0.82])



print(x)

+--------------------------+-----------------------+------------------------------+----------+
|          Model           |       Resampling      |      Feature Selection       | Accuracy |
+--------------------------+-----------------------+------------------------------+----------+
| Random Forest Classifier |         Smote         | Recursive feature Extraction |   0.86   |
| Random Forest Classifier | Oversampling minority | Recursive feature Extraction |   0.9    |
| Random Forest Classifier | Downsampling majority | Recursive feature Extraction |   0.91   |
|         XGBoost          |         SMOTE         | Recursive feature Extraction |   0.82   |
+--------------------------+-----------------------+------------------------------+----------+


# Conclusion
 - Used various permutations of the following models
     - Random Forest Classifier
     - XGBoost
 - The training data was highly skewed, so tried various resampling methods
     - Oversampling of majority class
     - Downsampling of minority class
     - SMOTE
 - Also used Recursive Feature Elimination (RFE) to extract only the relavant features to train the model on
 - Among the models the one with the highest accuracy (90.58%) was the following
     - Random Forest Classifier, Downsampling Minority, RFE with 7 features extracted
 - Although, upon analyzing the complete classification report we find that the model doesn't generalize well to the minority class.
 - The recall values for all the models for the `negative (false/0)` case is pretty low.
 - This clearly indicates there is a need to improve the performance of the model
 