## 1. INTRODUCTION
Lending companies work by analyzing the financial history of their loan applicants and choosing whether or not the applicant is too risky to be given a loan. If the applicant is not, the company then determines the terms of the loan. To acquire these applicants, companies can organically receive them through their websites/ apps, often with the help of advertisement companies. Other times, lending companies partner with peer-to-peer (P2P) lending marketplaces, in order to acquire leads of possible applicants. Some examples of the marketplace include Upstart, Lending Tree, and Lending Club. In this project, we are going to asses the 'quality' of the leads our company receives from these marketplaces.

Market: The target audience is the set of loan applicants who reached out through an intermediary marketplace.
Product: A loan.
Goal: Develop a model to predict for 'quality' applicants are those who reach a key part of the loan application process.
## 2. BUSINESS CHALLENGE
1. In this Case Study, we will be working for a fin-tech company that specializes in loans. It offers low APR loans to applicants based on their financial habits, as almost all lending companies do. This company has partnered with a P2P lending marketplace that provides real-time leads (loan applicants). The numbers of conversations from these leads are satisfactory.
<br>
2. The company tasks you with creating a model that predicts whether or not these leads will complete the electronic signature phase of the loan applicant (a.k.a. e_signed). The company seeks to leverage this model to identify less 'qualify' applicants (e.g. those who are not responding to the onboarding process) and experiment with giving them different onboarding screens.
<br>
3. The reason for selecting the e_signing process as the response variable is due to the structure of the loan application.
<br>
4. The official application begins with the lead arriving on our website after we opted to acquire it. Here, The applicant begins the onboarding process to apply for a loan. The user begins to provide more financial information by going over every screen of the onboarding process. This first phase ends with the applicant providing his/ her signature indicating all of the given information is correct.
<br>
5. Any of the following screens, in which the applicant is approved/denied and given the terms of the loan, is dependent on the company, not the applicant stops having control of the application process.

## Column Details

<b>entry_id</b> -----> user identifier column. integer <br>
<b>age</b> -----> age of the applicant is always 18+<br>
<b>pay_schedule</b> -----> how the applucant gets paid. weekly/bi-weekly/monthly...<br>
<b>home_owner</b> -----> 1- Is Owner, 0- Not Owner<br>
<b>income</b> -----> Monthly income of applicant<br>
<b>months_employed</b> ----> <br>
<b>years_employed</b> -----> how many years employed<br>
<b>current_address_year</b> -----> years person stayed at Present add<br>
<b>personal_account_m</b> -----> how many months user has personal account<br>
<b>personal_account_y</b> -----> how many years user has personal account<br>
<b>has_debt</b>-----> has any debt/owe money<br>
<b>amount_requested</b> -----> the loan appplied for<br>
<b>risk_score</b> -----> these are 5 risk score from finance team<br>
<b>risk_score_2</b> -----> these are 5 risk score from finance team<br>
<b>risk_score_3</b> -----> these are 5 risk score from finance team<br>
<b>risk_score_4</b> -----> these are 5 risk score from finance team<br>
<b>risk_score_5</b> -----> these are 5 risk score from finance team<br>
<b>ext_quality_score</b> -----> got from P2P market place<br>
<b>ext_quality_score_2</b> -----> got from P2P market place<br>
<b>inquiries_last_month</b> -----> the user has had a hard check<br>
<b>e_signed</b> -----> 0: NOT Signed 1: Signed

# Importing the libraries

In [43]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Importing the Dataset

In [44]:
df = pd.read_csv('Financial_Data.csv')

In [45]:
df.head()

Unnamed: 0,entry_id,age,pay_schedule,home_owner,income,years_employed,current_address_year,has_debt,amount_requested,risk_score,risk_score_2,risk_score_3,risk_score_4,risk_score_5,ext_quality_score,ext_quality_score_2,inquiries_last_month,e_signed,personal_account_months
0,7629673,40,bi-weekly,1,3135,3,3,1,550,36200,0.737398,0.903517,0.487712,0.515977,0.580918,0.380918,10,1,30
1,3560428,61,weekly,0,3180,6,3,1,600,30150,0.73851,0.881027,0.713423,0.826402,0.73072,0.63072,9,0,86
2,6934997,23,weekly,0,1540,0,0,1,450,34550,0.642993,0.766554,0.595018,0.762284,0.531712,0.531712,7,0,19
3,5682812,40,bi-weekly,0,5230,6,1,1,700,42150,0.665224,0.960832,0.767828,0.778831,0.792552,0.592552,8,1,86
4,5335819,33,semi-monthly,0,3590,5,2,1,1100,53850,0.617361,0.85756,0.613487,0.665523,0.744634,0.744634,12,0,98


In [46]:
df.tail()

Unnamed: 0,entry_id,age,pay_schedule,home_owner,income,years_employed,current_address_year,has_debt,amount_requested,risk_score,risk_score_2,risk_score_3,risk_score_4,risk_score_5,ext_quality_score,ext_quality_score_2,inquiries_last_month,e_signed,personal_account_months
17903,9949728,31,monthly,0,3245,5,3,1,700,71700,0.691126,0.928196,0.664112,0.838012,0.727705,0.627705,2,0,74
17904,9442442,46,bi-weekly,0,6525,2,1,1,800,51800,0.648525,0.970832,0.699241,0.844724,0.774918,0.474918,3,0,39
17905,9857590,46,weekly,0,2685,5,1,1,1200,59650,0.677975,0.918141,0.687981,0.939101,0.472045,0.672045,9,0,97
17906,8708471,42,bi-weekly,0,2515,3,5,1,400,80200,0.642741,0.885684,0.456448,0.686823,0.406568,0.406568,3,1,18
17907,1498559,29,weekly,1,2665,4,10,1,600,64950,0.720889,0.874372,0.505565,0.631619,0.846163,0.846163,4,1,16


## One Hot Encoding
<b>
done for categorical column  pay_schedule

In [47]:
# counting the values
df.pay_schedule.value_counts()

bi-weekly       10716
weekly           3696
semi-monthly     2004
monthly          1492
Name: pay_schedule, dtype: int64

In [48]:
# Using pandas.get_dummies
df = pd.get_dummies(df)

In [49]:
# check the columns if the categorical values are spread
df.columns

Index(['entry_id', 'age', 'home_owner', 'income', 'years_employed',
       'current_address_year', 'has_debt', 'amount_requested', 'risk_score',
       'risk_score_2', 'risk_score_3', 'risk_score_4', 'risk_score_5',
       'ext_quality_score', 'ext_quality_score_2', 'inquiries_last_month',
       'e_signed', 'personal_account_months', 'pay_schedule_bi-weekly',
       'pay_schedule_monthly', 'pay_schedule_semi-monthly',
       'pay_schedule_weekly'],
      dtype='object')

In [50]:
df= df.drop(columns=['pay_schedule_semi-monthly'])

In [51]:
df.columns

Index(['entry_id', 'age', 'home_owner', 'income', 'years_employed',
       'current_address_year', 'has_debt', 'amount_requested', 'risk_score',
       'risk_score_2', 'risk_score_3', 'risk_score_4', 'risk_score_5',
       'ext_quality_score', 'ext_quality_score_2', 'inquiries_last_month',
       'e_signed', 'personal_account_months', 'pay_schedule_bi-weekly',
       'pay_schedule_monthly', 'pay_schedule_weekly'],
      dtype='object')

In [52]:
df.head()

Unnamed: 0,entry_id,age,home_owner,income,years_employed,current_address_year,has_debt,amount_requested,risk_score,risk_score_2,...,risk_score_4,risk_score_5,ext_quality_score,ext_quality_score_2,inquiries_last_month,e_signed,personal_account_months,pay_schedule_bi-weekly,pay_schedule_monthly,pay_schedule_weekly
0,7629673,40,1,3135,3,3,1,550,36200,0.737398,...,0.487712,0.515977,0.580918,0.380918,10,1,30,1,0,0
1,3560428,61,0,3180,6,3,1,600,30150,0.73851,...,0.713423,0.826402,0.73072,0.63072,9,0,86,0,0,1
2,6934997,23,0,1540,0,0,1,450,34550,0.642993,...,0.595018,0.762284,0.531712,0.531712,7,0,19,0,0,1
3,5682812,40,0,5230,6,1,1,700,42150,0.665224,...,0.767828,0.778831,0.792552,0.592552,8,1,86,1,0,0
4,5335819,33,0,3590,5,2,1,1100,53850,0.617361,...,0.613487,0.665523,0.744634,0.744634,12,0,98,0,0,0


# Data Pre Processing

In [53]:
x = df.loc[:, df.columns != "e_signed"]
y = df['e_signed']

In [54]:
print(x.head())

print()

print(y.head())

   entry_id  age  home_owner  income  years_employed  current_address_year  \
0   7629673   40           1    3135               3                     3   
1   3560428   61           0    3180               6                     3   
2   6934997   23           0    1540               0                     0   
3   5682812   40           0    5230               6                     1   
4   5335819   33           0    3590               5                     2   

   has_debt  amount_requested  risk_score  risk_score_2  risk_score_3  \
0         1               550       36200      0.737398      0.903517   
1         1               600       30150      0.738510      0.881027   
2         1               450       34550      0.642993      0.766554   
3         1               700       42150      0.665224      0.960832   
4         1              1100       53850      0.617361      0.857560   

   risk_score_4  risk_score_5  ext_quality_score  ext_quality_score_2  \
0      0.487712    

# Splitting the Data

In [55]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)

In [56]:
# Removing the entry_id column as it is not a real feature.
# but at the end we will use this data associating the prediction from user it came from

train_identifier = x_train['entry_id']
x_train = x_train.drop(columns=['entry_id'])

test_identifier = x_test['entry_id']
x_test = x_test.drop(columns=['entry_id'])

# Feature scaling

In [57]:
from sklearn.preprocessing import StandardScaler
sc= StandardScaler()

In [58]:
# scale the x_train and create a new variable for scaled x_train
#Standard scaler returns a numpy array so we need to convert to DataFrame
#In array form 'index' and 'column names' are lost we need it so convert to Df

x_train_scaled= pd.DataFrame(sc.fit_transform(x_train))
x_test_scaled = pd.DataFrame(sc.transform(x_test))

In [59]:
# put x_train_scaled have the columns of original x_train set
x_train_scaled.columns = x_train.columns.values

# put x_test_scaled have the columns of original x_test set
x_test_scaled.columns = x_test.columns.values

# take the indexes also
x_train_scaled.index = x_train.index.values
x_test_scaled.index = x_test.index.values

In [60]:
# Compare the original training set with new scaled training set
x_train= x_train_scaled
x_test= x_test_scaled

# Applying machine learning algorithms

### Creating a function that will give the Following output when any model is run

In [61]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

def print_score(clf, x_train, y_train, x_test, y_test, train=True):
    if train:
        pred = clf.predict(x_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    elif train==False:
        pred = clf.predict(x_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")

### LOGISTIC REGRESSION

In [62]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0)

clf.fit(x_train,y_train)
print_score(clf, x_train, y_train, x_test, y_test, train=True)
print_score(clf, x_train, y_train, x_test, y_test, train=False)

Train Result:
Accuracy Score: 57.96%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy     macro avg  weighted avg
precision     0.560606     0.589388  0.579576      0.574997      0.576098
recall        0.413908     0.721696  0.579576      0.567802      0.579576
f1-score      0.476215     0.648866  0.579576      0.562541      0.569145
support    6615.000000  7711.000000  0.579576  14326.000000  14326.000000
_______________________________________________
Confusion Matrix: 
 [[2738 3877]
 [2146 5565]]

Test Result:
Accuracy Score: 56.25%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy    macro avg  weighted avg
precision     0.535685     0.576386  0.562535     0.556035      0.557592
recall        0.394800     0.706432  0.562535     0.550616      0.562535
f1-score      0.454577     0.634817  0.562535     0.544697      0.551591
support    1654.000000  192

### DECISION TREE

In [63]:
from sklearn.tree import DecisionTreeClassifier
clf= DecisionTreeClassifier(random_state=0)

clf.fit(x_train,y_train)
print_score(clf, x_train, y_train, x_test, y_test, train=True)
print_score(clf, x_train, y_train, x_test, y_test, train=False)

Train Result:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
                0       1  accuracy  macro avg  weighted avg
precision     1.0     1.0       1.0        1.0           1.0
recall        1.0     1.0       1.0        1.0           1.0
f1-score      1.0     1.0       1.0        1.0           1.0
support    6615.0  7711.0       1.0    14326.0       14326.0
_______________________________________________
Confusion Matrix: 
 [[6615    0]
 [   0 7711]]

Test Result:
Accuracy Score: 57.62%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy    macro avg  weighted avg
precision     0.540428     0.607895  0.576214     0.574161      0.576742
recall        0.549577     0.599066  0.576214     0.574322      0.576214
f1-score      0.544964     0.603448  0.576214     0.574206      0.576443
support    1654.000000  1928.000000  0.576214  3582.000000   3582.000000
__________________

### RANDOM FOREST

In [64]:
from sklearn.ensemble import RandomForestClassifier
clf1= RandomForestClassifier(random_state=0)

clf1.fit(x_train,y_train)
print_score(clf1, x_train, y_train, x_test, y_test, train=True)
print_score(clf1, x_train, y_train, x_test, y_test, train=False)

Train Result:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
                0       1  accuracy  macro avg  weighted avg
precision     1.0     1.0       1.0        1.0           1.0
recall        1.0     1.0       1.0        1.0           1.0
f1-score      1.0     1.0       1.0        1.0           1.0
support    6615.0  7711.0       1.0    14326.0       14326.0
_______________________________________________
Confusion Matrix: 
 [[6615    0]
 [   0 7711]]

Test Result:
Accuracy Score: 62.51%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy    macro avg  weighted avg
precision     0.601834     0.642336   0.62507     0.622085      0.623634
recall        0.555623     0.684647   0.62507     0.620135      0.625070
f1-score      0.577806     0.662817   0.62507     0.620311      0.623563
support    1654.000000  1928.000000   0.62507  3582.000000   3582.000000
__________________

### SVM

In [65]:
from sklearn.svm import SVC

print("=======================Linear Kernel SVM==========================")
model = SVC(kernel='linear')
model.fit(x_train, y_train)
print_score(model, x_train, y_train, x_test, y_test, train=True)
print_score(model, x_train, y_train, x_test, y_test, train=False)

print("=======================Polynomial Kernel SVM==========================")
from sklearn.svm import SVC

model = SVC(kernel='poly', degree=2, gamma='auto')
model.fit(x_train, y_train)

print_score(model, x_train, y_train, x_test, y_test, train=True)
print_score(model, x_train, y_train, x_test, y_test, train=False)

print("=======================Radial Kernel SVM==========================")
from sklearn.svm import SVC

model = SVC(kernel='rbf', gamma=1)
model.fit(x_train, y_train)

print_score(model, x_train, y_train, x_test, y_test, train=True)
print_score(model, x_train, y_train, x_test, y_test, train=False)


Train Result:
Accuracy Score: 58.27%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy     macro avg  weighted avg
precision     0.569557     0.588899  0.582717      0.579228      0.579968
recall        0.394255     0.744391  0.582717      0.569323      0.582717
f1-score      0.465964     0.657578  0.582717      0.561771      0.569101
support    6615.000000  7711.000000  0.582717  14326.000000  14326.000000
_______________________________________________
Confusion Matrix: 
 [[2608 4007]
 [1971 5740]]

Test Result:
Accuracy Score: 56.87%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy    macro avg  weighted avg
precision     0.548273     0.578068  0.568677     0.563170      0.564310
recall        0.374244     0.735477  0.568677     0.554861      0.568677
f1-score      0.444844     0.647341  0.568677     0.546092      0.553837
support    1654.000000  192

# XGBoost

In [105]:
pip install xgboost
from xgboost import XGBClassifier

ModuleNotFoundError: No module named 'xgboost'

In [104]:
#from xgboost import XGBClassifier
import xgboost as xgb
#classifier = XGBClassifier(random_state=0)
classifier = xgb(random_state=0)

classifier.fit(x_train,y_train)
print_score(classifier, x_train, y_train, x_test, y_test, train=True)
print_score(classifier, x_train, y_train, x_test, y_test, train=False)

ModuleNotFoundError: No module named 'xgboost'

# K-Fold Cross Validation

In [66]:
from sklearn.model_selection import cross_val_score
accuracies= cross_val_score(estimator=clf1, X=x_train, y=y_train, cv=10)

print("Random Forest Classifier Accuracy: %0.2f (+/- %0.2f)" %(accuracies.mean(), accuracies.std()*2))

Random Forest Classifier Accuracy: 0.63 (+/- 0.02)


# Grid Search --> Hyper Parameter Tuning Technique

In [67]:
parameters= {'max_depth':[3, None],
             'min_samples_split':[2,5,10],
             'min_samples_leaf':[1,5,10],
             'bootstrap':[True,False],
             'criterion':['entropy']}

In [68]:
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(estimator =clf1, param_grid=parameters,
                          scoring='accuracy',
                          cv=10,
                          n_jobs=-1)
grid_search.fit(x_train, y_train)
best_params = grid_search.best_params_
print(f"Best paramters: {best_params})")

Best paramters: {'bootstrap': True, 'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 10})


In [69]:
clf1= RandomForestClassifier(**best_params)
clf1.fit(x_train,y_train)
print_score(clf1, x_train, y_train, x_test, y_test, train=True)
print_score(clf1, x_train, y_train, x_test, y_test, train=False)

Train Result:
Accuracy Score: 99.48%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy     macro avg  weighted avg
precision     0.999389     0.990872  0.994765      0.995130      0.994805
recall        0.989267     0.999481  0.994765      0.994374      0.994765
f1-score      0.994302     0.995158  0.994765      0.994730      0.994763
support    6615.000000  7711.000000  0.994765  14326.000000  14326.000000
_______________________________________________
Confusion Matrix: 
 [[6544   71]
 [   4 7707]]

Test Result:
Accuracy Score: 62.73%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy    macro avg  weighted avg
precision     0.607119     0.641663  0.627303     0.624391      0.625712
recall        0.546554     0.696577  0.627303     0.621565      0.627303
f1-score      0.575247     0.667993  0.627303     0.621620      0.625167
support    1654.000000  192

In [70]:
parameters1= {'max_depth':[None],
              'max_features':[3,5,7],
             'min_samples_split':[8,10,12],
             'min_samples_leaf':[1,2,3],
             'bootstrap':[True],
             'criterion':['entropy']}

In [71]:
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(estimator =clf1, param_grid=parameters1,
                          scoring='accuracy',
                          cv=10,
                          n_jobs=-1)
grid_search.fit(x_train, y_train)
best_params = grid_search.best_params_
print(f"Best paramters: {best_params})")

Best paramters: {'bootstrap': True, 'criterion': 'entropy', 'max_depth': None, 'max_features': 5, 'min_samples_leaf': 2, 'min_samples_split': 12})


In [72]:
clf1= RandomForestClassifier(**best_params)
clf1.fit(x_train,y_train)
print_score(clf1, x_train, y_train, x_test, y_test, train=True)
print_score(clf1, x_train, y_train, x_test, y_test, train=False)

Train Result:
Accuracy Score: 98.86%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy     macro avg  weighted avg
precision     0.997532     0.981130  0.988552      0.989331      0.988703
recall        0.977627     0.997925  0.988552      0.987776      0.988552
f1-score      0.987479     0.989456  0.988552      0.988468      0.988543
support    6615.000000  7711.000000  0.988552  14326.000000  14326.000000
_______________________________________________
Confusion Matrix: 
 [[6467  148]
 [  16 7695]]

Test Result:
Accuracy Score: 63.62%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy    macro avg  weighted avg
precision     0.617391     0.649736  0.636237     0.633564      0.634801
recall        0.558041     0.703320  0.636237     0.630680      0.636237
f1-score      0.586218     0.675467  0.636237     0.630842      0.634256
support    1654.000000  192

# Feature Selection

In [76]:
from sklearn.feature_selection import RFE # Recurssive Feature Elimination
from sklearn.ensemble import RandomForestClassifier


In [77]:
# Reducing the features to 20 from 42
classifier= RandomForestClassifier()
rfe = RFE(classifier, 6)
rfe= rfe.fit(x_train,y_train)

In [78]:
# summarise the selection attributes
# which columns are selected(True) and which are not (False)
print(rfe.support_)

[False False  True False False False False  True  True  True  True False
  True False False False False False False]


In [79]:
x_train.columns[rfe.support_]

Index(['income', 'risk_score', 'risk_score_2', 'risk_score_3', 'risk_score_4',
       'ext_quality_score'],
      dtype='object')

In [80]:
rfe.ranking_

array([ 6, 12,  1,  9,  8, 11,  4,  1,  1,  1,  1,  3,  1,  2,  7,  5, 10,
       14, 13])

In [83]:
clf1= RandomForestClassifier(random_state=0)

clf1.fit(x_train[x_train.columns[rfe.support_]],y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [86]:
# Predicting Test Set

y_pred = clf1.predict(x_test[x_test.columns[rfe.support_]])

In [87]:
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score

cm = confusion_matrix(y_test,y_pred)
print('Confusion Matrix:')
print(cm)
print('---------------------------------------------------')
print('Accuracy_Score:')
print(accuracy_score(y_test,y_pred))
print('---------------------------------------------------')
print('F1_Score:')
print(f1_score(y_test,y_pred))
print('---------------------------------------------------')
print('Precision_Score:')
print(precision_score(y_test,y_pred))
print('---------------------------------------------------')
print('Recall_Score:')
print(recall_score(y_test,y_pred))

Confusion Matrix:
[[ 738  916]
 [ 707 1221]]
---------------------------------------------------
Accuracy_Score:
0.5469011725293133
---------------------------------------------------
F1_Score:
0.6007380073800739
---------------------------------------------------
Precision_Score:
0.5713617220402434
---------------------------------------------------
Recall_Score:
0.633298755186722


# Final Result

In [97]:
final_results1 = pd.concat([y_test, test_identifier],axis=1).dropna()

In [98]:
final_results1

Unnamed: 0,e_signed,entry_id
3629,1,8825262
1820,1,9216889
6685,0,1762129
17241,1,7249770
8332,1,5967375
...,...,...
7546,1,9384491
9836,1,2445124
7446,1,6534419
9526,1,5501730


In [99]:
final_results1['predicted_esigned']=y_pred

In [100]:
final_results= final_results1[['entry_id','e_signed','predicted_esigned']].reset_index(drop=True)

In [101]:
final_results

Unnamed: 0,entry_id,e_signed,predicted_esigned
0,8825262,1,0
1,9216889,1,1
2,1762129,0,0
3,7249770,1,1
4,5967375,1,1
...,...,...,...
3577,9384491,1,1
3578,2445124,1,1
3579,6534419,1,0
3580,5501730,1,1
