In [47]:
import numpy as np
import pandas as pd


In [48]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [49]:
!ls

gdrive	sample_data


In [50]:
dataset = pd.read_csv('/content/gdrive/MyDrive/train.csv')


In [74]:
dataset

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...
609,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [None]:
My hypothesis is that from this above dataset, we will be able to find an algorithm that approximates a target function for mapping examples of inputs to appropriate outputs. In this case if a loan is accepted(Y) or rejected(N). 
We will reject the null hypothesis if accuracy is above 80%.

In [51]:
dataset.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,592.0,600.0,564.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199
std,6109.041673,2926.248369,85.587325,65.12041,0.364878
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,168.0,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


In [52]:
dataset.head()
dataset = dataset.drop("Loan_ID", axis = 1)

In [53]:

X = dataset.iloc[:, :-1]
y = dataset.iloc[:,-1]

In [54]:
#pipeline for cleaning data. Onehotencoding categorical data, label encoding our target, scaling and imputing our numerical data.
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

numeric_features = ["ApplicantIncome", "CoapplicantIncome", "LoanAmount", "Loan_Amount_Term", "Credit_History"]
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)

categorical_features = ["Gender", "Married", "Dependents", "Education", "Self_Employed", "Property_Area" ]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)


In [55]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)


In [56]:
#first run through utilizing logistic for speed and ease of use. 83% shows clear sign this data is worth looking into and gaining accuracy.
from sklearn.linear_model import LogisticRegression
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

model score: 0.837


In [57]:
#taking X_train and X_test out of clf pipeline for training other models. 
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

In [76]:
#run this if you need to install lazypredict, this is uncommon to have in google collab or jupyter
!pip install lazypredict

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lazypredict
  Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12


In [78]:
#Quickly checking other algs with lazy predict
from lazypredict.Supervised import LazyClassifier
clf2 = LazyClassifier(verbose = 0, ignore_warnings=True, custom_metric = None)
models, predictions = clf2.fit(X_train, X_test, y_train, y_test)
print(models)

100%|██████████| 29/29 [00:03<00:00,  8.85it/s]

                               Accuracy  Balanced Accuracy  ROC AUC  F1 Score  \
Model                                                                           
LGBMClassifier                     0.80               0.74     0.74      0.80   
BaggingClassifier                  0.80               0.74     0.74      0.80   
AdaBoostClassifier                 0.81               0.72     0.72      0.80   
CalibratedClassifierCV             0.84               0.72     0.72      0.82   
LinearDiscriminantAnalysis         0.84               0.72     0.72      0.82   
RidgeClassifierCV                  0.84               0.72     0.72      0.82   
RidgeClassifier                    0.84               0.72     0.72      0.82   
LogisticRegression                 0.84               0.72     0.72      0.82   
LinearSVC                          0.84               0.72     0.72      0.82   
NearestCentroid                    0.78               0.71     0.71      0.78   
XGBClassifier               




From the above tests, it is clear that this data peaks at around 84 percent accuracy. I will now run a quick neural network to see if it is something worth pursuing. With the small data size, I do not beleive it is worth running a NN, but it does not take much time to test.

In [58]:
#Crteating neural network layers
import tensorflow as tf
model = tf.keras.models.Sequential([
  tf.keras.layers.Dense(10, activation='relu'),
  tf.keras.layers.Dense(10, activation='relu'),
  tf.keras.layers.Dense(2, activation='softmax')
])

In [59]:
#hyperparameter tuning the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

In [60]:
model.fit(X_train, y_train, epochs=50)


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7ff9b89f5640>

In [61]:
loss, accuracy =model.evaluate(X_test, y_test)




In [64]:
accuracy

0.8455284833908081

In [65]:
model.summary()


Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_6 (Dense)             (None, 10)                250       
                                                                 
 dense_7 (Dense)             (None, 10)                110       
                                                                 
 dense_8 (Dense)             (None, 2)                 22        
                                                                 
Total params: 382
Trainable params: 382
Non-trainable params: 0
_________________________________________________________________


In [68]:
test_dataset = pd.read_csv('/content/gdrive/MyDrive/test.csv')

In [87]:
 #tensorflow had .01 more accuracy so we will use that for predictions. column 0 is the loan was approved,column 1 is the loan was not approved. b
 
prediction_newdata = model.predict(preprocessor.transform(test_dataset))
prediction_newdata



array([[0.16773623, 0.8322638 ],
       [0.22253634, 0.7774637 ],
       [0.15637968, 0.84362036],
       [0.15342142, 0.84657854],
       [0.2890051 , 0.71099496],
       [0.33815405, 0.6618459 ],
       [0.3017324 , 0.6982676 ],
       [0.9880475 , 0.01195255],
       [0.09346248, 0.9065375 ],
       [0.12176709, 0.87823296],
       [0.29314956, 0.7068504 ],
       [0.19854884, 0.80145115],
       [0.09168326, 0.9083168 ],
       [0.9150642 , 0.08493578],
       [0.11961658, 0.8803833 ],
       [0.3009914 , 0.6990086 ],
       [0.1877432 , 0.8122568 ],
       [0.13731302, 0.862687  ],
       [0.35254866, 0.6474512 ],
       [0.18654397, 0.8134561 ],
       [0.3409306 , 0.6590693 ],
       [0.11287928, 0.8871207 ],
       [0.34448326, 0.65551674],
       [0.41386905, 0.586131  ],
       [0.33278528, 0.66721475],
       [0.9958155 , 0.00418439],
       [0.16345996, 0.8365399 ],
       [0.32034412, 0.67965585],
       [0.09755565, 0.9024443 ],
       [0.21703015, 0.7829698 ],
       [0.

In [94]:
#To quickly read, we will index the first column and since it is binary, if it is above .5, the person was approved. Full dataframe of outputs below
output_df = pd.DataFrame(prediction_newdata[:,0])
pd.set_option('display.max_rows', None)


Unnamed: 0,0
0,0.17
1,0.22
2,0.16
3,0.15
4,0.29
5,0.34
6,0.3
7,0.99
8,0.09
9,0.12


In [72]:
model.save('Loan_Application_model/1')


Loan Application write up: 12/27/2022

• Why did you use that model?
• What could you do to improve your model?
• What next steps would you take to?

For my above write up I will touch on the 3 questions asked. First, I chose to use a 3 layer neural network due to it having the highest accuracy for this dataset. I first went with a logistic regression due to the binary label. I then ran a lazypredict to quikcly make sure there was not a model outperforming logistic. However, lazypredict is a tool just to confirm my original idea that logistic would be enough for this dataset. Just to cover all my bases. I then ran a quick Neural Netowrk utilizng tensorflow to see if it was favorable. It ended up being .05 better than logistic so I chose to move forward with it. With this dataset being as small as it was, logistic would have been enough to move forward, but the nerual network brings with it the ability to become more accurate with larger datasets and more parameter tuning if needed. 

I have found that in most sitations, starting with clean data makes the largest jump in accuracy. However, this data was relatively clean so in order to break above that 90% accuracy threshold, I would most likely ingest more data, say 10k rows, and then start to hyperparameter tune the neural network. Starting with adjusting the learning rate and potentially adding in another layer. 

Next steps again would be to get more data. More data would mean I could trim or add in outlier cases, hyperparameter tune, and I may also test coeeficents on the features to see if dropping any of them actually increased accuracy and speed. Once the model reached an accuracy that is appropriate for the team, I would look to deploy the model either on a flask api or a virtual machine depending on need. 

In [95]:
#see full output
output_df

Unnamed: 0,0
0,0.17
1,0.22
2,0.16
3,0.15
4,0.29
5,0.34
6,0.3
7,0.99
8,0.09
9,0.12
