# Theory
## Sigmoid Function
$$\sigma(x) = \frac{1}{1+e^{-x}}$$
## Cost function
$$\hat{y} = \sigma(W^TX+B)$$
$$\text{Cost} = \frac{-1}{m} \sum_{i=1}^m [y_i\log(\hat{y}_i)+(1-y_i)\log(1-\hat{y}_i)]$$
## Gradient Descent
$$W = W - \alpha\frac{\delta\text{Cost}}{\delta W}$$
$$B = B - \alpha\frac{\delta\text{Cost}}{\delta B}$$
$$\frac{\delta\text{Cost}}{\delta W} = \frac{1}{m}[\hat{Y}-Y]_{(1, m)}X_{(m, n)}$$
$$\frac{\delta\text{Cost}}{\delta B} = \frac{1}{m}\sum_{i=1}^m[\hat{Y}-Y]_{(1, m)}$$
## Note
- $\alpha =$ Learning Rate
- $m = $ number of observations and $n = $ number of features.
- $W =$ Weights and $B =$ Bias. Together they comprise of the model.
### Logistic regression can only be used for binary classification.
***
# Problem Statement
Dream Housing Finance company deals in all home loans. They have a presence across all urban, semi-urban, and rural areas. Customer-first applies for a home loan after that company validates the customer eligibility for a loan.

The company wants to automate the loan eligibility process (real-time) based on customer detail provided while filling the online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History, and others. To automate this process, they have given a problem to identify the customer's segments, those are eligible for loan amount so that they can specifically target these customers. Here they have provided a partial data set.
***
# Importing the libraries and loading the dataset

In [1]:
import numpy as np
import pandas as pd

train_data = pd.read_csv("./loan-dataset.csv")
train_data

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


# Cleaning the data

In [2]:
train_data = train_data.dropna()

# updating married column
train_data.loc[train_data['Married'] == "No", 'Married'] = 0
train_data.loc[train_data['Married'] == "Yes", 'Married'] = 1

# updating gender column
train_data.loc[train_data['Gender'] == "Male", 'Gender'] = 1
train_data.loc[train_data['Gender'] == "Female", 'Gender'] = 0

# updating Self_Employed column
train_data.loc[train_data['Self_Employed'] == "No", 'Self_Employed'] = 0
train_data.loc[train_data['Self_Employed'] == "Yes", 'Self_Employed'] = 1

# updating Loan_Status column
train_data.loc[train_data['Loan_Status'] == "N", 'Loan_Status'] = 0
train_data.loc[train_data['Loan_Status'] == "Y", 'Loan_Status'] = 1

# updating Education column
train_data.loc[train_data['Education'] == "Graduate", 'Education'] = 1
train_data.loc[train_data['Education'] == "Not Graduate", 'Education'] = 0

train_data = train_data.drop("Loan_ID", axis = "columns")
train_data = train_data.rename(columns = {'Gender':'is_male', 'Education': 'Graudation'})
train_data

Unnamed: 0,is_male,Married,Dependents,Graudation,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,1,1,1,1,0,4583,1508.0,128.0,360.0,1.0,Rural,0
2,1,1,0,1,1,3000,0.0,66.0,360.0,1.0,Urban,1
3,1,1,0,0,0,2583,2358.0,120.0,360.0,1.0,Urban,1
4,1,0,0,1,0,6000,0.0,141.0,360.0,1.0,Urban,1
5,1,1,2,1,1,5417,4196.0,267.0,360.0,1.0,Urban,1
...,...,...,...,...,...,...,...,...,...,...,...,...
609,0,0,0,1,0,2900,0.0,71.0,360.0,1.0,Rural,1
610,1,1,3+,1,0,4106,0.0,40.0,180.0,1.0,Rural,1
611,1,1,1,1,0,8072,240.0,253.0,360.0,1.0,Urban,1
612,1,1,2,1,0,7583,0.0,187.0,360.0,1.0,Urban,1


In [3]:
dummy = pd.get_dummies(train_data['Property_Area'], dtype = int)
dummy = dummy.drop("Semiurban", axis = "columns")
dummy

Unnamed: 0,Rural,Urban
1,1,0
2,0,1
3,0,1
4,0,1
5,0,1
...,...,...
609,1,0
610,1,0
611,0,1
612,0,1


In [4]:
dummy2 = pd.get_dummies(train_data['Dependents'], drop_first = True, dtype = int)
dummy2

Unnamed: 0,1,2,3+
1,1,0,0
2,0,0,0
3,0,0,0
4,0,0,0
5,0,1,0
...,...,...,...
609,0,0,0
610,0,0,1
611,1,0,0
612,0,1,0


In [5]:
train_data = train_data.drop(["Dependents", "Property_Area"], axis = "columns")
train_data = pd.concat([train_data, dummy, dummy2], axis = 'columns')
train_data

Unnamed: 0,is_male,Married,Graudation,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status,Rural,Urban,1,2,3+
1,1,1,1,0,4583,1508.0,128.0,360.0,1.0,0,1,0,1,0,0
2,1,1,1,1,3000,0.0,66.0,360.0,1.0,1,0,1,0,0,0
3,1,1,0,0,2583,2358.0,120.0,360.0,1.0,1,0,1,0,0,0
4,1,0,1,0,6000,0.0,141.0,360.0,1.0,1,0,1,0,0,0
5,1,1,1,1,5417,4196.0,267.0,360.0,1.0,1,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,0,0,1,0,2900,0.0,71.0,360.0,1.0,1,1,0,0,0,0
610,1,1,1,0,4106,0.0,40.0,180.0,1.0,1,1,0,0,0,1
611,1,1,1,0,8072,240.0,253.0,360.0,1.0,1,0,1,1,0,0
612,1,1,1,0,7583,0.0,187.0,360.0,1.0,1,0,1,0,1,0


# Scaling down the data-points

In [6]:
train_data["ApplicantIncome"] /= max(train_data["ApplicantIncome"])
train_data["CoapplicantIncome"] /= max(train_data["CoapplicantIncome"])
train_data["LoanAmount"] /= max(train_data["LoanAmount"])
train_data["Loan_Amount_Term"] /= max(train_data["Loan_Amount_Term"])
train_data

Unnamed: 0,is_male,Married,Graudation,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status,Rural,Urban,1,2,3+
1,1,1,1,0,0.056580,0.044567,0.213333,0.750,1.0,0,1,0,1,0,0
2,1,1,1,1,0.037037,0.000000,0.110000,0.750,1.0,1,0,1,0,0,0
3,1,1,0,0,0.031889,0.069687,0.200000,0.750,1.0,1,0,1,0,0,0
4,1,0,1,0,0.074074,0.000000,0.235000,0.750,1.0,1,0,1,0,0,0
5,1,1,1,1,0.066877,0.124006,0.445000,0.750,1.0,1,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,0,0,1,0,0.035802,0.000000,0.118333,0.750,1.0,1,1,0,0,0,0
610,1,1,1,0,0.050691,0.000000,0.066667,0.375,1.0,1,1,0,0,0,1
611,1,1,1,0,0.099654,0.007093,0.421667,0.750,1.0,1,0,1,1,0,0
612,1,1,1,0,0.093617,0.000000,0.311667,0.750,1.0,1,0,1,0,1,0


# Creating Custom Logistic Regression Model

In [7]:
from math import exp, log

Y = np.array(train_data["Loan_Status"])
X = train_data.drop("Loan_Status", axis = "columns").to_numpy()

sigmoid = lambda x: 1.0 / (1.0 + exp(-x))

In [8]:
learning_rate = 0.001
bias = 0.0
weights = [0.0 for i in X[0]]
get_y_cap = lambda index: sigmoid(np.dot(weights, X[index]) + bias)

# Cost function

In [9]:
def cost():
    m = X.shape[0]
    arr = []
    for i in range(m):
        a = get_y_cap(i)
        arr.append(Y[i] * log(a) + (1 - Y[i]) * log(1 - a))
    return sum(arr) / float(-m)

# Gradient Descent Function

In [10]:
def gradient_descent(weights, bias, learning_rate):
    m = float(X.shape[0])
    y_cap = np.array([get_y_cap(i) for i in range(len(X))])
    difference = np.subtract(y_cap, Y)
    differential = (1.0 / m) * np.matmul(difference, X)
    weights = np.subtract(weights, differential * learning_rate)
    bias -= (learning_rate * (1.0 / m) * np.sum(y_cap))
    return weights, bias

# Training the Model

In [11]:
for i in range(3000):
    try:
        a, b = gradient_descent(weights, bias, learning_rate)
        weights, bias = a, b
        if i % 500 == 0: print(f"Current cost: {cost()}")
    except Exception:
        print(f"Exception at {i}-th iteration")
        break
# print(weights, bias)

Current cost: 0.6930751402341806
Current cost: 0.6650604563776825
Current cost: 0.6474341833846681
Current cost: 0.6354352986158024
Current cost: 0.6268130135289127
Current cost: 0.6204237492903757


In [12]:
y_cap = np.array([get_y_cap(i) for i in range(len(X))])

# Custom Model Predictions

In [13]:
custom_model_predictions = np.array(list(map(lambda x: 0 if x < 0.5 else 1, y_cap)))
custom_model_predictions

array([1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,

# Creating the Logistic Regression Model

In [14]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(random_state = 0)
consume_output = logreg.fit(train_data.drop("Loan_Status", axis = "columns"), Y.astype('int'))

# Scikit-learn Model Predictions

In [15]:
scikit_learn_predictions = logreg.predict(train_data.drop("Loan_Status", axis = "columns"))
scikit_learn_predictions

array([1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

# Actual Values

In [16]:
Y.astype('int')

array([0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1,
       0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1,
       0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1,

# Testing Model Accuracy
A accuracy of a logistic regresssion model can be determined in 2 ways:
- Precision: Precision is calculated as the ratio of true positives (correctly predicted positives) to the total number of items predicted by the model as positive (true positives + false positives).
- Recall: Recall is calculated as the ratio of true positives to the total number of actual positive items (true positives + false negatives).

In [17]:
total_truth_Y = np.sum(Y)
custom_model_total_truths = np.sum(custom_model_predictions)
scikit_learn_total_truths = np.sum(scikit_learn_predictions)
custom_model_truths, scikit_learn_truths = 0, 0
custom_model_score, scikit_learn_score = 0, 0
custom_model_precision, scikit_learn_precision = 0, 0

for i in range(X.shape[0]):
    if custom_model_predictions[i] == 1 and int(Y[i]) == 1: custom_model_precision += 1
    if scikit_learn_predictions[i] == 1 and int(Y[i]) == 1: scikit_learn_precision += 1
    
    custom_model_score += (1 if custom_model_predictions[i] == int(Y[i]) else 0)
    scikit_learn_score += (1 if scikit_learn_predictions[i] == int(Y[i]) else 0)

pc1 = custom_model_precision / custom_model_total_truths
pc2 = scikit_learn_precision / scikit_learn_total_truths

rc1 = custom_model_precision / total_truth_Y
rc2 = scikit_learn_precision / total_truth_Y

sc1 = float(custom_model_score) / float(X.shape[0]) * 100.0
sc2 = float(scikit_learn_score) / float(X.shape[0]) * 100.0

print(f"Precision\nCustom Model Precision: {round(pc1 * 100, 2)} %")
print(f"Scikit-learn Precision: {round(pc2 * 100, 2)} %\n")

print(f"Recall\nCustom Model Recall: {round(rc1 * 100, 2)} %")
print(f"Scikit-learn Recall: {round(rc2 * 100, 2)} %\n")

print(f"Score\nCustom Model Score: {custom_model_score} ({round(sc1,2)} %)")
print(f"Scikit-learn Model Score: {scikit_learn_score} ({round(sc2,2)} %)")

Precision
Custom Model Precision: 79.82 %
Scikit-learn Precision: 79.46 %

Recall
Custom Model Recall: 81.02 %
Scikit-learn Recall: 97.89 %

Score
Custom Model Score: 349 (72.71 %)
Scikit-learn Model Score: 389 (81.04 %)


***