# Dataset Information

Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan. Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers.

This is a standard supervised classification task.A classification problem where we have to predict whether a loan would be approved or not. Below is the dataset attributes with description.
   
Variable | Description
----------|--------------
Loan_ID | Unique Loan ID
Gender | Male/ Female
Married | Applicant married (Y/N)
Dependents | Number of dependents
Education | Applicant Education (Graduate/ Under Graduate)
Self_Employed | Self employed (Y/N)
ApplicantIncome | Applicant income
CoapplicantIncome | Coapplicant income
LoanAmount | Loan amount in thousands
Loan_Amount_Term | Term of loan in months
Credit_History | Credit history meets guidelines
Property_Area | Urban/ Semi Urban/ Rural
Loan_Status | Loan approved (Y/N)

**Download link:-** https://www.kaggle.com/altruistdelhite04/loan-prediction-problem-dataset

#### Load the dataset

In [None]:
import pandas as pd

In [None]:
train = pd.read_csv('train.csv')

#### Analyse the dataset

In [None]:
train.head()

In [None]:
train.tail()

In [None]:
train.describe()

In [None]:
train.info()

In [None]:
train.isna().sum()

#### Fill na/null values

In [None]:
train['Gender'].fillna(
    value=train['Gender'].mode()[0]
)

In [None]:
train['LoanAmount'].fillna(
    value=train['LoanAmount'].median()
)

In [None]:
df = train.copy()

In [None]:
df.isna().sum()

In [None]:
def fill_na_values(df: pd.DataFrame) -> pd.DataFrame:

    for feature in df:
        
        if df[feature].isna().sum() > 0:

            if df[feature].dtype == 'object':

                df[feature].fillna(
                    value=df[feature].mode()[0],
                    inplace=True
                )
            
            else:

                df[feature].fillna(
                    value=df[feature].median(),
                    inplace=True
                )
    
    return df

In [None]:
df = fill_na_values(df)

In [None]:
df.isna().sum()

#### Note - 

Feature `Dependents` is numerical feature but has other characters than digits, so we can apply RegEx to filter the numbers and then convert that into `int64`.

In [None]:
import re

In [None]:
def remove_unwanted_characters(value: str) -> int:

    value = re.sub(r'[^0-9]', '', value)

    return int(value)

In [None]:
df['Dependents'] = df['Dependents'].apply(remove_unwanted_characters)

In [None]:
df['Dependents'] = df['Dependents'].astype(dtype='int64')

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df.drop(
    labels=['Loan_ID'],
    inplace=True,
    axis=1
)

In [None]:
for feature in df:

    if df[feature].dtype == 'object':
        sns.countplot(
            data=df,
            x=feature
        )
    else:
        sns.displot(
            data=df,
            x=feature,
            kde=True
        )

    plt.title(f'Count vs {feature}')
    plt.show()

In [None]:
df.info()

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
for feature in df:

    le = LabelEncoder()

    if df[feature].dtype == 'object':

        df[feature] = le.fit_transform(df[feature])

In [None]:
df.info()

In [None]:
df.head()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, confusion_matrix

In [None]:
def run_model(model, df) -> None:

    X = df.drop(
        labels=['Loan_Status'],
        axis=1
    ).values

    y = df['Loan_Status'].values

    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=0.2,
        random_state=2022
    )

    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    score = precision_score(y_test, y_pred)

    print(f'Precision score -> {score}')

    cm = confusion_matrix(y_test, y_pred)

    return cm

In [None]:
lr = LogisticRegression(max_iter=500)
cm = run_model(lr, df)

In [None]:
sns.heatmap(cm)

#### Enhancement

In [None]:
import numpy as np

In [None]:
df['Total_Income'] = df['ApplicantIncome'] + df['CoapplicantIncome']

In [None]:
df['ApplicantIncome'] = np.log(df['ApplicantIncome'] + 1)
df['CoapplicantIncome'] = np.log(df['CoapplicantIncome'] + 1)
df['LoanAmount'] = np.log(df['LoanAmount'] + 1)
df['Total_Income'] = np.log(df['Total_Income'] + 1)
df['Loan_Amount_Term'] = np.log(df['Loan_Amount_Term'] + 1)

In [None]:
lr = LogisticRegression(max_iter=500)
run_model(lr, df)

In [None]:
corr = df.corr()

In [None]:
plt.figure(figsize=(20, 10))
sns.heatmap(corr)