# **Loan Status Study Notebook**

## General Idea

* A company aims to revolutionize its loan qualification process by automating and streamlining the assessment of online loan applications in real-time. The focus is on employing machine learning (ML) models that can accurately predict whether a loan should be approved for an applicant, based on the information provided during the application process. This initiative seeks to expedite the decision-making process, ensuring quick and efficient determination of loan eligibility.

## Objectives

* Business Requirement:
Conduct exploratory data analysis (EDA) to understand patterns and trends in the data, such as the impact of education level, marital status, and income on loan approval. Therefore analyze the given data to identify common characteristics of applicants who default versus those who repay successfully. Try to make conclusions.


## Inputs

* outputs/datasets/collection/LoanStatusPrediction.csv

## Outputs

* Generate code/ plots to underline they key patterns of the dataset. The output should give some evidence concerning the business requirement.

## Additional Comments

* None


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data

The Load_ID variable is unique to each dataset and therefore will be dropped for the further analysis.

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/LoanStatusPrediction.csv")
    .drop(['Loan_ID'], axis=1)
    )
df.head(3)

---

# Data Exploration

We are interested to get more familiar with the dataset, check variable type and distribution, missing levels and what these variables mean in a business context

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

---

# Handling of missing Data

Handling missing data is crucial for the upcoming Correlation Study to work, several strategies exist.
* Dropping missing values
* Imputation, e.g. with the help of mean/ median
* Predictive Imputation

In the following a simple imputation technique is used, based on `median` for numerical variables and `most_frequent` for categorical variables. The imputation takes place for missing values.

In [None]:
from sklearn.impute import SimpleImputer
from feature_engine.encoding import OneHotEncoder

# For numerical variables
imputer_num = SimpleImputer(strategy='median')
numerical_vars = df.select_dtypes(include=['int64', 'float64']).columns
df[numerical_vars] = imputer_num.fit_transform(df[numerical_vars])

# For categorical variables
imputer_cat = SimpleImputer(strategy='most_frequent')
categorical_vars = df.select_dtypes(include=['object']).columns
df[categorical_vars] = imputer_cat.fit_transform(df[categorical_vars])

---

# Correlation Study

In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(3)

Make use of `.corr()` for `spearman` and `pearson` methods and investigate the top 10 correlations
* This command returns a pandas series and the first item is the correlation between target and target, therfore target (Loan_Status) gets excluded. 
* We sort values considering the absolute value, by setting `key=abs`

In [None]:
corr_spearman = df_ohe.corr(method='spearman')['Loan_Status'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

Same goes for `pearson`

In [None]:
corr_pearson = df_ohe.corr(method='pearson')['Loan_Status'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

Both methods show up weak or moderate levels of correlation between Loan_Status and a given variable. 
* With `Credit_History` being the only given feature showing highest correlation, which can be alos argumented causally.

The top five correlation levels will be pursued in `df_ohe` and the associated variables will be studied at `df`

In [None]:
top_n = 5
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

### Addressing the Top 5 Selection
The goal was to select the top five variables, but due to overlaps six were considered.
Therefore a more logical approach is pushed forward:
1. Since Credit_History appears as the most correlated variable in both analyses, it should definitely be included.
2. Acknowledging that some variables like Married_Yes/Married_No are essentially the same feature, they should be counted as one, `Married`.
3. Property_Area should be considered as one variable, though it has been split into Semiurban and Rural due to one-hot encoding. Including both as separate entities might be redundant for a top 5 analysis where you're looking for diversity in features.
4. Since CoapplicantIncome shows a significant correlation in the Spearman analysis and represents a different aspect of the applicants financial situation, it's reasonable to include it for further study despite it increasing the count beyond five when combining lists.

### Therefore following four variables will be further studied and investigated:

In [None]:
vars_to_study = ['Credit_History', 'CoapplicantIncome', 'Married', 'Property_Area']
vars_to_study

---

# EDA on selected variables

In [None]:
df_eda = df.filter(vars_to_study + ['Loan_Status'])
df_eda.head(3)

## Variables Distribution by Churn

The distribution is plotted (numerical and categorical) coloured by Loan_Status.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')


def plot_categorical(df, col, target_var):

    plt.figure(figsize=(12, 5))
    sns.countplot(data=df, x=col, hue=target_var, order=df[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


def plot_numerical(df, col, target_var):
    plt.figure(figsize=(8, 5))
    sns.histplot(data=df, x=col, hue=target_var, kde=True, element="step")
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


target_var = 'Loan_Status'
for col in vars_to_study:
    if df_eda[col].dtype == 'object':
        plot_categorical(df_eda, col, target_var)
        print("\n\n")
    else:
        plot_numerical(df_eda, col, target_var)
        print("\n\n")


# Conclusions

The analysis of correlations and visual data reveals consistent findings:

* Having a good credit history appears to positively affect loan approval.
* The absence of income for the co-applicant could negatively impact the chances of securing a loan.
* Being married is seen to positively influence loan approval prospects.
* Residents of suburban areas are more likely to be approved for loans.