# **Risk Status Study Notebook**

## General Idea

* A company aims to revolutionize its loan qualification process by automating and streamlining the assessment of online loan applications in real-time. The focus is on employing machine learning (ML) models that can accurately predict whether a loan should be approved for an applicant, based on the information provided during the application process. This initiative seeks to expedite the decision-making process, ensuring quick and efficient determination of loan eligibility.

## Objectives

* Business Requirement:
Conduct exploratory data analysis (EDA) to understand patterns and trends in the data, such as the impact of education level, marital status, and income on loan approval. Therefore analyze the given data to identify common characteristics of applicants who default versus those who repay successfully. Try to make conclusions.


## Inputs

* outputs/datasets/collection/GermanCreditData.csv

## Outputs

* Generate code/ plots to underline they key patterns of the dataset. The output should give some evidence concerning the business requirement.

## Additional Comments

Risk 
*   "0" = "bad applicant"
*   "1" = "good applicant"

Content

*   Age (numeric)
*   Sex (text: male, female)
*   Job (numeric: 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled)
*   Housing (text: own, rent, or free)
*   Saving accounts (text - little, moderate, quite rich, rich)
*   Checking account (numeric, in DM - Deutsch Mark)
*   Credit amount (numeric, in DM)
*   Duration (numeric, in month)
*   Purpose(text: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others
*   Risk (Value target - Good or Bad Risk)



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data

The Load_ID variable is unique to each dataset and therefore will be dropped for the further analysis.

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/GermanCreditData.csv")
    .drop(['Unnamed: 0'], axis=1)
    )
df.head(3)

---

# Data Exploration

We are interested to get more familiar with the dataset, check variable type and distribution, missing levels and what these variables mean in a business context

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

---

# Handling of missing Data

Handling missing data is crucial for the upcoming Correlation Study to work, several strategies exist.
* Dropping missing values
* Imputation, e.g. with the help of mean/ median
* Predictive Imputation

In the following a simple imputation technique is used, based on `median` for numerical variables and `most_frequent` for categorical variables. The imputation takes place for missing values.

In [None]:
from sklearn.impute import SimpleImputer
from feature_engine.encoding import OneHotEncoder

# For numerical variables
imputer_num = SimpleImputer(strategy='median')
numerical_vars = df.select_dtypes(include=['int64', 'float64']).columns
df[numerical_vars] = imputer_num.fit_transform(df[numerical_vars])

# For categorical variables
imputer_cat = SimpleImputer(strategy='most_frequent')
categorical_vars = df.select_dtypes(include=['object']).columns
df[categorical_vars] = imputer_cat.fit_transform(df[categorical_vars])

---

# Correlation Study

In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(3)

Make use of `.corr()` for `spearman` and `pearson` methods and investigate the top 10 correlations
* This command returns a pandas series and the first item is the correlation between target and target, therfore target (Loan_Status) gets excluded. 
* We sort values considering the absolute value, by setting `key=abs`

In [None]:
corr_spearman = df_ohe.corr(method='spearman')['Risk'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

Same goes for `pearson`

In [None]:
corr_pearson = df_ohe.corr(method='pearson')['Risk'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

Both methods show up weak levels of correlation between Risk and a given variable, which indicate a weak linear and rank-order relationship between the variables and the target variable (Risk), a broader and more nuanced approach to EDA could provide deeper insights.
Therefore a Bivariate Analysis and Multivariate Analysis will be looked at.

---

# Variables Distribution by ['Risk']

## Bivariate Analysis

*   For categorical variables, use grouped bar charts to see how the categories relate to the target variable. 
*   For continuous variables, scatter plots or line plots should show insight.
*   The distribution is plotted (numerical and categorical) coloured by Risk.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
vars_to_study = [col for col in df.columns if col != 'Risk']

def plot_categorical(df, col, target_var):

    plt.figure(figsize=(12, 5))
    sns.countplot(data=df, x=col, hue=target_var, order=df[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.title(f"{col} vs Risk Distribution" , fontsize=20, y=1.05)
    plt.show()


def plot_numerical(df, col, target_var):
    plt.figure(figsize=(8, 5))
    sns.histplot(data=df, x=col, hue=target_var, kde=True, element="step")
    plt.title(f"{col} vs Risk Distribution", fontsize=20, y=1.05)
    plt.show()


target_var = 'Risk'
for col in vars_to_study:
    if df[col].dtype == 'object':
        plot_categorical(df, col, target_var)
        print("\n\n")
    else:
        plot_numerical(df, col, target_var)
        print("\n\n")


## Multivariant Analysis

Finding "common sense" combinations for multivariate analysis is challenging because it requires a deep understanding of the domain and the realtionship between multiple variables. This complexity arises from the need to identify how different features might interact in non-obvious ways to influence outcomes, making it difficult to predict without extensive knowledge of the specific context and the underlying data relationships.

*   For simplicity, let's focus on 'Housing' as an example of financial stability.
*   For a second plot-set we will focus on purpose with combination on creit amount and duration.

In [None]:
sns.catplot(x="Housing", hue="Risk", col="Saving accounts", kind="count", data=df, aspect=.8, col_wrap=4)
plt.subplots_adjust(top=0.9)
plt.suptitle("Risk Distribution by Housing and Saving Account Status")
plt.show()

In [None]:
g = sns.FacetGrid(df, col="Purpose", hue="Risk", margin_titles=True, height=3.5, col_wrap=4)
g.map(sns.scatterplot, "Duration", "Credit amount")
g.add_legend()
g.set_axis_labels("Duration", "Credit Amount")
g.set_titles("{col_name}")
plt.show()

# Conclusions

The analysis of the correlations revealed only weak correlations, looking at the bi- and multivariant analysis leads to following assumptions:

Homeownership and Loan Applications:
*   Homeowners are more likely to take out loans than renters or those without rental expenses. This could be due to homeowners feeling more financially secure or needing loans for home maintenance and improvements.

Purpose of Loans - Cars:
*   A notable number of loans are for purchasing vehicles. This indicates that car loans are popular, possibly because of the essential role cars play in personal transportation or the attractive financing options available for these purchases.

Savings Accounts and Loan Behavior:
*   Individuals with smaller savings accounts are more inclined to apply for loans, possibly due to greater financial needs. In contrast, those with more substantial savings not only seek fewer loans but also have a higher rate of repayment. This suggests that people with larger savings are in a better financial position, leading to more prudent borrowing and repayment practices.