# **Risk Status Study Notebook**

## General Idea

* A company aims to revolutionize its loan qualification process by automating and streamlining the assessment of online loan applications in real-time. The focus is on employing machine learning (ML) models that can accurately predict whether a loan should be approved for an applicant, based on the information provided during the application process. This initiative seeks to expedite the decision-making process, ensuring quick and efficient determination of loan eligibility.

## Objectives

* Business Requirement:
Conduct exploratory data analysis (EDA) to understand patterns and trends in the data, such as the impact of education level, marital status, and income on loan approval. Therefore analyze the given data to identify common characteristics of applicants who default versus those who repay successfully. Try to make conclusions.


## Inputs

* outputs/datasets/collection/GermanCreditData.csv

## Outputs

* Generate code/ plots to underline they key patterns of the dataset. The output should give some evidence concerning the business requirement.

## Additional Comments

Risk 
*   "0" = "bad applicant"
*   "1" = "good applicant"

Content

*   Age (numeric)
*   Sex (text: male, female)
*   Job (numeric: 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled)
*   Housing (text: own, rent, or free)
*   Saving accounts (text - little, moderate, quite rich, rich)
*   Checking account (numeric, in DM - Deutsch Mark)
*   Credit amount (numeric, in DM)
*   Duration (numeric, in month)
*   Purpose(text: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others
*   Risk (Value target - Good or Bad Risk)



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data

The Load_ID variable is unique to each dataset and therefore will be dropped for the further analysis.

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/GermanCreditData.csv")
    .drop(['Unnamed: 0'], axis=1)
    )
df.head(3)

---

# Data Exploration

We are interested to get more familiar with the dataset, check variable type and distribution, missing levels and what these variables mean in a business context

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

---

# Handling of missing Data

Handling missing data is crucial for the upcoming Correlation Study to work, several strategies exist.
* Dropping missing values
* Imputation, e.g. with the help of mean/ median
* Predictive Imputation

In the following a simple imputation technique is used, based on `median` for numerical variables and `most_frequent` for categorical variables. The imputation takes place for missing values.

In [None]:
from sklearn.impute import SimpleImputer
from feature_engine.encoding import OneHotEncoder

# For numerical variables
imputer_num = SimpleImputer(strategy='median')
numerical_vars = df.select_dtypes(include=['int64', 'float64']).columns
df[numerical_vars] = imputer_num.fit_transform(df[numerical_vars])

# For categorical variables
imputer_cat = SimpleImputer(strategy='most_frequent')
categorical_vars = df.select_dtypes(include=['object']).columns
df[categorical_vars] = imputer_cat.fit_transform(df[categorical_vars])

---

# Correlation Study

In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(3)

Make use of `.corr()` for `spearman` and `pearson` methods and investigate the top 10 correlations
* This command returns a pandas series and the first item is the correlation between target and target, therfore target (Loan_Status) gets excluded. 
* We sort values considering the absolute value, by setting `key=abs`

In [None]:
corr_spearman = df_ohe.corr(method='spearman')['Risk'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

Same goes for `pearson`

In [None]:
corr_pearson = df_ohe.corr(method='pearson')['Risk'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

Both methods show up weak levels of correlation between Risk and a given variable, which indicate a weak linear and rank-order relationship between the variables and the target variable (Risk), a broader and more nuanced approach to EDA could provide deeper insights.
Therefore a Bivariate Analysis and Multivariate Analysis will be looked at.

---

# Variables Distribution by ['Risk']

## Bivariate Analysis

*   For categorical variables, use grouped bar charts to see how the categories relate to the target variable. 
*   For continuous variables, scatter plots or line plots should show insight.
*   The distribution is plotted (numerical and categorical) coloured by Risk.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
vars_to_study = [col for col in df.columns if col != 'Risk']

def plot_categorical(df, col, target_var):

    plt.figure(figsize=(12, 5))
    sns.countplot(data=df, x=col, hue=target_var, order=df[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.title(f"{col} vs Risk Distribution" , fontsize=20, y=1.05)
    plt.show()


def plot_numerical(df, col, target_var):
    plt.figure(figsize=(8, 5))
    sns.histplot(data=df, x=col, hue=target_var, kde=True, element="step")
    plt.title(f"{col} vs Risk Distribution", fontsize=20, y=1.05)
    plt.show()


target_var = 'Risk'
for col in vars_to_study:
    if df[col].dtype == 'object':
        plot_categorical(df, col, target_var)
        print("\n\n")
    else:
        plot_numerical(df, col, target_var)
        print("\n\n")


## Multivariant Analysis

Finding "common sense" combinations for multivariate analysis is challenging because it requires a deep understanding of the domain and the realtionship between multiple variables. This complexity arises from the need to identify how different features might interact in non-obvious ways to influence outcomes, making it difficult to predict without extensive knowledge of the specific context and the underlying data relationships.

*   For simplicity, let's focus on 'Housing' as an example of financial stability.
*   For a second plot-set we will focus on purpose with combination on creit amount and duration.

In [None]:
sns.catplot(x="Housing", hue="Risk", col="Saving accounts", kind="count", data=df, aspect=.8, col_wrap=4)
plt.subplots_adjust(top=0.9)
plt.suptitle("Risk Distribution by Housing and Saving Account Status")
plt.show()

In [None]:
g = sns.FacetGrid(df, col="Purpose", hue="Risk", margin_titles=True, height=3.5, col_wrap=4)
g.map(sns.scatterplot, "Duration", "Credit amount")
g.add_legend()
g.set_axis_labels("Duration", "Credit Amount")
g.set_titles("{col_name}")
plt.show()

---

### Age and Credit Risk
**Hypothesis:**

Older applicants are less likely to be classified as bad credit risks compared to younger applicants.

**Validation:**

Perform a logistic regression analysis with "Risk" as the dependent variable and "Age" as an independent variable. Additionally, a visualization (e.g., boxplot) showing the distribution of ages for good vs. bad credit risks could provide initial insights. The significance of the age coefficient in the regression model would indicate the impact of age on credit risk.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import numpy as np

# Independent and dependent variables
X = df[['Age']]
y = df['Ris']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Getting the coefficient for Age
age_coefficient = model.coef_[0][0]
print(f"Coefficient for Age: {age_coefficient}")

# A positive coefficient would indicate that as age increases, the log-odds of being a good risk (encoded as 1) increases,
# whereas a negative coefficient would indicate that as age increases, the likelihood of being a bad risk increases.

#### Conclusion
A positive coefficient for age, e.g. 0.0291, means that as people get older, they're slightly more likely to be seen as having a lower risk of defaulting on credit. This backs up the idea that older applicants might be safer bets for lenders. In simple terms, the older you are, the better your chances might be of being considered a good risk for a loan, although this effect seems quite small.

---

### Impact of Savings and Checking Account Balances on Credit Risk

**Hypothesis:**

Applicants with little savings or checking account balances are more likely to be classified as bad credit risks.

**Validation:**

Conduct logistic regression analysis with "Risk" as the dependent variable and categorical encodings of "Saving accounts" and "Checking account" as independent variables. An ANOVA test could be used if transforming account balances into numerical categories or bins to see if there are statistically significant differences in risk classification across different levels of savings and checking account balances.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder

# Step 1: Load the dataset
data = pd.read_csv('your_dataset.csv')

# Step 2: Data Preparation
# Handle missing values
data.fillna(method='ffill', inplace=True)

# Encode categorical variables
label_encoder = LabelEncoder()
data['Saving accounts'] = label_encoder.fit_transform(data['Saving accounts'])
data['Checking account'] = label_encoder.fit_transform(data['Checking account'])

# Step 3: Exploratory Data Analysis (EDA)
# Optional: Explore the dataset using summary statistics and visualizations

# Step 4: Hypothesis Testing
# Split data into train and test sets
X = data[['Saving accounts', 'Checking account']]
y = data['Risk']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate the model
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy}")


#### Conclusions

An accuracy of e.g. 0.705 means that the logistic regression model correctly predicted the risk classification for approximately 70.5% of the test instances.

In the context of credit risk assessment, this accuracy indicates how well the model performs in classifying applicants into "good" or "bad" credit risks based on their savings and checking account balances.

A ANOVA test or other statistical could validate this hypothesis even more.

---

# Conclusions

The analysis of the correlations revealed only weak correlations, looking at the bi- and multivariant analysis leads to following assumptions:

Homeownership and Loan Applications:
*   Homeowners are more likely to take out loans than renters or those without rental expenses. This could be due to homeowners feeling more financially secure or needing loans for home maintenance and improvements.

Purpose of Loans - Cars:
*   A notable number of loans are for purchasing vehicles. This indicates that car loans are popular, possibly because of the essential role cars play in personal transportation or the attractive financing options available for these purchases.

Savings Accounts and Loan Behavior:
*   Individuals with smaller savings accounts are more inclined to apply for loans, possibly due to greater financial needs. In contrast, those with more substantial savings not only seek fewer loans but also have a higher rate of repayment. This suggests that people with larger savings are in a better financial position, leading to more prudent borrowing and repayment practices.