### Interpretation of Correlation Coefficients

Correlation coefficients range from -1 to +1.

*   **+1:** Perfect positive correlation (as one variable increases, the other increases proportionally).
*   **0:** No correlation (no linear relationship between the variables).
*   **-1:** Perfect negative correlation (as one variable increases, the other decreases proportionally).

The **strength** of the correlation is indicated by the absolute value of the coefficient:

*   **0.00 - 0.19:** Very weak
*   **0.20 - 0.39:** Weak
*   **0.40 - 0.59:** Moderate
*   **0.60 - 0.79:** Strong
*   **0.80 - 1.00:** Very strong

The **direction** of the correlation is indicated by the sign of the coefficient (+ for positive, - for negative).

We will examine the coefficients for each feature to understand their relationship with the 'LoanAmount'. Pearson correlation measures linear relationships, while Spearman and Kendall's measure monotonic relationships (whether the variables tend to increase or decrease together, not necessarily at a constant rate).

# Board Level Strategic Insight Request
## Data Analyst: Harvey Kim Solano
## Date of Request: 09/26/2025


Scenario

A mid-sized financial institution has been actively expanding its personal loan portfolio. However, recent shifts in customer behavior and credit risk profiles have prompted the Board of Directors to revisit the underlying factors influencing loan approvals and amounts granted. The board is particularly concerned with credit risk, profitability, and data-driven decision-making.

During a quarterly strategy meeting, the Chief Risk Officer (CRO) presents a dashboard highlighting inconsistencies in the average loan amounts issued across customer segments. Some applicants with moderate income and lower credit scores are receiving higher-than-expected loan amounts, while more qualified applicants appear to be under-leveraged.

In response, the Chairperson of the Board raises a crucial question:

"Among all the customer attributes we’ve been collecting—like income, credit score, employment history, age, debt-to-income ratio, and education level—can we clearly identify which of these factors have the strongest statistical relationship with the loan amount issued?"

The board mandates the Data Analytics Team to conduct an immediate correlation analysis to:

1. Identify which factors most strongly drive loan amounts.

2. Distinguish between statistically significant vs. insignificant variables.

3. Support future decisions about automating loan approvals, tightening risk thresholds, or tailoring products to specific customer profiles.

Create a Business Analytics Report that address the request of the board.

In [None]:
%pip install qdesc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
import qdesc
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
import statsmodels.api as sm
from datetime import datetime
from google.colab import files

Collecting qdesc
  Downloading qdesc-0.1.9.7-py3-none-any.whl.metadata (8.8 kB)
Downloading qdesc-0.1.9.7-py3-none-any.whl (10 kB)
Installing collected packages: qdesc
Successfully installed qdesc-0.1.9.7


In [None]:
import os
print(os.getcwd())

/content


In [None]:
os.chdir('/content/sample_data')

In [None]:
# Loading a data set and storing it into a variable.
df = pd.read_excel("HypotheticalLoansData.xlsx")
df

Unnamed: 0,Income,CreditScore,EmploymentYears,DebtToIncome,Age,EducationLevel,LoanAmount
0,67450.71,746.31,22,0.27,48,Bachelor,105119.57
1,57926.04,795.47,11,0.38,59,Bachelor,104818.75
2,69715.33,630.07,16,0.37,26,Bachelor,99397.74
3,82845.45,728.15,7,0.23,29,High School,99163.53
4,56487.70,667.47,10,0.38,40,High School,80141.29
...,...,...,...,...,...,...,...
495,68083.65,685.94,23,0.31,58,High School,82016.95
496,44441.31,789.88,25,0.34,31,Bachelor,101933.38
497,57144.92,732.04,24,0.22,55,High School,91180.47
498,46865.73,671.44,28,0.42,28,Bachelor,89112.09


In [None]:
# Display data types to identify categorical columns
print(df.info())

# Perform one-hot encoding on the 'EducationLevel' column
df_processed = pd.get_dummies(df, columns=['EducationLevel'], drop_first=True)

# Display the first few rows of the processed DataFrame
display(df_processed.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Income           500 non-null    float64
 1   CreditScore      500 non-null    float64
 2   EmploymentYears  500 non-null    int64  
 3   DebtToIncome     500 non-null    float64
 4   Age              500 non-null    int64  
 5   EducationLevel   500 non-null    object 
 6   LoanAmount       500 non-null    float64
dtypes: float64(4), int64(2), object(1)
memory usage: 27.5+ KB
None


Unnamed: 0,Income,CreditScore,EmploymentYears,DebtToIncome,Age,LoanAmount,EducationLevel_High School,EducationLevel_Master,EducationLevel_PhD
0,67450.71,746.31,22,0.27,48,105119.57,False,False,False
1,57926.04,795.47,11,0.38,59,104818.75,False,False,False
2,69715.33,630.07,16,0.37,26,99397.74,False,False,False
3,82845.45,728.15,7,0.23,29,99163.53,True,False,False
4,56487.7,667.47,10,0.38,40,80141.29,True,False,False
