# <center> **Home Credit Default Risk Assessment**
# <center> **Hypothesis Tests**

# **Libraries**

In [1]:
import pandas as pd
import numpy as np

import functions
import importlib
importlib.reload(functions)

import warnings

# **Display**

In [2]:
%matplotlib inline

pd.options.display.max_rows = 300000
pd.options.display.max_columns = 999
pd.options.display.max_colwidth = 500

warnings.filterwarnings("ignore")
warnings.simplefilter(action="ignore", category=FutureWarning)

pd.set_option('display.max_rows', 200)

size = 20

# **Data**

## **Load Data**

In [3]:
train = pd.read_csv(
    r"C:\Users\Dell\Documents\AI\Risk\Data\application_train.csv",
    index_col=False
)

## **Reduce Memory Usage**

In [5]:
train = functions.reduce_memory_usage(train)

Memory usage of dataframe is 286.23 MB
Memory usage after optimization is: 92.38 MB
Decreased by 67.7%


# **Hypothesis Tests**

## **Income Stability and Default Rate (Chi-Square Test)**

**Null**: There is no relationship between income stability (such as whether the person is a regular employee or has a stable income) and the default rate. <BR>
**Alternative**: There is a significant relationship between income stability and default rate. <BR>
**Chi-Square Test**: The chi-square test can be applied to examine whether the proportion of defaults differs across income groups.

In [11]:
import pandas as pd
from scipy.stats import chi2_contingency
data = train


contingency_table = pd.crosstab(data['NAME_INCOME_TYPE'], data['TARGET'])

chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print("Chi-Square Statistic:", chi2)
print("p-value:", p_value)
print("Degrees of Freedom:", dof)

alpha = 0.05  
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant relationship between income stability and default rate.")
else:
    print("Fail to reject the null hypothesis: No significant relationship between income stability and default rate.")

Chi-Square Statistic: 1253.4708080924986
p-value: 1.9281456056861122e-266
Degrees of Freedom: 7
Reject the null hypothesis: There is a significant relationship between income stability and default rate.


## **Credit Amount and Default Risk (T-test)**

**Null**: The average credit amount is the same for those who default and those who do not. <BR>
**Alternative**: The average credit amount differs between defaulters and non-defaulters. <BR>
**Two-Sample T-Test**: Two-sample t-test can be applied to compare the means between the two groups (defaulters vs. non-defaulters).

In [12]:
import pandas as pd
from scipy.stats import ttest_ind

data = train
default_group = data[data['TARGET'] == 1]['AMT_CREDIT']  
non_default_group = data[data['TARGET'] == 0]['AMT_CREDIT']  


t_stat, p_value = ttest_ind(default_group, non_default_group, equal_var=False)  # equal_var=False assumes unequal variances


print("T-Statistic:", t_stat)
print("p-value:", p_value)


alpha = 0.05 
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in credit amount between defaulters and non-defaulters.")
else:
    print("Fail to reject the null hypothesis: No significant difference in credit amount between defaulters and non-defaulters.")


T-Statistic: -19.273200010157254
p-value: 2.7206132011522836e-82
Reject the null hypothesis: There is a significant difference in credit amount between defaulters and non-defaulters.


## **Hypothesis Test for Age and Default Risk (Kruskal-Wallis Test)**

**Null**: The age of customers has no effect on the default risk. <BR>
**Alternative**: There is a significant effect of age on default risk. <BR>
**Kruskal-Wallis Test**: Kruskal-Wallis test (non-parametric) can be applied to assess whether the distribution of defaults differs significantly across age groups.

In [13]:
import pandas as pd
from scipy.stats import kruskal


data['AGE_YEARS'] = data['DAYS_BIRTH'] / -365
data['AGE_GROUP'] = pd.cut(data['AGE_YEARS'], bins=[20, 30, 40, 50, 60, 70], labels=['20s', '30s', '40s', '50s', '60s'])


grouped_data = data.groupby('AGE_GROUP')['TARGET'].apply(list)

stat, p_value = kruskal(*grouped_data)

print("Kruskal-Wallis Statistic:", stat)
print("p-value:", p_value)


alpha = 0.05 
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in default rates between age groups.")
else:
    print("Fail to reject the null hypothesis: No significant difference in default rates between age groups.")


Kruskal-Wallis Statistic: 1788.6225580246416
p-value: 0.0
Reject the null hypothesis: There is a significant difference in default rates between age groups.


# **Insights**

> * **Hypothesis Test #1** — There are approximately 300,000 rows and 122 columns. The final column is "TARGET" indicates if the client defaulted on a loan. There are enormous missing values in many of the columns. 
> * **Hypothesis Test #2** — There are approximately 300,000 rows and 122 columns. The final column is "TARGET" indicates if the client defaulted on a loan. There are enormous missing values in many of the columns. 
> * **Hypothesis Test #3** — There are approximately 300,000 rows and 122 columns. The final column is "TARGET" indicates if the client defaulted on a loan. There are enormous missing values in many of the columns. 