# <center> **Home Credit Default Risk Assessment**
# <center> **Hypothesis Tests**

# **Introduction**

In this part of the project, I conducted two important hypothesis tests, one related to income type and the other related to credit amount.

# **Libraries**

In [14]:
import pandas as pd
import numpy as np

import functions
import importlib
importlib.reload(functions)

from scipy.stats import chi2_contingency
from scipy.stats import ttest_ind
from statsmodels.stats.proportion import proportions_ztest

import warnings

# **Display**

In [2]:
%matplotlib inline

pd.options.display.max_rows = 300000
pd.options.display.max_columns = 999
pd.options.display.max_colwidth = 500

warnings.filterwarnings("ignore")
warnings.simplefilter(action="ignore", category=FutureWarning)

pd.set_option('display.max_rows', 200)

size = 20

# **Data**

## **Load Data**

In [3]:
train = pd.read_csv(
    r"C:\Users\Dell\Documents\AI\Risk\Data\application_train.csv",
    index_col=False
)

## **Reduce Memory Usage**

In [4]:
train = functions.reduce_memory_usage(train)

Memory usage of dataframe is 286.23 MB
Memory usage after optimization is: 92.38 MB
Decreased by 67.7%


# **Hypothesis Tests**

## **Income Stability and Default Rate (Chi-Square Test)**

**Null**: There is no relationship between income stability (such as whether the person is a regular employee or has a stable income) and the default rate. <BR>
**Alternative**: There is a significant relationship between income stability and default rate. <BR>
**Chi-Square Test**: The chi-square test can be applied to examine whether the proportion of defaults differs across income groups.

In [6]:
data = train


contingency_table = pd.crosstab(data['NAME_INCOME_TYPE'], data['TARGET'])

chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print("Chi-Square Statistic:", chi2)
print("p-value:", p_value)
print("Degrees of Freedom:", dof)

alpha = 0.05  
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant relationship between income type and default rate.")
else:
    print("Fail to reject the null hypothesis: No significant relationship between income type and default rate.")

Chi-Square Statistic: 1253.4708080924986
p-value: 1.9281456056861122e-266
Degrees of Freedom: 7
Reject the null hypothesis: There is a significant relationship between income type and default rate.


## **Credit Amount and Default Risk (T-test)**

**Null**: The average credit amount is the same for those who default and those who do not. <BR>
**Alternative**: The average credit amount differs between defaulters and non-defaulters. <BR>
**Two-Sample T-Test**: Two-sample t-test can be applied to compare the means between the two groups (defaulters vs. non-defaulters).

In [7]:
data = train
default_group = data[data['TARGET'] == 1]['AMT_CREDIT']  
non_default_group = data[data['TARGET'] == 0]['AMT_CREDIT']  


t_stat, p_value = ttest_ind(default_group, non_default_group, equal_var=False)  # equal_var=False assumes unequal variances


print("T-Statistic:", t_stat)
print("p-value:", p_value)


alpha = 0.05 
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in credit amount between defaulters and non-defaulters.")
else:
    print("Fail to reject the null hypothesis: No significant difference in credit amount between defaulters and non-defaulters.")


T-Statistic: -19.273200010157254
p-value: 2.7206132011522836e-82
Reject the null hypothesis: There is a significant difference in credit amount between defaulters and non-defaulters.


## **Car Ownership and Loan Default (Two-Proportion Z-Test)** 

The Two-Proportion Z-Test is used to determine whether there is a significant difference between the proportions of two independent groups. The two groups must be independent of each other. he test assumes that the sample size in each group is large enough to approximate the a normal distribution.

**Null**: There is no significant difference between the proporation of clients who owned a car and those who did not.<BR>
**Alternative**: There is a significant difference between the proporation of clients who owned a car and those who did not.<BR> 

In [5]:
feature = 'FLAG_OWN_CAR'
target = 'TARGET'

In [20]:
own_car = train.loc[train[feature] == 'Y']
count_own_car = own_car.shape[0]
count_own_car_default = (own_car[target] == 1).sum()
prop_own_car_default = count_own_car_default / count_own_car 
print(
    f"Proportion of clients who owned a car and defaulted {prop_own_car_default:.3f}"
)

Proportion of clients who owned a car and defaulted 0.072


In [19]:
no_car = train.loc[train[feature] == 'N']
count_no_car = no_car.shape[0]
count_no_car_default = (no_car[target] == 1).sum()
prop_no_car_default = count_no_car_default / count_no_car

print(f"Proportion of clients who didn't own a car and defaulted {prop_no_car_default:.3f}")

Proportion of clients who didn't own a car and defaulted 0.085


In [21]:
numerator = np.array([count_own_car_default, count_no_car_default])
denominator = np.array([count_own_car, count_no_car])
denominator = np.array([count_own_car, count_no_car])


stat, pval = proportions_ztest(numerator, denominator, alternative="two-sided")

print(f"The p-value is: {pval:.2f}")

if pval< 0.05:
    print("Null hypothesis is rejected.")
else:
    print("Failed to reject the null hypothesis.")

The p-value is: 0.00
Null hypothesis is rejected.


# **Summary**

> * **Hypothesis Test #1** — There is a significant relationship between income stability as related to income type and default rate. 
> * **Hypothesis Test #2** — There is a significant difference in credit amount between defaulters and non-defaulters. 
> * **Hypothesis Test #3** — There is a significant difference between defaulters and non-defaulters with respect to car ownership. 