# Loan Prediction – Hypothesis Testing

This notebook performs inferential statistical analysis on the Loan Prediction dataset.
Hypothesis testing is used to validate assumptions about numerical and categorical
features influencing loan approval.

## 1. Import Libraries and Load Cleaned Dataset

The cleaned dataset from the preprocessing stage is loaded
for statistical hypothesis testing.

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind, chi2_contingency
df = pd.read_csv('../data/loan_data_cleaned.csv')
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,128.0,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


## 2. Two-Sample T-Test (Applicant Income vs Loan Status)

### Hypotheses:
- Null Hypothesis (H₀):  
  The mean applicant income of approved and rejected loans is the same.
- Alternative Hypothesis (H₁):  
  The mean applicant income of approved and rejected loans is different.

In [2]:
approved_income = df[df['Loan_Status'] == 'Y']['ApplicantIncome']
rejected_income = df[df['Loan_Status'] == 'N']['ApplicantIncome']
t_stat, p_value = ttest_ind(approved_income, rejected_income)
t_stat, p_value

(np.float64(-0.11650844828724542), np.float64(0.907287812130518))

### T-Test Conclusion

Since the p-value (0.907) is greater than the significance level of 0.05,
the null hypothesis is not rejected. This indicates that there is no
statistically significant difference in the mean applicant income
between approved and rejected loans.


## 3. Chi-Square Test (Education vs Loan Status)

### Hypotheses:
- Null Hypothesis (H₀):  
  Education level and loan approval status are independent.
- Alternative Hypothesis (H₁):  
  Education level and loan approval status are dependent.

In [3]:
# Cross-tabulation
contingency_table = pd.crosstab(df['Education'], df['Loan_Status'])
contingency_table

Loan_Status,N,Y
Education,Unnamed: 1_level_1,Unnamed: 2_level_1
Graduate,140,340
Not Graduate,52,82


In [4]:
chi2, p, dof, expected = chi2_contingency(contingency_table)
chi2, p

(np.float64(4.091490413303621), np.float64(0.043099621293573545))

### Chi-Square Test Conclusion

Since the p-value (0.043) is less than the significance level of 0.05,
the null hypothesis is rejected. This indicates that there is a statistically
significant relationship between education level and loan approval status.


## 4. ANOVA (Conceptual Explanation)

Analysis of Variance (ANOVA) is used when comparing the means of
three or more independent groups.

In the context of this project, ANOVA would be appropriate if applicant
income were divided into multiple income groups (e.g., low, medium, high)
and the mean loan amount needed to be compared across these groups.

ANOVA helps avoid multiple pairwise T-tests, which increase the risk
of Type I error.

## Conclusion

In this notebook, inferential statistical techniques were applied to the
Loan Prediction dataset. The T-Test evaluated differences in applicant income,
while the Chi-Square test examined the relationship between education level
and loan approval status. These statistical insights support data-driven
model development in subsequent stages.