# Project Report - Shahin Hussain (ID 140474758)

## 1. Introduction

**Problem**: We are tasked with the problem of predicting the approval or rejection of loan applications based on applicant financial and demographic metrics. The dataset consists of $s$ independent observations, each corresponding to a single historical loan application, together with a binary target variable indicating the final loan status.

This is a typical supervised binary classifcation problem since we are predicting class labels given explanatory variables, where class labels can take on two possible values.

To address this task, ridge-regularised logistic regression is implemented from scratch using only NumPy. The resulting model is then compared against two benchmark approaches: logistic regression implemented via $\texttt{scikit-learn}$, and a classification tree with controlled depth. We include this models to validate our model implementation, assess predictive performance, and consider varying the benefits of varying model complexity.

The dataset is first randomly partitioned into a training set and a held-out test set, using an 80/20 split. Model selection is then performed just on the training data via $K$-fold cross-validation, with the regularisation hyperparameter selected to maximise cross-validated accuracy. Final model evaluation is carried out on the held-out test set using standard classification metrics, including accuracy, confusion matrices, and receiver operating characteristic (ROC) analysis.

This study aims to construct a predictive model for loan approval that balances predictive performance and interpretability. The study concludes with an interpretation of the learned model parameters and an assessment of predictive performance relative to alternative classification models.

## 2. Dataset and Preprocessing

The dataset consists of $s$ loan applications, each described by a collection of numerical and categorical features capturing applicant demographics, financial status, and loan characteristics. Each observation corresponds to a single applicant, together with a binary target variable indicating the final loan decision.

### Dataset Structure

Taking a look at the first few rows of our dataset, we have:

|    | no_of_dependents | education     | self_employed | income_annum | loan_amount | loan_term | cibil_score | residential_assets_value | commercial_assets_value | luxury_assets_value | bank_asset_value | loan_status |
|----|------------------|---------------|---------------|--------------|-------------|-----------|-------------|--------------------------|-------------------------|---------------------|------------------|-------------|
| 0  | 2                | Graduate      | No            | 9600000      | 29900000    | 12        | 778         | 2400000                  | 17600000                | 22700000            | 8000000          | Approved    |
| 1  | 0                | Not Graduate  | Yes           | 4100000      | 12200000    | 8         | 417         | 2700000                  | 2200000                 | 8800000             | 3300000          | Rejected    |
| 2  | 3                | Graduate      | No            | 9100000      | 29700000    | 20        | 506         | 7100000                  | 4500000                 | 33300000            | 12800000         | Rejected    |


Formally, the dataset is represented by a feature matrix
$$
X \in \mathbb{R}^{s \times d},
$$
where each row $x_i^\top \in \mathbb{R}^d$ corresponds to the feature vector of the $i$-th applicant, and a target vector
$$
y = (y_1, \dots, y_s)^\top \in \{0,1\}^s,
$$
where $y_i = 1$ denotes an approved loan and $y_i = 0$ denotes a rejected loan.

The target variable exhibits a moderate class imbalance, with a larger proportion of approved loans than rejected ones. However, both classes are well represented, ensuring that sufficient samples from each class are present in both training and validation splits.

---

### Feature Types

The feature set includes:
- **Numerical variables**, such as annual income, loan amount, loan term, credit score, and asset-related quantities.
- **Categorical variables**, including education level and employment status.

Categorical variables were converted into numerical representations using one-hot encoding, resulting in a fully numerical design matrix suitable for optimisation-based learning methods.

---

### Preprocessing Steps

Prior to model training, the following preprocessing steps were applied:

1. **Train–test split.**  
   The dataset was randomly partitioned into a training set and a held-out test set using an 80/20 split. All subsequent preprocessing and model selection steps were performed exclusively on the training data.

2. **Handling of missing values.**  
   The dataset contains no missing values across the selected features; therefore, no imputation was required.

3. **Feature standardisation.**  
   All numerical features were standardised to zero mean and unit variance using statistics computed on the training set. This ensures that all features are placed on a comparable scale and prevents variables with large magnitudes from dominating the optimisation objective.

Standardisation is particularly important for ridge-regularised models, as the $\ell_2$ penalty depends directly on the scale of the coefficients.

---

### Exploratory Visual Analysis

To gain insight into the structure of the data, exploratory analysis was conducted on the numerical features and the target variable. Figure references below correspond to plots generated in the accompanying notebook.

- The distribution of the target variable highlights the moderate class imbalance between approved and rejected loans.
- Histograms of numerical features reveal substantial variation in scale prior to standardisation.
- The correlation matrix of numerical features (see corresponding figure) shows strong positive correlations among income and asset-related variables, indicating the presence of multicollinearity.

These observations motivate the use of ridge regularisation in the subsequent modelling stage, as well as careful interpretation of learned model coefficients.


## Exploratory Data Analysis

## Methodology

### Ridge Logistic Regression (NumPy)

## Model Selection via K-Fold Cross-Validation

The ridge logistic regression model depends on a regularisation hyperparameter $\alpha > 0$, which controls the strength of the $\ell_2$ penalty applied to the model parameters. Selecting an appropriate value of $\alpha$ is essential in order to balance model bias and variance.

### Ridge Logistic Regression Objective

Given a training dataset
$$
\mathcal{S} = \{(x_i, y_i)\}_{i=1}^s,
\qquad
x_i \in \mathbb{R}^d,
\quad
y_i \in \{0,1\},
$$
the ridge logistic regression estimator is obtained by solving
$$
\hat{w}_\alpha
=
\arg\min_{w \in \mathbb{R}^{d+1}}
\left[
\frac{1}{n}
\sum_{i=1}^n
\Big(
\log\!\left(1 + e^{\langle w, x_i \rangle}\right)
-
y_i \langle w, x_i \rangle
\Big)
+
\frac{\alpha}{2}
\lVert w \rVert_2^2
\right].
$$

Here, the first term corresponds to the empirical logistic loss, while the second term penalises large coefficients and improves numerical stability in the presence of correlated features. The intercept is included by augmenting the input vector with a constant feature, hence $w \in \mathbb{R}^{d+1}$.

---

### K-Fold Cross-Validation Procedure

To select the optimal value of $\alpha$, $K$-fold cross-validation was employed. The dataset $\mathcal{S}$ was randomly partitioned into $K$ disjoint subsets of approximately equal size,
$$
\mathcal{S}
=
\bigcup_{k=1}^{K} \mathcal{S}_k,
\qquad
\mathcal{S}_k \cap \mathcal{S}_\ell = \emptyset
\quad \text{for } k \neq \ell.
$$

For each fold $k = 1, \dots, K$, the model was trained on the reduced training set
$$
\mathcal{S}^{(k)}_{\text{train}} = \mathcal{S} \setminus \mathcal{S}_k,
$$
and evaluated on the validation set $\mathcal{S}_k$.

Let $\hat{w}^{(k)}_\alpha$ denote the solution obtained by minimising the ridge logistic loss on $\mathcal{S}^{(k)}_{\text{train}}$. The corresponding validation accuracy for fold $k$ is defined as
$$
\mathrm{Acc}_k(\alpha)
=
\frac{1}{|\mathcal{S}_k|}
\sum_{(x_i,y_i)\in \mathcal{S}_k}
\mathbb{I}
\big[
\hat{y}_i = y_i
\big],
$$
where the predicted labels are given by
$$
\hat{y}_i
=
\mathbb{I}
\Big(
\sigma(\langle \hat{w}^{(k)}_\alpha, x_i \rangle) \ge 0.5
\Big),
\qquad
\sigma(t) = \frac{1}{1 + e^{-t}}.
$$

---

### Cross-Validated Performance

The cross-validated accuracy for a given value of $\alpha$ is computed by averaging over the $K$ folds,
$$
\overline{\mathrm{Acc}}(\alpha)
=
\frac{1}{K}
\sum_{k=1}^{K}
\mathrm{Acc}_k(\alpha).
$$

This quantity provides an empirical estimate of the model’s generalisation performance for a given regularisation strength.

---

### Hyperparameter Selection

The final regularisation parameter was selected according to
$$
\alpha^\ast
=
\arg\max_{\alpha}
\overline{\mathrm{Acc}}(\alpha).
$$

In this study, the optimal value was found to be $\alpha = 0.01$, which achieved the highest average cross-validation accuracy. This value was then used to train the final ridge logistic regression model on the full training dataset before evaluation on a held-out test set.

---

### Remarks

K-fold cross-validation allows for principled model selection while avoiding optimistic bias that would arise from tuning hyperparameters on the same data used for training. Moreover, ridge regularisation is particularly well suited to this dataset due to the presence of strong correlations among income and asset-related features, as identified during the exploratory data analysis.


### Model Selection

### Benchmark Models

## Results

### Ridge Logistic Regression

The ridge logistic regression model depends on a regularisation parameter $\alpha \geq 0$, which controls the strength of the 


The final ridge logistic regression model was selected via KFold cross-validation (CV) over the regularisation parameter $\alpha$. 


7 different values of $\alpha$ were tested, with the result as follows:

| α| CV Accuracy |
|------------|-------------|
| 0.0        | 0.91276     |
| 1e-06      | 0.91276     |
| 1e-05      | 0.91276     |
| 0.0001     | 0.91306     |
| 0.001      | 0.91364     |
| 0.01       | 0.91364     |
| 0.1        | 0.89959     |

Note $\alpha = 0$ corresponds to no regularisation. We see that cross-validation accuracy remains relatively stable for small values of $\alpha$, with performance getting worse for larger regularisation strengths. The optimal value of was found to be: $$\alpha = 0.01$$ achieving a mean cross-validation accuracy of approximately **0.914**.

When evaluated on the held-out test set, the final model achieved: 

| Metric | Value | Formula |
|--------|-------|---------|
| Accuracy | 0.919 | $\displaystyle \frac{TP + TN}{n}$ |
| Misclassification Rate | 0.081 | $\displaystyle \frac{FP + FN}{n}$ |
| Precision | 0.933 | $\displaystyle \frac{TP}{TP + FP}$ |
| Recall (TPR) | 0.931 | $\displaystyle \frac{TP}{TP + FN}$ |
| False Positive Rate (FPR) | 0.098 | $\displaystyle \frac{FP}{FP + TN}$ |

Where:

- $TP$ = True Positives  
- $TN$ = True Negatives  
- $FP$ = False Positives  
- $FN$ = False Negatives  



## Model Interpretation

## Discussion

## Conclusion