# Machine Learning in Python - Project 2

Alfie Plant, Markus Emmott, Oscar Youngman, Ashe Raymond-Barker

## Setup

*Install any packages here, define any functions if neeed, and load data*

In [1]:
# Add any additional libraries or submodules below

# Data libraries
import pandas as pd
import numpy as np

# Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Plotting defaults
plt.rcParams['figure.figsize'] = (8,5)
plt.rcParams['figure.dpi'] = 80

# sklearn modules
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC  
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import Binarizer
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import GridSearchCV, KFold, StratifiedKFold
from sklearn.ensemble import BaggingClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score
from sklearn.metrics import RocCurveDisplay, roc_auc_score
from sklearn.metrics import PrecisionRecallDisplay

# Introduction

Brief discussion of problem and approaches used

### Data Overview

Data type of each feature and a short description, indicate features that can be excluded.

Columns that have been removed:

`ppmt_pnlty`, `prod_type`, `id_loan`, `id_loan_rr` and `io_ind`

In [12]:
d = pd.read_csv("freddiemac.csv", low_memory=False)
d = d.drop(columns=['id_loan', 'id_loan_rr', 'io_ind', 'prod_type', 'ppmt_pnlty'])


### Missing Data

The data includes approximately 20,000 null values for the **cd_msa** information. Despite this, there is no missing information on **zipcode**. As a result, we have decided to use zipcode data in analysis. This will be discussed in more detail in exploratory data analysis.
The remaining features that contain null values are **flag_sc**, and **rr_ind**, however these are categorical variables that take NaN to refer to 'No'. All NaN entries have been replaced with 'N'.

There are 41 missing Credit Scores in the dataset. Out of these, 3 loans were defaults. Since the dataset does not contain many defaults, we do not want to exclude this information and we will proceed by ***add detail here...***. Only one observation is not available for the mortgage insurance percentage and given this loan was not a default, we will exclude it from the data set. There are 6 missing observations for combined loan-to-value, and 2 of these are the missing values for loan-to-value, which occurs when the loanee has no other loans. None of these loans were defaults, so they have been excluded.

There are 2,412 missing values for the debt-to-income ratio, and more notably, a disproportinal number of these observations are loans that have defaulted. When debt-to-income ratio is greater than 65%, it is classified as a missing value. These are loans where the monthly debt payments are greater than 65% of monthly income of the loanee suggesting that they are higher-risk loans. To deal with this we have discretised debt-to-income and encoded it ordinally. We have used cross-validation to determine a reasonable choice for the number of bins as 10. ***check this, should we do CV using a logistic model which typically has bad performance***

We interpret the value of 9 for the program indicator as Not Applicable meaning that the loan is not part of a program. There could be missing data embedded within this, however since this category acts as a baseline associated to no programs, then this assumption does not lose any information. Finally, there are 125 values missing for property value. None of these are defaults, so they have been excluded.

In [None]:
d['flag_sc'] = d['flag_sc'].fillna('N')
d['rr_ind'] = d['rr_ind'].fillna('N')
d = d[d['fico'] != 9999] # removed missing credit scores for now but perhaps we should reconsider
d = d[d['mi_pct'] != 999]
d = d[d['cltv'] != 999]
d = d[d['property_val'] != 9]

### Data Split

***To avoid any data leakage, the dataset is split into training and test sets from this point onwards.***

We will separate active loans from non-active loans. The non-active loans will be separated a training set (70%) to build the model, a validation set to tune the model (15%), and finally a test set (15%). Once the final model has been determined, it will be used to offer insight on active loans.

In [21]:
active = d[d['loan_status'] == 'active']
nonactive = d[d['loan_status'] != 'active']

# Feature matrix and response vector
X, y = nonactive.drop(['loan_status'], axis=1), nonactive['loan_status']

# Convert to numpy array
X = X.values

# Encode default
y = LabelEncoder().fit_transform(y)

# Naively split the data into train and test sets 
X_train, X_tv, y_train, y_tv = train_test_split(X, y, shuffle= True,
                                                    test_size = 0.3, random_state=1112, stratify=y)

X_test, X_val, y_test, y_val = train_test_split(X_tv, y_tv, shuffle= True,
                                                    test_size = 0.5, random_state=1112, stratify=y_tv)

# Convert back to DataFrame
df_train = pd.DataFrame(np.concatenate([X_train, y_train.reshape(-1, 1)], axis=1), columns=nonactive.columns)
df_test = pd.DataFrame(np.concatenate([X_test, y_test.reshape(-1, 1)], axis=1), columns=nonactive.columns)
df_val = pd.DataFrame(np.concatenate([X_test, y_test.reshape(-1, 1)], axis=1), columns=nonactive.columns)

# Exploratory Data Analysis

### Categorical Data

### Numerical Data

### Geographical Data

# Model Fitting and Tuning

### Logistic Regression

### Linear SVM

### Non-Linear SVM

### Random Forests

### Neural Networks

### Final Model


# Discussion & Conclusions

*In this section you should provide a general overview of your final model, its performance, and reliability. You should discuss what the implications of your model are in terms of the included features, estimated parameters and relationships, predictive performance, and anything else you think is relevant.*

*This should be written with a target audience of a banking official, who is understands the issues associated with mortgage defaults but may only have university level mathematics (not necessarily postgraduate statistics or machine learning). Your goal should be to highlight to this audience how your model can useful. You should also discuss potential limitations or directions of future improvement of your model.*

*Finally, you should include recommendations on factors that may increase the risk of default, which may be useful for the companies to improve their understanding of
mortgage defaults, and also to explain their decisions to clients and regulatory bodies. You should also use your model to inform the company of any active loans that are at risk of default.*

*Keep in mind that a negative result, i.e. a model that does not work well predictively, that is well explained and justified in terms of why it failed will likely receive higher marks than a model with strong predictive performance but with poor or incorrect explinations / justifications.*

# Generative AI statement

*Include a statement on how generative AI was used in the project and report.*

# References

*Include references if any*

In [None]:
# Run the following to render to PDF
!jupyter nbconvert --to pdf project2.ipynb