# EDA and Notes

## Project Goal - Predicting Termination Likelihood of Financial Arrangements

This use case focuses on building a predictive model using an anonymized dataset containing detailed records of financial arrangements. The target is the 'Terminated' feature, a binary variable describing whether an arrangement was terminated. 

## Project steps

1. Exploratory Data Analysis
2. Data pre-processing
3. Model Building and Fine-tuning
4. Evaluating results


Importing packages, loading data

In [None]:
import pandas as pd; pd.set_option('display.max_columns', None)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('data/dataset_v2.csv')
data.info()

Dataset contains a mix of float64 and int64 types, not consistent across categorical/numerical. Missing features described in the task document, but missing features have no predictive value (either by proxy or unavailable at time of prediction). 

Several features with large proportion of null values:

1. arrears_months - numerical , HANDLED
2. Arrears_category - categorical, DROPPED
3. NoMonths_FirstPayment - numerical, HANDLED
4. NoMonth_FirstMissedPayment numerical, HANDLED

Upon inspection large proportion of features with ethical/legal(GDPR) considerations. Protected characteristics include:

1. Age
2. Sex
3. Race
4. Socioeconomic attributes
5. Marriage / Civil Partnership status

No information on DPIA, informed consent, best to drop these - see full list in data_preprocessing.py. Worth noting that in initial model tests, social grade percentages are amongst the top 10 most important features.

In [None]:
data.describe()

Mean disposable income greater than household income? Consider dropping without further information

In [None]:
print(data['Terminated'].value_counts())
plt.figure(figsize=(4, 6))
sns.countplot(x='Terminated', data=data, palette=['#e424b2', '#172344'])
plt.title("Target Distribution")
plt.show()

In [None]:
# Find duplicate rows
duplicate_rows = data[data.duplicated()]
print(duplicate_rows)

Removing features with ethical or data protection concerns

In [None]:
data.drop(['Partner_Gender', 
           'Partner_Employment_Status',
           'Output_Area_Classification_Code',
           'Lower_Super_Output_Area_Code',
           'AB',
           'C1',
           'C2',
           'DE', 
           'DOB_Year', 
           'DOB_Month',
           'Gender',
           'Partner_Gender',
           'Partner_Employment_Status',
           'under_18',
           'Marital_Status',
           'Physical_Disability_Vulnerability',
           'Illness_Vulnerability',
           'Addiction_Vulnerability',
           'Mental_Health_Vulnerability',
           'age',
           'age_partner',
           'DOB_Year',
           'DOB_Month'],
           axis=1, inplace=True)

In [None]:
correlation_data = data.corr()
plt.figure(figsize=(24, 20))
# Apply a mask to the upper triangle
mask = np.triu(np.ones_like(correlation_data, dtype=bool))
sns.heatmap(correlation_data, annot=False, cmap='coolwarm', fmt='.2f', linewidths=2, mask=mask)
plt.title('Correlation Matrix')
plt.show()

plt.figure(figsize=(6, 10))
sns.heatmap(correlation_data[['Terminated']].sort_values(by='Terminated', ascending=False), annot=True, cmap='coolwarm', fmt='.2f', linewidths=2)

High correlation between arrears category, amount, and months. Strong inverse correlation between contributions expected, months before first missed payment and contributions received to date. 

Not necessarily causation but worth noting for evaluation. Strong correlation between arrears amount and arrears category, but without more information I'm uncomfortable using imputation for ~73% of the arrangements. 

In [None]:
categorical_features = ['Employment_Status', 
                        'home_owner_flag', 
                        'Arrears_Category', 
                        'agreed_missed_flag']

numerical_rows = (len(categorical_features) + 2) // 3

fig, axes = plt.subplots(numerical_rows, 3, figsize=(16, 6 * numerical_rows))

for i, feature in enumerate(categorical_features):
    row = i // 3
    col = i % 3
    if numerical_rows > 1:
        ax = axes[row, col]
    else:
        ax = axes[col]

    sns.countplot(x=feature, data=data, ax=ax)
    ax.set_title(f'Distribution of {feature}')
    ax.set_xlabel(feature)
    ax.set_ylabel('Count')

plt.tight_layout()
plt.show()


In [None]:
# Disable future warnings - not a great fix
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

all_features = set(data.columns)
numerical_features = list(all_features - set(categorical_features))
num_numerical_features = len(numerical_features)
numerical_rows = (num_numerical_features + 2) // 3

fig, axes = plt.subplots(numerical_rows, 3, figsize=(16, 6 * numerical_rows))

for i, feature in enumerate(numerical_features):
    row = i // 3
    col = i % 3
    if numerical_rows > 1:
        ax = axes[row, col]
    else:
        ax = axes[col]

    sns.histplot(data[feature], kde=True, ax=ax)
    ax.set_title(f"Distribution of {feature}")
    ax.set_xlabel(feature)
    ax.set_ylabel("Count")

plt.tight_layout()
plt.show()




Numerical features don't all appear normally distributed or from inspection, independent - rule out logistic regression for now. Outlier in nomonth first missed payment, drop that row for training/testing.

Why is household disposable income often much larger than household income? Total income just referring to wages while Disposable includes other sources? Recommend dropping without further information on source of data.

In [None]:
for i in range(0, len(numerical_features), 5):
    sns.pairplot(data, 
                 x_vars=numerical_features[i:i+3], 
                 y_vars='Terminated', 
                 hue='Terminated', 
                 palette=['#e424b2', '#172344'])


In [None]:
num_features = len(numerical_features)
num_rows = (num_features + 2) // 3
fig, axes = plt.subplots(num_rows, 3, figsize=(18, 6*num_rows))

for i, feature in enumerate(numerical_features):
    row = i // 3
    col = i % 3
    ax = axes[row, col]
    sns.boxplot(x='Terminated', y=feature, data=data, ax=ax)
    ax.set_title(f'Boxplot of {feature} by Terminated')
    ax.set_xlabel('Terminated')
    ax.set_ylabel(feature)

# Remove empty subplots
if num_features % 3 != 0:
    for i in range(num_features % 3, 3):
        fig.delaxes(axes[num_rows-1, i])

plt.tight_layout()
plt.show()


Distributions agree with correlation and feature importances, 

In [None]:
# Count the number of entries where disposable income is greater than house total income
print(data[data['household_DI'] > data['household_income']].shape[0])