**Full name**: Dang Ha Nguyen (Janice) Nguyen

**ID**: 47856491

**Repository**: [GitHub Repository](https://github.com/JaniceNguyen/BUSA8001_Assignment1)

# BUSA8001 - APPLIED PREDICTIVE ANALYTICS

---

#### **Introduction**
1. **Project Overview**:  
This project involves analyzing a dataset provided by a financial institution to understand the credit behavior of its clients. The analysis is based on two primary datasets: `application_record.csv` and `credit_record.csv`. The `application_record.csv` dataset contains detailed information about the clients, such as their income, education, employment, and demographic details. The `credit_record.csv` dataset provides a monthly record of the clients' credit status, indicating whether they have missed payments or are overdue.

The objective of this project is to preprocess the data, explore it, and develop predictive models that can help identify clients who are at risk of defaulting on their loans. By leveraging the insights gained from these models, the financial institution can better manage its credit risk and make more informed lending decisions.

2. **Problem Statement**:  
The specific problem this project aims to solve is the identification of clients who are likely to default on their loans. Defaulting clients pose a significant risk to financial institutions, and being able to predict such behavior allows the institution to take proactive measures. The goal is to use the historical application and credit data to build a model that can accurately predict whether a client will default based on their past credit records and application data.

3. **Data Description**:  
For this assignment there are two files in the `data` folder `credit_record.csv` and `application_record.csv` where bank clients are related by the `ID` column.

In `application_record.csv` we have the following variables

| Feature Name         | Explanation     | Additional Remarks |
|--------------|-----------|-----------|
| ID | Randomly allocated client number      |         |
| AMT_INCOME   | Annual income  |  |
| NAME_INCOME_TYPE   | Income Source |  |
| NAME_EDUCATION_TYPE   | Level of Education  |  |
| CODE_GENDER   | Applicant's Gender   |  |
| FLAG_OWN_CAR | Car Ownership |  | 
| CNT_CHILDREN | Number of Children | |
| FLAG_OWN_REALTY | Real Estate Ownership | | 
| NAME_FAMILY_STATUS | Relationship Status | | 
| NAME_HOUSING_TYPE | Housing Type | | 
| DAYS_BIRTH | No. of Days | Count backwards from current day (0), -1 means yesterday
| DAYS_EMPLOYED | No. of Days | Count backwards from current day (0). If positive, it means the person is currently unemployed.
| FLAG_MOBIL | Mobile Phone Ownership | | 
| FLAG_WORK_PHONE | Work Phone Ownership | | 
| FLAG_PHONE | Landline Phone Ownership | | 
| FLAG_EMAIL | Landline Phone Ownership | | 
| OCCUPATION_TYPE | Occupation | | 
| CNT_FAM_MEMBERS | Count of Family Members | |



In `credit_record.csv` we have the following variables


| Feature Name         | Explanation     | Additional Remarks |
|--------------|-----------|-----------|
| ID | Randomly allocated client number | |
| MONTHS_BALANCE | Number of months in the past from now when STATUS is measured | 0 = current month, -1 = last month, -2 = two months ago, etc.|
| STATUS | Number of days a payment is past due | 0: 1-29 days past due 1: 30-59 days past due 2: 60-89 days overdue 3: 90-119 days overdue 4: 120-149 days overdue 5: Overdue or bad debts, write-offs for more than 150 days C: paid off that month X: No loan for the month |

The two datasets are linked by the `ID` column, which represents the unique identifier for each client. By merging these datasets, we can analyze how the characteristics captured in `application_record.csv` correlate with the credit behavior recorded in `credit_record.csv`. This combined analysis will be crucial for building models to predict loan defaults.

---

In [99]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

#### **Task 1: Reading, Summarizing, and Cleaning Data**

##### **Question 1**

1. **Loading Data**:
   - Import the datasets using pandas into `df_application` and `df_credit`.
   - Check the number of rows in each dataset.
   - Determine the number of unique clients in each dataset.

In [100]:
# Import data from csv file and store it in a DataFrame 
df_application = pd.read_csv('data/application_record.csv')
df_credit = pd.read_csv('data/credit_record.csv')

In [101]:
# Describe the data
## Number of rows and columns in each DataFrame
print(f"Number of rows in df_application: {df_application.shape[0]}")
print(f"Number of rows in df_credit: {df_credit.shape[0]}")

Number of rows in df_application: 438445
Number of rows in df_credit: 1047185


In [102]:
## Number of unique values in each DataFrame
print(f"Number of unique clients in df_application: {df_application['ID'].nunique()}")
print(f"Number of unique clients in df_credit: {df_credit['ID'].nunique()}")

Number of unique clients in df_application: 438398
Number of unique clients in df_credit: 45924


2. **Data Merging**:
   - Merge `df_application` and `df_credit` on the `ID` column to create a combined dataset `df`.
   - Evaluate the combined dataset by checking the number of rows and unique clients.

In [103]:
# Merge the DataFrames
df = pd.merge(df_application, df_credit, on='ID', how='inner')

In [104]:
# Describe the data
## Number of rows and columns in DataFrame
print(f"Number of rows in df: {df.shape[0]}")
## Number of unique values in each DataFrame
print(f"Number of unique clients in df: {df['ID'].nunique()}")

Number of rows in df: 776325
Number of unique clients in df: 36396


The fact that there are 776,325 rows in `df` but only 36,396 unique clients suggests that each client (`ID`) is represented by multiple rows. This aligns with our understanding that the `credit_record.csv` file contains multiple records per client, corresponding to different months of credit activity.

3. **Exploratory Data Analysis (EDA)**:
   - Describe how the merged data contains multiple rows per `ID` and what differentiates these rows.

To demonstrate how multiple rows for each `ID` in the merged dataset (`df`) are different, we can use code to show the differences between these rows. Specifically, we'll focus on the `MONTHS_BALANCE` and `STATUS` columns, which distinguish the rows for each client.

In [105]:
# Selecting the first ID for demonstration
temp_id = df['ID'].iloc[0]
# Filter the DataFrame to show all records for this ID
df_filtered = df[df['ID'] == temp_id]

In [106]:
# Display the filtered data
df_filtered_sorted = df_filtered.sort_values(by='MONTHS_BALANCE', ascending=False)
print(df_filtered_sorted[['ID', 'MONTHS_BALANCE', 'STATUS']])

         ID  MONTHS_BALANCE STATUS
0   5008804               0      C
1   5008804              -1      C
2   5008804              -2      C
3   5008804              -3      C
4   5008804              -4      C
5   5008804              -5      C
6   5008804              -6      C
7   5008804              -7      C
8   5008804              -8      C
9   5008804              -9      C
10  5008804             -10      C
11  5008804             -11      C
12  5008804             -12      C
13  5008804             -13      1
14  5008804             -14      0
15  5008804             -15      X


The output of this section shows multiple rows for the selected `ID` (temporary ID for demonstration), where each row represents a different month's credit record. The `MONTHS_BALANCE` column indicates the number of months in the past from the current month, while the `STATUS` column shows the credit status for that month.

This proves that the multiple rows for each `ID` are different because they capture the client's credit behavior over different months.

##### **Question 2**

1. **Change the Values of `STATUS`**

We need to map the values of the `STATUS` column in `df` according to the given rules:

- Map `{C, X, 0}` to `0`
- Map `{1, 2, 3, 4, 5}` to `1`

In [107]:
# Map the STATUS values
df['STATUS'] = df['STATUS'].replace({'C': '0', 'X': '0', '0': '0', '1': '1', '2': '1', '3': '1', '4': '1', '5': '1'})

# Ensure that STATUS is of integer type
df['STATUS'] = df['STATUS'].astype(int)

2. **Create `list_of_past_due`**

To identify clients with `STATUS = 1` at any point in the last 12 months, we can filter the data:

In [108]:
# Filter for records in the last 12 months
df_last_12_months = df[df['MONTHS_BALANCE'] >= -12]

# Find IDs with STATUS = 1
list_of_past_due = df_last_12_months[df_last_12_months['STATUS'] == 1]['ID'].unique()

3. **Create `df_final`**

Create a new DataFrame `df_final` for clients who had a past due status, ensuring only one row per `ID`.

In [109]:
# Filter df to include only the IDs in list_of_past_due
df_final = df[df['ID'].isin(list_of_past_due)].drop_duplicates(subset='ID')

# Number of rows in df_final
print(f"Number of rows in df_final: {df_final.shape[0]}")

Number of rows in df_final: 1830


4. **Add a New Column `y = 1`**

Add a new column `y` with the value `1` for all rows in `df_final`.

In [110]:
# Add y column
df_final['y'] = 1

5. **Increase `df_final` to 4,500 Rows**

We need to add more rows from `df` with IDs that are not in `list_of_past_due`.

In [111]:
# Filter for IDs not in list_of_past_due
df_remaining = df[~df['ID'].isin(list_of_past_due)].drop_duplicates(subset='ID')

# Add rows to df_final to reach 4,500 rows
df_final = pd.concat([df_final, df_remaining.iloc[:4500 - df_final.shape[0]]])

6. **Fill Missing Values of `y` and Remove Columns**

Fill missing `y` values with `0`, and remove the `STATUS` and `MONTHS_BALANCE` columns.

In [112]:
# Fill missing y values with 0
df_final.fillna({'y': 0}, inplace=True)

# Remove STATUS and MONTHS_BALANCE columns
df_final.drop(columns=['STATUS', 'MONTHS_BALANCE'], inplace=True)

##### **Question 3**

1. **Delete `ID` Column and Reset Index**

Remove the `ID` column from `df_final` and reset the index.

In [113]:
# Delete ID column and reset index
df_final = df_final.drop(columns=['ID']).reset_index(drop=True)

2. **Determine Numeric and Nominal Variables**

Analyzing the variables:

**Numeric Variables**:
   Numeric variables are those that represent quantitative measures and can be used for mathematical calculations. In the context of `df_final`, these typically include:
   - **AMT_INCOME**: Annual income (numeric value)
   - **CNT_CHILDREN**: Number of children (integer value)
   - **DAYS_BIRTH**: Age in days (integer value)
   - **DAYS_EMPLOYED**: Employment duration in days (integer value)
   - **FLAG_MOBIL**: Mobile phone ownership (binary indicator, numeric)
   - **FLAG_WORK_PHONE**: Work phone ownership (binary indicator, numeric)
   - **FLAG_PHONE**: Landline phone ownership (binary indicator, numeric)
   - **FLAG_EMAIL**: Email ownership (binary indicator, numeric)

**Ordinal Variables**:
   Ordinal variables represent categories with a meaningful order but not necessarily a uniform scale between them. In `df_final`, the only ordinal variable is:
   - **NAME_EDUCATION_TYPE**: Education level (e.g., Secondary, Higher, etc., which have a meaningful order)

**Nominal Variables**:
   Nominal variables represent categories without a meaningful order. These are purely categorical and are used to label data. In `df_final`, nominal variables include:
   - **CODE_GENDER**: Gender (e.g., Male, Female)
   - **FLAG_OWN_CAR**: Car ownership status (e.g., Yes, No)
   - **FLAG_OWN_REALTY**: Real estate ownership status (e.g., Yes, No)
   - **NAME_INCOME_TYPE**: Source of income (e.g., Commercial associate, Working)
   - **NAME_FAMILY_STATUS**: Family status (e.g., Single, Married)
   - **NAME_HOUSING_TYPE**: Housing type (e.g., Apartment, House)

Completing the table:

|Variable type|Numbers of features|Features' list|
| --- | --- | --- |
|Numeric:|8| AMT_INCOME, CNT_CHILDREN, DAYS_BIRTH, DAYS_EMPLOYED, FLAG_MOBIL, FLAG_WORK_PHONE, FLAG_PHONE, FLAG_EMAIL |
|Ordinal:|1| NAME_EDUCATION_TYPE |
|Nominal:|6| CODE_GENDER, FLAG_OWN_CAR, FLAG_OWN_REALTY, NAME_INCOME_TYPE, NAME_FAMILY_STATUS, NAME_HOUSING_TYPE |

3. **Find and Comment on Missing Values**

Finally, we can check for missing values in `df_final`:

In [114]:
# Check for missing values
missing_values = df_final.isnull().sum()

# Print missing values summary
print(missing_values)

CODE_GENDER               0
FLAG_OWN_CAR              0
FLAG_OWN_REALTY           0
CNT_CHILDREN             74
AMT_INCOME                0
NAME_INCOME_TYPE          0
NAME_EDUCATION_TYPE    1831
NAME_FAMILY_STATUS        0
NAME_HOUSING_TYPE         0
DAYS_BIRTH                0
DAYS_EMPLOYED             0
FLAG_MOBIL                0
FLAG_WORK_PHONE           0
FLAG_PHONE                0
FLAG_EMAIL                0
OCCUPATION_TYPE        1351
CNT_FAM_MEMBERS           0
y                         0
dtype: int64


**Variables with Missing Values:**
   - **`CNT_CHILDREN`:** 74 missing values
   - **`NAME_EDUCATION_TYPE`:** 1,831 missing values
   - **`OCCUPATION_TYPE`:** 1,351 missing values

**Variables with No Missing Values:**
   - **`CODE_GENDER`, `FLAG_OWN_CAR`, `FLAG_OWN_REALTY`, `AMT_INCOME`, `NAME_INCOME_TYPE`, `NAME_FAMILY_STATUS`, `NAME_HOUSING_TYPE`, `DAYS_BIRTH`, `DAYS_EMPLOYED`, `FLAG_MOBIL`, `FLAG_WORK_PHONE`, `FLAG_PHONE`, `FLAG_EMAIL`, `CNT_FAM_MEMBERS`, `y`** have no missing values.

---

#### **Task 2: Data Preprocessing**

##### **Question 4: Imputing missing values**

We'll impute missing values in `df_final` considering the type of each variable and best practices. Here's how we'll approach this:

1. For numeric variables, we'll use the median to impute missing values.

2. For categorical variables, we'll use the mode (most frequent value) to impute missing values.

In [115]:
# Impute numeric variables with median
numeric_columns = ['AMT_INCOME', 'CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'FLAG_MOBIL', 'FLAG_WORK_PHONE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS']
for col in numeric_columns:
    df_final[col] = df_final[col].fillna(df_final[col].median())

In [116]:
# Impute categorical variables with mode
categorical_columns = ['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE']
for col in categorical_columns:
    df_final[col] = df_final[col].fillna(df_final[col].mode()[0])

In [117]:
# Verify that all missing values have been imputed
print(df_final.isnull().sum())

CODE_GENDER            0
FLAG_OWN_CAR           0
FLAG_OWN_REALTY        0
CNT_CHILDREN           0
AMT_INCOME             0
NAME_INCOME_TYPE       0
NAME_EDUCATION_TYPE    0
NAME_FAMILY_STATUS     0
NAME_HOUSING_TYPE      0
DAYS_BIRTH             0
DAYS_EMPLOYED          0
FLAG_MOBIL             0
FLAG_WORK_PHONE        0
FLAG_PHONE             0
FLAG_EMAIL             0
OCCUPATION_TYPE        0
CNT_FAM_MEMBERS        0
y                      0
dtype: int64


##### **Question 5: Converting values in NAME_EDUCATION_TYPE**

We need to convert the values in the `NAME_EDUCATION_TYPE` column to numeric values. We'll use the following mapping:
- Lower secondary -> 1
- Secondary / secondary special -> 2
- Incomplete higher -> 3
- Higher education -> 4

In [118]:
# Define the mapping for education types
education_mapping = {
    'Lower secondary': 1,
    'Secondary / secondary special': 2,
    'Incomplete higher': 3,
    'Higher education': 4
}

In [119]:
# Apply the mapping to the NAME_EDUCATION_TYPE column
df_final['NAME_EDUCATION_TYPE'] = df_final['NAME_EDUCATION_TYPE'].map(education_mapping)

In [120]:
# Verify the transformation
print(df_final['NAME_EDUCATION_TYPE'].value_counts())

NAME_EDUCATION_TYPE
2    3610
4     725
3     150
1      15
Name: count, dtype: int64


##### **Question 6: Adding dummy variables for nominal features**

We'll create dummy variables for the nominal features in `df_final` to convert them into a format suitable for machine learning models.

In [121]:
# List of nominal features
nominal_features = ['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_INCOME_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE']

In [122]:
# Create dummy variables
df_dummies = pd.get_dummies(df_final[nominal_features], drop_first=True)

In [123]:
# Remove original nominal features and add dummy variables
df_final = df_final.drop(columns=nominal_features).join(df_dummies)

In [124]:
# Verify the changes
print(df_final.columns)

Index(['CNT_CHILDREN', 'AMT_INCOME', 'NAME_EDUCATION_TYPE', 'DAYS_BIRTH',
       'DAYS_EMPLOYED', 'FLAG_MOBIL', 'FLAG_WORK_PHONE', 'FLAG_PHONE',
       'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'y', 'CODE_GENDER_M', 'FLAG_OWN_CAR_Y',
       'FLAG_OWN_REALTY_Y', 'NAME_INCOME_TYPE_Pensioner',
       'NAME_INCOME_TYPE_State servant', 'NAME_INCOME_TYPE_Student',
       'NAME_INCOME_TYPE_Working', 'NAME_FAMILY_STATUS_Married',
       'NAME_FAMILY_STATUS_Separated',
       'NAME_FAMILY_STATUS_Single / not married', 'NAME_FAMILY_STATUS_Widow',
       'NAME_HOUSING_TYPE_House / apartment',
       'NAME_HOUSING_TYPE_Municipal apartment',
       'NAME_HOUSING_TYPE_Office apartment',
       'NAME_HOUSING_TYPE_Rented apartment', 'NAME_HOUSING_TYPE_With parents',
       'OCCUPATION_TYPE_Cleaning staff', 'OCCUPATION_TYPE_Cooking staff',
       'OCCUPATION_TYPE_Core staff', 'OCCUPATION_TYPE_Drivers',
       'OCCUPATION_TYPE_HR staff', 'OCCUPATION_TYPE_High skill tech staff',
       'OCCUPATION_TYPE_IT staff',

---

#### **Task 3: Preparing Data for Modeling**

##### **Question 7**

**1. Creating the y array**

- We use `df_final['y']` to select the 'y' column from our DataFrame.

- `.to_numpy()` converts this pandas Series to a NumPy array.

- `dtype=int` ensures that the values are stored as integers.

In [125]:
# Create y array from the 'y' column of df_final
y = df_final['y'].to_numpy(dtype=int)

**2. Creating the `X` array**
- `df_final.drop('y', axis=1)` creates a new DataFrame without the 'y' column.

- `.to_numpy()` converts this DataFrame to a NumPy array.

In [126]:
# Create X array from all remaining features in df_final
X = df_final.drop('y', axis=1).to_numpy()

##### **Question 8**

**1. Splitting the data into train and test sets**

- We use `train_test_split` from scikit-learn to split our data.

- `test_size=0.25` ensures 75% train and 25% test split.

- `random_state=8` sets a seed for reproducibility.

- `stratify=y` ensures that the proportion of samples for each class is roughly equal in both train and test sets.

In [127]:
# Split the data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=8, stratify=y)

**2. Standardizing the data**
- We use StandardScaler from scikit-learn to standardize our data.

- fit_transform on X_train both computes the mean and std to be used for later scaling, and scales the training data.

- We use transform on X_test to scale it using the mean and std computed from X_train. This is crucial to prevent data leakage.

In [128]:
# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both training and test data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

---

#### **Task 4: Model Training and Evaluation**

##### **Question 9**

1. **Logistic Regression Classifier**:
   - Train the Logistic Regression model on the standardized data.
   - Calculate and print the training and test accuracy.

In [129]:

# Initialize the Logistic Regression model
lr_model = LogisticRegression(random_state=10)

# Train the model on the standardized training data
lr_model.fit(X_train_scaled, y_train)

# Make predictions on both training and test sets
lr_train_pred = lr_model.predict(X_train_scaled)
lr_test_pred = lr_model.predict(X_test_scaled)

# Compute accuracies
lr_train_accuracy = accuracy_score(y_train, lr_train_pred)
lr_test_accuracy = accuracy_score(y_test, lr_test_pred)

# Print accuracies rounded to three decimal places
print("Logistic Regression Results:")
print(f"Training Accuracy: {lr_train_accuracy:.3f}")
print(f"Test Accuracy: {lr_test_accuracy:.3f}")

Logistic Regression Results:
Training Accuracy: 0.655
Test Accuracy: 0.656


2. **Random Forest Classifier**:
   - Train the Random Forest model on the standardized data.
   - Calculate and print the training and test accuracy.

In [130]:
# Initialize the Random Forest model
rf_model = RandomForestClassifier(random_state=10)

# Train the model on the standardized training data
rf_model.fit(X_train_scaled, y_train)

# Make predictions on both training and test sets
rf_train_pred = rf_model.predict(X_train_scaled)
rf_test_pred = rf_model.predict(X_test_scaled)

# Compute accuracies
rf_train_accuracy = accuracy_score(y_train, rf_train_pred)
rf_test_accuracy = accuracy_score(y_test, rf_test_pred)

# Print accuracies rounded to three decimal places
print("\nRandom Forest Results:")
print(f"Training Accuracy: {rf_train_accuracy:.3f}")
print(f"Test Accuracy: {rf_test_accuracy:.3f}")


Random Forest Results:
Training Accuracy: 0.975
Test Accuracy: 0.903


##### **Question 9**

3. **Model Comparison and Insights**

a. **Comparing training and test accuracies for each classifier**:

**Logistic Regression**:
- **Training Accuracy**: 0.655
- **Test Accuracy**: 0.656

The Logistic Regression model shows very similar accuracies for both training and test sets, with only a 0.001 difference. This suggests that the model is not overfitting. It has learned patterns that generalize well to unseen data, but its overall performance is moderate.

**Random Forest**:
- **Training Accuracy**: 0.975
- **Test Accuracy**: 0.903

The Random Forest model shows a higher training accuracy (0.975) compared to its test accuracy (0.903). This difference of 0.072 indicates some degree of overfitting. The model has learned the training data very well, including some patterns that don't generalize to the test set.

**Extent of overfitting**:
- Logistic Regression: Minimal to no overfitting. The model's performance on training and test data is nearly identical.
- Random Forest: Moderate overfitting. The model performs notably better on the training data than on the test data, suggesting it has learned some patterns specific to the training set that don't generalize well.

b. **Comparing accuracies across the two classifiers**:

The Random Forest classifier provides better forecasts overall:
- **Logistic Regression Test Accuracy**: 0.656
- **Random Forest Test Accuracy**: 0.903

The Random Forest outperforms the Logistic Regression by a significant margin (0.247 or 24.7 percentage points) on the test set. This suggests that the Random Forest is much better at capturing the underlying patterns in the data that are relevant for prediction.

Despite its slight overfitting, the Random Forest still generalizes much better to unseen data compared to the Logistic Regression model. Its test accuracy of 0.903 indicates strong predictive performance.

c. **Presence of nonlinearities in the dataset**:

The significant performance difference between the Logistic Regression (a linear model) and the Random Forest (a nonlinear model) strongly suggests the presence of nonlinearities in the credit default prediction dataset. Here's a detailed explanation considering the context:

1. **Nature of Credit Default Prediction**:
   Credit default prediction is inherently complex, often involving nonlinear relationships between various factors. For instance, the relationship between income (`AMT_INCOME`) and default risk may not be linear - there could be threshold effects or diminishing returns.

2. **Diverse Feature Set**:
   The dataset includes a wide range of features such as income, education level, employment duration, family status, and various binary flags (car ownership, phone ownership, etc.). The interaction between these diverse features is likely to be nonlinear. For example, the effect of income on default risk might vary depending on education level or family status.

3. **Temporal Aspects**:
   The `DAYS_BIRTH` and 'DAYS_EMPLOYED' features count backwards from the current day. The impact of age or employment duration on credit risk may not be linear - there could be critical periods or thresholds that significantly affect risk.

4. **Categorical Variables**:
   The dataset includes several categorical variables (`NAME_INCOME_TYPE`, `NAME_EDUCATION_TYPE`, `OCCUPATION_TYPE`). The Random Forest's superior performance suggests it's better at handling the potentially nonlinear effects of these categorical variables on credit risk.

5. **Credit History Complexity**:
   The `STATUS` variable in the `credit_record.csv` file indicates various levels of payment delays. The relationship between past payment behavior and future default risk is likely to be nonlinear, with recent behavior possibly having a disproportionate impact.

6. **Model Performance Comparison**:
   The Logistic Regression model, which assumes linear relationships, achieved only moderate accuracy (0.656). In contrast, the Random Forest model, capable of capturing nonlinear patterns, achieved much higher accuracy (0.903). This substantial improvement (24.7 percentage points) strongly indicates that nonlinear relationships in the data are crucial for predicting credit defaults.

7. **Feature Interactions**:
   In credit risk assessment, the interaction between features can be critical. For instance, the impact of income on default risk might depend on the number of dependents (`CNT_CHILDREN`) or housing type (`NAME_HOUSING_TYPE`). Random Forests can automatically capture such interactions, while Logistic Regression cannot without explicit feature engineering.

8. **Threshold Effects**:
   Credit behavior often exhibits threshold effects. For example, missing a payment by a few days might have a disproportionate impact on future default risk. The Random Forest's ability to create complex decision boundaries allows it to capture such nonlinear threshold effects more effectively than Logistic Regression.

---

#### **Conclusion**

1. **Summary of Findings**:
Throughout this project, we gained several key insights into predicting credit defaults for a financial institution:

- **Data Integration**: By merging application records with credit histories, we created a comprehensive dataset that captures both client characteristics and their credit behavior over time.

- **Data Preprocessing**: We addressed missing values in key fields like education type and number of children, ensuring a complete dataset for analysis. We also converted categorical variables into dummy variables, making them suitable for machine learning models.

- **Feature Importance**: The project highlighted the significance of various factors in predicting credit defaults, including income, employment duration, age, family status, and past credit behavior.

- **Model Performanc**e: We compared two models:
  - Logistic Regression achieved accuracies of 0.655 (training) and 0.656 (test)
  - **Random Forest** achieved accuracies of 0.975 (training) and 0.903 (test)

The Random Forest model emerged as the most effective, outperforming Logistic Regression by a significant margin (24.7 percentage points on the test set). This implies that credit default prediction in this dataset involves complex, nonlinear relationships that the Random Forest can capture more effectively.

The high accuracy of the **Random Forest model** (90.3% on unseen data) suggests it could be a valuable tool for the financial institution in assessing credit risk and making lending decisions. However, the slight overfitting observed (97.5% training accuracy vs 90.3% test accuracy) indicates there's room for model refinement.


2. **Future Work**:
To further improve this analysis and its applications, we suggest:

- Feature Engineering: Create new features that might capture important aspects of credit risk, such as debt-to-income ratio or credit utilization rate.

- Time Series Analysis: Incorporate more sophisticated analysis of the temporal aspects of credit behavior, possibly using time series models.

- Interpretability: While Random Forest performs well, its decisions can be hard to interpret. Implementing techniques like SHAP (SHapley Additive exPlanations) values could provide insights into feature importance and model decisions.

This analysis could be extended to similar datasets in the financial sector, such as:
- Predicting early loan repayments, which can affect a bank's expected interest income.
- Estimating the probability of a customer opening new accounts or services.
- Detecting fraudulent transactions by incorporating similar machine learning techniques.

By continually refining these models and expanding their applications, financial institutions can make more informed decisions, manage risk more effectively, and provide better services to their clients.

---