# Assignment 2
In this assignment, I've created a logistic regression model that should predict the loan status based on selected features.

The assignment is split into key parts that contain:
- selecting variables for further analysis
- preparing data
- performing feature engineering, cross-validation, and model evaluation
- dealing with class imbalance
- discovering the insights

In the second part of the assignment, the logistic regression model was optimized by training by fine-tuning thresholds.

In the third part, the linear regression model was built based on pre-defined features.

## Part 1. Creating a logistic regression model

### 1. Import libraries, load data
All imports are gathered in one place to make sure that they're not duplicated and can be reviewed complexly.
The 'df_out_dsif3.csv' file is selected as the data source for the current assignment.

In [25]:
python_material_folder_name = "python-material"

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.preprocessing import LabelEncoder

# Training
from sklearn.model_selection import train_test_split, cross_val_score

# Modelling
from sklearn.linear_model import LogisticRegression, LinearRegression

# Dealing with class imbalance
from imblearn.over_sampling import SMOTE

# Model evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, roc_curve, auc, roc_auc_score, mean_squared_error, r2_score

path_python_material = ".."

# Data loading
df = pd.read_csv(f"{path_python_material}/data/2-intermediate/df_out_dsif3.csv")

### 2. Selecting variables for analysis
To predict loan status the next fields were chosen:
- fico_range_low - the lower boundary range the borrower’s FICO at loan origination belongs to
- fico_range_high - the upper boundary range the borrower’s FICO at loan origination belongs to
- last_fico_range_high - the upper boundary range the borrower’s last FICO pulled belongs to
- last_fico_range_low - the lower boundary range the borrower’s last FICO pulled belongs to
- annual_inc - the self-reported annual income provided by the borrower during registration
- sub_grade - LC assigned loan subgrade
- verification_status - indicates if income was verified by LC, not verified, or if the income source was verified
- dti - a ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income

For better understanding 'dti' values I've added the example of how it can be calculated.

#### Very High DTI
- **Monthly Debt Payments (excluding mortgage and the new loan):** 4,500
- **Monthly Income (self-reported):** 5,000
- **DTI Calculation:** 4,500 / 5,000 = 0.90 or 90%

Description: A DTI of 90% is extremely high, indicating that nearly all of the borrower's income goes towards paying off their existing debts. This is a red flag for lenders, as it suggests the borrower has little room to manage new debt or unexpected expenses.

**'sub_grade' description:** sub_grade value helps lenders assess the specific risk level within each grade. It can affect the loan’s interest rate, as borrowers with a higher subgrade (e.g., A1) are typically considered lower risk and may receive more favorable loan terms compared to those with a lower subgrade (e.g., A5).

In summary, the sub_grade provides a finer classification of the borrower's credit risk, helping lenders set more accurate loan pricing and risk strategies.

### 3. Feature engineering
Based on selected variables it was built next features:
-  loan_default
-  fico_average
-  last_fico_average
-  fico_category
-  last_fico_category

fico_category, last_fico_category, sub_grade, verification_status were encoded into numeric values to be able to perform model analysis

#### Helper functions

In [26]:
def categorize_fico(fico_avg):
    if fico_avg >= 800:
        return 'Excellent Credit'
    elif fico_avg >= 740:
        return 'Very Good Credit'
    elif fico_avg >= 670:
        return 'Good Credit'
    elif fico_avg >= 580:
        return 'Fair Credit'
    else:
        return 'Poor Credit'

def model_evaluation(model, predictions, y_test, X_test):
    accuracy = accuracy_score(y_test, predictions)
    precision = precision_score(y_test, predictions)
    recall = recall_score(y_test, predictions)
    f1 = f1_score(y_test, predictions)
    roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
    
    # Display the confusion matrix
    cm = confusion_matrix(y_test, predictions)
    
    print(f'Accuracy: {accuracy}')
    print(f'Precision: {precision}')
    print(f'Recall: {recall}')
    print(f'F1-Score: {f1}')
    print(f'ROC-AUC: {roc_auc}')
    print(f'Confusion Matrix:\n{cm}')

#### Creating new features

In [27]:
df['loan_default'] = df.loan_status == "Charged Off"

df['fico_average'] = (df['fico_range_low'] + df['fico_range_high']) / 2
df['last_fico_average'] = (df['last_fico_range_low'] + df['last_fico_range_high']) / 2


df['fico_category'] = df['fico_average'].apply(categorize_fico)
df['last_fico_category'] = df['last_fico_average'].apply(categorize_fico)

# As the logistic regression model can work with numeric values only, category values are encoded
le = LabelEncoder()
df.loc[:, 'fico_category_encoded'] = le.fit_transform(df['fico_category'])
df.loc[:, 'last_fico_category_encoded'] = le.fit_transform(df['last_fico_category'])
df.loc[:, 'sub_grade_encoded'] = le.fit_transform(df['sub_grade'])
df.loc[:, 'verification_status_encoded'] = le.fit_transform(df['verification_status'])

# It was identified that some parts of rows (193 from 100000) don't have 'dti' value, so they should be cleaned-up
df_cleaned = df[df['dti'].notna()]

### 3. Selecting features for the model

In [28]:
features = ['fico_category_encoded', 'annual_inc', 'dti', 'funded_amnt', 'sub_grade_encoded']
X = df_cleaned[features]
y = df_cleaned['loan_default']

In [29]:
# Split the data into training and testing sets (returns pandas dfs)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=60)

In [30]:
for dframe in [X_train, X_test, y_train, y_test]:
    print(f"Shape: {dframe.shape}")

Shape: (79918, 5)
Shape: (19980, 5)
Shape: (79918,)
Shape: (19980,)


### 4. Building and training the model

In [31]:
# max_iter parameter was added as the total number of iterations (100 by default) reached the limit
model = LogisticRegression(max_iter=300)

In [32]:
model.fit(X_train, y_train)

In [33]:
predictions = model.predict(X_test)

### 5. Evaluating the model

In [34]:
model_evaluation(model, predictions, y_test, X_test)

Accuracy: 0.8757757757757758
Precision: 0.42857142857142855
Recall: 0.015795868772782502
F1-Score: 0.03046875
ROC-AUC: 0.7073210870010563
Confusion Matrix:
[[17459    52]
 [ 2430    39]]


### 6. Comparison: model_1 vs model_2

**model_1 evaluation** (from lesson)
- Accuracy: 0.87835
- Precision: 0.0
- Recall: 0.0
- F1-Score: 0.0
- ROC-AUC: 0.6173932969589436
- Confusion Matrix:
[[17567     0]
 [ 2433     0]]

**model_2 evaluation** (current)
- Accuracy: 0.8757757757757758
- Precision: 0.42857142857142855
- Recall: 0.015795868772782502
- F1-Score: 0.03046875
- ROC-AUC: 0.7073210870010563
- Confusion Matrix:
[[17459    52]
 [ 2430    39]]

#### Insights
1. Unlike Model 1, Model 2 is making some positive predictions (39 true positives), so that the model is at least attempting to identify positive cases (defaults).

2. Both models have extremely low recall (0.0 for Model 1 and 0.0158 for Model 2), which means they are missing a vast majority of the actual positive cases. So both models can be biased toward predicting the negative class (non-defaults). This could be due to an imbalance in the dataset (many more negatives than positives).

3. The ROC-AUC of Model 2 (0.7073) is higher than Model 1 (0.6174), so Model 2 has a better ability to differentiate between positive and negative classes overall.

To impove the Model 2 performance we'll try to handle class imbalance. 

### 7. Handling imbalanced data

The ways of handling imbalanced data:
- take a set of lower group and duplicate it
- use advanced techniques to generate data that is similar to the needed group (such as SMOTE)

In [35]:
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)  

In [36]:
y_resampled.value_counts()

loan_default
False    87524
True     87524
Name: count, dtype: int64

For now the set of default and non-default cohorts are the same.

In [37]:
model_2 = LogisticRegression(max_iter=300)
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=74)

model_2.fit(X_train, y_train)

predictions_2 = model_2.predict(X_test)

In [38]:
model_evaluation(model_2, predictions_2, y_test, X_test)

Accuracy: 0.6534990002856327
Precision: 0.6516297855395643
Recall: 0.6559958767609667
F1-Score: 0.6538055420792785
ROC-AUC: 0.7217380613534272
Confusion Matrix:
[[11424  6124]
 [ 6007 11455]]


### 8. model_2 vs model_2 after handling imbalanced data

**model_2 evaluation** (before handling imbalanced data)
- Accuracy: 0.8757757757757758
- Precision: 0.42857142857142855
- Recall: 0.015795868772782502
- F1-Score: 0.03046875
- ROC-AUC: 0.7073210870010563
- Confusion Matrix:
[[17459    52]
 [ 2430    39]]

**model_2 evaluation** (after handling imbalanced data)
- Accuracy: 0.6514710082833476
- Precision: 0.6505437893531769
- Recall: 0.6508418279693048
- F1-Score: 0.6506927745333791
- ROC-AUC: 0.7185178578006775
- Confusion Matrix:
[[11443  6105]
 [ 6097 11365]]

#### Insights
1. The accuracy has significantly dropped (from 87.58% to 65.15%), which is expected after handling imbalanced data. This is a positive change since the model is now predicting both classes instead of defaulting to the negative class.

2. Recall: Significantly improved, now identifying 65.08% of actual positive cases.

3. F1-Score: The balanced precision and recall lead to a substantially better F1-Score, indicating that the model has improved its ability to correctly predict both classes.

4. The model now makes more true positive predictions (11,365) compared to before (39), showing that handling the class imbalance has greatly improved the model's capacity to identify the positive class.

5. The ROC-AUC increased slightly from 0.7073 to 0.7185, suggesting a better discriminative ability between positive and negative cases across thresholds.

### 9. Performing cross-validation

In [39]:
# Accuracy
cv_accuracy = cross_val_score(model_2, X, y, cv=5, scoring='accuracy')
print(f'Cross-Validation Accuracy Scores: {cv_accuracy}')
print(f'Mean CV Accuracy: {cv_accuracy.mean()}')

# Precision
cv_precision = cross_val_score(model_2, X, y, cv=5, scoring='precision')
print(f'Cross-Validation Precision Scores: {cv_precision}')
print(f'Mean CV Precision: {cv_precision.mean()}')

# Recall
cv_recall = cross_val_score(model_2, X, y, cv=5, scoring='recall')
print(f'Cross-Validation Recall Scores: {cv_recall}')
print(f'Mean CV Recall: {cv_recall.mean()}')

# F1-Score
cv_f1 = cross_val_score(model_2, X, y, cv=5, scoring='f1')
print(f'Cross-Validation F1 Scores: {cv_f1}')
print(f'Mean CV F1-Score: {cv_f1.mean()}')

# ROC-AUC
cv_roc_auc = cross_val_score(model_2, X, y, cv=5, scoring='roc_auc')
print(f'Cross-Validation ROC-AUC Scores: {cv_roc_auc}')
print(f'Mean CV ROC-AUC: {cv_roc_auc.mean()}')

Cross-Validation Accuracy Scores: [0.87557558 0.87592593 0.87512513 0.87541919 0.87476851]
Mean CV Accuracy: 0.8753628647417125
Cross-Validation Precision Scores: [0.43209877 0.47959184 0.40384615 0.42857143 0.37142857]
Mean CV Precision: 0.42310735120258924
Cross-Validation Recall Scores: [0.01414141 0.0189899  0.0169697  0.01818917 0.01575758]
Mean CV Recall: 0.01680955063978508
Cross-Validation F1 Scores: [0.02738654 0.03653323 0.03257076 0.03489725 0.03023256]
Mean CV F1-Score: 0.03232406803209402
Cross-Validation ROC-AUC Scores: [0.71272353 0.69455942 0.69768035 0.697021   0.70639182]
Mean CV ROC-AUC: 0.7016752254792344


#### Summary
1. The model has a high accuracy but remains biased toward the majority class, leading to very low recall and F1-score.

2. Despite a modest ROC-AUC (around 0.7017), the model struggles to identify positive cases, as shown by the consistently low recall across all cross-validation folds.

3. Further efforts in tuning of hyperparameters and improving feature engineering are needed to enhance the model's performance, especially in terms of recall.

## Part 2. Optimizing model by training by fine-tuning thresholds

### 1. Using custom loss functions

The main purpose of using custom loss functions is to assign different levels of importance to different types of errors made by a model. In many real-world scenarios, some mistakes are more costly than others, and a custom loss function helps the model focus on minimizing the most expensive mistakes.

By setting the next costs for FP and FN values we tell the model: "It’s okay to make some false alarms (false positives), but missing an actual positive case (false negative) is a bigger mistake and should be avoided as much as possible.

In [40]:
# Predictions
y_pred_prob = model_2.predict_proba(X)[:, 1]
threshold = 0.5  # Default threshold

# Custom thresholding based on custom loss function
FP_cost = 1.0  # Cost for False Positive
FN_cost = 5.0  # Cost for False Negative

def custom_threshold(y_pred_prob, threshold):
    return np.where(y_pred_prob > threshold, 1, 0)

def evaluate_custom_loss(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    loss = FP_cost * fp + FN_cost * fn
    return loss

# Calculate loss for the default threshold
y_pred_default = custom_threshold(y_pred_prob, threshold)
default_loss = evaluate_custom_loss(y, y_pred_default)

print(f"Default Loss: {default_loss}")

Default Loss: 54789.0


Searching for the threshold with less loss

In [41]:
thresholds = np.arange(0.3, 0.6, 0.005)
losses = []

for thresh in thresholds:
    y_pred = custom_threshold(y_pred_prob, thresh)
    loss = evaluate_custom_loss(y, y_pred)
    losses.append((thresh, loss))

# Find the threshold that gives the minimum loss
best_threshold, min_loss = min(losses, key=lambda x: x[1])

print(f"Best Threshold: {best_threshold}")
print(f"Minimum Loss: {min_loss}")


Best Threshold: 0.5950000000000002
Minimum Loss: 53052.0


Using the best threshold based on investigation

In [42]:
threshold = 0.595

y_pred_default = custom_threshold(y_pred_prob, threshold)
default_loss = evaluate_custom_loss(y, y_pred_default)

print(f"Default Loss: {default_loss}")

Default Loss: 53052.0


## Part 3. Building linear regression model

The business also wants you to do a PoC to see if they can predict `loan_amnt` based on the following features: `emp_length`,`home_ownership`, `annual_inc`.

In [47]:
# Standardizing the 'emp_length' column (to have numeric format)
df_cleaned.loc[:, 'emp_length_clean'] = df_cleaned['emp_length'].replace({
    '10+ years': '10',
    '< 1 year': '0'
}).str.rstrip(' years').astype('float')

# Filtering rows where 'emp_length_clean' is not NaN
df_cleaned_2 = df_cleaned[df_cleaned['emp_length_clean'].notna()]

In [49]:
# Encoding the 'home_ownership' column (to have numeric format)
df_cleaned_2.loc[:, 'home_wonership_encoded'] = le.fit_transform(df_cleaned_2['home_ownership'])

In [50]:
features = ['emp_length_clean', 'annual_inc', 'home_wonership_encoded']
X = df_cleaned_2[features]
y = df_cleaned_2['loan_amnt']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

y_pred = linear_model.predict(X_test)

Model evaluation

In [51]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')

Mean Squared Error: 88149943.64932868
R^2 Score: 0.02504042640747528


### Summary
MSE of 88,149,943.65: the model has large prediction errors, meaning its predictions for 'loan_amnt' are significantly off from the actual values.

R² of 0.025: the model explains only 2.5% of the variance in loan_amnt, indicating that it is not capturing meaningful relationships between the features and the target variable.

**Next Steps to Improve the Model:**
1. Add More Features: the current features (emp_length, home_ownership, annual_inc) might not be enough to predict loan amounts accurately
2. Feature Engineering: need to explore transforming or combining features to create more meaningful inputs for the model
3. Examine the Data: need to check for any issues with outliers, missing values, or skewed distributions that might be affecting the model’s performance.