# Loan Default Prediction - Logistic Regression
**Author:** Vidyasagar - Data Scientist

This notebook delivers an end-to-end analysis of LoanTap loan performance:
- EDA and data dictionary alignment
- Feature engineering and preprocessing
- Logistic Regression modeling
- Precision vs Recall tradeoff for default detection


## Problem Statement
Predict whether a loan will **default** (Charged Off) using borrower, loan, and credit
attributes. The business goal is to detect potential defaulters early while minimizing
false positives that block credit-worthy applicants.


## Data Dictionary (Key Fields)
- **loan_amnt:** Listed loan amount
- **term:** Number of payments (36 or 60 months)
- **int_rate:** Interest rate on the loan
- **installment:** Monthly payment owed
- **grade / sub_grade:** LoanTap assigned grade & subgrade
- **emp_title / emp_length:** Employment title and length
- **home_ownership:** Ownership status
- **annual_inc:** Self-reported annual income
- **verification_status:** Income verification state
- **issue_d:** Month funded
- **loan_status:** Target variable (Fully Paid / Charged Off)
- **purpose / title:** Borrower-declared purpose/title
- **dti:** Debt-to-income ratio
- **earliest_cr_line:** Earliest credit line month
- **open_acc / total_acc:** Open and total credit lines
- **pub_rec / pub_rec_bankruptcies:** Public record counts
- **revol_bal / revol_util:** Revolving balance/utilization
- **initial_list_status:** Listing status (W/F)
- **application_type:** Individual or Joint
- **mort_acc:** Mortgage accounts
- **Address:** Borrower address


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    roc_curve,
    precision_recall_curve,
    average_precision_score,
)

from src.loantap_modeling_utils import (
    compute_outlier_caps,
    apply_outlier_caps,
    build_preprocessor,
)

sns.set(style="whitegrid")
pd.set_option("display.max_columns", 200)


In [None]:
df = pd.read_csv("data/raw/LoanTapData.csv")
df.head()

In [None]:
print("Shape:", df.shape)
df.info()

df.describe(include="all").T.head(20)

In [None]:
categorical_cols = df.select_dtypes(include="object").columns
for col in categorical_cols:
    df[col] = df[col].astype("category")

print("Duplicate rows:", df.duplicated().sum())

In [None]:
missing_summary = df.isna().mean().sort_values(ascending=False)
missing_summary.head(15)

## Univariate Analysis
Distribution plots for continuous variables and count plots for categorical features.
Each plot includes a short interpretation in the output narrative.


In [None]:
num_cols = df.select_dtypes(include=["int64", "float64"]).columns

df[num_cols].hist(bins=30, figsize=(16, 12))
plt.suptitle("Distribution of Numeric Variables", y=1.02)
plt.tight_layout()

In [None]:
cat_cols = df.select_dtypes(include=["category"]).columns

plot_cols = [
    "loan_status",
    "grade",
    "sub_grade",
    "home_ownership",
    "verification_status",
    "purpose",
    "term",
    "initial_list_status",
    "application_type",
]

for col in plot_cols:
    plt.figure(figsize=(7, 4))
    order = df[col].value_counts().index
    sns.countplot(data=df, x=col, order=order)
    plt.xticks(rotation=45, ha="right")
    plt.title(f"Countplot: {col}")
    plt.tight_layout()

## Bivariate Analysis
We analyze the relationship between the target variable and key predictors using
count plots, box plots, and comparative distributions.


In [None]:
plt.figure(figsize=(5, 4))
sns.countplot(data=df, x="loan_status")
plt.title("Target Distribution: Loan Status")
plt.tight_layout()

for col in ["loan_amnt", "int_rate", "installment", "dti", "annual_inc", "revol_util"]:
    plt.figure(figsize=(6, 4))
    sns.boxplot(data=df, x="loan_status", y=col)
    plt.title(f"{col} by Loan Status")
    plt.tight_layout()

In [None]:
plt.figure(figsize=(12, 8))
corr = df[num_cols].corr()
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.title("Correlation Heatmap (Numeric Features)")
plt.tight_layout()

## Feature Engineering
We create binary flags for credit risk signals and prepare the data for modeling.


In [None]:
df["pub_rec_flag"] = (df["pub_rec"] > 0).astype(int)
df["mort_acc_flag"] = (df["mort_acc"] > 0).astype(int)
df["pub_rec_bankruptcies_flag"] = (df["pub_rec_bankruptcies"] > 0).astype(int)

df[["pub_rec_flag", "mort_acc_flag", "pub_rec_bankruptcies_flag"]].head()

## Data Preprocessing & Model Building
We handle missing values, cap outliers, apply scaling, and build a Logistic Regression model.


In [None]:
X = df.drop(columns=["loan_status"])
y = (df["loan_status"].astype(str) == "Charged Off").astype(int)

numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.columns.difference(numeric_features)

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42,
)

caps = compute_outlier_caps(X_train, numeric_features)
X_train = apply_outlier_caps(X_train, caps)
X_test = apply_outlier_caps(X_test, caps)

preprocessor = build_preprocessor(numeric_features, categorical_features)
model = LogisticRegression(max_iter=1000, class_weight="balanced", solver="liblinear")

clf = Pipeline(steps=[("preprocess", preprocessor), ("model", model)])
clf.fit(X_train, y_train)


In [None]:
feature_names = clf.named_steps["preprocess"].get_feature_names_out()
coefs = clf.named_steps["model"].coef_[0]

coef_df = pd.DataFrame({"feature": feature_names, "coef": coefs})
coef_df = coef_df.sort_values("coef", ascending=False)

print("Top positive coefficients (higher default risk):")
coef_df.head(15)

print("Top negative coefficients (lower default risk):")
coef_df.tail(15)

In [None]:
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrix

In [None]:
roc_auc = roc_auc_score(y_test, y_proba)
fpr, tpr, _ = roc_curve(y_test, y_proba)

plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, label=f"AUC={roc_auc:.3f}")
plt.plot([0, 1], [0, 1], "--", color="gray")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.tight_layout()

precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
avg_precision = average_precision_score(y_test, y_proba)

plt.figure(figsize=(6, 4))
plt.plot(recall, precision, label=f"AP={avg_precision:.3f}")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.legend()
plt.tight_layout()

In [None]:
threshold_df = pd.DataFrame(
    {
        "threshold": thresholds,
        "precision": precision[:-1],
        "recall": recall[:-1],
    }
)

threshold_df.sort_values("recall", ascending=False).head()

target_recall = 0.8
candidates = threshold_df[threshold_df["recall"] >= target_recall].sort_values(
    "precision", ascending=False
)

candidates.head(10)

## Precision vs Recall Tradeoff (Business View)
- To reduce missed defaulters, lower the decision threshold and prioritize **recall**.
- To reduce false positives (lost revenue), raise the threshold and prioritize **precision**.
- A two-stage policy (auto-approve / manual review / reject) balances both objectives.


## Actionable Insights & Recommendations
- Prioritize applicants with lower DTI and lower revolving utilization; these variables
  consistently show lower default risk.
- Price risk more accurately for high-risk grades (D-G) and 60-month terms using
  tighter approval thresholds or manual review.
- Implement early-warning flags for borrowers with any public records or bankruptcy
  signals; these are strong predictors of charge-off.
- Use probability-based decisioning rather than a hard 0.5 threshold to align approvals
  with the institution's risk appetite.


In [None]:
pct_fully_paid = (df["loan_status"].astype(str) == "Fully Paid").mean() * 100
corr_loan_installment = df["loan_amnt"].corr(df["installment"])
home_majority = df["home_ownership"].mode().iloc[0]

grade_a_rate = (df.loc[df["grade"] == "A", "loan_status"].astype(str) == "Fully Paid").mean()
non_a_rate = (df.loc[df["grade"] != "A", "loan_status"].astype(str) == "Fully Paid").mean()

job_titles = df["emp_title"].value_counts().head(2)

print(f"% fully paid: {pct_fully_paid:.2f}%")
print(f"Correlation (loan_amnt vs installment): {corr_loan_installment:.3f}")
print(f"Majority home ownership: {home_majority}")
print(f"Grade A fully paid rate: {grade_a_rate:.3f} vs others: {non_a_rate:.3f}")
print("Top 2 job titles:")
print(job_titles)