## Build a smart loan recovery system that involves creating a machine learning model that can predict the likelihood of loan repayment and identify high-risk borrowers. Here's a step-by-step guide to help you build such a model:
### Problem Definition
•	Define the objective: Predict the probability of loan repayment and identify high-risk borrowers.
•	Identify the target variable: Loan repayment status (e.g., paid, defaulted, or overdue).
Data Collection
•	Collect relevant data: Loan applications, credit history, payment records, and borrower information.
•	Preprocess the data: Handle missing values, normalize/scale features, and transform variables ¹ ² ³.
Feature Engineering
•	Extract relevant features: Credit score, loan amount, interest rate, payment history, and borrower demographics.
•	Use techniques like correlation analysis and recursive feature elimination to select the most informative features ⁴ ².
Model Selection
•	Choose a suitable algorithm: Logistic Regression, Decision Trees, Random Forest, or Support Vector Machines.
•	Consider using ensemble methods like bagging or boosting to improve model performance.
Model Training and Evaluation
•	Train the model: Use a training dataset to fit the model and tune hyperparameters.
•	Evaluate the model: Use metrics like accuracy, precision, recall, F1-score, and AUC-ROC to assess model performance ⁵ ¹.
Model Deployment
•	Deploy the model: Integrate the model into a loan recovery system to predict the likelihood of loan repayment for new borrowers.
•	Monitor and update: Continuously monitor the model's performance and update it as necessary to ensure optimal results.
Some popular machine learning algorithms for building a smart loan recovery system include ⁵ ²:
•	Logistic Regression: A popular choice for binary classification problems like loan repayment prediction.
•	Decision Trees: Can handle complex interactions between variables and provide interpretable results.
•	Random Forest: An ensemble method that combines multiple decision trees to improve model performance.
•	Support Vector Machines: Can handle high-dimensional data and provide robust predictions.


In [40]:
#import all the libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

In [2]:
#read the csv file
data_path = 'Smart Loan Recovery System.csv' 
df = pd.read_csv(data_path)
df.head()

Unnamed: 0,Borrower_ID,Age,Gender,Employment_Type,Monthly_Income,Num_Dependents,Loan_ID,Loan_Amount,Loan_Tenure,Interest_Rate,...,Collateral_Value,Outstanding_Loan_Amount,Monthly_EMI,Payment_History,Num_Missed_Payments,Days_Past_Due,Recovery_Status,Collection_Attempts,Collection_Method,Legal_Action_Taken
0,BRW_1,59,Male,Salaried,215422,0,LN_1,1445796,60,12.39,...,1727997.0,291413.0,4856.88,On-Time,0,0,Partially Recovered,1,Settlement Offer,No
1,BRW_2,49,Female,Salaried,60893,0,LN_2,1044620,12,13.47,...,1180032.0,665204.2,55433.68,On-Time,0,0,Fully Recovered,2,Settlement Offer,No
2,BRW_3,35,Male,Salaried,116520,1,LN_3,1923410,72,7.74,...,2622540.0,1031372.0,14324.61,Delayed,2,124,Fully Recovered,2,Legal Notice,No
3,BRW_4,63,Female,Salaried,140818,2,LN_4,1811663,36,12.23,...,1145493.0,224973.9,6249.28,On-Time,1,56,Fully Recovered,2,Calls,No
4,BRW_5,28,Male,Salaried,76272,1,LN_5,88578,48,16.13,...,0.0,39189.89,816.46,On-Time,1,69,Fully Recovered,0,Debt Collectors,No


In [3]:
# Convert repayment status into numbers (adjust names to your dataset)
df['target'] = df['Recovery_Status'].map({
    'Fully Recovered': 1,
    'Partially Recovered': 1,
    'Written Off': 0
})

In [4]:
# Drop rows where target is missing
df = df.dropna(subset=['target'])

In [5]:
# Fill missing values
df = df.fillna(df.median(numeric_only=True))

In [6]:
# Select numeric columns only
X = df.select_dtypes(include=['int64', 'float64'])
X = X.drop(columns=['target'], errors='ignore')

y = df['target']

In [7]:
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [9]:
#let's create a few engineering features
df['debt_to_income'] = df['Loan_Amount'] / (df['Monthly_Income'] + 1)

df['loan_to_collateral'] = df['Outstanding_Loan_Amount'] / (df['Collateral_Value'] + 1)

df['high_interest'] = (df['Interest_Rate'] > df['Interest_Rate'].median()).astype(int)

df[['debt_to_income', 'loan_to_collateral', 'high_interest']].head()

Unnamed: 0,debt_to_income,loan_to_collateral,high_interest
0,6.711428,0.168642,1
1,17.154728,0.563716,1
2,16.506982,0.393272,0
3,12.865189,0.196399,1
4,1.161328,39189.892008,1


In [17]:
#split train and test the model
X = df[['debt_to_income', 'loan_to_collateral', 'high_interest']]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [19]:
#let's choose our models
models = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "SVM": SVC(probability=True)
}

In [21]:
#lets check if the model is train successfully
trained = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    trained[name] = model
    print(f"{name} trained successfully.")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression trained successfully.
Decision Tree trained successfully.
Random Forest trained successfully.
SVM trained successfully.


In [23]:
#evaluate the models
for name, model in trained.items():
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]

    print(f"\n{name}")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred))
    print("Recall:", recall_score(y_test, y_pred))
    print("F1 Score:", f1_score(y_test, y_pred))
    print("AUC:", roc_auc_score(y_test, y_proba))


Logistic Regression
Accuracy: 0.86
Precision: 0.86
Recall: 1.0
F1 Score: 0.9247311827956989
AUC: 0.420265780730897

Decision Tree
Accuracy: 0.79
Precision: 0.8494623655913979
Recall: 0.9186046511627907
F1 Score: 0.88268156424581
AUC: 0.45930232558139533

Random Forest
Accuracy: 0.86
Precision: 0.8673469387755102
Recall: 0.9883720930232558
F1 Score: 0.9239130434782609
AUC: 0.4663621262458472

SVM
Accuracy: 0.86
Precision: 0.86
Recall: 1.0
F1 Score: 0.9247311827956989
AUC: 0.5423588039867109


In [25]:
#lets identify the high risk borrowers by Picking the best model (let's assume RandomForest)
best = trained["Random Forest"]

df['risk_probability'] = best.predict_proba(X)[:, 1]
df['high_risk'] = (df['risk_probability'] < 0.4).astype(int)

df[['risk_probability', 'high_risk']].head()

Unnamed: 0,risk_probability,high_risk
0,0.8,0
1,1.0,0
2,1.0,0
3,0.99,0
4,0.99,0


# MODEL DEPLOYMENT

In [50]:
import joblib

joblib.dump(model, "loan_model.pkl")
print("Model saved!")

Model saved!


In [52]:
pip install fastapi uvicorn joblib

Note: you may need to restart the kernel to use updated packages.


In [53]:
from fastapi import FastAPI
import joblib
import numpy as np

app = FastAPI()

# load saved model
model = joblib.load("loan_model.pkl")

@app.get("/")
def home():
    return {"message": "Loan Recovery Prediction API is running!"}

EXPECTED_FEATURES = ["Payment_History", "Monthly_Income", "Loan_Amount"]
model = joblib.load("loan_model.pkl")

@app.post("/predict")
def predict_loan(
    data: dict = {
        "Payment_History": "on-time",
        "Monthly_Income": 50000,
        "Loan_Amount": 20000
    }
):
    # Your existing prediction logic here
    pass

    # predict probability
    prob = model.predict_proba(features)[0][1]   # probability of repayment

    # classify borrower
    high_risk = prob < 0.5

    return {
        "repayment_probability": float(prob),
        "high_risk_borrower": bool(high_risk)
    }

In [56]:
@app.post("/predict")
def predict_loan(data: dict):
    try:
        # Convert payment_history to numeric
        if "Payment_History" in data:
            data["Payment_History"] = 0 if data["Payment_History"] == "delayed" else 1
        
        # Convert to array
        features = np.array([list(data.values())]).reshape(1, -1)
        
        # Predict probability
        prob = model.predict_proba(features)[0][1]
        high_risk = prob < 0.5

        return {
            "repayment_probability": float(prob),
            "high_risk_borrower": bool(high_risk)
        }
    except Exception as e:
        return {"error": str(e)}

In [58]:
model = joblib.load(r"C:\Users\USER\Machine Learning\Mlproject2\loan_model.pkl")

In [60]:
print(model.n_features_in_)

3
