# NPCI MLOps Playground Challenge - 4

[Total: 20 Marks]

## Problem Statement

Develop a machine learning model that predicts the likelihood of a borrower defaulting on a loan based on factors such as credit history, repayment capacity, and annual income. This model aims to assist financial institutions in assessing the potential financial impact of credit risk and making informed lending decisions.

### Importing required packages [1 Mark]

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

### Loading the data [1 Mark]

In [2]:
df = pd.read_csv('credit_risk_dataset.csv')
print("Dataset Shape:", df.shape)
df.head()

### EDA [2 Marks]

Explore the data, understand the features, and handle the missing values.

In [3]:
print(df.describe())
print("\nMissing Values:\n", df.isnull().sum())
plt.figure(figsize=(6,4))
sns.countplot(x='loan_status', data=df, palette='coolwarm')
plt.title('Loan Status Distribution')
plt.xlabel('Loan Default (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

### Handling Missing Values [1 Mark]

In [4]:
df['person_emp_length'].fillna(df['person_emp_length'].mean(), inplace=True)
df['loan_int_rate'].fillna(df['loan_int_rate'].mean(), inplace=True)
print("\nMissing Values after handling:\n", df.isnull().sum())

### Handling Categorical Columns [3 Marks]

In [5]:
categorical_cols = ['person_home_ownership', 'loan_intent', 'loan_grade', 'cb_person_default_on_file']
encoder = LabelEncoder()
for col in categorical_cols:
    df[col] = encoder.fit_transform(df[col])
df.head()

### Define Target Variable and Features [1 Mark]

In [6]:
X = df.drop(columns=['loan_status'])
y = df['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
print("Training Set Shape:", X_train.shape)
print("Testing Set Shape:", X_test.shape)

### Feature Scaling [1 Mark]

In [7]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Model Training [3 Marks]

In [8]:
log_model = LogisticRegression()
log_model.fit(X_train_scaled, y_train)
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

### Model Evaluation [2 Marks]

In [9]:
def evaluate_model(model_name, model, X_test, y_test):
    y_pred = model.predict(X_test)
    print(f"\nModel: {model_name}")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

evaluate_model("Logistic Regression", log_model, X_test_scaled, y_test)
evaluate_model("Decision Tree", dt_model, X_test, y_test)
evaluate_model("Random Forest", rf_model, X_test, y_test)

### Inference [2 Marks]

In [10]:
sample_input = pd.DataFrame({
    'person_age': [35],
    'person_income': [60000],
    'person_home_ownership': [1],
    'person_emp_length': [5],
    'loan_intent': [2],
    'loan_grade': [1],
    'loan_amnt': [10000],
    'loan_int_rate': [12.5],
    'loan_percent_income': [0.15],
    'cb_person_default_on_file': [0],
    'cb_preson_cred_hist_length': [7]
})
prediction = rf_model.predict(sample_input)
print("Sample Prediction (0 = No Default, 1 = Default):", prediction[0])