# Introduction

Welcome to this notebook, I have created some visual representaion of the data and have tried to explore some patterns.

**I hope this is useful, do comment for suggestions and feedback**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Getting Started

I have used only matplotlib here. Seaborn is also a good option for making great visualization, but here I have just explored one, that too I feel can be explored more further.

In [None]:
import matplotlib.pyplot as plt

# Importing the dataset
* Gathering some started information

In [None]:
# loading the data set
X_train = pd.read_csv('../input/playground-series-s5e11/train.csv')
X_test = pd.read_csv('../input/playground-series-s5e11/test.csv')

In [None]:
# basic information, tells about the data types and non-null count
X_train.info()

In [None]:
X_train.describe()

# About the Data

* id - Unique identification of each record

* annual_income (float64) â€“ Borrower's yearly income.

* debt_to_income_ratio (float64) â€“ Ratio of borrowerâ€™s debt to their income. Lower = better.

* credit_score (int64) â€“ Credit bureau score (e.g., FICO). Higher = less risky.

* loan_amount (float64) â€“ Amount of loan taken.

* interest_rate (float64) â€“ Loan par annual interest rate (%).

* gender (category) â€“ Borrower's gender (Male/Female/Other).

* marital_status (category) â€“ Marital status (Single, Married, Divorced, Widowed).

* education_level (category) â€“ Education level (High School, Bachelor's, Master's, PhD, other).

* employment_status (category) â€“ Current employment type (Employed, Retired, Self-Employed, Student, Unemployed).

* loan_purpose (category) â€“ Loan purpose (Business, Car, Debt consolidation, Education, Home, Medical, Vacation, Other).

* grade_subgrade (category) â€“ Risk category assigned to loan (A1, B2, etc.).

* loan_paid_back (int64) â€“ Target variable: 1 â†’ Borrower paid loan in full, 0 â†’ Borrower defaulted (did not repay fully).

This information is referenced from [original data set](https://www.kaggle.com/datasets/nabihazahid/loan-prediction-dataset-2025/data), along with small updations from current train set.

# Let's Visualize

In [None]:
# Creating a heat map using correlation matrix
train_num = X_train.select_dtypes(exclude = object)
label = list(train_num.columns)

plt.figure(figsize=(16, 6))
plt.imshow(train_num.corr(), cmap = 'OrRd')
plt.xticks(ticks=range(len(label)), labels=label, rotation=90)
plt.yticks(ticks=range(len(label)), labels=label)

plt.colorbar()

plt.show()

**Observation** : 
* loan_paid_back has good correlation with debt_to_income_ratio and credit score.
* interest_rate and credit_score has high negative correlation

In [None]:
# boc plots for all numerical features
i=1
plt.figure(figsize=(16,14))
for col in train_num.drop(['id', 'loan_paid_back'], axis = 1).columns:
    plt.subplot(2,3,i)
    i+=1
    plt.boxplot(train_num[col], patch_artist = True)
    plt.xticks(ticks=[1], labels = [col])
plt.show()

* Too many outliers everywhere
* Debt to income ratio is mostly less than or equal to 0.1

In [None]:
def cat_distribute(col):
    print(f"Plotting over {col}")
    plt.figure(figsize=(16, 4))
    plt.subplot(1,3,1)
    index = X_train.groupby(col)[col].count().index
    plt.bar(index, X_train.groupby(col)[col].count())
    plt.title('Overall Distribution')
    plt.xticks(rotation=40)

    plt.subplot(1,3,2)
    plt.bar(index, X_train[X_train.loan_paid_back==1].groupby(col)[col].count())
    plt.title('Those who paid back their loans')
    plt.xticks(rotation=40)
    
    plt.subplot(1,3,3)
    plt.bar(index, X_train[X_train.loan_paid_back==0].groupby(col)[col].count())
    plt.title('Loans Unpaid')
    plt.xticks(rotation=40)
    plt.show()

cat_distribute('gender')
cat_distribute('marital_status')
cat_distribute('education_level')
cat_distribute('loan_purpose')
cat_distribute('grade_subgrade')

* Gender evenly distributed
* Equal distribution across those who paid vs those who didn't paid back their loans
* Most loans are under Debt Consolidation, i.e., combining multiple debts.
* Different distributions can be observed over the grade_subgrade category.

In [None]:
def scatter(col1, col2):
    plt.figure(figsize=(12,6))
    plt.subplot(1,2,1)
    plt.scatter(X_train[col1][loan_paid], X_train[col2][loan_paid], color = 'b', label = 'loan paid')
    plt.scatter(X_train[col1][loan_unpaid], X_train[col2][loan_unpaid], color = 'r', label = 'loan not paid')
    plt.xlabel(f'{col1}')
    plt.ylabel(f'{col2}')

    plt.title(f"{col1} VS {col2}")
    plt.legend()

    plt.subplot(1,2,2)
    plt.scatter(X_test[col1], X_test[col2], color = 'c', label = col1)
    plt.xlabel(f'{col1}')
    plt.ylabel(f'{col2}')

    plt.title(f"{col1} VS {col2}")
    plt.legend()
    
    plt.show()

loan_paid = X_train['loan_paid_back']==1
loan_unpaid = X_train['loan_paid_back']==0

In [None]:
scatter('loan_amount', 'annual_income')

Similar graph is produced for annual_income vs debt_to_income_ratio

In [None]:
scatter('debt_to_income_ratio', 'credit_score')

A slight increse in credit score is observed with increase in debt to income ratio

In [None]:
scatter('interest_rate', 'credit_score')

Higher credit score implies lower interest rates

In [None]:
scatter('debt_to_income_ratio', 'annual_income')

People with high income have low debt to income ratio, implies that rich take less debt (compared to their income) ðŸ˜‚.

Also, People with annual income more than 200000 are more likely to pay their loans.

*Hopefully it's helpful*

# Prediction

In [None]:
# importing useful libraries

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBClassifier

In [None]:
# Extracting categorical data
X_cat = X_train.drop(['id', 'annual_income', 'loan_amount', 'loan_paid_back', 'interest_rate'], axis = 1)
test_cat = X_test.drop(['id', 'annual_income', 'loan_amount', 'interest_rate'], axis = 1)
X_cat.head()

In [None]:
# One Hot Encoding

oh = OneHotEncoder(handle_unknown = 'ignore', sparse_output = False)
oh_X = pd.DataFrame(oh.fit_transform(X_cat))
oh_X.columns = oh_X.columns.astype(str)

oh_t = pd.DataFrame(oh.transform(test_cat))
oh_t.columns = oh_t.columns.astype(str)

oh_X.head()

In [None]:
# Extracting Numerical Features
X = X_train.select_dtypes(include = np.number)
test = X_test.select_dtypes(include = np.number)

X.drop('id', axis = 1, inplace = True)
X.head()

# Merging
X = pd.concat([X, oh_X], axis = 1)
X_test = pd.concat([test, oh_t], axis = 1)

In [None]:
y = X.pop('loan_paid_back')
testID = X_test.pop('id')

In [None]:
model = XGBClassifier(random_state=42)

In [None]:
model.fit(X, y)

In [None]:
final = model.predict_proba(X_test)[:,1]

final = pd.DataFrame({'id': testID, 'loan_paid_back' : final})

final.head()

In [None]:
final.to_csv('submission.csv', index = False)