# Business Understanding
## Project Overview
In the financial sector, assessing the creditworthiness of individuals or businesses is critical to minimizing risk while ensuring profitability. This project focuses on credit analysis to predict the likelihood of loan default, enabling financial institutions to make data-driven decisions regarding loan approvals, interest rates, and credit limits.

## Business Objective
The objective of this project is to develop a reliable predictive model that assesses the credit risk of potential borrowers. By analyzing historical loan and borrower data, the model aims to:
- Mitigate Financial Risk: Identify high-risk borrowers to minimize default rates.
- Improve Decision-Making: Enhance the efficiency and accuracy of credit approval processes.
- Optimize Resource Allocation: Prioritize loans to creditworthy individuals or businesses, ensuring better returns.

## Key Questions
To achieve the objectives, the project seeks to answer the following business questions:
- What are the key factors influencing loan defaults?
- How can we categorize borrowers based on their credit risk levels?
- Can we predict the likelihood of default with high accuracy using historical data?

## Stakeholders
The primary stakeholders for this project include:
- Financial Institutions: To enhance their credit evaluation processes.
- Risk Management Teams: To develop strategies for reducing loan defaults.
- Business Analysts: To derive actionable insights from the data.

## Success Metrics
The success of this project will be evaluated based on:
- Model Performance: Achieving high accuracy, precision, recall, and F1-score in predicting defaults.
- Business Impact: Reduction in non-performing loans (NPLs) and improved profitability.
- Operational Efficiency: Faster and more efficient credit approval processes.

## Scope of the Project
This analysis will involve exploring historical data related to borrowers’ demographics, financial behaviors, and loan performance. The output will be a machine learning model capable of categorizing borrowers into risk groups (low, medium, high) and predicting the probability of default for each applicant.

## DATA UNDERSTANDING
### Source of the data
The dataset being used for this project has been retrieved from kaggle. Click [here](https://www.kaggle.com/datasets/laotse/credit-risk-dataset) to see the data itself.
### Description of the dataset
There is a total of 12 different columns as described below:

1) person_age: this describes the age a person is
2) person_income: Annual Income earned by the person
3) person_home_ownership: The type of home owned by a person
4) person_emp_length:	How long the person has been employed(in years)
5) loan_intent: Reason for the getting the loan
6) loan_grade:	Loan grade
7) loan_amnt: Loan amount
8) loan_int_rate: Interest rate on the loan taken
9) loan_status: Loan status where 0 is non default 1 is default
10) loan_percent_income: Percent income
11) cb_person_default_on_file: Historical records on defaulting
12) cb_preson_cred_hist_length: Credit history length

Now there are a total of 32,581 entries in the dataset it self that would be used to perform the task at hand - Credit Scoring


In [2]:
#Import neccessary libraries 
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

In [3]:
#Import the dataset to be analysed and display it
credit = pd.read_csv('credit_risk_dataset.csv')
credit

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.10,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4
...,...,...,...,...,...,...,...,...,...,...,...,...
32576,57,53000,MORTGAGE,1.0,PERSONAL,C,5800,13.16,0,0.11,N,30
32577,54,120000,MORTGAGE,4.0,PERSONAL,A,17625,7.49,0,0.15,N,19
32578,65,76000,RENT,3.0,HOMEIMPROVEMENT,B,35000,10.99,1,0.46,N,28
32579,56,150000,MORTGAGE,5.0,PERSONAL,B,15000,11.48,0,0.10,N,26


In [4]:
#show a summary of the dataset
credit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32581 entries, 0 to 32580
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   person_age                  32581 non-null  int64  
 1   person_income               32581 non-null  int64  
 2   person_home_ownership       32581 non-null  object 
 3   person_emp_length           31686 non-null  float64
 4   loan_intent                 32581 non-null  object 
 5   loan_grade                  32581 non-null  object 
 6   loan_amnt                   32581 non-null  int64  
 7   loan_int_rate               29465 non-null  float64
 8   loan_status                 32581 non-null  int64  
 9   loan_percent_income         32581 non-null  float64
 10  cb_person_default_on_file   32581 non-null  object 
 11  cb_person_cred_hist_length  32581 non-null  int64  
dtypes: float64(3), int64(5), object(4)
memory usage: 3.0+ MB


In [5]:
#Give a statistical summary of the dataset
credit.describe()

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_cred_hist_length
count,32581.0,32581.0,31686.0,32581.0,29465.0,32581.0,32581.0,32581.0
mean,27.7346,66074.85,4.789686,9589.371106,11.011695,0.218164,0.170203,5.804211
std,6.348078,61983.12,4.14263,6322.086646,3.240459,0.413006,0.106782,4.055001
min,20.0,4000.0,0.0,500.0,5.42,0.0,0.0,2.0
25%,23.0,38500.0,2.0,5000.0,7.9,0.0,0.09,3.0
50%,26.0,55000.0,4.0,8000.0,10.99,0.0,0.15,4.0
75%,30.0,79200.0,7.0,12200.0,13.47,0.0,0.23,8.0
max,144.0,6000000.0,123.0,35000.0,23.22,1.0,0.83,30.0


In [6]:
credit.isnull().sum()

person_age                       0
person_income                    0
person_home_ownership            0
person_emp_length              895
loan_intent                      0
loan_grade                       0
loan_amnt                        0
loan_int_rate                 3116
loan_status                      0
loan_percent_income              0
cb_person_default_on_file        0
cb_person_cred_hist_length       0
dtype: int64

In [7]:
credit = credit.dropna()

In [8]:
credit

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.10,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4
...,...,...,...,...,...,...,...,...,...,...,...,...
32576,57,53000,MORTGAGE,1.0,PERSONAL,C,5800,13.16,0,0.11,N,30
32577,54,120000,MORTGAGE,4.0,PERSONAL,A,17625,7.49,0,0.15,N,19
32578,65,76000,RENT,3.0,HOMEIMPROVEMENT,B,35000,10.99,1,0.46,N,28
32579,56,150000,MORTGAGE,5.0,PERSONAL,B,15000,11.48,0,0.10,N,26


In [9]:
credit.duplicated().sum()

137

In [10]:
credit = credit.drop_duplicates()
credit.duplicated().sum()

0

In [11]:
credit

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.10,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4
...,...,...,...,...,...,...,...,...,...,...,...,...
32576,57,53000,MORTGAGE,1.0,PERSONAL,C,5800,13.16,0,0.11,N,30
32577,54,120000,MORTGAGE,4.0,PERSONAL,A,17625,7.49,0,0.15,N,19
32578,65,76000,RENT,3.0,HOMEIMPROVEMENT,B,35000,10.99,1,0.46,N,28
32579,56,150000,MORTGAGE,5.0,PERSONAL,B,15000,11.48,0,0.10,N,26


Retained over 87% percent of the initial data after dropping the rows with null entities. This still is very valuable as there is stil significant amount of data that can be used. Then after removing the duplicate values there was a total of 86.7% of the data which is still very useful

In [12]:
#86.7 % 