# Credit Scoring Model: Problem Statement and Project Plan

## Problem Statement
The goal of this project is to develop a machine learning model to predict the creditworthiness of individuals based on historical financial data. This is a classification task where the model will predict whether a person is likely to default on a loan or has good credit based on their financial attributes.

## About the Dataset
This dataset contains information on credit scores and various personal and financial attributes of individuals. It includes features such as age, gender, income, education level, marital status, number of children, home ownership status, and more. These attributes are important in predicting an individual’s creditworthiness, which is crucial for lending institutions to make informed decisions.

- **Dataset Source**: Kaggle
- **Dataset Type**: Structured/tabular data
- **Task Type**: Classification

## Features of the Dataset
The dataset includes the following features:
- **Age**: The age of the individual.
- **Gender**: Gender of the individual (Male/Female).
- **Income**: Annual income of the individual.
- **Education**: Highest level of education attained.
- **Marital Status**: Marital status (Single/Married/Divorced).
- **Number of Children**: Number of children the individual has.
- **Home Ownership**: Whether the individual owns a home (Yes/No).
- **Credit Score**: The individual’s credit score (the target variable we aim to predict).

## Why This Dataset?
This dataset is ideal for this project because:
- It includes a variety of financial and personal attributes that can influence an individual's creditworthiness.
- It is a real-world dataset commonly used for classification problems in machine learning.
- It provides a balanced mix of numerical and categorical features, which makes it suitable for demonstrating data preprocessing, model training, and evaluation.

## When Was This Dataset Collected?
The dataset was collected recently and contains updated information regarding individual financial profiles, making it relevant for modern credit scoring models.

## Steps for the Project

### 1. Data Collection
- Download the dataset from Kaggle.

### 2. Data Exploration (EDA)
- Load the dataset and perform an initial exploration to understand its structure.
- Check for missing values, outliers, and distribution of key features.
- Visualize the distribution of credit scores and other features.
- Analyze correlations between the features.

### 3. Data Preprocessing
- Handle missing values and outliers.
- Encode categorical variables (e.g., gender, marital status) using one-hot encoding or label encoding.
- Normalize or standardize numerical features like income and age to ensure consistent scaling.
- Split the data into training and testing sets.

### 4. Model Training
- Train multiple classification models, such as Logistic Regression, Random Forest, Decision Trees, and Gradient Boosting (XGBoost).
- Tune hyperparameters using GridSearchCV or RandomSearchCV to optimize model performance.

### 5. Model Evaluation
- Evaluate model performance using metrics like accuracy, precision, recall, F1-score, and ROC-AUC.
- Compare the performance of different models to select the best-performing one.

### 6. Model Optimization and Tuning
- Fine-tune the selected model for better accuracy.
- Perform cross-validation to ensure model robustness.

### 7. Model Deployment (Optional)
- Save the trained model using joblib or pickle.
- Deploy the model for real-time predictions if required.

### 8. Reporting and Documentation
- Document the entire process and results.
- Create visualizations and present model performance and findings.

---

### Conclusion
By following these steps, the credit scoring model will be able to predict the likelihood of an individual defaulting on a loan based on their financial attributes. This project will demonstrate how machine learning can be applied to real-world problems in the finance sector.


In [1]:
import pandas as pd ## For Data reading and manupulation
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')

In [7]:
df = pd.read_csv('D:\code_alpha\credit_scoring\data\preprocessed.csv')
# checking the first 5 rows
df.head()

Unnamed: 0,Age,Gender,Income,Education,Marital Status,Number of Children,Home Ownership,Credit Score
0,25,Female,50000,Bachelor's Degree,Single,0,Rented,High
1,30,Male,100000,Master's Degree,Married,2,Owned,High
2,35,Female,75000,Doctorate,Married,1,Owned,High
3,40,Male,125000,High School Diploma,Single,0,Owned,High
4,45,Female,100000,Bachelor's Degree,Married,3,Owned,High


In [9]:
df.shape

(164, 8)

In [11]:
df.columns

Index(['Age', 'Gender', 'Income', 'Education', 'Marital Status',
       'Number of Children', 'Home Ownership', 'Credit Score'],
      dtype='object')

In [12]:
### Checking the info of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 164 entries, 0 to 163
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Age                 164 non-null    int64 
 1   Gender              164 non-null    object
 2   Income              164 non-null    int64 
 3   Education           164 non-null    object
 4   Marital Status      164 non-null    object
 5   Number of Children  164 non-null    int64 
 6   Home Ownership      164 non-null    object
 7   Credit Score        164 non-null    object
dtypes: int64(3), object(5)
memory usage: 10.4+ KB


# Observations from the Dataset

## General Overview
The dataset contains a total of **164 entries** with **8 columns**. The columns can be divided into two types:
- **Numerical Columns**: `Age`, `Income`, `Number of Children`
- **Categorical Columns**: `Gender`, `Education`, `Marital Status`, `Home Ownership`, `Credit Score`


In [13]:
### Checking the Nulll Values

df.isnull().sum()

Age                   0
Gender                0
Income                0
Education             0
Marital Status        0
Number of Children    0
Home Ownership        0
Credit Score          0
dtype: int64

In [14]:
# Check for duplicate rows
df.duplicated().sum()

np.int64(62)

In [20]:
df[df.duplicated()].head(20)

Unnamed: 0,Age,Gender,Income,Education,Marital Status,Number of Children,Home Ownership,Credit Score
73,27,Female,37500,High School Diploma,Single,0,Rented,Low
74,32,Male,57500,Associate's Degree,Single,0,Rented,Average
79,28,Female,32500,Associate's Degree,Single,0,Rented,Low
80,33,Male,52500,High School Diploma,Single,0,Rented,Average
81,38,Female,67500,Bachelor's Degree,Married,2,Owned,High
82,43,Male,92500,Master's Degree,Single,0,Owned,High
85,29,Female,27500,High School Diploma,Single,0,Rented,Low
86,34,Male,47500,Associate's Degree,Single,0,Rented,Average
87,39,Female,62500,Bachelor's Degree,Married,2,Owned,High
88,44,Male,87500,Master's Degree,Single,0,Owned,High


In [21]:
df.drop_duplicates(inplace=True)


In [22]:
df.shape

(102, 8)

In [23]:
df.isnull().sum()

Age                   0
Gender                0
Income                0
Education             0
Marital Status        0
Number of Children    0
Home Ownership        0
Credit Score          0
dtype: int64