# **Title of Project**

**Credit Card Default Prediction**

# **Dataset Information**

**The data set consists of 2000 samples from each of two categories. Five variables are:**

1. Income
2. Age
3. Loan
4. Loan to Income (engineered feature)
5. Default

# **Data Source**

**YBI Foundation GitHub Page**

# **Import Library**

In [1]:
import pandas as pd

# **Import Data**

In [2]:
default = pd.read_csv('https://github.com/ybifoundation/Dataset/raw/main/Credit%20Default.csv')

# **Describe Data**

In [3]:
default.head()

Unnamed: 0,Income,Age,Loan,Loan to Income,Default
0,66155.9251,59.017015,8106.532131,0.122537,0
1,34415.15397,48.117153,6564.745018,0.190752,0
2,57317.17006,63.108049,8020.953296,0.13994,0
3,42709.5342,45.751972,6103.64226,0.142911,0
4,66952.68885,18.584336,8770.099235,0.13099,1


# **Data Preprocessing**

In [4]:
default.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Income          2000 non-null   float64
 1   Age             2000 non-null   float64
 2   Loan            2000 non-null   float64
 3   Loan to Income  2000 non-null   float64
 4   Default         2000 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 78.2 KB


In [5]:
default.describe()

Unnamed: 0,Income,Age,Loan,Loan to Income,Default
count,2000.0,2000.0,2000.0,2000.0,2000.0
mean,45331.600018,40.927143,4444.369695,0.098403,0.1415
std,14326.327119,13.26245,3045.410024,0.05762,0.348624
min,20014.48947,18.055189,1.37763,4.9e-05,0.0
25%,32796.45972,29.062492,1939.708847,0.047903,0.0
50%,45789.11731,41.382673,3974.719418,0.099437,0.0
75%,57791.28167,52.596993,6432.410625,0.147585,0.0
max,69995.68558,63.971796,13766.05124,0.199938,1.0


In [6]:
default['Default'].value_counts()

Default
0    1717
1     283
Name: count, dtype: int64

# **Define Target Variable (Y) and Feature Variables (X)**

In [7]:
default.columns

Index(['Income', 'Age', 'Loan', 'Loan to Income', 'Default'], dtype='object')

In [8]:
Y= default['Default']

In [9]:
Y.shape

(2000,)

In [10]:
X = default.drop(['Default'],axis=1)

In [11]:
X.shape

(2000, 4)

In [12]:
X

Unnamed: 0,Income,Age,Loan,Loan to Income
0,66155.92510,59.017015,8106.532131,0.122537
1,34415.15397,48.117153,6564.745018,0.190752
2,57317.17006,63.108049,8020.953296,0.139940
3,42709.53420,45.751972,6103.642260,0.142911
4,66952.68885,18.584336,8770.099235,0.130990
...,...,...,...,...
1995,59221.04487,48.518179,1926.729397,0.032535
1996,69516.12757,23.162104,3503.176156,0.050394
1997,44311.44926,28.017167,5522.786693,0.124636
1998,43756.05660,63.971796,1622.722598,0.037086


# **Train Test Split Data**

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,train_size =0.7,random_state=2529)

In [15]:
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((1400, 4), (600, 4), (1400,), (600,))

# **Model Selection**

In [16]:
from sklearn.linear_model import LogisticRegression

In [17]:
model = LogisticRegression(max_iter=5000)

# **Model Evaluation**

In [18]:
model.fit(X_train,Y_train)

In [19]:
model.intercept_

array([9.39569095])

In [20]:
model.coef_

array([[-2.31410016e-04, -3.43062682e-01,  1.67863323e-03,
         1.51188530e+00]])

# **Predict Test Data**

In [21]:
Y_pred = model.predict(X_test)

In [22]:
Y_pred

array([0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,

# **Model Accuracy**

In [23]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [24]:
confusion_matrix(Y_test,Y_pred)

array([[506,  13],
       [ 17,  64]])

In [25]:
accuracy_score(Y_test,Y_pred)

0.95

In [26]:
print(classification_report(Y_test,Y_pred))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97       519
           1       0.83      0.79      0.81        81

    accuracy                           0.95       600
   macro avg       0.90      0.88      0.89       600
weighted avg       0.95      0.95      0.95       600



# **Explaination**

**A "Credit Card Default Prediction" Machine Learning (ML) project involves developing a model that can accurately predict whether the Credit Card payment made is Default or not, based on certain input features.**

Here's an explanation of the key steps and components involved in such a project:

**Data Collection and Preprocessing:**

Gather a dataset that contains information about various Credit Card users. The dataset should include features like Income, Age, Loan, Loan to Income. Clean and preprocess the data to handle missing values, outliers, and categorical variables.

**Feature Selection/Engineering:**

Identify the most relevant features that could contribute to predicting Defaulters accurately. You might need to transform or engineer features to make them more suitable for modeling.

**Data Splitting:**

Split the dataset into training and testing subsets. The training subset will be used to train the model, while the testing subset will be used to evaluate its performance on unseen data.

**Model Selection:**

Choose an appropriate regression algorithm for your Credit Card Default prediction task. Logistic Regression is a common choice for such problems.

**Model Training:**

Train the selected model using the training data. During training, the model learns the relationship between the input features and the target (Default) by adjusting its internal parameters.

**Model Evaluation:**

Use the testing data to evaluate the model's performance. Common evaluation metrics for logistic regression tasks include Confusion Matrix, Accuracy Score, Classification Report and Area Under Curve (AUC) score. These metrics help you understand how well the model's predictions match the actual Chances of Default.

**Prediction and Deployment (Optional):**

Once you're satisfied with the model's performance, you can deploy it for making real-world predictions. Users could input customer details, and the model would predict whether the Credit Card owner is a Defaulter or not.

**Iterative Refinement:**

If your model's performance is not satisfactory, you can revisit earlier steps to improve it. This might involve collecting more data, experimenting with different features, trying different algorithms.

**Communication:**

Finally, communicate your results and findings. This could involve creating visualizations to show how well the model's predictions align with the actual Chances of Default. Explain the model's strengths, weaknesses, and potential applications.

**Overall, a Credit Card Default Prediction ML project demonstrates how machine learning can be applied to real-world problems in the banking industry domain, predicting whether the Credit Card Owner is a Defaulter or not.**