## Life cycle of Machine learning Project
- 1. Understanding the Problem Statement
- 2. Data Collection
- 3. Data Checks to perform
- 4. Exploratory data analysis
- 5. Data Pre-Processing
- 6. Model Training

## 1. Problem Statement: Predict credit card default risk.

- `Objective`: Develop a predictive model to assess the likelihood of credit card holders defaulting on payments.

- `Success Criteria`: Achieve a high accuracy rate in predicting credit card default, allowing for proactive measures to mitigate risk.

# 2. Data Collection
## About Data
- `data` :The UCI_Credit_Card dataset specifically focuses on credit card data, which is a commonly studied domain in data science and finance. It is likely that the dataset was contributed by researchers or organizations to facilitate studies related to credit card usage, fraud detection, risk assessment, or other relevant topics.

## 2.1 Import Data and Required Packages
## - Importing Pandas for Reading the data

In [38]:
import pandas as pd

In [39]:
df = pd.read_csv("data\\UCI_Credit_Card.csv")

## 2.2 Showing first 5 columns

In [40]:
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


## 2.3 Renaming Dependent columns

In [44]:
df["default"] = df["default.payment.next.month"]
df.drop(columns=["default.payment.next.month"],inplace=True)

In [45]:
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


## 2.4 Checking the Null Values in the Datasets

In [46]:
df.isnull().sum()

ID           0
LIMIT_BAL    0
SEX          0
EDUCATION    0
MARRIAGE     0
AGE          0
PAY_0        0
PAY_2        0
PAY_3        0
PAY_4        0
PAY_5        0
PAY_6        0
BILL_AMT1    0
BILL_AMT2    0
BILL_AMT3    0
BILL_AMT4    0
BILL_AMT5    0
BILL_AMT6    0
PAY_AMT1     0
PAY_AMT2     0
PAY_AMT3     0
PAY_AMT4     0
PAY_AMT5     0
PAY_AMT6     0
default      0
dtype: int64

## Conclusion
- No Missing Value in the Datasets

## 2.6 Showing Shape of Datasets

In [47]:
df.shape

(30000, 25)

## 2.7  Checking the Duplicates in the data

In [48]:
df.duplicated().sum()

0

## Conclusion
- No Duplicated Data

## 3.  Data Checks to perform

## 3.1  Data type in each Column

In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         30000 non-null  int64  
 1   LIMIT_BAL  30000 non-null  float64
 2   SEX        30000 non-null  int64  
 3   EDUCATION  30000 non-null  int64  
 4   MARRIAGE   30000 non-null  int64  
 5   AGE        30000 non-null  int64  
 6   PAY_0      30000 non-null  int64  
 7   PAY_2      30000 non-null  int64  
 8   PAY_3      30000 non-null  int64  
 9   PAY_4      30000 non-null  int64  
 10  PAY_5      30000 non-null  int64  
 11  PAY_6      30000 non-null  int64  
 12  BILL_AMT1  30000 non-null  float64
 13  BILL_AMT2  30000 non-null  float64
 14  BILL_AMT3  30000 non-null  float64
 15  BILL_AMT4  30000 non-null  float64
 16  BILL_AMT5  30000 non-null  float64
 17  BILL_AMT6  30000 non-null  float64
 18  PAY_AMT1   30000 non-null  float64
 19  PAY_AMT2   30000 non-null  float64
 20  PAY_AM

## 3.2 Covvertimg each columns into suitable data Type

In [50]:
import numpy as np

def convert_columns(df):
    # Convert columns with int64 dtype to int32 if the values are within the int32 range
    int_cols = df.select_dtypes(include=['int64']).columns
    for col in int_cols:
        if np.iinfo(np.int32).min <= df[col].min() <= df[col].max() <= np.iinfo(np.int32).max:
            df[col] = df[col].astype('int32')

    # Convert columns with float64 dtype to float32 if the values are within the float32 range
    float_cols = df.select_dtypes(include=['float64']).columns
    for col in float_cols:
        if np.finfo(np.float32).min <= df[col].min() <= df[col].max() <= np.finfo(np.float32).max:
            df[col] = df[col].astype('float32')

    return df


In [51]:
df = convert_columns(df)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         30000 non-null  int32  
 1   LIMIT_BAL  30000 non-null  float32
 2   SEX        30000 non-null  int32  
 3   EDUCATION  30000 non-null  int32  
 4   MARRIAGE   30000 non-null  int32  
 5   AGE        30000 non-null  int32  
 6   PAY_0      30000 non-null  int32  
 7   PAY_2      30000 non-null  int32  
 8   PAY_3      30000 non-null  int32  
 9   PAY_4      30000 non-null  int32  
 10  PAY_5      30000 non-null  int32  
 11  PAY_6      30000 non-null  int32  
 12  BILL_AMT1  30000 non-null  float32
 13  BILL_AMT2  30000 non-null  float32
 14  BILL_AMT3  30000 non-null  float32
 15  BILL_AMT4  30000 non-null  float32
 16  BILL_AMT5  30000 non-null  float32
 17  BILL_AMT6  30000 non-null  float32
 18  PAY_AMT1   30000 non-null  float32
 19  PAY_AMT2   30000 non-null  float32
 20  PAY_AM

## Saving Processed Data

In [52]:
df.to_csv("data\processed_data.csv",index=False)

## Reading Preprocessed Data

In [53]:
df = pd.read_csv("data\processed_data.csv")

In [54]:
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0
