LIMIT_BAL: This feature represents the credit limit assigned to the individual's credit card. It indicates the maximum amount of credit the person can utilize.

SEX: This feature represents the gender of the credit card holder. While gender itself may not directly impact credit card fault detection, it can be considered as a demographic factor that might have some influence on creditworthiness.

EDUCATION: This feature indicates the educational background of the credit card holder. It can provide insights into the person's level of education, which might indirectly correlate with their financial stability and ability to manage credit.

MARRIAGE: This feature represents the marital status of the credit card holder. Similar to gender, marital status can be a demographic factor that could potentially impact credit card fault detection.

AGE: This feature denotes the age of the credit card holder. Age can be an important factor in assessing creditworthiness as it often correlates with financial responsibility and stability.

PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6: These features represent the repayment status of the credit card for the past six months. The values indicate the payment status (e.g., -1 represents payment delay for one month, 0 represents payment on time, 1 represents payment delay for two months, and so on). These features are crucial in determining the payment behavior of the individual over time.

BILL_AMT1, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, BILL_AMT6: These features represent the amount of bill statement for the respective months. They provide information about the outstanding balance on the credit card at specific points in time.

PAY_AMT1, PAY_AMT2, PAY_AMT3, PAY_AMT4, PAY_AMT5, PAY_AMT6: These features represent the amount of payment made by the credit card holder for the respective months. They indicate the actual payments made to reduce the outstanding balance.

default payment next month: This is the target variable or the dependent variable that indicates whether the credit card holder defaulted on their payment in the following month (1 for default, 0 for no default). This is the variable that the credit card fault detection model aims to predict.

In [1]:
import pandas as pd
data=pd.read_csv("https://raw.githubusercontent.com/sunnysavita10/credit_card_pw_hindi/main/creditCardFraud_28011964_120214.csv")

In [2]:
data

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0
1,50000,1,1,2,37,0,0,0,0,0,...,19394,19619,20024,2500,1815,657,1000,1000,800,0
2,500000,1,1,2,29,0,0,0,0,0,...,542653,483003,473944,55000,40000,38000,20239,13750,13770,0
3,100000,2,2,2,23,0,-1,-1,0,0,...,221,-159,567,380,601,0,581,1687,1542,0
4,140000,2,3,1,28,0,0,2,0,0,...,12211,11793,3719,3329,0,432,1000,1000,1000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,100000,1,2,1,29,0,0,0,0,-1,...,-2618,95748,101299,3320,5000,0,100000,7186,0,0
997,200000,2,2,1,28,0,0,0,0,0,...,97041,103541,3632,5000,2000,89000,6500,91,1504,0
998,90000,2,2,1,40,-1,-1,-1,-1,-1,...,657,1332,780,0,2806,2256,2274,780,0,0
999,360000,1,1,2,36,1,-2,-2,-2,-2,...,0,0,0,0,0,0,0,0,0,1


In [3]:
data.columns

Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default payment next month'],
      dtype='object')

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 24 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   LIMIT_BAL                   1001 non-null   int64
 1   SEX                         1001 non-null   int64
 2   EDUCATION                   1001 non-null   int64
 3   MARRIAGE                    1001 non-null   int64
 4   AGE                         1001 non-null   int64
 5   PAY_0                       1001 non-null   int64
 6   PAY_2                       1001 non-null   int64
 7   PAY_3                       1001 non-null   int64
 8   PAY_4                       1001 non-null   int64
 9   PAY_5                       1001 non-null   int64
 10  PAY_6                       1001 non-null   int64
 11  BILL_AMT1                   1001 non-null   int64
 12  BILL_AMT2                   1001 non-null   int64
 13  BILL_AMT3                   1001 non-null   int64
 14  BILL_AMT

In [5]:
data.isnull().sum()

LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                     0
BILL_AMT4                     0
BILL_AMT5                     0
BILL_AMT6                     0
PAY_AMT1                      0
PAY_AMT2                      0
PAY_AMT3                      0
PAY_AMT4                      0
PAY_AMT5                      0
PAY_AMT6                      0
default payment next month    0
dtype: int64

In [6]:
data.describe()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
count,1001.0,1001.0,1001.0,1001.0,1001.0,1001.0,1001.0,1001.0,1001.0,1001.0,...,1001.0,1001.0,1001.0,1001.0,1001.0,1001.0,1001.0,1001.0,1001.0,1001.0
mean,167532.467532,1.589411,1.776224,1.604396,34.945055,-0.004995,-0.161838,-0.164835,-0.283716,-0.283716,...,40748.408591,39078.666334,38012.011988,5382.33966,5051.400599,4176.14985,4671.488511,5331.04995,5090.704296,0.213786
std,130587.92132,0.492187,0.750916,0.532298,9.21976,1.173446,1.228732,1.262459,1.184662,1.170224,...,68206.92951,63108.238729,63074.415024,12180.755275,15626.153184,10514.647502,13269.943983,16812.536877,23658.888052,0.410183
min,10000.0,1.0,1.0,0.0,21.0,-2.0,-2.0,-2.0,-2.0,-2.0,...,-3684.0,-28335.0,-339603.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,50000.0,1.0,1.0,1.0,28.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,1423.0,1206.0,830.0,1000.0,390.0,228.0,148.0,189.0,0.0,0.0
50%,140000.0,2.0,2.0,2.0,33.0,0.0,0.0,0.0,0.0,0.0,...,17710.0,17580.0,15846.0,2184.0,1710.0,1206.0,1398.0,1306.0,1250.0,0.0
75%,240000.0,2.0,2.0,2.0,41.0,0.0,0.0,0.0,0.0,0.0,...,48851.0,46404.0,46557.0,5090.0,4500.0,3720.0,4000.0,3745.0,3784.0,0.0
max,700000.0,2.0,6.0,3.0,75.0,8.0,7.0,7.0,7.0,7.0,...,628699.0,484612.0,473944.0,199646.0,285138.0,133657.0,188840.0,195599.0,528666.0,1.0


In [7]:
pip install pandas-profiling

Collecting pandas-profiling
  Downloading pandas_profiling-3.6.6-py2.py3-none-any.whl (324 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m324.4/324.4 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting ydata-profiling
  Downloading ydata_profiling-4.5.1-py2.py3-none-any.whl (357 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m357.3/357.3 kB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
Collecting multimethod<2,>=1.4
  Downloading multimethod-1.9.1-py3-none-any.whl (10 kB)
Collecting dacite>=1.8
  Downloading dacite-1.8.1-py3-none-any.whl (14 kB)
Collecting phik<0.13,>=0.11.1
  Downloading phik-0.12.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (679 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m679.5/679.5 kB[0m [31m47.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting visions[type_image_path]==0.7.5
  Downloading visions-0.7.5-py3-none-any.whl (102 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [8]:
from pandas_profiling import ProfileReport

  from pandas_profiling import ProfileReport


In [9]:
profile=ProfileReport(data,title="Profile Report of Data ")

In [10]:
## profile.to_widgets()  it gives all eda information 

## to run the same in jupeter notebook
from pandas_profiling import ProfileReport

profile = ProfileReport(data, title="Profile Report of data")
profile.to_notebook_iframe()


In [11]:
data.columns

Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default payment next month'],
      dtype='object')

In [12]:
X=data.drop(labels=["default payment next month"],axis=1)

In [13]:
y=data["default payment next month"]

In [14]:
X

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
0,50000,1,2,1,57,-1,0,-1,0,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679
1,50000,1,1,2,37,0,0,0,0,0,...,57608,19394,19619,20024,2500,1815,657,1000,1000,800
2,500000,1,1,2,29,0,0,0,0,0,...,445007,542653,483003,473944,55000,40000,38000,20239,13750,13770
3,100000,2,2,2,23,0,-1,-1,0,0,...,601,221,-159,567,380,601,0,581,1687,1542
4,140000,2,3,1,28,0,0,2,0,0,...,12108,12211,11793,3719,3329,0,432,1000,1000,1000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,100000,1,2,1,29,0,0,0,0,-1,...,67782,-2618,95748,101299,3320,5000,0,100000,7186,0
997,200000,2,2,1,28,0,0,0,0,0,...,8441,97041,103541,3632,5000,2000,89000,6500,91,1504
998,90000,2,2,1,40,-1,-1,-1,-1,-1,...,1114,657,1332,780,0,2806,2256,2274,780,0
999,360000,1,1,2,36,1,-2,-2,-2,-2,...,0,0,0,0,0,0,0,0,0,0


In [15]:
y

0       0
1       0
2       0
3       0
4       0
       ..
996     0
997     0
998     0
999     1
1000    1
Name: default payment next month, Length: 1001, dtype: int64

In [16]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=20)

In [17]:
from sklearn.preprocessing import StandardScaler
Scaler=StandardScaler()

In [18]:
X_train_scaled=Scaler.fit_transform(X_train)
X_test_scaled=Scaler.transform(X_test)

In [19]:
X_test_scaled,X_train_scaled

(array([[ 0.26645872, -1.22474487, -1.06447243, ..., -0.35566332,
         -0.31969348, -0.19618447],
        [ 0.89072748, -1.22474487, -1.06447243, ..., -0.06014669,
          0.57228329, -0.02901509],
        [-0.90404522,  0.81649658,  1.66494406, ..., -0.13335509,
         -0.25498182, -0.16149403],
        ...,
        [-0.90404522, -1.22474487, -1.06447243, ..., -0.1946815 ,
         -0.32613887, -0.16033768],
        [-0.27977645, -1.22474487,  1.66494406, ...,  0.05828993,
         -0.32613887, -0.20003897],
        [ 1.20286187, -1.22474487, -1.06447243, ..., -0.2100131 ,
         -0.32613887, -0.08440416]]),
 array([[-0.66994443,  0.81649658,  0.30023581, ..., -0.22710784,
         -0.32613887, -0.1229491 ],
        [ 0.26645872,  0.81649658, -1.06447243, ..., -0.05646711,
          0.32052657, -0.03044125],
        [-0.82601162, -1.22474487, -1.06447243, ..., -0.2483421 ,
         -0.20367656, -0.1229491 ],
        ...,
        [-0.74797803,  0.81649658,  0.30023581, ..., -

In [20]:
from sklearn.naive_bayes import GaussianNB

In [21]:
clf=GaussianNB()

In [22]:
clf.fit(X_train_scaled,y_train)

In [23]:
y_pred=clf.predict(X_test_scaled)

In [24]:
y_pred

array([0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1,
       0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0])

In [25]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

In [26]:
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[146  54]
 [ 20  31]]
0.7051792828685259
              precision    recall  f1-score   support

           0       0.88      0.73      0.80       200
           1       0.36      0.61      0.46        51

    accuracy                           0.71       251
   macro avg       0.62      0.67      0.63       251
weighted avg       0.77      0.71      0.73       251



In [27]:
param_grid={"var_smoothing":[0.1, 0.001, 0.5,0.05,0.01,1e-8,1e-7,1e-6,1e-10,1e-11]}

In [28]:
from sklearn.model_selection import GridSearchCV

In [29]:
clfs=GridSearchCV(clf,param_grid,cv=5,verbose=3)

In [30]:
clfs.fit(X_train_scaled,y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END .................var_smoothing=0.1;, score=0.800 total time=   0.0s
[CV 2/5] END .................var_smoothing=0.1;, score=0.720 total time=   0.0s
[CV 3/5] END .................var_smoothing=0.1;, score=0.707 total time=   0.0s
[CV 4/5] END .................var_smoothing=0.1;, score=0.800 total time=   0.0s
[CV 5/5] END .................var_smoothing=0.1;, score=0.547 total time=   0.0s
[CV 1/5] END ...............var_smoothing=0.001;, score=0.793 total time=   0.0s
[CV 2/5] END ...............var_smoothing=0.001;, score=0.713 total time=   0.0s
[CV 3/5] END ...............var_smoothing=0.001;, score=0.633 total time=   0.0s
[CV 4/5] END ...............var_smoothing=0.001;, score=0.667 total time=   0.0s
[CV 5/5] END ...............var_smoothing=0.001;, score=0.447 total time=   0.0s
[CV 1/5] END .................var_smoothing=0.5;, score=0.793 total time=   0.0s
[CV 2/5] END .................var_smoothing=0.5;

In [31]:
clfs.best_params_

{'var_smoothing': 0.5}

In [32]:
from sklearn.naive_bayes import GaussianNB
clf=GaussianNB(var_smoothing=0.5)
clf.fit(X_train_scaled,y_train)
y_pred1=clf.predict(X_test_scaled)
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred1)

0.8047808764940239

In [35]:
from sklearn.ensemble import RandomForestClassifier
rs1=RandomForestClassifier(criterion='gini',n_estimators= 1200,random_state=0)
rs1.fit(X_train_scaled,y_train)
y_pred3=rs1.predict(X_test_scaled)
accuracy_score(y_test,y_pred3)*100

83.26693227091634

In [42]:
pip install Xgboost

Collecting Xgboost
  Downloading xgboost-1.7.6-py3-none-manylinux2014_x86_64.whl (200.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.3/200.3 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: Xgboost
Successfully installed Xgboost-1.7.6
Note: you may need to restart the kernel to use updated packages.


In [43]:
from xgboost import XGBClassifier
xgb=XGBClassifier()
xgb.fit(X_train_scaled,y_train)
y_pred4=xgb.predict(X_test_scaled)
accuracy_score(y_test,y_pred4)*100

80.47808764940238