# Credit Fraud Imbalanced Dataset

This is an imbalanced dataset consists the details of its clients and the target column tells us whether that person has experienced a Fraud (1) or not (0).
It consists of the following attributes:

1   ID              
2   GENDER          
3   CAR             
4   REALITY         
5   NO_OF_CHILD      
6   INCOME          
7   INCOME_TYPE     
8   EDUCATION_TYPE  
9   FAMILY_TYPE     
10  HOUSE_TYPE     
11  FLAG_MOBIL       
12  WORK_PHONE       
13  PHONE            
14  E_MAIL          
15  FAMILY SIZE     
16  BEGIN_MONTH       
17  AGE              
18  YEARS_EMPLOYED   
19  TARGET

### Objective
The objective of this notebook is to predict whether a given customer will experience a fraud or not using Logistic Regression and SMOTE undersampling and oversampling.

### Tools Required
Python
Jupyter Notebook

### Libraries Used
1. Pandas
2. train_test_split
3. Ordinal_Encoder
4. LogisticRegression
5. accuracy_score
6. sklearn
7. confusion_matrix
8. precision_recall_fscore_support
9. SMOTE
10. RandomUnderSampler
11. Pipeline
12. imblearn

In [1]:
import pandas as pd

In [2]:
# Reading the dataset
df = pd.read_csv('C:/Users/admin/Downloads/credit_dataset.csv')

In [3]:
# Having a look at the datatypes of the columns and finding null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25134 entries, 0 to 25133
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      25134 non-null  int64  
 1   ID              25134 non-null  int64  
 2   GENDER          25134 non-null  object 
 3   CAR             25134 non-null  object 
 4   REALITY         25134 non-null  object 
 5   NO_OF_CHILD     25134 non-null  int64  
 6   INCOME          25134 non-null  float64
 7   INCOME_TYPE     25134 non-null  object 
 8   EDUCATION_TYPE  25134 non-null  object 
 9   FAMILY_TYPE     25134 non-null  object 
 10  HOUSE_TYPE      25134 non-null  object 
 11  FLAG_MOBIL      25134 non-null  int64  
 12  WORK_PHONE      25134 non-null  int64  
 13  PHONE           25134 non-null  int64  
 14  E_MAIL          25134 non-null  int64  
 15  FAMILY SIZE     25134 non-null  float64
 16  BEGIN_MONTH     25134 non-null  int64  
 17  AGE             25134 non-null 

In [4]:
# Having a look at the dataset
df.head()

Unnamed: 0.1,Unnamed: 0,ID,GENDER,CAR,REALITY,NO_OF_CHILD,INCOME,INCOME_TYPE,EDUCATION_TYPE,FAMILY_TYPE,HOUSE_TYPE,FLAG_MOBIL,WORK_PHONE,PHONE,E_MAIL,FAMILY SIZE,BEGIN_MONTH,AGE,YEARS_EMPLOYED,TARGET
0,0,5008806,M,Y,Y,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,1,0,0,0,2.0,29,59,3,0
1,1,5008808,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,1,0,1,1,1.0,4,52,8,0
2,2,5008809,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,1,0,1,1,1.0,26,52,8,0
3,3,5008810,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,1,0,1,1,1.0,26,52,8,0
4,4,5008811,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,1,0,1,1,1.0,38,52,8,0


In [5]:
# Verifying that this is a imbalanced dataset
df.TARGET.value_counts()

0    24712
1      422
Name: TARGET, dtype: int64

In [6]:
# Converting into type int for simplicity
df['FAMILY SIZE'] = df['FAMILY SIZE'].astype(int)

In [7]:
# Choosing features and Target for training and testing
X = df.copy()
y = X.pop('TARGET')

In [8]:
# Splitting the dataset into train and test
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [9]:
# Ordinal Encoding for Categorical variables
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
s = (xtrain.dtypes == 'object')
object_cols = list(s[s].index)
label_x_train = xtrain.copy()
label_x_test = xtest.copy()
label_x_train[object_cols] = ordinal_encoder.fit_transform(xtrain[object_cols])
label_x_test[object_cols] = ordinal_encoder.transform(xtest[object_cols])

In [10]:
# Logistic Regression modelling
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0, class_weight='balanced')
classifier.fit(label_x_train, ytrain)

LogisticRegression(class_weight='balanced', random_state=0)

In [11]:
# Making predictions
y_pred = classifier.predict(label_x_test)

In [12]:
# Evaluating the model
from sklearn.metrics import accuracy_score
print ("Accuracy : ", accuracy_score(ytest, y_pred))

Accuracy :  0.6871419478039466


In [13]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(ytest, y_pred)
  
print ("Confusion Matrix : \n", cm)

Confusion Matrix : 
 [[4232 1945]
 [  21   86]]


In [14]:
# Precision, Recall and Fscore
from sklearn.metrics import precision_recall_fscore_support
precision_recall_fscore_support(ytest, y_pred, average='micro')

(0.6871419478039466, 0.6871419478039466, 0.6871419478039466, None)

# Doing SMOTE undersampling and oversampling as this is an imbalanced dataset

In [15]:
import imblearn

In [16]:
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

In [17]:
# define pipeline
over = SMOTE(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)

In [18]:
X1, y1 = pipeline.fit_resample(label_x_train, ytrain)
X1_test, y1_test = pipeline.fit_resample(label_x_test, ytest)

In [19]:
# Logistic Regression modelling
classifier1 = LogisticRegression(random_state = 0)
classifier1.fit(X1, y1)

LogisticRegression(random_state=0)

In [20]:
# Making predictions
y_pred1 = classifier1.predict(X1_test)

In [21]:
# Evaluating the model
print ("Accuracy : ", accuracy_score(y1_test, y_pred1))

Accuracy :  0.7920043219881145


In [22]:
#Confusion Matrix
cm = confusion_matrix(y1_test, y_pred1)
  
print ("Confusion Matrix : \n", cm)

Confusion Matrix : 
 [[1039  195]
 [ 190  427]]


In [23]:
# Precision, Recall and F-score
precision_recall_fscore_support(y1_test, y_pred1, average='weighted')

(0.7924335699341652, 0.7920043219881145, 0.7922128863106175, None)