## Credit Card Fraud Detection
This is an Imbalance Dataset.Resampling is a common practice to address the imbalanced dataset issue. Although
there are many techniques within resampling, here i’ll be using the three most popular techniques.
#### 1.Random Under Sampling
Randomly eleminates Majority class instances/records until it balance with the Minority class 
Disadvantage: As it eleminates randomly there is a possibility of elemination of USEFUL DATA thus making the algorithm predicts inaccurately
#### 2.RandomOverSampling
Randomly replicates the Minority Class to increase its frequency (in numbers to match Majority class) in this technique there is 'no loss of information' but possibility of 'overfitting' the model since it is replicating the data
#### 3.Synthetic Minority Oversampling TEchnique ( SMOTE)
Increase minority class by introducing synthetic examples  through  connecting all k (default = 5) minority class nearest neighbors using feature space similarity (Euclidean distance).

### Outline:
1. Loading Libraries and data
2. Summarizing and Visualizing Data
3. Preparing data :     
i) Data Cleaning : By removing duplicates, marking missing values and even imputing missing values  
ii)   Feature Selection : redundant features may be removed and new features developed.
iii)  Data Transform : attributes are scaled or redistributed in order to best expose the structure of the problem later to learning algorithms
4. Why Accuracy is not a Good Performance Metric when dealing with Imbalance Dataset
5. Right Way of Resampling
6. Resampling imbalance dataset through Random Under Sampler
7. Random Over Sampling  
8. SMOTE sampling
9. Smote with Random Forest Classifier  
10.Smote - Confusion_Matrix , ROC_AUC curve , classification_report

#### References:
1.Mastering Machine Learning - Manohar Swaminathan  
2.Machine Learning Mastery -jason brownlee   
3.Right way of sampling - nick becker blog
        
        
        

#### 1.Loading Libraries and Data

In [None]:
#Loading Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import RobustScaler

from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.metrics import accuracy_score,precision_score,recall_score,classification_report,confusion_matrix,roc_curve,auc

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings("ignore")

In [None]:
# loading data
df = pd.read_csv('../input/creditcard.csv')


#### 2. summarizing and visualizing data

In [None]:
## Analyse the Data
#  Dispcriptive statistics
df.shape       # shape - gives the total number of rows and columns
                      # it has 284807 rows and 31 columns

In [None]:
df.head()            # head () function - gives the top 5 rows
                     # it has 'Time' 'Amount' 'Class' and 28Variables(for security reasons actuall names are hidden and represented as V1,V2..etc.,)
                     # from the Table data identify 'Features(input)' and 'Labels(output)'
                     # As per the Data we Decide 'Class' is Our Label/output
                         # Class = 0 --No Fraud
                         # Class = 1 -- Fraud
                     # remaining all Columns we are taking as our 'Features(inputs)'
                     # check for CATOGORICAL Values , if there are any Catogorical Values convert it into "numerical format"
                     # as ML understands only Numerical format data

In [None]:
# checking datatypes   
df.info()           # all the features are of float datatype except the 'Class' which is of int type

In [None]:
# check for missing values in each feature and label
df.isnull().sum()       # missing values represented by 1 or more than 1
                           # no missing values represented by  0
                           # here there are no missing values

In [None]:
# statistical summary of the data 
# mean,standard deviation,count etc.,
df.describe()

#### Data Visualization

In [None]:
# Check the Class distribution of 'Output Class'
# to identify whether our data is 'Balanced' or 'Imbalanced'

print(df['Class'].value_counts() )      # 0 - NonFraud Class
                                        # 1 - Fraud Class

# to get in percentage use 'normalize = True'
print('\nNoFrauds = 0 | Frauds = 1\n')
print(df['Class'].value_counts(normalize = True)*100)

In [None]:
# visualizing throug bar graph
df['Class'].value_counts().plot(kind = 'bar', title = 'Class Distribution\nNoFrauds = 0 | Frauds = 1'); # semicolon(;) to avoid '<matplotlib.axes._subplots.AxesSubplot at 0xe81c4b518>' in output

#### 3.preparing Data

In [None]:
# No Missing Data, No Duplicates
# No Feature Selection as the feature names are hidden for security reasons
# 
# As the Data is PCA transformed we assume that the variables v1 - v28 are scaled 
# we scale leftout 'Time' and 'Amount' features


#visualizing through density plots using seaborn
import seaborn as sns
fig, (ax1, ax2,ax3) = plt.subplots(ncols=3, figsize=(20, 5))

ax1.set_title(' Variable V1-V28\nAssuming as Scaled')  # plotting only few variables
sns.kdeplot(df['V1'], ax=ax1)                          # kde - kernel density estimate
sns.kdeplot(df['V2'], ax=ax1)
sns.kdeplot(df['V3'], ax=ax1)
sns.kdeplot(df['V25'], ax=ax1)
sns.kdeplot(df['V28'], ax=ax1)

ax2.set_title('Time Before Scaling')
sns.kdeplot(df['Time'], ax=ax2)

ax3.set_title('Amount Before Scaling')            
sns.kdeplot(df['Amount'], ax=ax3)

plt.show()

In [None]:
#Scaling data using RobustScaler
from sklearn.preprocessing import StandardScaler,RobustScaler
rb = RobustScaler()
df['Time'] = rb.fit_transform(df['Time'].values.reshape(-1,1))
df['Amount'] = rb.fit_transform(df['Amount'].values.reshape(-1,1))
df.head()

From Above analysis we found our dataset is 'Imbalanced'
Non Frauds = 99.82%
Frauds = 0.17%

Most of the Transactions are 'Non Fraud' Transactions.If the event to be predicted is from the Minority Class ( here Minority Class is detecting Fraud Cases ) and the event rate is less than 5% it is usually referred as 'rare event'

so this dataset needs to be balanced

If we apply ML Algorithms on this Dataset before Balancing it,ML Algorithms probably 'overfit' it assumes all the Transactions as 'Non Frauds'

 Over Fitting  : Good performance on TRAIN DATA , Bad Performance on TEST DATA/UNSEEN DATA
 Under Fitting : Bad Performance on Both TRAIN & TEST DATA

####  4. Why Accuracy Metric is Misleading When Dealing with Imbalanced Datasets ?

In [None]:
# lets Analyse why the accuracy is misleading(high) 

x = df.drop('Class',axis = 1)
y = df['Class']

#train and test split
xTrain,xTest,yTrain,yTest = train_test_split(x,y,test_size = 0.3,random_state = 42)

# spot check algorithms
classifiers = {"Logistic Regression":LogisticRegression(),
               "DecisionTree":DecisionTreeClassifier(),
               "LDA":LinearDiscriminantAnalysis()}        
# as the dataset is too big computation time will be high
# bcoz of which iam using only 3 classifiers

for name,clf in classifiers.items():
    accuracy      = cross_val_score(clf,xTrain,yTrain,scoring='accuracy',cv = 5)
    accuracyTest  = cross_val_score(clf,xTest,yTest,scoring='accuracy',cv = 5)
    
    precision     = cross_val_score(clf,xTrain,yTrain,scoring='precision',cv = 5)
    precisionTest = cross_val_score(clf,xTest,yTest,scoring='precision',cv = 5)
    
    recall        = cross_val_score(clf,xTrain,yTrain,scoring='recall',cv= 5)
    recallTest    = cross_val_score(clf,xTest,yTest,scoring='recall',cv = 5)
    
    print(name,'---','Train-Accuracy :%0.2f%%'%(accuracy.mean()*100),
                     'Train-Precision: %0.2f%%'%(precision.mean()*100),
                     'Train-Recall   : %0.2f%%'%(recall.mean()*100))
    
    print(name,'---','Test-Accuracy :%0.2f%%'%(accuracyTest.mean()*100),
                     'Test-Precision: %0.2f%%'%(precisionTest.mean()*100),
                     'Test-Recall   : %0.2f%%'%(recallTest.mean()*100),'\n')


##### Conclusion : 
with almost all classifiers the  accuracy is around 99.9%
but there is a change in precision and recall score

so the Accuracy Metric when working with imbalance datasets are misleading (very high)


#### 5. Right way of Resampling
 1. split the 'Original Train data ' into train & test
 2. oveSample or underSample the splitted train data
 3. fit the model with upsample or downsampled train data
 4. perform 'prediction' on upsample or downsampled train data
 5. Finally perform 'prediction' on 'Original Test Data'

#### 6. Random Under Sampling

In [None]:
# 1. split the 'Original Train data ' into train & test
# 2. Oversample or UnderSample the splitted train data
# 3. fit the model with Oversample or Undersampled train data
# 4. perform 'prediction' on Oversample or Undersampled train data
# 5. Finally perform 'prediction' on Original TEST Data

#step 1
xTrain_rus,xTest_rus,yTrain_rus,yTest_rus = train_test_split(xTrain,yTrain,test_size = 0.2,random_state = 42)

#step 2
rus = RandomUnderSampler()
x_rus,y_rus = rus.fit_sample(xTrain_rus,yTrain_rus)

#converting it to DataFrame to Visualize in pandas
df_x_rus = pd.DataFrame(x_rus)
df_x_rus['target'] = y_rus
print(df_x_rus['target'].value_counts())
print(df_x_rus['target'].value_counts().plot(kind = 'bar',title = 'RandomUnderSampling\nFrauds = 1 | NoFrauds = 0'))



In [None]:
#step 3
lr = LogisticRegression()
lr.fit(x_rus,y_rus)

#step 4
yPred_rus = lr.predict(xTest_rus)

rus_accuracy = accuracy_score(yTest_rus,yPred_rus)
rus_classReport = classification_report(yTest_rus,yPred_rus)
#print('\nTrain-Accuracy %0.2f%%'%(rus_accuracy*100),
#      '\nTrain-ClassificationReport:\n',rus_classReport,'\n')

#step 5
yPred_actual = lr.predict(xTest)
test_accuracy = accuracy_score(yTest,yPred_actual)
test_classReport = classification_report(yTest,yPred_actual)
print('\nTest-Accuracy %0.2f%%'%(test_accuracy*100),
      '\n\nTest-ClassificationReport:\n',test_classReport)


#### 7. Random Over Sampling

In [None]:
#step 1
xTrain_ros,xTest_ros,yTrain_ros,yTest_ros = train_test_split(xTrain,yTrain,test_size=0.2,random_state=42)

#step 2
ros = RandomOverSampler()
x_ros,y_ros = ros.fit_sample(xTrain_ros,yTrain_ros)

#Converting it to dataframe to visualize in pandas
df_x_ros = pd.DataFrame(x_ros)
df_x_ros['target'] = y_ros
print(df_x_ros['target'].value_counts())
print(df_x_ros['target'].value_counts().plot(kind = 'bar',title = 'RandomOverSampling\nFrauds = 0 | NoFrauds = 1'))


In [None]:
#step 3
lr = LogisticRegression()
lr.fit(x_ros,y_ros)

#step 4
yPred_ros = lr.predict(xTest_ros)

ros_accuracy = accuracy_score(yTest_ros,yPred_ros)
ros_classReport = classification_report(yTest_ros,yPred_ros)
print('\nTrain-Accuracy %0.2f%%'%(rus_accuracy*100),
      '\nTrain-ClassificationReport:\n',rus_classReport,'\n')

#step 5
yPred_actual = lr.predict(xTest)
test_accuracy = accuracy_score(yTest,yPred_actual)
test_classReport = classification_report(yTest,yPred_actual)
print('\nTest-Accuracy %0.2f%%'%(test_accuracy*100),
      '\n\nTest-ClassificationReport:\n',test_classReport)

#### SMOTE 

In [None]:
#step 1
xTrain_smote,xTest_smote,yTrain_smote,yTest_smote = train_test_split(xTrain,yTrain,test_size = 0.2,random_state = 42 )

#step2
smote = SMOTE()
x_smote,y_smote = smote.fit_sample(xTrain_smote,yTrain_smote)
#Converting it to dataframe to visualize in pandas
df_x_smote = pd.DataFrame(x_smote)
df_x_smote['target'] = y_smote
print(df_x_smote['target'].value_counts())
print(df_x_smote['target'].value_counts().plot(kind = 'bar',title = 'SMOTE\nFrauds = 0 | NoFrauds = 1'))



In [None]:
rfc = RandomForestClassifier(random_state = 42)
rfc.fit(x_smote,y_smote)
ypred_smote = rfc.predict(xTest_smote)

rfc_prediction=rfc.predict(xTest)
print('RFC-Accuracy',accuracy_score(yTest,rfc_prediction),'\n')
print('Confusion_Matrix:\n',confusion_matrix(yTest,rfc_prediction),'\n')
print('Classification Report',classification_report(yTest,rfc_prediction))

In [None]:
#auc score
rfc_fpr,rfc_tpr,_ = roc_curve(yTest,rfc_prediction)
rfc_auc = auc(rfc_fpr,rfc_tpr)
print('RandomForestClassifier-auc : %0.2f%%'%(rfc_auc * 100))

#roc curve
plt.figure()
plt.plot(rfc_fpr,rfc_tpr,label ='RFC(auc = %0.2f%%)'%(rfc_auc *100))
plt.plot([0,1],[0,1],'k--')
plt.legend()
plt.title('Smote with RandomForestClassifier\nROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()