## Credit Card Fraud Detection Prototype

### Introduction
In this notebook, we explore the development of a credit card fraud detection model. The goal is to build a robust solution capable of identifying fraudulent transactions in real-time, leveraging machine learning techniques and cloud infrastructure.

### Dataset Exploration
We begin by analyzing a sample dataset obtained from Kaggle [DataSet](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud), which contains historical credit card transactions. We perform exploratory data analysis (EDA) to gain insights into the distribution of features, class imbalance, and potential correlations.

### Data Preprocessing
After gaining a comprehensive understanding of the dataset, we preprocess the data by handling missing values, scaling numerical features, and encoding categorical variables. Additionally, we split the dataset into training and testing sets to evaluate the performance of our models effectively.

### Model Development
We employ a Random Forest classifier to detect fraudulent transactions due to its ability to handle imbalanced datasets and capture nonlinear relationships. The model is trained on the preprocessed training data, utilizing features such as transaction amount, merchant category, and time of day.

### Model Evaluation
The performance of the trained model is evaluated using the test dataset. We assess key metrics such as precision, recall, and F1-score to measure the model's ability to correctly classify fraudulent transactions while minimizing false positives.

### Deployment on Cloud Platform
To enable real-time fraud detection, we plan to deploy the trained model on a serverless platform using Google Cloud's BigQuery ML. This approach offers scalability, cost-effectiveness, and seamless integration with existing systems. By leveraging AI Platform endpoints, our solution will provide instant fraud alerts and notifications, ensuring timely action to mitigate risks.

Additionally, we intend to utilize Google Data Studio for comprehensive data analysis. This will empower organizations to gain actionable insights into fraudulent trends, optimize business processes, and enhance overall performance. By combining advanced analytics with real-time fraud detection, our solution aims to deliver unparalleled value and security to financial institutions.

### Future Enhancements
In the future, we aim to enhance the model's performance by utilizing larger and more diverse datasets. Additionally, we plan to implement advanced techniques such as anomaly detection and deep learning to further improve the accuracy of fraud detection.

### Conclusion
In summary, this notebook presents a prototype for credit card fraud detection, demonstrating the integration of machine learning, cloud computing, and real-time analytics. By leveraging advanced technologies, we strive to develop a robust solution capable of safeguarding financial transactions and providing valuable insights for organizations to combat fraud effectively.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
# loading the dataset to a Pandas DataFrame
credit_card_data = pd.read_csv('/content/creditcard.csv')

In [None]:
# dataset informations
credit_card_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13954 entries, 0 to 13953
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Time    13954 non-null  int64  
 1   V1      13954 non-null  float64
 2   V2      13954 non-null  float64
 3   V3      13954 non-null  float64
 4   V4      13954 non-null  float64
 5   V5      13954 non-null  float64
 6   V6      13954 non-null  float64
 7   V7      13954 non-null  float64
 8   V8      13954 non-null  float64
 9   V9      13954 non-null  float64
 10  V10     13954 non-null  float64
 11  V11     13954 non-null  float64
 12  V12     13954 non-null  float64
 13  V13     13954 non-null  float64
 14  V14     13954 non-null  float64
 15  V15     13954 non-null  float64
 16  V16     13954 non-null  float64
 17  V17     13954 non-null  float64
 18  V18     13954 non-null  float64
 19  V19     13954 non-null  float64
 20  V20     13954 non-null  float64
 21  V21     13954 non-null  float64
 22

In [None]:
# checking the number of missing values in each column
credit_card_data.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    1
Class     1
dtype: int64

In [None]:
# distribution of legit transactions & fraudulent transactions
credit_card_data['Class'].value_counts()

Class
0.0    13897
1.0       56
Name: count, dtype: int64

**This Dataset is highly unblanced****

**0** --> Normal Transaction

**1** --> fraudulent transaction

In [None]:
# separating the data for analysis
legit = credit_card_data[credit_card_data.Class == 0]
fraud = credit_card_data[credit_card_data.Class == 1]

In [None]:
print(legit.shape)
print(fraud.shape)

(13897, 31)
(56, 31)


In [None]:
# statistical measures of the data
legit.Amount.describe()

count    13897.000000
mean        63.773909
std        177.164503
min          0.000000
25%          5.770000
50%         15.900000
75%         52.370000
max       7712.430000
Name: Amount, dtype: float64

In [None]:
fraud.Amount.describe()

count      56.000000
mean       90.815893
std       310.308450
min         0.000000
25%         1.000000
50%         1.000000
75%         1.025000
max      1809.680000
Name: Amount, dtype: float64

In [None]:
# compare the values for both transactions
credit_card_data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,10111.481255,-0.217093,0.266252,0.887227,0.276789,-0.10751,0.136745,-0.131462,-0.020385,1.00546,...,0.020758,-0.068686,-0.15823,-0.034834,0.013065,0.117928,0.034349,0.009565,0.002329,63.773909
1.0,12262.107143,-4.727948,4.660436,-9.328536,6.783464,-2.890388,-2.03036,-6.578337,1.028374,-2.981468,...,0.453283,0.116896,0.048132,-0.225526,-0.452514,-0.124184,0.333325,0.793484,-0.046177,90.815893


****Under-Sampling****

Build a sample dataset containing similar distribution of normal transactions and Fraudulent Transactions

**Number of Fraudulent Transactions** --> 492

In [None]:
legit_sample = legit.sample(n=492)

Concatenating two DataFrames

In [None]:
new_dataset = pd.concat([legit_sample, fraud], axis=0)

In [None]:
new_dataset.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
8422,11287,-0.489945,0.449771,1.706803,-0.400513,0.103308,-0.182073,0.177444,0.064068,1.278853,...,-0.124929,0.004741,-0.214947,0.022329,-0.211501,1.011616,0.00581,0.093065,6.03,0.0
3538,3023,-0.521762,0.626184,-0.599239,-3.271827,2.656585,2.747355,0.584885,0.567661,0.887218,...,-0.021664,0.228997,-0.319731,1.005759,0.285072,-0.819992,0.346156,0.039702,1.0,0.0
6248,7322,1.375224,-0.50124,0.622072,-0.237328,-1.074882,-0.773451,-0.656449,-0.228672,0.900953,...,-0.170137,-0.194478,-0.026502,0.34825,0.517657,-0.282831,-0.017593,0.007945,21.0,0.0
1614,1253,-1.679299,-0.509727,0.140798,-1.945697,-0.117735,-1.770334,0.632436,0.253202,0.129805,...,0.326806,0.383192,0.228425,0.575899,-0.060168,-0.178033,0.195578,-0.005537,160.0,0.0
1668,1295,1.316366,-0.529736,0.106098,-0.615562,-0.998812,-1.05875,-0.284262,-0.163252,-1.190051,...,-0.376226,-0.730324,0.054252,0.537787,0.219676,0.981338,-0.089602,-0.005121,30.0,0.0


In [None]:
new_dataset.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
11880,20451,-15.819179,8.775997,-22.804686,11.864868,-9.092361,-2.386893,-16.560368,0.948349,-6.310658,...,-2.350634,1.036362,1.136051,-1.043414,-0.108923,0.657437,2.136424,-1.411945,1.0,1.0
12070,20931,-16.367923,9.223692,-23.270631,11.844777,-9.462037,-2.450444,-16.925152,1.384208,-6.287736,...,-2.343674,1.004602,1.188212,-1.047184,-0.035573,0.6649,2.122796,-1.416741,1.0,1.0
12108,21046,-16.917468,9.6699,-23.736443,11.82499,-9.830548,-2.514829,-17.290657,1.820408,-6.264903,...,-2.336111,0.972755,1.241866,-1.051086,0.038009,0.672317,2.108471,-1.421243,1.0,1.0
12261,21419,-17.46771,10.114816,-24.202142,11.805469,-10.198046,-2.579938,-17.656788,2.256902,-6.242149,...,-2.328024,0.94083,1.296817,-1.055104,0.111792,0.679695,2.093541,-1.425491,1.0,1.0
12369,21662,-18.018561,10.5586,-24.667741,11.78618,-10.564657,-2.645681,-18.023468,2.693655,-6.219464,...,-2.319479,0.908839,1.352904,-1.059222,0.185751,0.687037,2.078081,-1.429517,1.0,1.0


In [None]:
new_dataset['Class'].value_counts()

Class
0.0    492
1.0     56
Name: count, dtype: int64

In [None]:
new_dataset.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,9542.380081,-0.283828,0.290392,0.87144,0.183697,-0.133379,0.090199,-0.027253,0.018222,1.023106,...,0.049552,-0.106494,-0.160579,-0.004095,0.006531,0.108683,0.030812,-0.004218,-0.020588,66.589614
1.0,12262.107143,-4.727948,4.660436,-9.328536,6.783464,-2.890388,-2.03036,-6.578337,1.028374,-2.981468,...,0.453283,0.116896,0.048132,-0.225526,-0.452514,-0.124184,0.333325,0.793484,-0.046177,90.815893


Splitting the data into Features & Targets

In [None]:
X = new_dataset.drop(columns='Class', axis=1)
Y = new_dataset['Class']

In [None]:
print(X)

In [None]:
print(Y)


**Split the data into Training data & Testing Data**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

(548, 30) (438, 30) (110, 30)


****Model Training****

**Random Forest model**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Initialize Random Forest Classifier
rf_clf = RandomForestClassifier()

# Define hyperparameters grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Perform grid search cross-validation
grid_search = GridSearchCV(rf_clf, param_grid, cv=5, scoring='f1', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get the best model
best_rf_model = grid_search.best_estimator_




**Model Evaluation**

**Accuracy Score**

In [None]:
# Evaluate the model
y_pred = best_rf_model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99        99
         1.0       1.00      0.82      0.90        11

    accuracy                           0.98       110
   macro avg       0.99      0.91      0.95       110
weighted avg       0.98      0.98      0.98       110

