## Credit Card Fraud Detection

Credit card fraud detection involves identifying unauthorized and potentially fraudulent transactions within a dataset of credit card transactions. Given the rapid increase in online transactions and digital payments, detecting fraudulent activities has become crucial for financial institutions to prevent financial losses and protect customers.

#### Key Challenges:
1. **Imbalanced Data**: Fraudulent transactions typically make up a very small fraction of the total transactions, often less than 1%. This imbalance makes it challenging for models to accurately detect fraud, as they may be biased towards predicting the majority class (non-fraudulent transactions).
2. **Evolving Fraud Techniques**: Fraudsters continuously develop new techniques to evade detection, requiring models to be adaptable and continuously updated.
3. **High False Positive Rate**: Detecting fraud is a high-stakes problem where false positives (legitimate transactions flagged as fraud) can inconvenience customers and damage trust.

#### Approach:
1. **Data Preprocessing**:
   - **Data Cleaning**: Handling missing values, duplicates, and erroneous data.
   - **Feature Engineering**: Creating new features that might help in distinguishing between fraudulent and non-fraudulent transactions.
   
2. **Initial Model Training**:
   - Train machine learning models on the unbalanced dataset to establish a baseline performance.
   - Common models include Logistic Regression, Decision Trees, Random Forests, and Gradient Boosting Machines.

3. **Evaluation Metrics**:
   - Use appropriate metrics such as Precision, Recall, F1-Score, and AUC-ROC to evaluate model performance, considering the imbalance in the dataset.

4. **Addressing Imbalance**:
   - If initial models do not perform well, implement techniques to balance the dataset.
   - Techniques include oversampling the minority class (fraudulent transactions), undersampling the majority class (non-fraudulent transactions), or using algorithms designed to handle imbalanced data.

5. **Model Improvement and Deployment**:
   - Continuously monitor model performance and update the model with new data to adapt to evolving fraud patterns.
   - Implement the model in a real-time system to flag suspicious transactions for further investigation.


In [14]:
# import the necessary packages
import numpy as np
import pandas as pd
import plotly.graph_objs as go
import plotly.figure_factory as ff

Certainly! Here's the description as if you're presenting your project:

---

First, we import the necessary packages. We use:
- **NumPy** for numerical operations,
- **Pandas** for data manipulation and analysis,
- **Plotly** for creating interactive visualizations.

Next, we take a look at the first few rows of our dataset using the `head()` function. This gives us a quick overview of the data structure and the types of values we are working with.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

# Read the data from the CSV file
data = pd.read_csv('/content/drive/MyDrive/Pic/creditcard.csv')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [21]:
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


Next, we check the dimensions of our dataset using the `shape` attribute. This tells us the number of rows and columns in the dataset, providing a sense of its size and structure.

In [7]:
data.shape

(284807, 31)

Then, we use the `describe()` function to generate summary statistics for the dataset. This includes measures such as mean, standard deviation, and quartiles, helping us understand the distribution and central tendencies of the numerical features.

In [9]:
data.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.168375e-15,3.416908e-16,-1.379537e-15,2.074095e-15,9.604066e-16,1.487313e-15,-5.556467e-16,1.213481e-16,-2.406331e-15,...,1.654067e-16,-3.568593e-16,2.578648e-16,4.473266e-15,5.340915e-16,1.683437e-15,-3.660091e-16,-1.22739e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


Next, we determine the number of fraud cases in our dataset. We filter the data into two subsets: fraudulent transactions and valid transactions. Then, we calculate the fraction of fraudulent transactions compared to valid ones. Finally, we print out the outlier fraction along with the total number of fraud cases and valid transactions in the dataset.

In [11]:
# Determine number of fraud cases in dataset
fraud = data[data['Class'] == 1]
valid = data[data['Class'] == 0]
outlierFraction = len(fraud)/float(len(valid))
print(outlierFraction)
print('Fraud Cases: {}'.format(len(data[data['Class'] == 1])))
print('Valid Transactions: {}'.format(len(data[data['Class'] == 0])))

0.0017304750013189597
Fraud Cases: 492
Valid Transactions: 284315


####Given that only 0.17% of all transactions are fraudulent, the data is highly imbalanced. Initially, let's apply our models to this unbalanced dataset. If the models do not achieve satisfactory accuracy, we will then explore methods to balance the dataset. For now, let's proceed with the unbalanced data and consider balancing it only if necessary.

Following that, we print the details of the amounts involved in fraudulent transactions. This is achieved by using the `describe()` function specifically on the 'Amount' column of the fraudulent transactions subset. It provides summary statistics such as mean, standard deviation, minimum, maximum, and quartiles, offering insights into the distribution of transaction amounts for fraudulent cases.

In [13]:
print("Details of the amounts involved in fraudulent transactions")
fraud.Amount.describe()

Details of the amounts involved in fraudulent transactions


count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64

In this step, we visualize the correlation matrix as a heatmap using Plotly. Here's a breakdown of the process:

1. **Calculate the correlation matrix**: We compute the correlation matrix of the dataset using the `corr()` function. This matrix quantifies the linear relationships between pairs of variables in the dataset.

2. **Create the heatmap**: Using Plotly's `Heatmap` object, we specify the correlation matrix values (`z`), row and column names (`x` and `y`), and the color scale to represent the correlation coefficients. Here, the colorscale 'Viridis' is used, ranging from -1 to 1.

3. **Set up the layout**: We customize the layout settings for the heatmap, including the title, angle for tick labels on the x and y axes, as well as the dimensions of the plot.

4. **Create the figure**: We use Plotly's `Figure` object to combine our heatmap data and layout settings.

5. **Show the figure**: Finally, we display the heatmap figure using Plotly's `show()` function, allowing for an interactive exploration of the correlation matrix heatmap. This visualization aids in identifying patterns and relationships between variables in the dataset.

In [15]:
# Calculate the correlation matrix
corrmat = data.corr()

# Create a heatmap
heatmap = go.Heatmap(
    z=corrmat.values,
    x=corrmat.columns,
    y=corrmat.columns,
    colorscale='Viridis',
    zmin=-1,
    zmax=1
)

# Set up the layout
layout = go.Layout(
    title='Correlation Matrix Heatmap',
    xaxis=dict(tickangle=-45),
    yaxis=dict(tickangle=0),
    width=800,
    height=800
)

# Create the figure
fig = go.Figure(data=[heatmap], layout=layout)

# Show the figure
fig.show()

In this step, we split the dataset into feature variables (X) and the target variable (Y). Here's what we do:

1. **Divide the dataset**: We separate the dataset into two parts: features (X) and the target variable (Y). The features are obtained by dropping the 'Class' column from the dataset, while the target variable 'Class' is assigned to Y.

2. **Print the dimensions**: We print the shapes of the feature matrix (X) and the target vector (Y) to verify the splitting process. This helps ensure that the dimensions of X and Y match our expectations.

3. **Convert to numpy arrays**: To facilitate further processing, we convert the Pandas DataFrames (X and Y) into numpy arrays. This conversion is done using the `values` attribute, which returns the array representation of the DataFrame without column labels.

These steps prepare our data for subsequent modeling and analysis.

In [16]:
# dividing the X and the Y from the dataset
X = data.drop(['Class'], axis = 1)
Y = data["Class"]
print(X.shape)
print(Y.shape)
# getting just the values for the sake of processing
# (its a numpy array with no columns)
xData = X.values
yData = Y.values

(284807, 30)
(284807,)


Here, we utilize Scikit-learn to split our data into training and testing sets. Here's what's happening:

1. **Import train_test_split**: We import the `train_test_split` function from the `model_selection` module of Scikit-learn. This function is used to split datasets into random train and test subsets.

2. **Split the data**: We split our feature variables (`xData`) and target variable (`yData`) into training and testing sets. The `test_size` parameter specifies the proportion of the dataset to include in the test split (in this case, 20%), and `random_state` ensures reproducibility by fixing the random seed.

3. **Resulting sets**: After splitting, we have four sets: `xTrain` (training features), `xTest` (testing features), `yTrain` (training target), and `yTest` (testing target). These sets are used for training and evaluating machine learning models.

In [17]:
# Using Scikit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
xTrain, xTest, yTrain, yTest = train_test_split(xData, yData, test_size = 0.2, random_state = 42)

Here, we build a Random Forest Classifier using Scikit-learn. Here's a breakdown of the process:

1. **Import RandomForestClassifier**: We import the `RandomForestClassifier` class from the `ensemble` module of Scikit-learn. This class implements a random forest classifier, a popular ensemble learning method.

2. **Model creation**: We create an instance of the `RandomForestClassifier` class, named `rfc`.

3. **Model training**: We train the random forest classifier (`rfc`) using the `fit()` method. The training data consists of the features (`xTrain`) and their corresponding labels (`yTrain`).

4. **Make predictions**: We use the trained classifier to make predictions on the test data (`xTest`) using the `predict()` method. The predicted labels are stored in `yPred`.

This process allows us to build and evaluate a Random Forest classifier for our dataset.

In [18]:
# Building the Random Forest Classifier (RANDOM FOREST)
from sklearn.ensemble import RandomForestClassifier
# random forest model creation
rfc = RandomForestClassifier()
rfc.fit(xTrain, yTrain)
# predictions
yPred = rfc.predict(xTest)

In this section, we evaluate the performance of our classifier using various metrics. Here's what we do:

1. **Import metrics**: We import several metrics from Scikit-learn, including `classification_report`, `accuracy_score`, `precision_score`, `recall_score`, `f1_score`, `matthews_corrcoef`, and `confusion_matrix`.

2. **Calculate scores**: We calculate and print different evaluation metrics to assess the classifier's performance:
   - **Accuracy**: The proportion of correctly classified instances out of the total instances.
   - **Precision**: The proportion of true positive predictions out of all positive predictions.
   - **Recall**: The proportion of true positive predictions out of all actual positive instances.
   - **F1-Score**: The harmonic mean of precision and recall, providing a balance between the two metrics.
   - **Matthews correlation coefficient (MCC)**: Measures the quality of binary classifications, considering both true and false positives and negatives.

3. **Print results**: We print out each metric along with the classifier used (Random Forest classifier).

These metrics provide insights into the classifier's ability to correctly classify fraudulent and valid transactions, aiding in the assessment of its effectiveness.

In [20]:
# Evaluating the classifier
# printing every score of the classifier
# scoring in anything
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score, matthews_corrcoef
from sklearn.metrics import confusion_matrix

n_outliers = len(fraud)
n_errors = (yPred != yTest).sum()
print("The model used is Random Forest classifier")

acc = accuracy_score(yTest, yPred)
print("The accuracy is {}".format(acc))

prec = precision_score(yTest, yPred)
print("The precision is {}".format(prec))

rec = recall_score(yTest, yPred)
print("The recall is {}".format(rec))

f1 = f1_score(yTest, yPred)
print("The F1-Score is {}".format(f1))

MCC = matthews_corrcoef(yTest, yPred)
print("The Matthews correlation coefficient is {}".format(MCC))

The model used is Random Forest classifier
The accuracy is 0.9995611109160493
The precision is 0.974025974025974
The recall is 0.7653061224489796
The F1-Score is 0.8571428571428571
The Matthews correlation coefficient is 0.8631826952924256


Here, we visualize the confusion matrix of our classifier's predictions. Here's what we do:

1. **Calculate the confusion matrix**: We compute the confusion matrix using the `confusion_matrix` function from Scikit-learn. This matrix tabulates the true positive, false positive, true negative, and false negative predictions.

2. **Define labels and ticks**: We define labels for the classes ('Normal' and 'Fraud') and tick labels for the confusion matrix heatmap.

3. **Create the heatmap trace**: Using Plotly's `Heatmap` object, we specify the confusion matrix values (`z`), tick labels for the x and y axes (`x` and `y`), and the colorscale for visualization.

4. **Define the layout**: We customize the layout settings for the confusion matrix plot, including the title and axis labels.

5. **Create the figure**: We use Plotly's `Figure` object to combine our heatmap data and layout settings.

6. **Show the figure**: Finally, we display the confusion matrix heatmap using Plotly's `show()` function, providing an intuitive visual representation of the classifier's performance in classifying normal and fraudulent transactions.

In [23]:
# printing the confusion matrix
LABELS = ['Normal', 'Fraud']

conf_matrix = confusion_matrix(yTest, yPred)

# Define the values for x and y ticks
xticks = ['Predicted ' + label for label in LABELS]
yticks = ['True ' + label for label in LABELS]

# Create the heatmap trace
heatmap = go.Heatmap(z=conf_matrix,
                     x=xticks,
                     y=yticks,
                     colorscale='Viridis',  # You can choose any other colorscale
                     reversescale=False,  # Change to True if needed
                     showscale=True)

# Define the layout
layout = go.Layout(title='Confusion Matrix',
                   xaxis=dict(title='Predicted class'),
                   yaxis=dict(title='True class'))

# Create the figure
fig = go.Figure(data=[heatmap], layout=layout)

# Show the figure
fig.show()