# Credit Card Fraud Detection

```{article-info}
:avatar: https://avatars.githubusercontent.com/u/25820201?v=4
:avatar-link: https://github.com/PhotonicGluon/
:author: "[Ryan Kan](https://github.com/PhotonicGluon/)"
:date: "Jul 1, 2024"
:read-time: "{sub-ref}`wordcount-minutes` min read"
```

*This notebook is largely inspired by the Keras code example [Imbalanced classification: credit card fraud detection](https://keras.io/examples/structured_data/imbalanced_classification/) by [fchollet](https://twitter.com/fchollet).*

<center>
    <img alt="Credit Cards" style="width: 75%" src="https://storage.googleapis.com/kaggle-datasets-images/310/684/3503c6c827ca269cc00ffa66f2a9c207/dataset-cover.jpg">
</center>

It essential that credit card companies can detect fraudulent transactions using credit cards so that customers are not charged for items that they did not buy. This example looks at the [Kaggle Credit Card Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) dataset to demonstrate how to train a classification model on data with highly imbalanced classes.

## Preparing the Data

### Loading the Data

The dataset we will be using is the [Kaggle Credit Card Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) dataset. To access it, you will need a Kaggle account.

```{button-link} https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
:color: primary
:shadow:

Download Data
```

The dataset contains transactions made by credit cards in September 2013 by European cardholders over two days, where there are 492 frauds out of 284,807 transactions. The dataset is highly unbalanced &mdash; the fraudulent transactions account for only 0.172% of all transactions. Despite this class imbalance, we will try to create a model that detects fraud.

The dataset contains only numerical input variables which are the result of a [Principal Component Analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) transformation. However, the real features used to generate the original, unmodified dataset are not given; the features given here are the principal components obtained with PCA. The only unchanged columns are `Time` and `Amount`. 
- The `Time` is the seconds elapsed between each transaction and the first transaction in the dataset.
- The `Amount` is the transaction amount.

Our aim is to predict the `Class` label, where `1` reflects a fraudulent transaction and `0` otherwise.

The dataset is saved in the file called `credit-card-fraud.csv` in the folder `data`. We will first vectorize the data.

In [None]:
FILE_NAME = "data/credit-card-fraud.csv"

In [None]:
all_features = []
all_targets = []

with open(FILE_NAME) as f:
    for i, line in enumerate(f):
        # We will skip the first line, which is the header
        if i == 0:
            # Skip the header
            print("HEADER:", line.strip())
            continue
        
        # Get the fields of that row
        fields = line.strip().split(",")
        all_features.append([float(v.replace('"', "")) for v in fields[:-1]])
        all_targets.append([int(fields[-1].replace('"', ""))])
        
        # Print the first line as an example of what features we have
        if i == 1:
            print("EXAMPLE FEATURES:", all_features[-1])

features = np.array(all_features, dtype="float32")
targets = np.array(all_targets, dtype="uint8")
print("Shape of features:", features.shape)
print("Shape of targets: ", targets.shape)

### Preprocessing the Data

TODO: ADD