# Data Exploration

Execute this notebook to explore the data.

The data used in this exercise contains records of normal and fraudulent credit card payments.

You must use the data to train a machine learning model for detecting whether a transaction is a normal payment or a fraud.

First of all, load the data and inspect a few records:

In [None]:
import pandas as pd

df = pd.read_csv('data/creditcard.csv')
df.head()

The dataset contains the following columns:

* **Time**: the transaction time.
* **Columns V1..V28**: Anonymized values for privacy reasons. You can assume that these columns describe the details of each transaction.
* **Amount**: The amount of the transaction.
* **Class**: Whether the transaction is fraudlent(`1`) or not (`0`).

Count the number of rows (samples) of the dataset:

In [None]:
len(df)

You can take a look at descriptive statistics for each column

In [None]:
df.describe()

Verify whether the dataset contains null values

In [None]:
print(
    f"No Frauds {round(df['Class'].value_counts()[0]/len(df) * 100, 2)}"
    f' % of the dataset'
)
print(
    f"Frauds {round(df['Class'].value_counts()[1]/len(df) * 100, 2)}"
    f'% of the dataset'
)

As you can see, most of the transactions are non-fraud. If we use this datase as the base for training our model, then we might get a lot of classification errors and the model will probably assume that most transactions are not fraud. But we don't want our model to assume based on class probabilities, we want our model to detect patterns that give signs of fraud!

The preprocessing step will take care of this problem by balancing the dataset.

## References

* The code used for this exercise is inspired in the following Kaggle notebook https://www.kaggle.com/code/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets

* Data https://www.kaggle.com/code/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets/input