# Credit Card Fraud Investigation

## Context & Data

We are investigating credit card fraud and whether it is possible to detect this fraud from given data. <br>

We have credit card transaction records as structured data consisting of a mix of quantitative and categorical variables including transaction amount, merchant, time and location. Important categorical variables also include transaction type (online or in-store) and customer profile (regular, irregular). This structured data comes in the form of spreadsheets and can be analysed using Python Pandas as a dataframe. This data could be imported as a CSV.

There is also unstructured data available from customer complaints and enquiries relating to credit card fraud. This may not seem relevant for our initial statistical analysis but will provide excellent context for our situation. We could use ChatGPT to investigate large amounts of customer complaints and pick out key words to potentially indicate what we should look for in our analysis.
Some of the available datasets have anonymised the variable name due to confidentiality of the data. We must trust that these variables are relevant in contributing to the analysis of credit card fraud.


## Packages, Resources & GitHub
There are many resources available including Kaggle for datasets, Python libraries including Pandas and NumPy for data analysis and scikit-learn for machine learning models in our later investigation. Also, there is a large amount of documentation explaining fraud detection algorithms which can be leveraged in our scenario. Furthermore, there are many GitHub repositories investigating this scenario. We could use some of the analyses in these and either expand on them or use these to help us to design and implement our models.

We aim to create a successful fraud detection algorithm and we must use our resources effectively and efficiently to do so. The Python packages will be essential for our initial data analysis and then when creating our algorithm. Machine learning algorithms including k-means clustering and logistic regression will likely be helpful for classifying transactions as fraudulent.

GitHub allows us to collaborate remotely as a group which is vital for the project due to the time constraints. Privacy issues are less significant since the data that we will be using is publicly available. Significant issues may occur if GitHub itself shuts down and we cannot access our repository to work together on the project.

We will perform exploratory data analysis on a particular dataset, using Pandas and NumPy to analyse the data, with Matplotlib and Seaborn to create interesting visualisations.

The dataset that we have chosen to analyse is called [Credit Card Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/data), and was found on [Kaggle](https://www.kaggle.com).

## Exploratory Data Analysis

In [2]:
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# load the dataset using pandas
df = pd.read_csv('creditcard.csv')

# quick data preview
print(df.head())

FileNotFoundError: [Errno 2] No such file or directory: 'creditcard.csv'

In [None]:
# shape of dataset
print(f"Rows: {df.shape[0]}.")
print(f"Columns: {df.shape[1]}.")

In [None]:
# basic info to understand data types and missing values
print(df.info())

In [None]:
# look for any missing values
print(df.isnull().sum())

There are no null values in the dataset. Thus, we do not need to clean or pre-process null values out of the dataset and we can start looking directly at the data.

In [None]:
# summary statistics
print(df.describe())

We can see that 'Amount' is very variable due to the high standard deviation and also very skewed since the mean is higher than the third quartile. The 'Class' column is also very skewed with many more non-fraudulent cases.

In [None]:
# plot fraud vs non-fraud counts
plt.figure(figsize=(6, 4))
sns.countplot(x='Class', data=df, palette='Set1')
plt.title('Fraud (1) vs Non-Fraud (0)')
plt.xlabel('Class')
plt.ylabel('Count')
plt.show()

Clearly, the data is very skewed and there are many more cases of 'non-fraud' than 'fraud'.

In [None]:
# distribution of transaction amounts by class
plt.figure(figsize=(8, 6))
sns.histplot(df[df['Class'] == 0]['Amount'], bins=40, color='blue', kde=True, label='Non-Fraud', alpha=0.6)
sns.histplot(df[df['Class'] == 1]['Amount'], bins=40, color='red', kde=True, label='Fraud', alpha=0.6)
plt.title('Transaction Amount Distribution by Class')
plt.legend()
plt.show()

Non-fraudulent cases typically have lower 'Amount' values and fraudulent cases typically have a wider range of 'Amount' values with some very high.

In [None]:
# heatmap of feature correlations
plt.figure(figsize=(12, 8))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm_r')
plt.title('Feature Correlation Heatmap')
plt.show()

In [None]:
# heatmap of feature correlations, focusing on correlation with 'Class'
plt.figure(figsize=(12, 8))
corr_matrix = df.corr()
sns.heatmap(corr_matrix[['Class']], annot=True, cmap='coolwarm', vmin=-1, vmax=1, center=0)
plt.title('Correlation with Class')
plt.show()

Correlations are generally very low, suggesting there may be a non-linear relationship with 'Class'.

In [None]:
# outlier detection for transaction amounts
plt.figure(figsize=(7, 5))
sns.boxplot(x=df['Amount'], color='lightblue')
plt.title('Transaction Amount Outliers')
plt.show()

In [None]:
# outlier detection for transaction amounts by class
plt.figure(figsize=(7, 5))
sns.boxplot(x='Class', y='Amount', data=df, palette='Set1')
plt.title('Transaction Amount Outliers by Class')
plt.show()

Most transaction amounts are very small but there are a number of very high transactions, surprisingly more outliers from the non-fraudulent class.

In [None]:
# pairplot for top 5 most correlated predictors
top_predictors = corr_matrix['Class'].abs().sort_values(ascending=False).index[1:6]  # top 5 predictors

sample = df.sample(500)  # sample 500 rows to speed up visualization
sns.pairplot(sample[top_predictors.union(['Class'])], hue='Class', palette='cool')
plt.show()

The pair plots show that there may be non-linear relationship between the predictors since there are clusters of points. We may need to consider clustering methods to further analyse these relationships.

In [None]:
# skewness of transaction amounts
print("Skewness in 'Amount':", df['Amount'].skew())

# skewness of transaction amounts by class
skew_non_fraud = df[df['Class'] == 0]['Amount'].skew()
skew_fraud = df[df['Class'] == 1]['Amount'].skew()
print(f"Skewness in 'Amount' for non-fraud: {skew_non_fraud}")
print(f"Skewness in 'Amount' for fraud: {skew_fraud}")

In [None]:
# count fraud vs non-fraud transactions
fraud_cases = df[df['Class'] == 1].shape[0]
non_fraud_cases = df[df['Class'] == 0].shape[0]

print(f"Number of fraud cases: {fraud_cases}")
print(f"Number of non-fraud cases: {non_fraud_cases}")

Again, we can see that the data is very skewed and there are many more cases of 'non-fraud' than 'fraud'. The non-fraudulent transaction values ('Amount') have much more skewness than the fraudlent. We may need to transform the data to analyse it more successfully.

## Conclusion
We can see that the data is highly skewed towards non-fraudulent cases and we must implement methods to overcome this if we want to train a model to identify fraud. We must use the vast range of named resources to learn more about such methods and models. One way of dealing with imbalanced data is explored in the next document.

This is an initial analysis of the data which we will expand on in the next project.