# Credit Card Fraud

Fraud activities are considered uncommon or outliers transactions, which is probably one of the main characteristics regarding Fraud. As the authors of the book Fraud Analytics using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection pointed out:

“This makes it difficult [because of the outlier characteristic] to both detect fraud, since the fraudulent cases are covered by the nonfraudulent ones, as well as to learn from historical cases to build a powerful fraud-detection system since only few examples are available”

Hence, it is imperative to overcome this issue considering the unbalanced nature of the data.

In this project, we will use the credit card data available in [https://www.kaggle.com/mlg-ulb/creditcardfraud](https://www.kaggle.com/mlg-ulb/creditcardfraud). Moreover, we will use PySpark, an Interface for Apache Spark in Python, since it is an excellent tool dealing with Big Data.

### Feature Technicalities:

- PCA Transformation: The description of the data says that all the features went through a PCA transformation (Dimensionality Reduction technique) (Except for time and amount).
- Scaling: In order to implement a PCA transformation features need to be previously scaled. (In this case, all the V features have been scaled)


# EDA - Exploratory Data Analysis

In [None]:
# Import libraries
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Import PySpark libraries
from pyspark.sql import Window
import pyspark.sql.types as t
import pyspark.sql.functions as f
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, GBTClassifier, LinearSVC, NaiveBayes
from pyspark.ml.feature import StringIndexer, VectorIndexer, IndexToString, VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator

# Configure Spark Session
spark = SparkSession.builder \
    .appName("Credit Card Fraud") \
    .config("spark.master", "local") \
    .getOrCreate()
sc = spark.sparkContext


In [None]:
# Load file
df = spark.read.csv('datasets/creditcard.csv', header=True, inferSchema=True, sep=",")
# Print Schema
df.printSchema()

In [None]:
# Spark DataFrame Shape
print(f"Number of columns: {len(df.columns)}")
print(f"Number of Records: {df.count()}")

The dataset contains **284,807 records** and consists of **31 columns**. Out of these columns, **28 columns (V1 to V28)** are results of the **Principal Component Analysis (PCA)** technique. These columns are likely to be more abstract in nature since they are generated from linear combinations of the original features. Therefore, it may be difficult to interpret the individual contributions of these features towards the prediction task.

On the other hand, the remaining three columns are:

- **Time**: This column indicates the time elapsed in seconds between each transaction and the first transaction in the dataset.
- **Amount**: This column indicates the transaction amount.
- **Class**: This column indicates the fraud status of the transaction. The value 1 indicates fraud, and 0 indicates a legitimate transaction.

It is important to note that since the **PCA** technique is used, the original feature names and descriptions are not available. Therefore, feature engineering might be necessary to extract meaningful features from the given dataset.


In [None]:
# Inspecting the first 10 records
df.limit(10).toPandas()

In [None]:
#statistics
df.toPandas().describe().T

The dataset contains a total of **284807 records**. The **time column** in the dataset indicates the time elapsed in seconds between each transaction and the first transaction in the dataset. The maximum time recorded between transactions is **172,792 seconds**. This is equivalent to 2880 minutes or approximately **48 hours**. Since the timeframe of the dataset is only 2 days, it means that the last transaction was made two days after the first one.

The amount column indicates the transaction amount. The biggest transaction recorded in the dataset is worth **$25,691.16**.


## Checking Missing Values

In [None]:
#Check missing and null data
df.select([f.count(f.when(f.isnan(c) | f.col(c).isNull(), c)).alias(c) for c in df.columns]).show()

Upon checking the dataset, it was observed that there are no missing values present.
Each column was inspected for null or NaN values, and it was found that the dataset is complete!


## 1.2 Univariate Analysis

### 1.2.1 Class

#### Q1. HOW MANY ROWS IN THE DATASET REPRESENT CREDIT CARD FRAUD?

In [None]:
#Number of Frauds and non-frauds
classFreq = df.groupBy("Class").count()
classFreq.show()

In [None]:
total = classFreq.select("count").agg({"count": "sum"}).collect().pop()['sum(count)']
result = classFreq.withColumn('percent', f.format_number(classFreq['count']/total * 100, 2))
result.show()

Out of the total of **284,807 records**, only **492** represent credit card fraud, which is approximately **0.17%** of the dataset. Therefore, we have an imbalanced dataset.

In [None]:
# Create a pandas dataframe with only the Class column
fraud_df = df.select("Class").toPandas()

# Count the number of transactions for each class (fraudulent or not)
fraud_counts = fraud_df['Class'].value_counts()

# Calculate the percentage and absolute number of each class
fraud_perc = fraud_counts / fraud_counts.sum() * 100
fraud_abs = fraud_counts.values

# Create a bar plot with logarithmic y-axis
sns.barplot(x=fraud_counts.index, y=fraud_counts.values)
plt.yscale('log')

# Add percentage and absolute value labels to the bars
for i, v in enumerate(fraud_counts):
    plt.text(i, v, f"{fraud_perc[i]:.1f}% ({fraud_abs[i]:,})", ha='center', va='bottom', fontweight='bold')

# Set the y-axis labels and formatter
plt.ylabel('Number of Transactions')
plt.gca().yaxis.set_major_formatter(FuncFormatter(lambda x, _: '{:.0f} K'.format(x/1000)))

# Set the axis labels and title
plt.xlabel('Transaction Class')
plt.xticks([0, 1], ['Non-Fraud', 'Fraud'])
plt.title('Distribution of fraudulent transactions')

plt.show()

As mentioned earlier, the majority of transactions are legitimate. If we use this dataset as the foundation for our predictive models and analysis, we may encounter many errors, and our algorithms are likely to overfit since they will "presume" that most transactions are not fraudulent. However, we do not want our model to presume; we want our model to identify patterns that indicate fraud.



### 1.2.2 Amount -- Should we remove this? At least for me, doesnt make much sense.

In [None]:
# create a function to group time values
@f.udf(returnType=t.StringType())
def time_udf(time):
    if time < 50000:
        return "Under 50K s"
    elif 50000 <= time <= 100000:
        return "Between 50K and 100K s"
    elif time > 100000:
        return "Over 100K s"
    else:
        return "NA"

# apply the function to the "Time" column
df = df.withColumn('time_udf', time_udf('Time'))

df.limit(10).toPandas()

In [None]:
# Define window function with partitionBy clause
window = Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

# Create time group table
time_group_table = df.select("time_udf", "Amount"). \
    groupBy("time_udf"). \
    agg(
    f.count("Amount").alias("UserCount"),
    f.mean("Amount").alias("Amount_Avg"),
    f.min("Amount").alias("Amount_Min"),
    f.max("Amount").alias("Amount_Max")). \
    withColumn("total", f.sum("UserCount").over(window)). \
    withColumn("Percent", f.round(f.col("UserCount")*100 / f.col("total"), 2)). \
    drop("total"). \
    orderBy(f.desc("Percent"))

time_group_table.limit(10).toPandas()


In [None]:
# Convert PySpark dataframe to Pandas dataframe
time_group_df = time_group_table.toPandas()

# Create barplot
ax = sns.barplot(x="time_udf", y="Percent", data=time_group_df)

for p in ax.patches:
    abs_value = p.get_height()
    ax.annotate(f'{abs_value}',
                (p.get_x() + p.get_width() / 2., abs_value),
                ha='center', va='center',
                xytext=(0, 9),
                textcoords='offset points')

plt.title("User Count and Transaction Amount by Time Group")
plt.xlabel("Time Group")
plt.ylabel("Percentage")
plt.show();

In [None]:
# All Transactions
df_aux = df.select("Class", "Amount").toPandas()

# Define amount ranges
amount_ranges = [
    {"range": "Transaction Value <= $100", "min_amount": 0, "max_amount": 100},
    {"range": "Transaction Value between \$101 and \$2000", "min_amount": 101, "max_amount": 2000},
    {"range": "Transaction Value between \$2001 and \$5000", "min_amount": 2001, "max_amount": 5000},
    {"range": "Transaction Value > $5000", "min_amount": 5001, "max_amount": df_aux["Amount"].max()}
]

# Create four subplots for different amount ranges
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))

for idx, amount_range in enumerate(amount_ranges):
    row_idx = idx // 2
    col_idx = idx % 2

    # Filter transactions within amount range
    df_range = df_aux[(df_aux["Amount"] > amount_range["min_amount"]) & (df_aux["Amount"] <= amount_range["max_amount"])]

    # Plot histogram
    sns.histplot(data=df_range, x="Amount", ax=axes[row_idx, col_idx], hue="Class", kde=True)
    axes[row_idx, col_idx].set_title(amount_range["range"])
    axes[row_idx, col_idx].set_ylabel("Number of Transactions")
    axes[row_idx, col_idx].legend(labels=["Fraud", "Non-Fraud"])

fig.suptitle("All Transactions", fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
# Only fraud transactions
only_fraud = df.filter(df.Class == 1).select("Amount").toPandas()

# Add a pallet
aux_pal = ["#ff7f7f", "#ff3c3c"]

# Create three subplots for different amount ranges
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(18, 6))

# Amount range <= 100
sns.histplot(data=only_fraud[only_fraud["Amount"] <= 100], x="Amount", ax=axes[0], kde=True, color=aux_pal[0])
axes[0].set_title("Transaction Amount <= $100")
axes[0].set_ylabel("Number of Transactions")

# Amount range > 100
sns.histplot(data=only_fraud[(only_fraud["Amount"] > 100)], x="Amount", ax=axes[1], kde=True, color=aux_pal[1])
axes[1].set_title("Transaction Amount > $100")
axes[1].set_ylabel("Number of Transactions")

fig.suptitle("Only Fraudulent Transactions", fontsize=16)
plt.tight_layout()
plt.show()

### 1.2.3 Remaining variables

In [None]:
# create several histograms to show the distributions of the features
fig, axs = plt.subplots(8, 4, figsize=(25, 15))
fig.suptitle("Distribution of Features", fontsize=14)

for col, ax in zip(df.columns, axs.flatten()):
    ax.hist(df.select(col).toPandas()[col])
    ax.grid(False)
    ax.tick_params(axis='x', labelrotation=45, labelsize=14)
    ax.tick_params(axis='y', labelsize=15)
    ax.set_title(col.upper(), fontsize=14)

plt.tight_layout()
plt.subplots_adjust(top=0.9, hspace=0.4)
plt.show()

Through the histograms, we can observe that the feature "Time" does not follow a normal distribution, with a lower occurrence in the range of 75,000-100,000 seconds. On the other hand, the features Vx show a normalized distribution centered around the value 0, with some negative values.

## 1.3 Imbalanced Data

To address the imbalanced data issue, we will implement the technique of Random Under Sampling. The aim is to balance the dataset by removing data, which helps to avoid model overfitting.

The following steps will be taken:

    - Determine the degree of imbalance in the class by using the "value_counts()" function on the class column to count the number of instances for each label.

    - Bring the number of non-fraud transactions to the same amount as fraud transactions (assuming we want a 50/50 ratio),which will be equivalent to 492 cases of fraud and 492 cases of non-fraud transactions.

    - Shuffle the data to ensure that our models can maintain a certain accuracy every time we run the script.

**Note**: It is important to note that the Random Under Sampling technique may result in a loss of information since we are reducing the number of instances in the dataset.

In [None]:
# Select fraud and non-fraud transactions and limit non-fraud transactions to the same number as fraud transactions
fraud_df = df.filter(f.col('Class') == 1)
non_fraud_df = df.filter(f.col('Class') == 0).limit(fraud_df.count())

# Combine fraud and non-fraud transactions and shuffle the data
balanced_df = fraud_df.union(non_fraud_df).orderBy(f.rand())

# Show 10 rows of the shuffled, balanced dataframe
balanced_df.limit(10).toPandas()

In [None]:
# Convert the Spark dataframe to a pandas dataframe
fraud_df = balanced_df.select("Class").toPandas()

# Create a countplot with logarithmic y-axis
sns.countplot(x='Class', data=fraud_df, palette='Set1', order=[1, 0])

# Calculate the percentage of each class
fraud_counts = fraud_df['Class'].value_counts()
fraud_perc = fraud_counts / fraud_counts.sum() * 100

# Set the y-axis labels and formatter
plt.ylabel('Number of Transactions')

# Set the axis labels and title
plt.xlabel('Transaction Class')
plt.xticks([0, 1], ['Fraud', 'Non-Fraud'])
plt.title('Balanced Distribution of Transactions')

plt.show()

Now that we have our dataframe correctly balanced, we can go further with our analysis and data preprocessing.


## Correlations

Correlation matrices are the essence of understanding our data. We want to know if there are features that influence heavily in whether a specific transaction is a fraud. However, it is important that we use the correct dataframe (Balanced) in order for us to see which features have a high positive or negative correlation in regard to fraud transactions.

Summary and Explanation:

    Negative Correlations: V17, V14, V12 and V10 are negatively correlated. Notice how the lower these values are, the more likely the end result will be a fraud transaction.
    Positive Correlations: V2, V4, V11, and V19 are positively correlated. Notice how the higher these values are, the more likely the end result will be a fraud transaction.
    BoxPlots: We will use boxplots to have a better understanding of the distribution of these features in fradulent and non fradulent transactions.

Note: We have to make sure we use the subsample in our correlation matrix or else our correlation matrix will be affected by the high imbalance between our classes. This occurs due to the high class imbalance in the original dataframe.


In [None]:
df = df.drop("time_udf")
balanced_df = balanced_df.drop("time_udf")

plt.figure(figsize=(20, 10))

plt.subplot(1, 2, 1)
sns.heatmap(df.toPandas().corr(), cmap='coolwarm_r')
plt.title('Correlations with Unbalanced Data')

plt.subplot(1, 2, 2)
sns.heatmap(balanced_df.toPandas().corr(), cmap='coolwarm_r')
plt.title('Correlations with Balanced Data')

plt.show()

In [None]:
plt.figure(figsize=(20, 10))

plt.subplot(1, 2, 1)
sns.heatmap(balanced_df.toPandas().corr() > 0.7,cbar=False, annot_kws={'size':20})
plt.title('Correlations with Balanced Data > 0.7')

plt.subplot(1, 2, 2)
sns.heatmap(balanced_df.toPandas().corr() < -0.7,cbar=False, annot_kws={'size':20})
plt.title('Correlations with Balanced Data < -0.7')

plt.show();


#### Conclusions from Correlation Matrix
Based on our correlation matrix, we have concluded the following:

- The variables that were a product of PCA are not correlated with each other, meaning they are independent.
- The variable "Time" seems to be negatively correlated with all of the "Vx" variables, which means that as "Time" increases, the values of "Vx" decrease, and vice versa.
- The variables "Time" and "Class" seem to have no correlation, indicating that the time at which a transaction occurs has no influence on whether it is fraudulent or not.
- The variable "Class" seems to be negatively correlated with some of the "Vx" variables, indicating that certain values of "Vx" are more likely to be associated with fraudulent transactions, while not correlated at all with others.


## Distributions: Univariate

In [None]:
# select all columns except the ones to exclude
cols_to_include = [col for col in balanced_df.columns if col not in ['Time']]
df_to_plot = balanced_df.select(*cols_to_include)

for col in df_to_plot.columns:
    fig, axs = plt.subplots(1, 2, figsize=(15, 5))
    # plot histogram
    axs[0].hist(df_to_plot.select(col).rdd.flatMap(lambda x: x).collect(), bins=50)
    axs[0].set_title(f"Distribution of {col} values", size=14)
    axs[0].set_xlabel(col, size=12)
    axs[0].set_ylabel("Count", size=12)

    # plot boxplot
    sns.boxplot(data=df_to_plot.select(col).toPandas(), x=col, ax=axs[1])
    axs[1].set_title(f"{col} Boxplot", size=14)
    axs[1].set_xlabel(col, size=12)
    axs[1].set_ylabel("Count", size=12)

plt.show();

## Distributions: Bivariate

In [None]:
#Bivariate Analysis

data = df_to_plot.toPandas()
fig, axes = plt.subplots(ncols=4, figsize=(20,4))

# Negative Correlations with our Class (The lower our feature value the more likely it will be a fraud transaction)
sns.boxplot(x="Class", y="V14", data=data,  ax=axes[0])
axes[0].set_title('V14 vs Class Negative Correlation')

sns.boxplot(x="Class", y="V12", data=data,  ax=axes[1])
axes[1].set_title('V12 vs Class Negative Correlation')


sns.boxplot(x="Class", y="V10", data=data, ax=axes[2])
axes[2].set_title('V10 vs Class Negative Correlation')


sns.boxplot(x="Class", y="V16", data=data, ax=axes[3])
axes[3].set_title('V16 vs Class Negative Correlation')

plt.show()

In [None]:
fig, axes = plt.subplots(ncols=4, figsize=(20,4))

# Positive correlations (The higher the feature the probability increases that it will be a fraud transaction)
sns.boxplot(x="Class", y="V4", data=data,  ax=axes[0])
axes[0].set_title('V4 vs Class Positive Correlation')

sns.boxplot(x="Class", y="V11", data=data,  ax=axes[1])
axes[1].set_title('V11 vs Class Positive Correlation')


sns.boxplot(x="Class", y="V2", data=data, ax=axes[2])
axes[2].set_title('V2 vs Class Positive Correlation')


sns.boxplot(x="Class", y="V19", data=data, ax=axes[3])
axes[3].set_title('V19 vs Class Positive Correlation')

plt.show()

In [None]:
hist_fraud = df_to_plot.filter(f.col('Class') == 1).drop('Class').toPandas()

hist_fraud.hist(bins=30, figsize=(10, 10))
plt.subplots_adjust(left=0.1,
                    bottom=0.1,
                    right=0.9,
                    top=0.9,
                    wspace=0.4,
                    hspace=0.8)
plt.show()