# Explortatory Data Analysis for Scalable-Credit-Card-Fraud-Detection-Behavior-Analysis


This notebook will focus on understanding the structure and content of the datasets we are using the project.The objective is to explore the transactions, cards, and users data to identify the key variables, data quality issues, and  other initial patterns that may be useful for fraud detection.

No cleaning or feature engineering is performed at this stage.

The notebook is formatted based on the steps followed in the  EDA report.


In [3]:
#Start Spark Session
from pyspark.sql import SparkSession

session = (
    SparkSession.builder
    .appName("Credit Card Fraud Detection")
    .master("local[*]")
    .config("spark.driver.memory", "6g")
    .config("spark.sql.shuffle.partitions", "8")
    .getOrCreate()
)

#### Load the Raw Datasets

In [4]:
# Load Data
transactions_df = session.read.csv("data/raw/transactions_data.csv", header=True, inferSchema=True)
cards_df = session.read.csv("data/raw/cards_data.csv", header=True, inferSchema=True) 
users_df = session.read.csv("data/raw/users_data.csv", header=True, inferSchema=True)

## Transactions Data

The transactions dataset is significantly larger with more than 8 lac records whis confirms that transactions dataset is the main dataset, while cards and users provide contextual information.

In [5]:
transactions_df.columns

['id',
 'date',
 'client_id',
 'card_id',
 'amount',
 'use_chip',
 'merchant_id',
 'merchant_city',
 'merchant_state',
 'zip',
 'mcc',
 'errors']

The transactions dataset includes:
- Identifiers (id, client_id, card_id)
- Temporal information (date)
- amount
- use_chip
- Merchant information(merchant_id,merchant_city,merchant_state)
- mcc (Might be Merchant Country Code)
- An error field that may indicate abnormal or failed transactions

This dataset is the core of the fraud detection task.


### 1. Data Inspection

In [17]:
#Preview Sample Rows
transactions_df.show(10)

+-------+-------------------+---------+-------+-------+------------------+-----------+-------------+--------------+-------+----+------+
|     id|               date|client_id|card_id| amount|          use_chip|merchant_id|merchant_city|merchant_state|    zip| mcc|errors|
+-------+-------------------+---------+-------+-------+------------------+-----------+-------------+--------------+-------+----+------+
|7475327|2010-01-01 00:01:00|     1556|   2972|$-77.00| Swipe Transaction|      59935|       Beulah|            ND|58523.0|5499|  NULL|
|7475328|2010-01-01 00:02:00|      561|   4575| $14.57| Swipe Transaction|      67570|   Bettendorf|            IA|52722.0|5311|  NULL|
|7475329|2010-01-01 00:02:00|     1129|    102| $80.00| Swipe Transaction|      27092|        Vista|            CA|92084.0|4829|  NULL|
|7475331|2010-01-01 00:05:00|      430|   2860|$200.00| Swipe Transaction|      27092|  Crown Point|            IN|46307.0|4829|  NULL|
|7475332|2010-01-01 00:06:00|      848|   3915| 

#### Check Data Types

In [7]:
transactions_df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- client_id: integer (nullable = true)
 |-- card_id: integer (nullable = true)
 |-- amount: string (nullable = true)
 |-- use_chip: string (nullable = true)
 |-- merchant_id: integer (nullable = true)
 |-- merchant_city: string (nullable = true)
 |-- merchant_state: string (nullable = true)
 |-- zip: double (nullable = true)
 |-- mcc: integer (nullable = true)
 |-- errors: string (nullable = true)



All the datatypes are consistent with the columns, except for the amount feature.

In [8]:
#Transaction Amount Format Check
transactions_df.select("amount").show(10, truncate=False)

+-------+
|amount |
+-------+
|$-77.00|
|$14.57 |
|$80.00 |
|$200.00|
|$46.41 |
|$4.81  |
|$77.00 |
|$26.46 |
|$261.58|
|$10.74 |
+-------+
only showing top 10 rows


As this column has a currency sign in the data, it must me removed and converted to float values.

#### Check Duplicate Transactions

In [9]:
from pyspark.sql.functions import count

duplicate_id_count = (
    transactions_df
    .groupBy("id")
    .agg(count("*").alias("cnt"))
    .filter("cnt > 1")
    .count()
)

duplicate_id_count
# Number of duplicate ids

0

There are no duplicate transaction ids in the dataset.

#### Categories and Counts

In [10]:
transactions_df.groupBy("use_chip").count().show()

+------------------+-------+
|          use_chip|  count|
+------------------+-------+
|Online Transaction|1557912|
| Swipe Transaction|6967185|
|  Chip Transaction|4780818|
+------------------+-------+



The majority of transactions are **Swipe Transactions**, followed by **Chip Transactions**, while **Online Transactions** represent a smaller portion of the total volume.

This distribution is expected in real-world card usage, where most everyday purchases are made using physical cards.  
However, even though online transactions are fewer in number, they are often associated with a higher fraud risk due to the absence of physical card verification.

For this reason, the transaction type (`use_chip`) is considered an important variable for later analysis and modeling.


In [11]:
transactions_df.groupBy("merchant_state").count().show()

+--------------------+-------+
|      merchant_state|  count|
+--------------------+-------+
|                  NJ| 322227|
|                  IL| 467930|
|United Arab Emirates|    300|
|        South Africa|    339|
|           Indonesia|    194|
|         South Korea|   1153|
|              Israel|    941|
|           Australia|    569|
|              Brunei|      3|
|                  CA|1427087|
|                  IN| 312468|
|                  OK| 159902|
|                  KY| 170013|
|                  LA| 159719|
|                  KS|  99442|
|                  WY|   8747|
|              Tuvalu|      5|
|            Thailand|    461|
|             Romania|     50|
|    Marshall Islands|     11|
+--------------------+-------+
only showing top 20 rows


In [12]:
transactions_df.groupBy("errors").count().show()

+--------------------+--------+
|              errors|   count|
+--------------------+--------+
|    Technical Glitch|   26271|
|Bad PIN,Insuffici...|     293|
|Bad Card Number,B...|      38|
|             Bad CVV|    6106|
|Bad Card Number,B...|      33|
|Bad PIN,Technical...|      70|
|                NULL|13094522|
|         Bad Zipcode|    1126|
|Insufficient Bala...|     243|
|Bad Expiration,In...|      47|
|Bad CVV,Insuffici...|      57|
|Bad Card Number,T...|      15|
|Insufficient Balance|  130902|
|Bad Expiration,Ba...|      32|
|      Bad Expiration|    6161|
|     Bad Card Number|    7767|
|             Bad PIN|   32119|
|Bad Expiration,Te...|      21|
|Bad Card Number,I...|      71|
|Bad CVV,Technical...|       8|
+--------------------+--------+
only showing top 20 rows


The `errors` column contains information about transactions that encountered issues during processing.
Most transactions have a `NULL` value, meaning they were completed successfully without any reported problem.

However, a small subset of transactions contains specific error types, such as:
- Insufficient balance
- Bad PIN
- Bad card number
- Bad expiration date
- Technical glitches

Among these, errors related to **insufficient balance**, **bad PIN**, and **technical issues** appear more frequently than others.


### 2. Handling Missing Values

In [13]:
# check missing values for all the dataset
from pyspark.sql.functions import col, sum as spark_sum
def count_missing_values(df):
    missing_counts = df.select([spark_sum(col(c).isNull().cast("int")).alias(c) for c in df.columns])
    return missing_counts
print("Missing values in transactions dataset:")
count_missing_values(transactions_df).show()

Missing values in transactions dataset:
+---+----+---------+-------+------+--------+-----------+-------------+--------------+-------+---+--------+
| id|date|client_id|card_id|amount|use_chip|merchant_id|merchant_city|merchant_state|    zip|mcc|  errors|
+---+----+---------+-------+------+--------+-----------+-------------+--------------+-------+---+--------+
|  0|   0|        0|      0|     0|       0|          0|            0|       1563726|1652706|  0|13094522|
+---+----+---------+-------+------+--------+-----------+-------------+--------------+-------+---+--------+



The missing data is in `merchant_state`, `zip`, and `errors`. Most of the `merchant_state` and `zip` is from the online transactions. So, these data can be filled with that clearly states that it is an online transaction to preseve the semantic meaning.

### 3. Standardizing Categorical Variables

As we saw the data, the categorical columns have formatted data already. There is no further formatting required.

### 4. Data Integrity Check

#### Transactions Consistency

We are checking if the merchant_id maps consistently to the same merchant location exceot for online transactions.

In [19]:
from pyspark.sql import functions as F

# Filter non-online transactions
physical_txns = transactions_df.filter(
    F.col("use_chip") != "Online Transaction"
)

# Count distinct city/state combinations per merchant
merchant_location_check = (
    physical_txns
    .groupBy("merchant_id")
    .agg(
        F.countDistinct(
            F.struct("merchant_city", "merchant_state")
        ).alias("num_locations")
    )
)

# Merchants with inconsistent locations
inconsistent_merchants = merchant_location_check.filter(
    F.col("num_locations") > 1
)

inconsistent_merchants.show(truncate=False)


+-----------+-------------+
|merchant_id|num_locations|
+-----------+-------------+
|54850      |588          |
|27092      |1049         |
|11468      |368          |
|9041       |7            |
|44919      |744          |
|28395      |398          |
|86438      |1305         |
|31258      |2            |
|36392      |2            |
|18014      |5            |
|7131       |2            |
|58897      |5            |
|95855      |112          |
|83240      |3            |
|57386      |85           |
|78680      |2            |
|24891      |9            |
|11901      |29           |
|94625      |256          |
|30928      |217          |
+-----------+-------------+
only showing top 20 rows


These merchants have multiple locations, which may indicate either: 
- Data quality issue
- Chain merchants

We are also checking if the online transaction data is having any physical location to check the integrity of the transaction details. 

For that we are using a Filter use_chip == 'Online Transaction'
and checking if merchant_state, or zip are filled

Any non-null values will be an integrity error.

In [54]:
online_location_violations = transactions_df.filter(
    (F.col("use_chip") == "Online Transaction") &
    (
        (F.col("merchant_city").isNotNull() & (F.col("merchant_city") != "ONLINE")) |
        (F.col("merchant_state").isNotNull()) |
        (F.col("zip").isNotNull())
    )
)

online_location_violations.show(truncate=False)


+---+----+---------+-------+------+--------+-----------+-------------+--------------+---+---+------+
|id |date|client_id|card_id|amount|use_chip|merchant_id|merchant_city|merchant_state|zip|mcc|errors|
+---+----+---------+-------+------+--------+-----------+-------------+--------------+---+---+------+
+---+----+---------+-------+------+--------+-----------+-------------+--------------+---+---+------+



As we see there are no fraudulent data for the online transactions.

In [55]:
spark.version


'4.0.1'

## Data Cleaning

Based on the Analyis, the dataset is cleaned and exported for our further use.

In [49]:
transactions_clean.unpersist()
del transactions_clean


In [50]:
from pyspark.sql.functions import (
    col, when, regexp_replace, countDistinct, upper, trim, collect_set, concat_ws, concat, lit
)



# 1. Clean amount
transactions_clean = transactions_df.withColumn(
    "amount",
    regexp_replace(col("amount"), "[$,]", "")
)

transactions_clean = transactions_clean.withColumn(
    "amount",
    col("amount").cast("double")
)

# 2. Handling missing errors
transactions_clean = transactions_clean.withColumn(
    "errors",
    when(col("errors").isNull(), "No Error").otherwise(col("errors"))
)

# 3. Normalize merchant location
transactions_clean = transactions_clean.withColumn(
    "merchant_city",
    upper(trim(col("merchant_city")))
).withColumn(
    "merchant_state",
    upper(trim(col("merchant_state")))
)



# 4. Merchant location consistency check

merchant_location_check = (
    transactions_clean
    .filter(col("use_chip") != "Online Transaction")
    .groupBy("merchant_id")
    .agg(
        countDistinct("merchant_city").alias("city_count"),
        countDistinct("merchant_state").alias("state_count"),
        concat_ws(
            " | ",
            collect_set(
                concat(col("merchant_city"), lit(" "), col("merchant_state"))
            )
        ).alias("all_locations")
    )
    .filter((col("city_count") > 1) | (col("state_count") > 1))
)


# Final verification
print("Transactions data after cleaning")

transactions_clean.printSchema()

transactions_clean.show(10)


print("Merchant data location")
merchant_location_check.show(5)

Transactions data after cleaning
root
 |-- id: integer (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- client_id: integer (nullable = true)
 |-- card_id: integer (nullable = true)
 |-- amount: double (nullable = true)
 |-- use_chip: string (nullable = true)
 |-- merchant_id: integer (nullable = true)
 |-- merchant_city: string (nullable = true)
 |-- merchant_state: string (nullable = true)
 |-- zip: double (nullable = true)
 |-- mcc: integer (nullable = true)
 |-- errors: string (nullable = true)

+-------+-------------------+---------+-------+------+------------------+-----------+-------------+--------------+-------+----+--------+
|     id|               date|client_id|card_id|amount|          use_chip|merchant_id|merchant_city|merchant_state|    zip| mcc|  errors|
+-------+-------------------+---------+-------+------+------------------+-----------+-------------+--------------+-------+----+--------+
|7475327|2010-01-01 00:01:00|     1556|   2972| -77.0| Swipe Transaction

In [None]:
print("Saving cleaned data...")

try:
    # Convert Spark DataFrame to Pandas and save as CSV
    transactions_clean.toPandas().to_csv("data/cleaned/transactions_cleaned.csv", index=False)
    merchant_location_check.toPandas().to_csv("data/cleaned/merchant_location.csv", index=False)
    
    # If no error occurs
    print("Saved cleaned data successfully!")
except Exception as e:
    # If an error occurs
    print("Error saving data:", e)


Saving cleaned data...


In [4]:
cards_df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- client_id: integer (nullable = true)
 |-- card_brand: string (nullable = true)
 |-- card_type: string (nullable = true)
 |-- card_number: long (nullable = true)
 |-- expires: string (nullable = true)
 |-- cvv: integer (nullable = true)
 |-- has_chip: string (nullable = true)
 |-- num_cards_issued: integer (nullable = true)
 |-- credit_limit: string (nullable = true)
 |-- acct_open_date: string (nullable = true)
 |-- year_pin_last_changed: integer (nullable = true)
 |-- card_on_dark_web: string (nullable = true)



In [5]:

users_df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- current_age: integer (nullable = true)
 |-- retirement_age: integer (nullable = true)
 |-- birth_year: integer (nullable = true)
 |-- birth_month: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- address: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- per_capita_income: string (nullable = true)
 |-- yearly_income: string (nullable = true)
 |-- total_debt: string (nullable = true)
 |-- credit_score: integer (nullable = true)
 |-- num_credit_cards: integer (nullable = true)



Inspecting the schemas helps identify:
- Data types (string, numeric, timestamp)
- Columns that may require conversion
- Potential inconsistencies such as amounts stored as strings

In [None]:
cards_df.show(5)
users_df.show(5)

+----+---------+----------+---------------+----------------+-------+---+--------+----------------+------------+--------------+---------------------+----------------+
|  id|client_id|card_brand|      card_type|     card_number|expires|cvv|has_chip|num_cards_issued|credit_limit|acct_open_date|year_pin_last_changed|card_on_dark_web|
+----+---------+----------+---------------+----------------+-------+---+--------+----------------+------------+--------------+---------------------+----------------+
|4524|      825|      NULL|          Debit|4344676511950444|12/2022|623|     YES|               2|      $24295|       09/2002|                 2008|              No|
|2731|      825|      NULL|          Debit|4956965974959986|12/2020|393|     YES|               2|      $21968|       04/2014|                 2014|              No|
|3701|      825|      Visa|          Debit|4582313478255491|02/2024|719|     YES|               2|      $46414|       07/2003|                 2004|              No|
|  4

In [19]:
# check missing values for all the dataset
from pyspark.sql.functions import col, sum as spark_sum
def count_missing_values(df):
    missing_counts = df.select([spark_sum(col(c).isNull().cast("int")).alias(c) for c in df.columns])
    return missing_counts
print("Missing values in transactions dataset:")
count_missing_values(transactions_df).show()
print("Missing values in cards dataset:")
count_missing_values(cards_df).show()
print("Missing values in users dataset:")
count_missing_values(users_df).show()

Missing values in transactions dataset:
+---+----+---------+-------+------+--------+-----------+-------------+--------------+-------+---+--------+
| id|date|client_id|card_id|amount|use_chip|merchant_id|merchant_city|merchant_state|    zip|mcc|  errors|
+---+----+---------+-------+------+--------+-----------+-------------+--------------+-------+---+--------+
|  0|   0|        0|      0|     0|       0|          0|            0|       1563726|1652706|  0|13094522|
+---+----+---------+-------+------+--------+-----------+-------------+--------------+-------+---+--------+

Missing values in cards dataset:
+---+---------+----------+---------+-----------+-------+---+--------+----------------+------------+--------------+---------------------+----------------+
| id|client_id|card_brand|card_type|card_number|expires|cvv|has_chip|num_cards_issued|credit_limit|acct_open_date|year_pin_last_changed|card_on_dark_web|
+---+---------+----------+---------+-----------+-------+---+--------+--------------

In [25]:
#relationship between the datasets
print(f"Number of unique client IDs in transactions dataset: {transactions_df.select('client_id').distinct().count()}")
print(f"Number of unique client IDs in cards dataset: {cards_df.select('client_id').distinct().count()}")
print(f"Number of unique client IDs in users dataset: {users_df.select('id').distinct().count()}")


Number of unique client IDs in transactions dataset: 1219
Number of unique client IDs in cards dataset: 2000
Number of unique client IDs in users dataset: 2000


In [None]:
transactions_df.select("card_id").distinct().count(), cards_df.select("id").distinct().count()


(1219, 2000)

In [28]:

transactions_df.select("client_id").distinct().count(), users_df.select("id").distinct().count()

(1219, 2000)

The number of distinct `card_id` values present in the transactions dataset is **1,219**, while the cards dataset contains **2,000** distinct cards.
Similarly, the number of distinct `client_id` values observed in the transactions is **1,219**, compared to **2,000** users listed in the users dataset.
This indicates that not all cards and users appear in the transactions dataset.
Some cards and users may be inactive or simply did not perform any transaction during the observed time period.


## Conclusion

This exploration phase provided an overview of the datasets and their structure.
The transactions dataset is large and contains the core information required for fraud detection.
Several data quality issues were identified, such as missing values and non-numeric amount formats.

In the next notebook, these issues will be addressed through systematic data cleaning and feature preparation.
