# EDA

Create your own EDA below/ Create as many code-blocks as you need.

Once you've completed your EDA, complete the section titled **Reflections** where you will answer questions about your EDA.

**Note**: Since this dataset is large, visualizations might load slower. You may want to consider using the `sample()` method if this dataset is too large for your computer to process in a reasonable amount of time.

In [1]:
import pandas as pd 
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# import data 
transactions = pd.read_csv("../data/bank_transactions.csv")

transactions.head()

Unnamed: 0,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,PAYMENT,983.09,C1454812978,36730.24,35747.15,M1491308340,0.0,0.0,0,0
1,PAYMENT,55215.25,C1031766358,99414.0,44198.75,M2102868029,0.0,0.0,0,0
2,CASH_IN,220986.01,C1451868666,7773074.97,7994060.98,C1339195526,924031.48,703045.48,0,0
3,TRANSFER,2357394.75,C458368123,0.0,0.0,C620979654,4202580.45,6559975.19,0,0
4,CASH_OUT,67990.14,C1098978063,0.0,0.0,C142246322,625317.04,693307.19,0,0


In [3]:
# TODO: view the shape of your data

print(f"Number of rows: {transactions.shape[0]}")
print(f"Number of columns: {transactions.shape[1]}")


Number of rows: 1000000
Number of columns: 10


In [6]:
# TODO: Begin your EDA
print("=== Dataset Info ===")
transactions.info()


print("\n=== Summary Statistics ===")
print(transactions.describe())


print("\n=== Missing Values ===")
print(transactions.isnull().sum())


print("\n=== Sample Rows ===")
print(transactions.head())


print("\n=== Column Names ===")
print(transactions.columns)


print("\n=== Unique Values in Each Column ===")
for col in transactions.columns:
    print(f"{col}: {transactions[col].nunique()} unique values")


print("\n=== Value Counts Example ===")
print(transactions['type'].value_counts())


print(transactions.columns)

print(transactions['type'].value_counts())


print("Step 7: Example Value Counts for a Categorical Column")
example_col = 'type' 

if example_col in transactions.columns:
    print(transactions[example_col].value_counts())
else:
    print(f"Column '{example_col}' not found. Try another column like: {list(transactions.columns)}")



=== Dataset Info ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   type            1000000 non-null  object 
 1   amount          1000000 non-null  float64
 2   nameOrig        1000000 non-null  object 
 3   oldbalanceOrg   1000000 non-null  float64
 4   newbalanceOrig  1000000 non-null  float64
 5   nameDest        1000000 non-null  object 
 6   oldbalanceDest  1000000 non-null  float64
 7   newbalanceDest  1000000 non-null  float64
 8   isFraud         1000000 non-null  int64  
 9   isFlaggedFraud  1000000 non-null  int64  
dtypes: float64(5), int64(2), object(3)
memory usage: 76.3+ MB

=== Summary Statistics ===
             amount  oldbalanceOrg  newbalanceOrig  oldbalanceDest  \
count  1.000000e+06   1.000000e+06    1.000000e+06    1.000000e+06   
mean   1.796208e+05   8.351184e+05    8.565104e+05    1.102856e+06   
st

## Reflections

Answer each question based on the visualizations that you've generated.

Remember, you must justify your answers with proof.

### Q1

Take a closer look at the numeric features in your dataset. How are these values distributed and what might this tell you about how most transactions behave compared to a few **rare** ones?

**Hint**: Consider using visualizations that highlight frequency across value ranges.

Most of the transactions in the dataset seem to be for pretty small amounts. When I looked at the distributions using histograms, I saw that a lot of the values were grouped close to zero, and only a few were really large.

### Q2

When comparing different numerical features against one another, do any interesting patterns emerge for transactions marked as fraudulent? Are there particular regions or ranges where these transactions seem to concentrate?

**Hint**: Try comparing two numeric features at a time while distinguishing between fraud and non-fraud. Use visual cues to spot clusters or anomalies.

When I plotted different pairs of numbers like amount vs. balances, and highlighted which transactions were fraud, I noticed that frauds usually involved bigger amounts. A lot of the time, the original balance before the transaction was really low or even zero, which is kinda suspicious if a large amount was sent.

### Q3

How do types of transaction relate to the typical amounts involved? Are some types of transactions consistently larger or smaller than others?

**Hint**: Break the dataset down by transaction type and compare summary statistics.

When I looked at the different types of transactions and compared how much money was usually involved, I saw some clear differences. Transfers and cash-outs were usually for bigger amounts, while payments and debit transactions were smaller on average

### Q4

Do transaction amounts vary when you compare fraudulent and non-fraudulent transactions across different transaction types? What patterns emerge when you look at both fraud status and transaction type together?

**Hint**:  Try summarizing average transaction amounts by both fraud label and transaction type to compare across categories.

When I looked at both the fraud status and the type of transaction together, I noticed that most of the fraud happens in transfer and cash-out transactions. And in those cases, the fraudulent amounts are usually a lot higher than the regular ones.

## Q5

Consider how well the system's built-in fraud flag (`isFlaggedFraud`) aligns with actual fraudulent activity. Are there mismatches? What does this tell you about the system's current performance?

**Hint**: Try organizing the data in a way that shows how often flagged transactions are truly fraudulent and how often fraud goes unflagged.

When I compared the actual fraud column with the system’s built-in fraud flag, I noticed that almost none of the real frauds were actually flagged. The system only flagged a few transactions, and none of them were even real fraud.