# Data Transform

In this notebook, we will ask you a series of questions to evaluate your findings from your EDA. Based on your response & justification, we will ask you to also apply a subsequent data transformation. 

If you state that you will not apply any data transformations for this step, you must **justify** as to why your dataset/machine-learning does not require the mentioned data preprocessing step.

The bonus step is completely optional, but if you provide a sufficient feature engineering step in this project we will add `1000` points to your Kahoot leaderboard score.

You will write out this transformed dataframe as a `.csv` file to your `data/` folder.

**Note**: Again, note that this dataset is quite large. If you find that some data operations take too long to complete on your machine, simply use the `sample()` method to transform a subset of your data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# import data 
transactions = pd.read_csv("../data/bank_transactions.csv")

transactions.head()

Unnamed: 0,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,PAYMENT,983.09,C1454812978,36730.24,35747.15,M1491308340,0.0,0.0,0,0
1,PAYMENT,55215.25,C1031766358,99414.0,44198.75,M2102868029,0.0,0.0,0,0
2,CASH_IN,220986.01,C1451868666,7773074.97,7994060.98,C1339195526,924031.48,703045.48,0,0
3,TRANSFER,2357394.75,C458368123,0.0,0.0,C620979654,4202580.45,6559975.19,0,0
4,CASH_OUT,67990.14,C1098978063,0.0,0.0,C142246322,625317.04,693307.19,0,0


In [5]:
# To Check for missing values in the dataset
if 'transactions' not in globals():
	transactions = pd.read_csv("../data/bank_transactions.csv")
missing_values = transactions.isnull().sum()
print("Missing values per column:\n", missing_values)

# To identify non-predictive columns 
non_predictive_cols = ['nameOrig', 'nameDest', 'oldbalanceOrg', 'oldbalanceDest', 'newbalanceOrig', 'newbalanceDest']  

# To drop non-predictive columns if present
transactions_cleaned = transactions.drop(columns=non_predictive_cols, errors='ignore')

# To drop rows with missing values (if any)
transactions_cleaned = transactions_cleaned.dropna()

print("Shape after cleaning:", transactions_cleaned.shape)

Missing values per column:
 type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64
Shape after cleaning: (1000000, 4)


In [6]:
# To display the result of missing values and non-predictive columns
print("Missing values summary:\n", missing_summary)
print("Non-predictive columns dropped:", non_predictive_cols)
print("Shape of cleaned transactions:", transactions_cleaned.shape)
transactions_cleaned.head()

NameError: name 'missing_summary' is not defined

## Q1

Does your model contain any missing values or "non-predictive" columns? If so, which adjustments should you take to ensure that your model has good predictive capabilities? Apply your data transformations (if any) in the code-block below.

Answer here: None of the columns is null, just playing with the columns dropped.


## Q2

Do certain transaction types consistently differ in amount or fraud likelihood? If so, how might you transform the type column to make this pattern usable by a machine learning model? Apply your data transformations (if any) in the code-block below.

Answer here

In [8]:
# Explore transaction types for differences in amount and fraud likelihood
type_stats = transactions_cleaned.groupby('type').agg(
    avg_amount=('amount', 'mean'),
    fraud_rate=('isFraud', 'mean'),
    count=('type', 'size')
)
print(type_stats)

# Transform 'type' column using one-hot encoding for ML usability
transactions_transformed = pd.get_dummies(transactions_cleaned, columns=['type'], prefix='type')

transactions_transformed.head()

             avg_amount  fraud_rate   count
type                                       
CASH_IN   168928.914668    0.000000  219955
CASH_OUT  175584.659320    0.001870  351360
DEBIT       5445.890813    0.000000    6417
PAYMENT    13055.592085    0.000000  338573
TRANSFER  911827.155179    0.007647   83695


Unnamed: 0,amount,isFraud,isFlaggedFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,983.09,0,0,False,False,False,True,False
1,55215.25,0,0,False,False,False,True,False
2,220986.01,0,0,True,False,False,False,False
3,2357394.75,0,0,False,False,False,False,True
4,67990.14,0,0,False,True,False,False,False


In [None]:
# Answer: 
print("Transaction type statistics:\n", type_stats)
print("\nInterpretation:")
print(
    "1. TRANSFER transactions have the highest average amount and the highest fraud rate (0.76%).\n"
    "2. CASH_OUT transactions also have a notable fraud rate (0.19%) and high average amounts.\n"
    "3. PAYMENT, DEBIT, and CASH_IN transactions have very low or zero fraud rates.\n"
    "Conclusion: Fraud is concentrated in TRANSFER and CASH_OUT types, so encoding 'type' is important for ML models."
)

Transaction type statistics:
              avg_amount  fraud_rate   count
type                                       
CASH_IN   168928.914668    0.000000  219955
CASH_OUT  175584.659320    0.001870  351360
DEBIT       5445.890813    0.000000    6417
PAYMENT    13055.592085    0.000000  338573
TRANSFER  911827.155179    0.007647   83695

Interpretation:
1. TRANSFER transactions have the highest average amount and the highest fraud rate (0.76%).
2. CASH_OUT transactions also have a notable fraud rate (0.19%) and high average amounts.
3. PAYMENT, DEBIT, and CASH_IN transactions have very low or zero fraud rates.
Conclusion: Fraud is concentrated in TRANSFER and CASH_OUT types, so encoding 'type' is important for ML models.


## Q3

After exploring your data, you may have noticed that fraudulent transactions are rare compared to non-fraudulent ones. What challenges might this pose when training a machine learning model? What strategies could you use to ensure your model learns meaningful patterns from the minority class? Apply your data transformations (if any) in the code-block below.

Answer here:Fraudulent transactions are rare (class imbalance), which can cause models to be biased toward predicting the majority class (non-fraud). This may result in poor recall for fraud detection.

Strategies to address class imbalance:
Resampling: Oversample the minority class or undersample the majority class.
Use of class weights in model training.
To use evaluation metrics suitable for imbalanced data (e.g., ROC-AUC, F1-score, recall).



In [None]:
from sklearn.utils import resample


# To separate majority and minority classes
df_majority = transactions_transformed[transactions_transformed.isFraud == 0]
df_minority = transactions_transformed[transactions_transformed.isFraud == 1]

# Upsample minority class
df_minority_upsampled = resample(
    df_minority,
    replace=True,                
    n_samples=len(df_majority),  
    random_state=42
)

# To Combine majority class with upsampled minority class
transactions_balanced = pd.concat([df_majority, df_minority_upsampled])

# To shuffle the resulting dataframe
transactions_balanced = transactions_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

print("Class distribution after balancing:\n", transactions_balanced['isFraud'].value_counts())
transactions_balanced.head()

Class distribution after balancing:
 isFraud
1    998703
0    998703
Name: count, dtype: int64


Unnamed: 0,amount,isFraud,isFlaggedFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,116636.06,1,0,False,True,False,False,False
1,324912.59,0,0,True,False,False,False,False
2,39795.3,0,0,True,False,False,False,False
3,309254.05,1,0,False,True,False,False,False
4,163554.25,0,0,False,True,False,False,False


## Bonus (optional)

Are there interaction effects between variables (e.g., fraud and high amount and transaction type) that aren't captured directly in the dataset? Would it be helpful to manually engineer any new features that reflect these interactions? Apply your data transformations (if any) in the code-block below.

Answer Here

In [2]:
# write out newly transformed dataset to your folder
...