# Data Transform

In this notebook, we will ask you a series of questions to evaluate your findings from your EDA. Based on your response & justification, we will ask you to also apply a subsequent data transformation. 

If you state that you will not apply any data transformations for this step, you must **justify** as to why your dataset/machine-learning does not require the mentioned data preprocessing step.

The bonus step is completely optional, but if you provide a sufficient feature engineering step in this project we will add `1000` points to your Kahoot leaderboard score.

You will write out this transformed dataframe as a `.csv` file to your `data/` folder.

**Note**: Again, note that this dataset is quite large. If you find that some data operations take too long to complete on your machine, simply use the `sample()` method to transform a subset of your data.

In [1]:
import pandas as pd
import numpy as np

## Q1

Does your model contain any missing values or "non-predictive" columns? If so, which adjustments should you take to ensure that your model has good predictive capabilities? Apply your data transformations (if any) in the code-block below.

Answer here

In [2]:
# Load the dataset
transactions = pd.read_csv("../data/bank_transactions.csv")

# Check for missing values
missing_values = transactions.isnull().sum()
print("Missing values per column:")
print(missing_values)

# Drop non-predictive identifier columns
transactions = transactions.drop(columns=['nameOrig', 'nameDest'])

# Save the cleaned dataset
transactions.to_csv("../data/cleaned_transactions.csv", index=False)

# Preview
transactions.head()

Missing values per column:
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64


Unnamed: 0,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,PAYMENT,983.09,36730.24,35747.15,0.0,0.0,0,0
1,PAYMENT,55215.25,99414.0,44198.75,0.0,0.0,0,0
2,CASH_IN,220986.01,7773074.97,7994060.98,924031.48,703045.48,0,0
3,TRANSFER,2357394.75,0.0,0.0,4202580.45,6559975.19,0,0
4,CASH_OUT,67990.14,0.0,0.0,625317.04,693307.19,0,0


There are no missing values in the dataset, as confirmed by the .isnull().sum() check. All columns returned 0 for missing entries. However, the columns nameOrig and nameDest serve only as unique identifiers and do not provide useful predictive features for fraud detection. They were therefore dropped to prevent noise in the model training process. The cleaned dataset was saved for further analysis.

## Q2

Do certain transaction types consistently differ in amount or fraud likelihood? If so, how might you transform the type column to make this pattern usable by a machine learning model? Apply your data transformations (if any) in the code-block below.

Answer here

In [3]:
# Apply one-hot encoding to the 'type' column
transactions_encoded = pd.get_dummies(transactions, columns=['type'], prefix='type')

# Display the first few rows to verify
display(transactions_encoded.head())

# Save the transformed dataset (Optional) 
transactions_encoded.to_csv("../data/encoded_transactions.csv", index=False)


Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,983.09,36730.24,35747.15,0.0,0.0,0,0,False,False,False,True,False
1,55215.25,99414.0,44198.75,0.0,0.0,0,0,False,False,False,True,False
2,220986.01,7773074.97,7994060.98,924031.48,703045.48,0,0,True,False,False,False,False
3,2357394.75,0.0,0.0,4202580.45,6559975.19,0,0,False,False,False,False,True
4,67990.14,0.0,0.0,625317.04,693307.19,0,0,False,True,False,False,False


Yes, certain transaction types differ significantly in both average amount and fraud likelihood. From the EDA, it was observed that fraudulent transactions mainly occured in the TRANSFER and CASH_OUT types, which also had the highest average amounts. Other types like PAYMENT, DEBIT, and CASH_IN showed little or no fraud. To help the model learn these patterns, I applied one-hot encoding to the type column, converting it into numeric format for better model compatibility.

## Q3

After exploring your data, you may have noticed that fraudulent transactions are rare compared to non-fraudulent ones. What challenges might this pose when training a machine learning model? What strategies could you use to ensure your model learns meaningful patterns from the minority class? Apply your data transformations (if any) in the code-block below.

Answer here

In [4]:
# Check class distribution in 'isFraud'
fraud_counts = transactions_encoded['isFraud'].value_counts()
print("Fraudulent vs Non-Fraudulent:\n", fraud_counts)

# Separate features and target
X = transactions_encoded.drop(columns=['isFraud'])
y = transactions_encoded['isFraud']

# Apply SMOTE for oversampling minority class
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)

X_resampled, y_resampled = smote.fit_resample(X, y)

# Confirm the new class distribution
from collections import Counter
print("New class distribution after SMOTE:\n", Counter(y_resampled))

# Combine back into a single DataFrame
resampled_df = pd.DataFrame(X_resampled, columns=X.columns)
resampled_df['isFraud'] = y_resampled

# Save the resampled dataset (Optional) - Commented out due to Large size of data file
# resampled_df.to_csv("../data/smote_resampled_transactions.csv", index=False)


Fraudulent vs Non-Fraudulent:
 isFraud
0    998703
1      1297
Name: count, dtype: int64
New class distribution after SMOTE:
 Counter({0: 998703, 1: 998703})


In [5]:
# Check how many fraud vs non-fraud transactions exist
fraud_counts = transactions_encoded['isFraud'].value_counts()
fraud_percent = transactions_encoded['isFraud'].value_counts(normalize=True) * 100

print("Class Distribution:")
print(fraud_counts)
print("\nClass Distribution Percentage:")
print(fraud_percent)


Class Distribution:
isFraud
0    998703
1      1297
Name: count, dtype: int64

Class Distribution Percentage:
isFraud
0    99.8703
1     0.1297
Name: proportion, dtype: float64


In [6]:
from sklearn.utils import resample

# Split the fraud and non-fraud transactions
fraud = transactions_encoded[transactions_encoded['isFraud'] == 1]
non_fraud = transactions_encoded[transactions_encoded['isFraud'] == 0]

# Oversample the fraud cases
fraud_oversampled = resample(fraud, 
                             replace=True,               # sample with replacement
                             n_samples=len(non_fraud),   # match number of non-fraud samples
                             random_state=42)

# Combine them into a new balanced dataset
balanced_df = pd.concat([non_fraud, fraud_oversampled])

# Shuffle the dataset
balanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)

# Save it (optional) - Commented out due to Large size of data file
# balanced_df.to_csv("../data/balanced_transactions.csv", index=False)

# Show the new class distribution
print("Balanced class distribution:")
print(balanced_df['isFraud'].value_counts())


Balanced class distribution:
isFraud
1    998703
0    998703
Name: count, dtype: int64


*Addressing Class Imbalance:*  
Before applying machine learning models, it's important to address the imbalance between fraudulent and non-fraudulent transactions in the dataset. Initially, there were 998,703 non-fraud cases and only 1,297 fraud cases, which can lead to biased models.

To resolve this, I applied two techniques:
- SMOTE (Synthetic Minority Oversampling Technique): Creates synthetic samples for the minority class.

- Random Oversampling (using resample()): Duplicates existing fraud cases with replacement.

Both methods balanced the class distribution to:
Non-Fraudulent: 998,703
Fraudulent: 998,703

These resampled datasets will help ensure that the model learns from both classes effectively and is better at detecting fraud.

## Bonus (optional)

Are there interaction effects between variables (e.g., fraud and high amount and transaction type) that aren't captured directly in the dataset? Would it be helpful to manually engineer any new features that reflect these interactions? Apply your data transformations (if any) in the code-block below.

Answer Here

In [7]:
# Create a new feature: high_amount
transactions_encoded['high_amount'] = transactions_encoded['amount'] > 100000

# Create interaction feature: high_amount AND is type_TRANSFER
transactions_encoded['high_amt_transfer'] = (transactions_encoded['high_amount']) & (transactions_encoded['type_TRANSFER'])

# Other possible interactions like:
transactions_encoded['high_amt_cash_out'] = (transactions_encoded['high_amount']) & (transactions_encoded['type_CASH_OUT'])

# Save this new transformed dataset
transactions_encoded.to_csv("../data/engineered_transactions.csv", index=False)

# Preview new features
transactions_encoded[['amount', 'type_TRANSFER', 'high_amount', 'high_amt_transfer', 'high_amt_cash_out']].head()


Unnamed: 0,amount,type_TRANSFER,high_amount,high_amt_transfer,high_amt_cash_out
0,983.09,False,False,False,False
1,55215.25,False,False,False,False
2,220986.01,False,True,False,False
3,2357394.75,True,True,True,False
4,67990.14,False,False,False,False


### Feature Engineering for Interaction Effects

To explore potential interaction effects, I manually created new features that combine existing columns. One such example is whether a transaction is both **high in amount** (over 100,000) and of type **TRANSFER**.

These kinds of combinations may help the model detect more subtle fraud patterns that wouldn’t be obvious from individual features alone.

New features created:
- `high_amount`: True if amount > 100,000
- `high_amt_transfer`: True if transaction is both high amount and of type TRANSFER
- `high_amt_cash_out`: True if transaction is high amount and type CASH_OUT

These engineered features were saved into a new file `engineered_transactions.csv` for future use.
