# Data Transform

In this notebook, we will ask you a series of questions to evaluate your findings from your EDA. Based on your response & justification, we will ask you to also apply a subsequent data transformation. 

If you state that you will not apply any data transformations for this step, you must **justify** as to why your dataset/machine-learning does not require the mentioned data preprocessing step.

The bonus step is completely optional, but if you provide a sufficient feature engineering step in this project we will add `1000` points to your Kahoot leaderboard score.

You will write out this transformed dataframe as a `.csv` file to your `data/` folder.

**Note**: Again, note that this dataset is quite large. If you find that some data operations take too long to complete on your machine, simply use the `sample()` method to transform a subset of your data.

In [31]:
import pandas as pd
import numpy as np

In [32]:
# import data 
transactions = pd.read_csv("../data/bank_transactions.csv")

transform_data = transactions.copy()
transform_data.head()

Unnamed: 0,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,PAYMENT,983.09,C1454812978,36730.24,35747.15,M1491308340,0.0,0.0,0,0
1,PAYMENT,55215.25,C1031766358,99414.0,44198.75,M2102868029,0.0,0.0,0,0
2,CASH_IN,220986.01,C1451868666,7773074.97,7994060.98,C1339195526,924031.48,703045.48,0,0
3,TRANSFER,2357394.75,C458368123,0.0,0.0,C620979654,4202580.45,6559975.19,0,0
4,CASH_OUT,67990.14,C1098978063,0.0,0.0,C142246322,625317.04,693307.19,0,0


## Q1

Does your model contain any missing values or "non-predictive" columns? If so, which adjustments should you take to ensure that your model has good predictive capabilities? Apply your data transformations (if any) in the code-block below.

The dataset does not contain any missing values, but it does contain two non-predictive columns: nameOrig and nameDest. These columns are simply user IDs that don't carry meaningful patterns for fraud detection and can leak information to the model.

In [33]:
# Drop non-predictive columns
transform_data.drop(columns=['nameOrig', 'nameDest'], inplace=True)
transform_data.head()

Unnamed: 0,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,PAYMENT,983.09,36730.24,35747.15,0.0,0.0,0,0
1,PAYMENT,55215.25,99414.0,44198.75,0.0,0.0,0,0
2,CASH_IN,220986.01,7773074.97,7994060.98,924031.48,703045.48,0,0
3,TRANSFER,2357394.75,0.0,0.0,4202580.45,6559975.19,0,0
4,CASH_OUT,67990.14,0.0,0.0,625317.04,693307.19,0,0


## Q2

Do certain transaction types consistently differ in amount or fraud likelihood? If so, how might you transform the type column to make this pattern usable by a machine learning model? Apply your data transformations (if any) in the code-block below.

Yes certain transaction types such as TRANSFER and CASH_OUT are much more likely to be fraudulent, while types like PAYMENT and DEBIT are almost never fraudulent. This pattern is critical for our model.
To make this usable, we applied one hot encoding to the type column to convert it into numerical features that the model can interpret and use.


In [34]:
transform_data = pd.get_dummies(transform_data, columns=['type'], drop_first=True)

transform_data.to_csv("../data/cleaned_bank_transactions.csv", index=False)
transform_data


Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,983.09,36730.24,35747.15,0.00,0.00,0,0,False,False,True,False
1,55215.25,99414.00,44198.75,0.00,0.00,0,0,False,False,True,False
2,220986.01,7773074.97,7994060.98,924031.48,703045.48,0,0,False,False,False,False
3,2357394.75,0.00,0.00,4202580.45,6559975.19,0,0,False,False,False,True
4,67990.14,0.00,0.00,625317.04,693307.19,0,0,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
999995,13606.07,114122.11,100516.04,0.00,0.00,0,0,False,False,True,False
999996,9139.61,0.00,0.00,0.00,0.00,0,0,False,False,True,False
999997,153650.41,50677.00,0.00,0.00,380368.36,0,0,True,False,False,False
999998,163810.52,0.00,0.00,357850.15,521660.67,0,0,True,False,False,False


## Q3

After exploring your data, you may have noticed that fraudulent transactions are rare compared to non-fraudulent ones. What challenges might this pose when training a machine learning model? What strategies could you use to ensure your model learns meaningful patterns from the minority class? Apply your data transformations (if any) in the code-block below.

The dataset is highly imbalanced and less than 0.15% of all transactions are fraudulent. This imbalance can lead to the model ignoring fraud cases and predicting the majority class as non-fraud most of the time. To address this I can adjust class weights in the model and use the F1 score to focus on balancing precision and recall, making it more suitable for the imbalance.

## Bonus (optional)

Are there interaction effects between variables (e.g., fraud and high amount and transaction type) that aren't captured directly in the dataset? Would it be helpful to manually engineer any new features that reflect these interactions? Apply your data transformations (if any) in the code-block below.

Yes fraud is more likely to happen in TRANSFER or CASH_OUT transactions, especially when the amount is unusually high. To help the model learn this I would create a feature for the high value TRANSFERs and CASH_OUTs.

In [27]:
# write out newly transformed dataset to your folder
...