# Data Transform

In this notebook, we will ask you a series of questions to evaluate your findings from your EDA. Based on your response & justification, we will ask you to also apply a subsequent data transformation. 

If you state that you will not apply any data transformations for this step, you must **justify** as to why your dataset/machine-learning does not require the mentioned data preprocessing step.

The bonus step is completely optional, but if you provide a sufficient feature engineering step in this project we will add `1000` points to your Kahoot leaderboard score.

You will write out this transformed dataframe as a `.csv` file to your `data/` folder.

**Note**: Again, note that this dataset is quite large. If you find that some data operations take too long to complete on your machine, simply use the `sample()` method to transform a subset of your data.

In [2]:
import pandas as pd
import numpy as np

## Q1

Does your model contain any missing values or "non-predictive" columns? If so, which adjustments should you take to ensure that your model has good predictive capabilities? Apply your data transformations (if any) in the code-block below.

Drop nameOrig and nameDest from the dataset because they are non-predictive in their raw form and might be overfitting.

In [3]:
transactions = pd.read_csv("../data/bank_transactions.csv")

# Drop non-predictive
transactions_transformed = transactions.drop(['nameOrig', 'nameDest'], axis=1)

# Use a sample for faster processing (adjust n as needed)
transactions_transformed_sample = transactions_transformed.sample(n=50000, random_state=42)

# Verify
print("Sample shape:", transactions_transformed_sample.shape)
print("Missing values:\n", transactions_transformed_sample.isnull().sum())

# Save the transformed sample
transactions_transformed_sample.to_csv("../data/transactions_transformed_sample.csv", index=False)

Sample shape: (50000, 8)
Missing values:
 type              0
amount            0
oldbalanceOrg     0
newbalanceOrig    0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64


## Q2

Do certain transaction types consistently differ in amount or fraud likelihood? If so, how might you transform the type column to make this pattern usable by a machine learning model? Apply your data transformations (if any) in the code-block below.

Yes, transaction types differ significantly in both amount and fraud. Fraud is highly concentrated in TRANSFER and CASH_OUT types

In [4]:
# one hot encoder

transactions = pd.read_csv("../data/transactions_transformed_sample.csv")

# One-hot encode 
transactions_encoded = pd.get_dummies(transactions, columns=['type'], prefix='type')

# Verify 
print("Columns after encoding:", transactions_encoded.columns.tolist())

# Save the new transformed sample
transactions_encoded.to_csv("../data/transactions_transformed_sample_encoded.csv", index=False)

Columns after encoding: ['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest', 'isFraud', 'isFlaggedFraud', 'type_CASH_IN', 'type_CASH_OUT', 'type_DEBIT', 'type_PAYMENT', 'type_TRANSFER']


## Q3

After exploring your data, you may have noticed that fraudulent transactions are rare compared to non-fraudulent ones. What challenges might this pose when training a machine learning model? What strategies could you use to ensure your model learns meaningful patterns from the minority class? Apply your data transformations (if any) in the code-block below.

The model may just predict non-fraud for everything. Also, fraud may not have enough examples for the model to learn meaningful patterns.

In [None]:
#smote  wait this might be in model training folder instead


#transactions = pd.read_csv("../data/transactions_transformed_sample_encoded.csv")

#  1. Split features and target
X = transactions.drop('isFraud', axis=1)
y = transactions['isFraud']

#  2. Split into train/test 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

print("Original train fraud count:\n", y_train.value_counts())

#  3. Apply SMOTE to training data only
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("After SMOTE train fraud count:\n", y_train_smote.value_counts())

# 4. Combine back to DataFrame for inspection
train_smote = pd.concat([X_train_smote, y_train_smote], axis=1)

#5. Save SMOTE version of training data
train_smote.to_csv("../data/transactions_transformed_train_SMOTE.csv", index=False)


## Bonus (optional)

Are there interaction effects between variables (e.g., fraud and high amount and transaction type) that aren't captured directly in the dataset? Would it be helpful to manually engineer any new features that reflect these interactions? Apply your data transformations (if any) in the code-block below.

Answer Here

In [2]:
# write out newly transformed dataset to your folder
...