# Data Transform

In this notebook, we will ask you a series of questions to evaluate your findings from your EDA. Based on your response & justification, we will ask you to also apply a subsequent data transformation. 

If you state that you will not apply any data transformations for this step, you must **justify** as to why your dataset/machine-learning does not require the mentioned data preprocessing step.

The bonus step is completely optional, but if you provide a sufficient feature engineering step in this project we will add `1000` points to your Kahoot leaderboard score.

You will write out this transformed dataframe as a `.csv` file to your `data/` folder.

**Note**: Again, note that this dataset is quite large. If you find that some data operations take too long to complete on your machine, simply use the `sample()` method to transform a subset of your data.

In [1]:
import pandas as pd
import numpy as np

## Q1

Does your model contain any missing values or "non-predictive" columns? If so, which adjustments should you take to ensure that your model has good predictive capabilities? Apply your data transformations (if any) in the code-block below.

The model does not containe any missing values but there are "non-preductuce" columns. Here we can drop the non-predictive columns "nameOrig", "nameDest" and "isFlaggedFraud". The name of the counts do not help us with our predictions. Since there is only 1 transaction under "isFlaggedFraud", I do not believe this will help with our predictions either.

In [2]:
# import data 
transactions = pd.read_csv("../data/bank_transactions.csv")

transactions.head()

Unnamed: 0,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,PAYMENT,983.09,C1454812978,36730.24,35747.15,M1491308340,0.0,0.0,0,0
1,PAYMENT,55215.25,C1031766358,99414.0,44198.75,M2102868029,0.0,0.0,0,0
2,CASH_IN,220986.01,C1451868666,7773074.97,7994060.98,C1339195526,924031.48,703045.48,0,0
3,TRANSFER,2357394.75,C458368123,0.0,0.0,C620979654,4202580.45,6559975.19,0,0
4,CASH_OUT,67990.14,C1098978063,0.0,0.0,C142246322,625317.04,693307.19,0,0


In [3]:
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   type            1000000 non-null  object 
 1   amount          1000000 non-null  float64
 2   nameOrig        1000000 non-null  object 
 3   oldbalanceOrg   1000000 non-null  float64
 4   newbalanceOrig  1000000 non-null  float64
 5   nameDest        1000000 non-null  object 
 6   oldbalanceDest  1000000 non-null  float64
 7   newbalanceDest  1000000 non-null  float64
 8   isFraud         1000000 non-null  int64  
 9   isFlaggedFraud  1000000 non-null  int64  
dtypes: float64(5), int64(2), object(3)
memory usage: 76.3+ MB


In [4]:
transactions.isnull().sum()
#The are no non-null values.

type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [5]:
transactions_clean = transactions.drop(columns=['nameOrig', 'nameDest', 'isFlaggedFraud'])
transactions_clean.head()

Unnamed: 0,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud
0,PAYMENT,983.09,36730.24,35747.15,0.0,0.0,0
1,PAYMENT,55215.25,99414.0,44198.75,0.0,0.0,0
2,CASH_IN,220986.01,7773074.97,7994060.98,924031.48,703045.48,0
3,TRANSFER,2357394.75,0.0,0.0,4202580.45,6559975.19,0
4,CASH_OUT,67990.14,0.0,0.0,625317.04,693307.19,0


In [15]:
from sklearn.preprocessing import OneHotEncoder

cat_features = ["type"]                              
num_features = ["amount", "oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest", "isFraud"]    

X_cat = transactions_clean[cat_features]
X_num = transactions_clean[num_features]

X_cat.head()

Unnamed: 0,type
0,PAYMENT
1,PAYMENT
2,CASH_IN
3,TRANSFER
4,CASH_OUT


In [16]:
ohe = OneHotEncoder()
X_cat_full = ohe.fit_transform(X_cat).toarray()

X_cat_full

array([[0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0.],
       ...,
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.]], shape=(1000000, 5))

In [17]:
# we can also get our new column names
ohe.get_feature_names_out(['type'])

array(['type_CASH_IN', 'type_CASH_OUT', 'type_DEBIT', 'type_PAYMENT',
       'type_TRANSFER'], dtype=object)

In [18]:
cat_names = ohe.get_feature_names_out(['type'])

encoded_df = pd.DataFrame(X_cat_full, columns=cat_names, index=transactions_clean.index)

encoded_df.head()

Unnamed: 0,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0
4,0.0,1.0,0.0,0.0,0.0


In [19]:
new_encoded_transactions = pd.concat([X_num, encoded_df], axis=1)

new_encoded_transactions

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,983.09,36730.24,35747.15,0.00,0.00,0,0.0,0.0,0.0,1.0,0.0
1,55215.25,99414.00,44198.75,0.00,0.00,0,0.0,0.0,0.0,1.0,0.0
2,220986.01,7773074.97,7994060.98,924031.48,703045.48,0,1.0,0.0,0.0,0.0,0.0
3,2357394.75,0.00,0.00,4202580.45,6559975.19,0,0.0,0.0,0.0,0.0,1.0
4,67990.14,0.00,0.00,625317.04,693307.19,0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
999995,13606.07,114122.11,100516.04,0.00,0.00,0,0.0,0.0,0.0,1.0,0.0
999996,9139.61,0.00,0.00,0.00,0.00,0,0.0,0.0,0.0,1.0,0.0
999997,153650.41,50677.00,0.00,0.00,380368.36,0,0.0,1.0,0.0,0.0,0.0
999998,163810.52,0.00,0.00,357850.15,521660.67,0,0.0,1.0,0.0,0.0,0.0


In [20]:
new_encoded_transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 11 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   amount          1000000 non-null  float64
 1   oldbalanceOrg   1000000 non-null  float64
 2   newbalanceOrig  1000000 non-null  float64
 3   oldbalanceDest  1000000 non-null  float64
 4   newbalanceDest  1000000 non-null  float64
 5   isFraud         1000000 non-null  int64  
 6   type_CASH_IN    1000000 non-null  float64
 7   type_CASH_OUT   1000000 non-null  float64
 8   type_DEBIT      1000000 non-null  float64
 9   type_PAYMENT    1000000 non-null  float64
 10  type_TRANSFER   1000000 non-null  float64
dtypes: float64(10), int64(1)
memory usage: 83.9 MB


## Q2

Do certain transaction types consistently differ in amount or fraud likelihood? If so, how might you transform the type column to make this pattern usable by a machine learning model? Apply your data transformations (if any) in the code-block below.

Yes, certain transaction types—such as TRANSFER and CASH_OUT—show significantly higher fraud rates and transaction amounts.
We can use one-hot-endocing to use this with our MLM model. This allows the model to learn fraud likelihood and amount behavior associated with each transaction type.


In [11]:
# Dummy encode transaction type

#encoded_transactions = pd.get_dummies(transactions_clean, columns=['type'], drop_first=True)

#encoded_transactions

## Q3

After exploring your data, you may have noticed that fraudulent transactions are rare compared to non-fraudulent ones. What challenges might this pose when training a machine learning model? What strategies could you use to ensure your model learns meaningful patterns from the minority class? Apply your data transformations (if any) in the code-block below.

Because the data is so imbalanced, machine learning models might just ignore the minority class. In order to correct this, we can apply SMOTE to help balance out the data set.

In [12]:
smote = SMOTE(k_neighbors=3, random_state=37)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("Class distribution after SMOTE:")
print(y_train_smote.value_counts())

NameError: name 'SMOTE' is not defined

## Bonus (optional)

Are there interaction effects between variables (e.g., fraud and high amount and transaction type) that aren't captured directly in the dataset? Would it be helpful to manually engineer any new features that reflect these interactions? Apply your data transformations (if any) in the code-block below.

Answer Here

In [21]:
# write out newly transformed dataset to your folder
new_encoded_transactions.to_csv("new_hot_dataframe.csv", index=False)