<a href="https://colab.research.google.com/github/Shaymaxo/Capstone-2-Springboard/blob/main/3_pre_processing_capstone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os

# Base directory where your project folders live
drive_base = '/content/drive/MyDrive'
# Path to the folder containing your EDA notebook
eda_folder = os.path.join(drive_base, 'data', 'raw', 'Capstone 2 - Data Wrangling')
eda_notebook_path = os.path.join(eda_folder, 'EDA new-capstone.ipynb')
print('EDA notebook path:', eda_notebook_path)

EDA notebook path: /content/drive/MyDrive/data/raw/Capstone 2 - Data Wrangling/EDA new-capstone.ipynb


In [None]:
## Step 2: Load raw CSV files and merge
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
import zipfile
zip_path = '/content/drive/MyDrive/Capstone1/Capstone 2 - Data Wrangling/ieee-fraud-detection_project/data/raw/Archive.zip'
extract_path = '/content/data_extracted'
os.makedirs(extract_path, exist_ok=True)

# 4. Extract the zip file
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

# 5. Optional: List the extracted files to see what’s inside
os.listdir(extract_path)

transaction_path = '/content/data_extracted/train_transaction.csv'
identity_path = '/content/data_extracted/train_identity.csv'

print("train_transaction.csv exists:", os.path.exists(transaction_path))
print("train_identity.csv exists:", os.path.exists(identity_path))

train_transaction.csv exists: True
train_identity.csv exists: True


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

To keep memory usage manageable and avoid crashing the Colab environment, I limited the data load to the first 100,000 rows of each CSV file by using the nrows parameter with `pd.read_csv()`.

I then merged the two datasets — train_transaction.csv and train_identity.csv — on their common column TransactionID using a left join. This approach preserves all transaction records while adding identity information wherever available. The resulting dataframe, `df`, has 100,000 rows and 434 columns, incorporating both transaction and identity features. This merged dataset is essential for providing a comprehensive view of each transaction, combining all the relevant data necessary for the fraud detection model.

In [None]:
# 1. Load and merge data
n_rows = 100000
transaction_df = pd.read_csv(f'{extract_path}/train_transaction.csv', nrows=n_rows)
identity_df = pd.read_csv(f'{extract_path}/train_identity.csv', nrows=n_rows)
df = transaction_df.merge(identity_df, how='left', on='TransactionID')

print("Merged shape:", df.shape)

Merged shape: (100000, 434)


In [None]:
# 3. Separate target variable
y = df['isFraud']
X = df.drop(['isFraud'], axis=1)

# 4. Encode categorical features
categorical_cols = ['ProductCD', 'card4', 'card6', 'M1', 'M2', 'M3', 'M4', 'M5', 'M6',
                    'M7', 'M8', 'M9', 'id_12', 'id_15', 'id_16', 'id_28', 'id_29',
                    'id_35', 'DeviceType']

X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

* 80% of the data goes into X_train and y_train (for training the model)
* 20% goes into X_test and y_test (for evaluating how well the model generalizes).

I included two important parameters:

- `stratify=y`: This ensures that the class distribution of the target variable `isFraud` is preserved in both the train and test sets. Since fraud is a highly imbalanced class (only ~2.5% are frauds), stratification is critical.

- `random_state=42`: This ensures that the split is reproducible every time I run the code.

In [None]:
# 5. Handle missing values (simple version: fill with -999 or median)
X = X.fillna(-999)

# 6. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

In [None]:
# Check the types of columns
print(X_train.dtypes)

TransactionID          int64
TransactionDT          int64
TransactionAmt       float64
card1                  int64
card2                float64
                      ...   
id_16_NotFound          bool
id_28_New               bool
id_29_NotFound          bool
id_35_T                 bool
DeviceType_mobile       bool
Length: 441, dtype: object


In [None]:
# If there's still text data, we need to handle it by encoding it (we'll use get_dummies for all categorical columns)
categorical_cols = X_train.select_dtypes(include=['object']).columns
X_train = pd.get_dummies(X_train, columns=categorical_cols, drop_first=True)
X_test = pd.get_dummies(X_test, columns=categorical_cols, drop_first=True)

# Re-align the columns in X_train and X_test in case there are mismatches after one-hot encoding
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)

# Now we can apply scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ✅ Class Distribution in Train and Test Sets
I checked the class distribution of the target variable (isFraud) in both the training and test sets. The output shows that both the training and test sets have a significant class imbalance, with the majority class (non-fraudulent transactions) comprising approximately 97.4%, and the minority class (fraudulent transactions) only around 2.6%.

***Train set class distribution:***

* Non-fraudulent (0): 97.44%
* Fraudulent (1): 2.56%

***Test set class distribution:***

* Non-fraudulent (0): 97.44%
* Fraudulent (1): 2.56%

Given this imbalance, the model might have a tendency to predict the majority class (non-fraudulent) more frequently, which could lead to poor detection of fraudulent transactions. This highlights the need for techniques like SMOTE to address the class imbalance before training the model.

In [None]:
# Check class distribution in train and test sets
print("Train set class distribution:")
print(y_train.value_counts(normalize=True))

print("\nTest set class distribution:")
print(y_test.value_counts(normalize=True))

Train set class distribution:
isFraud
0    0.974387
1    0.025612
Name: proportion, dtype: float64

Test set class distribution:
isFraud
0    0.9744
1    0.0256
Name: proportion, dtype: float64


# 🧼 Balancing the Dataset and Finalizing Preprocessed DataFrames

To address the severe class imbalance in my training dataset—where fraudulent transactions made up only around 2.56% of the data—I applied SMOTE (Synthetic Minority Over-sampling Technique) to the training data only. The final print statement confirmed that the training set was now balanced, with the output showing an equal distribution. This preprocessing ensures that the data is clean, scaled, balanced, and ready for model training.



In [None]:
# 8. Apply SMOTE to training data only
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

In [None]:
# 9. Wrap everything into DataFrames again (optional, for convenience)
X_train_df = pd.DataFrame(X_train_resampled, columns=X_train.columns)
X_test_df = pd.DataFrame(X_test_scaled, columns=X_train.columns)
y_train_df = pd.DataFrame(y_train_resampled, columns=['isFraud'])
y_test_df = pd.DataFrame(y_test.values, columns=['isFraud'])

# 10. Check final class balance
print("Final class distribution in training set after SMOTE:")
print(y_train_df['isFraud'].value_counts(normalize=True))

Final class distribution in training set after SMOTE:
isFraud
0    0.5
1    0.5
Name: proportion, dtype: float64
