In [None]:
# Import all dependencies
import numpy as np
import pandas as pd

In [None]:
# Read the train_identity.csv file
train_identity_df = pd.read_csv('datasets/train_identity.csv')
train_identity_df.head()

In [None]:
# Read the train_transaction.csv file
train_transaction_df = pd.read_csv('datasets/train_transaction.csv')
train_transaction_df.head()

# Feature Engineering
## Questions
- What does feature engineering contribute to the training of the model?
- What are the different types of feature engineering?
  - Why should you do a specific type of feature engineering?
  - What does it benefit to the performance of the model?
  - When should you do specific types of feature engineering?

## Answers
This process is important because it ensures that garbage in, garbage out doesn't happen. Feature engineering allows us to significantly improve the model performance by refining features. There are various processes in feature engineering.

### Processes
1. Feature Creation - Create new features
   1. Domain-Specific: From industry knowledge like business rules
   2. Data-Driven: Derived from recognized patterns
   3. Synthetic: From combining existing features
2. Feature Transformation - Adjusts features
   1. Normalization & Scaling: Adjust the range of features for consistency
   2. Encoding: Convert categorical data to numerical data (i.e. one-hot encoding)
   3. Mathematical Transformations: Like logarithmic transformations for skewed data
3. Feature Extraction - To reduce dimensionality and simplify model
   1. Dimensionality Reduction: Reduce features while preserving important information (PCA technique)
   2. Aggregation & Combination: Summing/averaging features to simplify the model
4. Feature Selection - Choosing a subset of relevant features to use
   1. Filter methods: Based on statistical measures like correlation
   2. Wrapper methods: Select based on model performance (what?)
   3. Embedded methods: Feature selection integrated within model training (what? it's like not manual feature selection but learned feature selection)
5. Feature Scaling - Ensuring that all features contribute equally to the model
   1. Min-max scaling: Rescales values to a fixed range like 0 to 1
   2. Standard scaling: Normalizes to have a mean of 0 and variance of 1
      - Note: This is done across all features so that there is no bias towards features with inherently larger numerical values like comparing age to salary.

### Steps
1. Data Cleaning
   - Handling missing values (imputation | replace empty cells with mean, mode, or median values from other cells in the same column)
   - Find outliers and handle them
     - Replace with statistical number like max or min
     - Apply transformations to the feature like log or square root
     - Drop the outliers from the dataset
     - Note: What is the best approach for handling outliers in our use-case? I feel like we should keep them
2. Data Transformation
   - Encoding categorical variables
     - One-hot encoding: split up the feature into multiple columns like gender will be male and female
     - Label encoding: Assign a numerical value for each category label
     - Ordinal encoding: Assign numerical value based on the order of the category (if applicable)
     - Target encoding: If a category has multiple target values (what does this mean?), take the mean of the values and assign that to the category
3. Feature Extraction
4. Feature Selection
5. Feature Iteration

Sources:
- [GeekForGeeks - What is Feature Engineering](https://www.geeksforgeeks.org/machine-learning/what-is-feature-engineering/)
- [DataCamp - Feature Engineering in Machine Learning](https://www.datacamp.com/tutorial/feature-engineering)

In [None]:
# Print the first 20 for transactions
print(train_transaction_df.head(20))

In [None]:
# Print the first 20 for identity
print(train_identity_df.head(20))

In [None]:
# Exploring the datasets
# Find if there's repeating TransactionIds in train_transaction and train_identity
print(f"Number of unique TransactionIds in train_transaction: {train_transaction_df['TransactionID'].nunique()}")
print(f"Number of unique TransactionIds in train_identity: {train_identity_df['TransactionID'].nunique()}")
print(f"Total rows in train_transaction: {len(train_transaction_df)}")
print(f"Total rows in train_identity: {len(train_identity_df)}")

In [None]:
print(train_transaction_df['isFraud'].value_counts(normalize=True) * 100)

# Dataset Exploration

[Kaggle discussion regarding dataset columns](https://www.kaggle.com/competitions/ieee-fraud-detection/discussion/101203)

In [None]:
train_df = pd.merge(train_transaction_df, train_identity_df, on='TransactionID', how='left')
print(train_df.shape)
print(train_df.head())

In [None]:
for col in train_df.columns:
    unique_values = train_df[col].nunique()
    missing_values = train_df[col].isnull().sum()
    if missing_values > 0:
        print(f"Column: {col}, Unique Values: {unique_values}, Missing Values: {missing_values}")