# LLM Study Group Notebook: Kaggle Competition - Fraud Detection

This notebook is intended for the LLM Study Group and focuses on the Kaggle competition for fraud detection.

## Overview

The **IEEE-CIS Fraud Detection** competition is hosted on Kaggle in partnership with IEEE and the Data Science Institute (CIS). The competition aims to develop models that can accurately identify fraudulent transactions using machine learning techniques.

## Competition Details

- **Host**: Kaggle, IEEE, and the Data Science Institute (CIS)
- **Objective**: Create a machine learning model to detect fraudulent transactions.
- **Data Source**: Kaggle competition dataset

## Sections

### 0. Data Import from Kaggle

**Description**: Import the dataset provided by the IEEE-CIS Fraud Detection competition. This data will be used for building and evaluating the fraud detection model.
**Resource**: [Kaggle Competition Data](https://www.kaggle.com/competitions/ieee-fraud-detection/data)


### 1. Exploratory Data Analysis (EDA)

- **Description**: Analyze the dataset to understand its structure, distribution, and any underlying patterns.
- **Tasks**:
  - Data Cleaning
  - Feature Exploration
  - Visualization
  
### 2. Model Selection

- **Description**: Select and implement suitable machine learning algorithms to tackle the fraud detection problem. Evaluate different models to determine which best meets the objectives of the competition.

- **Tasks**:
  - **Model Comparison**:
    - **Neural Network (NN) Model**:
      - **Layer Normalization**: Explore the impact of layer normalization on model performance and stability.
    - **Gradient Boosting Decision Trees (GBDT) Model**:
      - **Overview**: Evaluate the effectiveness of GBDT models, such as XGBoost or LightGBM.
    - **Rationale**: Justify the choice of models based on their performance, interpretability, and suitability for the fraud detection task.
  - **Hyperparameter Tuning**:
    - Optimize hyperparameters for each selected model to improve performance and prevent overfitting.
  - **Model Training and Validation**:
    - Train models using the training dataset.
    - Validate models using cross-validation or a validation set to assess their generalization capability.

### 3. Metrics & Error Analysis

- **Description**: Assess model performance using various metrics and perform error analysis to identify areas for improvement.

- **Tasks**:
  - **Performance Metrics**:
    - **AUC-ROC**: Evaluate the model’s ability to distinguish between fraudulent and non-fraudulent transactions.
    - **Log Loss**: Measure the accuracy of predicted probabilities.
    - **Recall and Precision**: Assess the model’s ability to correctly identify fraudulent transactions (recall) and the proportion of correctly identified fraudulent transactions out of all predicted fraudulent ones (precision).
  - **Comparison of Metrics**:
    - Discuss the advantages and limitations of each metric in the context of the fraud detection task.
  - **Error Analysis**:
    - Analyze misclassified transactions to understand where and why errors occur.
  - **Model Interpretation**:
    - Interpret model predictions and feature importances to gain insights into the model's decision-making process and identify potential areas for improvement.


### 4. Enhancement Rollout Plan

- **Description**: Plan for enhancing the model and improving data quality. This includes identifying and addressing areas for improvement in both the model and the data, and ensuring a smooth transition to deployment.

- **Tasks**:
  - **Identify Areas for Improvement**:
    - **Model Performance**: Analyze current model performance and identify specific areas for improvement, such as accuracy, precision, recall, or other relevant metrics.
    - **Data Quality**: Evaluate data quality and identify any issues such as missing values, outliers, or data imbalances.
    - **Feature Engineering**: Determine if additional feature engineering or new features could enhance model performance.
  - **Implement Enhancements**:
    - **Model Enhancements**:
      - Apply advanced techniques such as hyperparameter tuning, model ensembling, or exploring alternative algorithms to improve performance.
    - **Data Improvements**:
      - **Data Cleaning**: Address issues in the dataset by handling missing values, correcting inconsistencies, and removing or mitigating outliers.
      - **Feature Engineering**: Develop new features or modify existing ones to provide more meaningful inputs to the model, enhancing its predictive power.
      - **Data Augmentation**: Increase the diversity and volume of the training data through techniques such as oversampling underrepresented classes, generating synthetic samples, or incorporating additional relevant data.
      - **Re-calibration (if needed)**: Adjust the model’s probability outputs or decision thresholds to improve performance metrics or better align with business objectives.


## Section 0: Data Import from Kaggle

In [26]:
# !pip install kaggle
# !mv ~/Downloads/kaggle.json .
# !mkdir -p ~/.kaggle 
# !chmod 600 ~/.kaggle/kaggle.json

In [7]:
#set up the Kaggle API credentials

import os
import json

kaggle_json_path = 'kaggle.json'

with open(kaggle_json_path) as f:
    kaggle_json = json.load(f)

os.environ['KAGGLE_USERNAME'] = kaggle_json['username']
os.environ['KAGGLE_KEY'] = kaggle_json['key']

In [6]:
!kaggle competitions download -c ieee-fraud-detection

Downloading ieee-fraud-detection.zip to /Users/wei/Documents/workspace/studyGroup
 97%|██████████████████████████████████████▉ | 115M/118M [00:03<00:00, 40.2MB/s]
100%|████████████████████████████████████████| 118M/118M [00:03<00:00, 37.5MB/s]


## Section1: Exploratory Data Analysis (EDA)

## Section 2. Model Selection

In [None]:
### What is the defination of batch/layer Normalization? Vis?

In [1]:
### What is Layer Normalization


In [None]:
### comparison Normalization

In [None]:
### if we dont have normalization -> how/why it effect training

In [None]:
### 通道 v.s. layers how it works on different layers

In [None]:
# why 允许较大的学习率

In [None]:
### Internal Covariate Shift (ICS) 问题：在训练的过程中，激活函数会改变各层数据的分布，随着网络的加深，这种改变（差异）会越来越大，使模型训练起来特别困难，收敛速度很慢，会出现梯度消失的问题。

In [None]:
### What scenarios, BN works better?

In [None]:
### Performace

## Section 3.  Metrics & Error Analysis

## Section 4. Enhancement Rollout Plan