## EDA for Credit Card Fraud Detection System (FDS)
##### Gavin Qu, version 11.14.2025

#### Content: 
- The dataset contains transactions made by credit cards in September 2013 by European cardholders.
- This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. 

- It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, â€¦ V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise. 

- Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification. 

- *PCA*: The idea is to project the data onto a new set of orthogonal axes, called PCs ordered so that the first PC accounts for the largest variance in the data, the second PC account for the largest remaining variance orthogonal to the first, and so on. By keeping only the first few PCs (which captures most variance), you can simplify the data and reduce noise without losing the essential relationships between data. 
- The eigenvectors of the covariance matrix is the principle components
- The eigenvector with the highest eigenvalue is the first PC, the second largest is PC2. 
- When you discard the principle components that explain the least variance (p - k components), you are also discarding the directions that contain the most "noise". 

In [None]:
#import kagglehub

# Download latest version
# path = kagglehub.dataset_download("mlg-ulb/creditcardfraud")

#print("Path to dataset files:", path)

### 1. Objective and Constraints
#### 1.1 Loss of Interpretability and Gain in Dimensionality Reduction
- Since PCA creates new axes that are linear combination of the original features, the column headers V1, V2, etc. no longer directly correspond to interpretable real world entities (like age, ZIP code...)
- Given the loss of correlation to original features, the coefficients used are unknown, which limits our ability to perform direct, causal analysis based on feature names alone. 
- Since PCs are orthogonal and uncorrelated to each other, this is actually beneficial for my usage of logistics regression, as it eliminates the issue of multicollinearity. 
- Features created this way should be ranked by amount of the variance they capture, V1 captures the most information (variance) in the original dataset and V28 captures the least information. This gives us the possibility of further dimensionality reduction if we are to drop the last features. 
- Since the original dataset has to be scaled/normalized before the PCA, and the PCs are derived from the covariance matrix, the original features scales are mixed together in the PCs. As a result, I don't have to worry about the scale of the features relative to each other from V1-V28. 

#### 1.2 Objective Formulation
I consider it success at this stage as gaining a robust understanding of the dataset and finding a mixture of the models that best predict the outcome using the given principal components. Speed is not too much of a concern but I'd like to simulate real-world scenarios in consideration of model development cycles. Given the class imbalance, the key metric to measure performance should be Area under the Precision-Recall Curve (AUPRC). 
1. Do "Time", "Amount" and "Class" show clustering based on other features? This could introduce leakage. 
2. How do features V1-V28 behave distributionally across our predicted variable Class? 
3. Time-series considerations. 
4. What should I do with class imbalance ratio? Do I need to interpret results differently from other types of datasets? e.g. summary statistics
5. How do we interpret the V1-V28 Principle Components if we find feature importance? 
6. Most importantly, where does the model learn to separate signal from noise? 