In [1]:
import pandas as pd
import numpy as np

# Part 1: Data Exploration & Preprocessing
Deliverables:
1. Exploratory Data Analysis (EDA)

- Statistical summary of all features
- Class distribution analysis (critical for classification!)
    - Bar plot of class counts
    - Calculate class imbalance ratio
- Distribution plots for key numerical features
- Correlation matrix/heatmap
- Identification of outliers
- Missing value analysis (if applicable)
- Categorical feature analysis (unique values, frequency)
2. Data Preprocessing Pipeline

- Handle missing values
- Encode categorical variables (one-hot, label encoding, etc.)
- Feature scaling/normalization
- Address class imbalance (if applicable):
    - SMOTE (Synthetic Minority Over-sampling)
    - Undersampling
    - Class weights
    - Or justify why no balancing is needed
- Train/test split with stratification

Required Output:
- Jupyter notebook or Python script with markdown/comments
- Minimum 5 visualizations including class distribution
- Written justification for preprocessing choices (3-4 paragraphs)
- Discussion of class imbalance strategy

In [None]:
# Getting the data into a pandas dataframe
df = pd.read_csv("./data/creditcard.csv")

# Part 2: Baseline Model
Implement Logistic Regression as your baseline.

Report:
- Training and test performance metrics (see Part 4)
- Confusion Matrix (with visualization)
- Classification Report (precision, recall, F1 per class)
- ROC Curve and AUC (for binary classification)
- Brief analysis of baseline performance
    - Where does the model succeed/fail?
    - Any signs of overfitting or underfitting?

# Part 3: Model Implementation & Comparison
Implement at least 4 different classification models. Must include variety:

Required Model Types:

1. At least one ensemble method (Random Forest, Gradient Boosting, XGBoost, AdaBoost)
2. At least one tree-based model (Decision Tree, Extra Trees)
3. At least one model from your group research (SVM, KNN, Naive Bayes, Neural Network, etc.)
4. Your choice for the fourth

For Each Model:
1. Initial Training

- Train with default hyperparameters
- Record training time
- Calculate all metrics (see Part 4)

2. Document Model Configuration
<details>
<summary>Example:</summary>
model_config = { <br>
    &emsp;'name': 'Random Forest', <br>
    &emsp;'hyperparameters': { <br>
        &emsp;&emsp;'n_estimators': 100, <br>
        &emsp;&emsp;'max_depth': None, <br>
        &emsp;&emsp;'min_samples_split': 2, <br>
        &emsp;&emsp;... <br>
    }, <br>
    &emsp;'preprocessing_requirements': 'Standard scaling applied', <br>
    &emsp;'training_time': 2.34,  # seconds <br>
    &emsp;'class_weight': 'balanced'  # if applicable <br>
}
</details>
3. Feature Importance (if applicable)

- Plot top 10 most important features
- Discuss which features drive predictions
- Compare feature importance across models
4. Model-Specific Analysis

- For tree-based: Visualize decision tree (if feasible)
- For SVM: Discuss kernel choice
- For KNN: Justify k value selection

# Part 4: Model Evaluation
Calculate the following metrics for ALL models (baseline + 4):

Required Metrics:

For Binary Classification:

- Accuracy (with discussion of when it's misleading)
- Precision (per class and macro/weighted average)
- Recall/Sensitivity (per class and macro/weighted average)
- F1-Score (per class and macro/weighted average)
- AUC-ROC (Area Under ROC Curve)
- AUC-PR (Area Under Precision-Recall Curve - important for imbalanced data)
- Confusion Matrix

For Multiclass Classification:

- All of the above (adapted for multiclass)
- Macro-average and Weighted-average metrics
- Per-class performance analysis

Create Comparison Visualizations:

1. Metrics Comparison Table

| Model            | Accuracy | Precision | Recall | F1    | AUC-ROC | Train Time |
|------------------|----------|-----------|--------|-------|---------|------------|
| Logistic Reg     | 0.92     | 0.85      | 0.78   | 0.81  | 0.94    | 0.12s      |
| Random Forest    | 0.94     | 0.88      | 0.82   | 0.85  | 0.96    | 2.45s      |
| ...              | ...      | ...       | ...    | ...   | ...     | ...        |

2. Bar Charts: Compare metrics across models

3. Confusion Matrices: For ALL models (use subplots)

4. ROC Curves: Plot all models on same graph for comparison

5. Precision-Recall Curves: Especially important for imbalanced datasets

6. Learning Curves (Optional but recommended): For your top 2 models

- Plot training vs validation score as function of training size
- Helps diagnose overfitting/underfitting

Analysis Requirements:
- Which metric is most important for your problem? Why?
- Discuss the precision-recall tradeoff
- For imbalanced data: Why is accuracy potentially misleading?
- Compare performance on minority vs. majority class

# Part 5: Hyperparameter Tuning
Select your best performing model from Part 3 and optimize it.

Required Approach:
Use GridSearchCV or RandomizedSearchCV
<details>
<summary>Implementation:</summary>
from sklearn.model_selection import GridSearchCV <br>
<br>
param_grid = { <br>
&emsp;    # Define at least 3 hyperparameters to tune <br>
&emsp;    # Each with at least 3 different values <br>
&emsp;    # Example for Random Forest: <br>
&emsp;    'n_estimators': [100, 200, 300], <br>
&emsp;    'max_depth': [10, 20, 30, None], <br>
&emsp;    'min_samples_split': [2, 5, 10], <br>
&emsp;    'class_weight': ['balanced', None]  # if imbalanced <br>
} <br>
<br>
grid_search = GridSearchCV( <br>
&emsp;    estimator=your_model, <br>
&emsp;    param_grid=param_grid, <br>
&emsp;    cv=5,  # 5-fold stratified cross-validation <br>
&emsp;    scoring='f1',  # or 'roc_auc', 'f1_weighted', etc. <br>
&emsp;    n_jobs=-1, <br>
&emsp;    verbose=1 <br>
) <br>
<br>
grid_search.fit(X_train, y_train)
</details>

Document:
- Initial hyperparameters vs. optimal hyperparameters
- Performance improvement (before/after tuning)
    - Show metrics table comparing both versions
- Cross-validation scores (mean and std)
- Training time comparison
- Scoring metric choice: Why did you choose that metric for optimization?
- Discussion: Was the tuning worth the computational cost?
Additional Consideration:
- If using imbalanced data, ensure CV is stratified
- Consider using multiple scoring metrics
- Discuss any tradeoffs (e.g., precision vs. recall)

Part 6: Reflections
Write a comprehensive report addressing:

1. Model Selection Justification
- Why did certain models perform better than others?
- What characteristics of the data favor specific algorithms?
- Were there any surprising results?
- How did class imbalance (if present) affect different models?
2. Feature Analysis
- Which features were most important across models?
- Did feature importance differ between models?
- Any features that were unexpectedly important/unimportant?
- Recommendations for feature engineering
3. Practical Considerations
- Which model would you deploy in production and why?
    - Consider: accuracy, speed, interpretability, maintenance
- Trade-offs:
    - Accuracy vs. interpretability vs. speed
    - Precision vs. recall (what's more important for your use case?)
- Ethical considerations (especially for sensitive domains):
    - Potential biases in the model
    - Fairness across different groups
    - Consequences of false positives vs. false negatives
- Potential limitations of your models
4. Future Improvements
- What would you try next to improve performance?
- Additional data that would be helpful
- Different approaches to consider:
    - Ensemble methods (stacking, voting)
    - Deep learning
    - Different feature engineering
    - Cost-sensitive learning
- How would you monitor model performance in production?