# Assignment 3 – Machine Learning

**Total Weight:** 15% of total course mark  
**Dataset:** [Credit Card Transactions Dataset](https://drive.google.com/file/d/1XW1SwrIGXKyz8Sb0-hCii7KLwvrqehCp/view?usp=drive_link)

---

## Overview

In this assignment, you will apply the complete machine learning pipeline — from data preparation to model tuning — using a real-world dataset of simulated credit card transactions.

You will complete **three parts**:
1. **Task Analysis & Exploration** (5 marks)  
2. **Regression Task** – Predict Transaction Amount (5 marks)  
3. **Classification Task** – Fraud Detection (5 marks)

The goal is to demonstrate your understanding of data preprocessing, feature engineering, exploratory analysis, model development, and performance evaluation.

## Submission Requirements

| File | Description |
|------|--------------|
| `z{studentID}.ipynb` | Notebook showing data analysis, model comparison, and interpretation; includes all outputs |
| `z{studentID}.py` | Python script that trains final models and outputs predictions - runs as `$ python3 z{studentID}.py <train_csv> <test_csv>`|
---

## Dataset Description

Each record represents one credit card transaction with associated customer, merchant, and location data.  
Below is a summary of the key columns:

| Column | Description | Notes |
|---------|--------------|-------|
| `trans_date_trans_time` | Date and time of transaction |   |
| `cc_num` | Credit card number (anonymised) |   |
| `merchant` | Merchant/business name |  |
| `category` | Transaction type (e.g., shopping, travel) |  |
| `amt` | Transaction amount | **Regression target** |
| `first`, `last` | Cardholder names |  |
| `gender` | F = Female, M = Male |   |
| `street`, `city`, `state`, `zip` | Customer address info |  |
| `lat`, `long` | Customer coordinates |   |
| `city_pop` | City population |  |
| `job` | Cardholder occupation |  |
| `dob` | Date of birth |  |
| `trans_num` | Unique transaction ID |  |
| `unix_time` | Transaction timestamp |  |
| `merch_lat`, `merch_long` | Merchant coordinates |   |
| `is_fraud` | 1 = Fraud, 0 = Legitimate | **Classification target** |

---

## Part I – Task Analysis & Exploration (5 marks)

**Deliverable:**  
A Jupyter Notebook (`z{studentID}.ipynb`) that documents your analytical process (steps you took to build and tune your final machine learning models for Part II and Part III). Make sure you include all outputs of your notebook before you submit. Make sure the size of your notebook is less than 3MB, including the notebook outputs. 

### Required Elements

- **Data Cleaning, Preparation & Data Exploration**: Briefly list the key steps you took to clean and preprocess the data. Explore the dataset to uncover meaningful patterns, trends, and anomalies. And demonstrate your analytical reasoning by explaining how these insights informed your modeling decisions or helped shape conclusions.

- **Feature Engineering & Feature Selection**: Identify and discuss the most important features you created, explaining the reasoning behind their significance. Demonstrate your ability to recognize and engineer features that meaningfully impact model performance. Additionally, explain how you evaluated feature importance—such as through model-based importance scores, correlation analysis, or feature selection techniques—to validate their contribution to predictive accuracy.

- **Model Exploration & Hyperparameter Tuning**: Compare at least two algorithms per task and document validation performance and reasoning for your final model choice. Demonstrate your skills in fine tuning machine learning models and avoiding overfitting. 

---

## Part II – Regression Task (5 marks)

**Objective:** Predict the **transaction amount (`amt`)** using other transaction features. 

**Deliverables**
- A Python script named `z{studentID}.py` (the same script used for Part III).
- The script should generate a CSV file named `z{studentID}_regression.csv` with the following format:

|trans_num|amt|
|---------|--------------|
|12345|39.10|
|12346|0.87|
|...|...|
|12347|1000.00|

- Ensure that your output files contain exactly two columns, including the headers, in the specified order and with the exact column names.
- The `trans_num` column must correspond to the same column in the test dataset.
- The `amt` column should contain your predicted values. There are no strict formatting requirements for this column — values may be in decimal or float format.
- Your output file must include predictions for every `trans_num` present in the test dataset. The order of the rows does not matter.

---

## Part III – Classification Task (5 marks)

**Objective:** Detect fraudulent transactions (`is_fraud` = 1).

**Deliverables**
- The same Python script (`z{studentID}.py`) should also output a CSV file named `z{studentID}_classification.csv` in the following format:
  
|trans_num|is_fraud|
|---------|--------------|
|12345|0|
|12346|1|
|...|...|
|12347|0|

- Ensure that your output files each contain exactly two columns, including the headers, in the specified order and with the exact column names.
- The `trans_num` column must correspond to the same column in the test dataset.
- The `is_fraud` column represents your predicted label, where:
    - `1` indicates a fraudulent transaction
    - `0` indicates a legitimate transaction
- Your output file must include predictions for every `trans_num` present in the test dataset. The order of the rows does not matter.

---

## Important Rules and Penalties

1. **Dataset Usage Rules**
   - You may use the **training dataset** for model training and validation.  
   - You **must not** use the test dataset for training, feature selection, or oversampling.  
   - The **test dataset** is used only for evaluating your model to avoid overfitting/underfitting.  
   - Your model will be evaluated against **a different test dataset** (only available to tutors); however, the training dataset will remain the same even during marking. The dataset schema will be the same for all datasets (e.g. same structure, similar quality, similar class balance).  However, you should always cater for unknown values if you are using categories if you are using label encoders. 

2. **Target Leakage**
   - Avoid any form of target leakage.  
     - Do **not** use `is_fraud` as a feature for predicting `amt` or `is_fraud`.  
     - Do **not** engineer features that indirectly encode target information.  
     - You can use `amt` as a feature for predicting `is_fraud`.  

3. **Late Submission Penalty**
   - 5% deduction per day (out of total marks).  
   - Submissions after **day 5** will **not be marked**.

4. **Runtime**
   - The evaluation will be conducted on the CSE servers using `Python 3.11`.
   - You may use any third-party libraries that have been officially used or demonstrated in the labs. The use of other external libraries, packages, or tools is not permitted, unless explicitly approved by the instructors. 
   - If your models take **more than 2 minutes** to train and generate outputs on CSE lab machines, you will lose **1 mark per additional minute**.

6. **Automatic Zero (0) Mark Conditions**
   You will receive **zero mark** for Assignment 3 if:
   - Your code produces **hard-coded predictions** instead of using a trained model.  
   - You use the **test dataset for training or oversampling**.  
   - Your code **fails to run** on CSE machines with the command:  
     ```
     python3 z{studentID}.py <train_csv> <test_csv>
     ```  
   - You **hard-code dataset names or file paths**.  
   - You **do not produce the required output files properly** (`z{studentID}_regression.csv`, `z{studentID}_classification.csv`). Make sure your output files are formatted as expected with exact column name and the number of rows must be the same as that of `<test_csv>` file. 
   - You **merge or concatenate** the train and test datasets at any point.  
     - Especially if oversampling is applied to a combined train+test dataset — this will be considered **cheating** (zero marks).  
     - Oversampling must only be applied to training data.  
     - To avoid this, define a clean preprocessing function:  
       ```python
       def preprocess(df) -> pd.DataFrame:
           # your preprocessing steps
           return df

       train_df = preprocess(train_df)
       test_df = preprocess(test_df)
       ```  

---

## Marking Criteria (15 marks)

**Part I: Task Analysis & Exploration (5 marks)** 

Your mark will be based on how well you demonstrate your skills from cleaning data to building and tuning machine learning models. 

- Data Cleaning, Preparation & Data Exploration (1.5 marks)
- Feature Engineering & Feature Selection (1.5 marks)
- Model Exploration & Hyperparameter Tuning (2 marks)

**Part II: Regression (5 marks)** 

Here is how your mark will be calculated:

$$
\text{regression}_{\text{mark}} =
\begin{cases} 
0 & \text{if } RMSE \ge 180 \text{ or invalid output files (size and format) }  \\[1mm] 
5 & \text{if } RMSE \le 140 \\[1mm]
(1 - \frac{RMSE - 140}{180 - 140}) \times 5 & \text{if } 140 < RMSE < 180
\end{cases}
$$

**RMSE**: Root Mean Squared Error, which will be calculated using your `z{studentID}_regression.csv` output file.

The requirement may be relaxed — awarding full marks for RMSE $\le$ 140 — before releasing the marks for Assignment 3, if no student meets the original threshold.

**Part III: Classification (5 marks)** 

Here is how your mark will be calculated:

$$
\text{classification}_{\text{mark}} =
\begin{cases} 
0 & \text{if } F1 \le 0.85 \text{ or invalid output files (size and format) }  \\[1mm] 
5 & \text{if } F1 \ge 0.97 \\[1mm]
\frac{F1 - 0.85}{0.97 - 0.85} \times 5 & \text{if } 0.85 < F1 < 0.97
\end{cases}
$$

F1 Macro will be used in this part, which will be calculated using your `z{studentID}_classification.csv` output file.

The requirement may be relaxed — awarding full marks for F1 $\ge$ 0.97 — before releasing the marks for Assignment 3, if no student meets the original threshold.
