# 📌 Machine Learning Assignment 1 - Instructions & Guidelines

### **📝 General Guidelines**
Welcome to Machine Learning Assignment 1! This assignment will test your understanding of **regression and classification models**, including **data preprocessing, hyperparameter tuning, and model evaluation**.

Follow the instructions carefully, and ensure your implementation is **correct, well-structured, and efficient**.

🔹 **Submission Format:**  
- Your submission **must be a single Jupyter Notebook (.ipynb)** file.  
- **File Naming Convention:**  
  - Use **your university email as the filename**, e.g.,  
    ```
    j.doe@innopolis.university.ipynb
    ```
  - **Do NOT modify this format**, or your submission may not be graded.

🔹 **Assignment Breakdown:**
| Task | Description | Points |
|------|------------|--------|
| **Task 1.1** | Linear Regression | 20 |
| **Task 1.2** | Polynomial Regression | 20 |
| **Task 2.1** | Data Preprocessing | 15 |
| **Task 2.2** | Model Comparison | 45 |
| **Total** | - | **100** |

---

### **📂 Dataset & Assumptions**
The dataset files are stored in the `datasets/` folder.  
- **Regression Dataset:** `datasets/task1_data.csv`
- **Classification Dataset:** `datasets/pokemon_modified.csv`

Each dataset is structured as follows:

🔹 **`task1_data.csv` (for regression tasks)**  
- Contains `X_train`, `y_train`, `X_test`, and `y_test`.  
- The goal is to fit **linear and polynomial regression models** and evaluate their performance.  

🔹 **`pokemon_modified.csv` (for classification tasks)**  
- Contains Pokémon attributes, with `is_legendary` as the **binary target variable (0 or 1)**.  
- Some features contain **missing values** and **categorical variables**, requiring preprocessing.

---

### **🚀 How to Approach the Assignment**
1. **Start with Regression (Task 1)**
   - Implement **linear regression** and **polynomial regression**.
   - Use **GridSearchCV** for polynomial regression to find the best degree.
   - Evaluate using **MSE, RMSE, MAE, and R² Score**.

2. **Move to Data Preprocessing (Task 2.1)**
   - Load and clean the Pokémon dataset.
   - Handle **missing values** correctly.
   - Encode categorical variables properly.
   - Ensure **no data leakage** when doing the preprocessing.

3. **Train and Evaluate Classification Models (Task 2.2)**
   - Train **Logistic Regression, KNN, and Naive Bayes**.
   - Use **GridSearchCV** for hyperparameter tuning.
   - Evaluate models using **Accuracy, Precision, Recall, and F1-score**.

---

### **📌 Grading & Evaluation**
- Your notebook will be **autograded**, so ensure:
  - Your function names **exactly match** the given specifications.
  - Your output format matches the expected results.
- Partial credit will be given where applicable.

🔹 **Need Help?**  
- If you have any questions, refer to the **assignment markdown instructions** in each task before asking for clarifications.
- You can post your question on this [Google sheet](https://docs.google.com/spreadsheets/d/1oyrqXDjT2CeGYx12aZhZ-oDKcQQ-PCgT91wHPhTlBCY/edit?usp=sharing)

🚀 **Good luck! Happy coding!** 🎯

### FAQ

**1) Should we include the lines to import the libraries?**

- **Answer:**  
  It doesn't matter if you include extra import lines, as the grader will only call the specified functions.

**2) Is it okay to submit my file with code outside of the functions?**

- **Answer:**  
  Yes, you can include additional code outside of the functions as long as the entire script runs correctly when converted to a `.py` file.

**Important Clarification:**

- The grader will first convert the Jupyter Notebook (.ipynb) into a Python file (.py) and then run it.
- **Note:** Please do not include any commands like `!pip install numpy` because they may break the conversion process and therefore the submission will not be graded.

## Task 1: Linear and Polynomial Regression (30 Points)

### Task 1.1 - Linear Regression (15 Points)
#### **Instructions**
1. Load the dataset from **`datasets/task1_data.csv`**.
2. Extract training and testing data from the following columns:
   - `"X_train"`: Training feature values.
   - `"y_train"`: Training target values.
   - `"X_test"`: Testing feature values.
   - `"y_test"`: Testing target values.
3. Train a **linear regression model** on `X_train` and `y_train`.
4. Use the trained model to predict `y_test` values.
5. Compute and return the following **evaluation metrics** as a dictionary:
   - **Mean Squared Error (MSE)**
   - **Root Mean Squared Error (RMSE)**
   - **Mean Absolute Error (MAE)**
   - **R² Score**
6. The function signature should match:
   ```python
   def task1_linear_regression() -> Dict[str, float]:

Please do not use any other libraries except for the ones imported below.

In [14]:
# Standard Library Imports
import os
import importlib.util
import nbformat
from tempfile import NamedTemporaryFile
from typing import Tuple, Dict

# Third-Party Library Imports
import numpy as np
import pandas as pd

from nbconvert import PythonExporter

# Scikit-Learn Imports
from sklearn.preprocessing import MinMaxScaler, StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             mean_squared_error, mean_absolute_error, r2_score)
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

In [15]:
def task1_linear_regression() -> Dict[str, float]:

    file_path = "datasets/task1_data.csv"
    data = pd.read_csv(file_path)
    
    X_train = data["X_train"].values.reshape(-1, 1)
    y_train = data["y_train"].values
    X_test = data["X_test"].values.reshape(-1, 1)
    y_test = data["y_test"].values
    
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    return {
        "MSE": mse,
        "RMSE": rmse,
        "MAE": mae,
        "R2": r2
    }

### Task 1.2 - Polynomial Regression (15 Points)

#### **Instructions**
1. Load the dataset from **`datasets/task1_data.csv`**.
2. Extract training and testing data from the following columns:
   - `"X_train"`: Training feature values.
   - `"y_train"`: Training target values.
   - `"X_test"`: Testing feature values.
   - `"y_test"`: Testing target values.
3. Define a **pipeline** that includes:
   - **Polynomial feature transformation** (degree range: **2 to 10**).
   - **Linear regression model**.
4. Use **GridSearchCV** with **8-fold cross-validation** to determine the best polynomial degree.
5. Train the model with the best polynomial degree and **evaluate it on the test set**.
6. Compute and return the following results as a dictionary:
   - **Best polynomial degree** (`best_degree`)
   - **Mean Squared Error (MSE)**

#### **Function Signature**
```python
def task1_polynomial_regression() -> Dict[str, float]:

In [16]:
def task1_polynomial_regression() -> Dict[str, float]:
    file_path = "datasets/task1_data.csv"
    data = pd.read_csv(file_path)
    
    X_train = data["X_train"].values.reshape(-1, 1)
    y_train = data["y_train"].values
    X_test = data["X_test"].values.reshape(-1, 1)
    y_test = data["y_test"].values
    
    pipeline = Pipeline([
        ('poly', PolynomialFeatures()),
        ('linear', LinearRegression())
    ])
    
    param_grid = {'poly__degree': list(range(2, 11))}
    
    grid_search = GridSearchCV(pipeline, param_grid, cv=8, scoring='neg_mean_squared_error')
    grid_search.fit(X_train, y_train)
    
    best_degree = grid_search.best_params_['poly__degree']
    y_pred = grid_search.predict(X_test)
    
    mse = mean_squared_error(y_test, y_pred)
    
    return {
        "best_degree": best_degree,
        "MSE": mse
    }

## Task 2: Classification with Data Preprocessing (70 Points)

### Task 2.1 - Data Preprocessing (30 Points)

#### **Instructions**
1. Load the dataset from **`datasets/pokemon_modified.csv`**.
2. Look at the data and study the provided features
3. Remove the **two redundant features**
4. Handle **missing values**:
   - Use **mean imputation** for **"height_m"** and **"weight_kg"**.
   - Use **median imputation** for **"percentage_male"**.
5. Perform **one-hot encoding** for the categorical column **"type1"**.
6. Ensure the **target variable** (`"is_legendary"`) is present.
7. **Split the data into training and testing sets** (`80%-20%` split). Is it balanced?
8. **Apply feature scaling** using **StandardScaler** or **MinMaxScaler**.
9. Return the following:
   - `X_train_scaled`: Processed training features.
   - `X_test_scaled`: Processed testing features.
   - `y_train`: Training labels.
   - `y_test`: Testing labels.

#### **Function Signature**
```python
def task2_preprocessing() -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:

In [17]:
def task2_preprocessing() -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:
    file_path = "datasets/pokemon_modified.csv"
    data = pd.read_csv(file_path)
    
    data = data.drop(columns=["name", "classification"])
    
    X = data.drop(columns=["is_legendary"])
    y = data["is_legendary"]
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, stratify=y, random_state=91
    )
    
    mean_features = ["height_m", "weight_kg"]
    median_features = ["percentage_male"]
    categorical_features = ["type1"]
    
    mean_imputer = SimpleImputer(strategy="mean")
    X_train[mean_features] = mean_imputer.fit_transform(X_train[mean_features])
    X_test[mean_features] = mean_imputer.transform(X_test[mean_features])
    
    median_imputer = SimpleImputer(strategy="median")
    X_train[median_features] = median_imputer.fit_transform(X_train[median_features])
    X_test[median_features] = median_imputer.transform(X_test[median_features])
    
    encoder = OneHotEncoder(drop="first", sparse_output=False)
    encoded_train = encoder.fit_transform(X_train[categorical_features])
    encoded_test = encoder.transform(X_test[categorical_features])
    encoded_cols = encoder.get_feature_names_out(categorical_features)
    
    X_train = pd.concat([
        X_train.drop(columns=categorical_features).reset_index(drop=True),
        pd.DataFrame(encoded_train, columns=encoded_cols)
    ], axis=1)
    
    X_test = pd.concat([
        X_test.drop(columns=categorical_features).reset_index(drop=True),
        pd.DataFrame(encoded_test, columns=encoded_cols)
    ], axis=1)
    
    scaler = StandardScaler()
    X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
    X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)
    
    return X_train_scaled, X_test_scaled, y_train, y_test

### Task 2.2 - Model Comparison (40 Points)

#### **Instructions**
1. **Train three classification models** on the preprocessed dataset:
   - **Logistic Regression**
   - **K-Nearest Neighbors (KNN)**
   - **Gaussian Naive Bayes (GNB)**
2. Use **GridSearchCV** for **hyperparameter tuning** on:
   - **Logistic Regression**: Regularization strength (`C`) and penalty (`l1`, `l2`).
   - **KNN**: Number of neighbors (`n_neighbors`), weight function, and distance metric.
3. Train each model on the **training set** and evaluate on the **test set**.
4. Compute the following **evaluation metrics**:
   - **Accuracy**
   - **Precision**
   - **Recall**
   - **F1 Score**
5. Return a dictionary containing the evaluation metrics for each model.

#### **Function Signature**
```python
def task2_model_comparison() -> Dict[str, Dict[str, float]]:

In [18]:
def task2_model_comparison() -> Dict[str, Dict[str, float]]:

    X_train, X_test, y_train, y_test = task2_preprocessing()
    
    param_grid_logreg = {'C': [0.1, 1, 10], 'penalty': ['l1', 'l2'], 'solver': ['liblinear']}
    param_grid_knn = {'n_neighbors': range(1, 11), 'weights': ['uniform', 'distance'], 'p': [1, 2]}
    
    logreg = GridSearchCV(LogisticRegression(), param_grid_logreg, cv=5)
    logreg.fit(X_train, y_train)
    logreg_best = logreg.best_estimator_
    
    knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=5)
    knn.fit(X_train, y_train)
    knn_best = knn.best_estimator_
    
    nb = GaussianNB()
    nb.fit(X_train, y_train)
    
    models = {"Logistic Regression": logreg_best, "KNN": knn_best, "Naive Bayes": nb}
    results = {}
    
    for name, model in models.items():
        y_pred = model.predict(X_test)
        results[name] = {
            "accuracy": accuracy_score(y_test, y_pred),
            "precision": precision_score(y_test, y_pred),
            "recall": recall_score(y_test, y_pred),
            "f1_score": f1_score(y_test, y_pred)
        }
    
    return results