# 📌 Machine Learning Assignment 1 - Instructions & Guidelines

### **📝 General Guidelines**
Welcome to Machine Learning Assignment 1! This assignment will test your understanding of **regression and classification models**, including **data preprocessing, hyperparameter tuning, and model evaluation**.

Follow the instructions carefully, and ensure your implementation is **correct, well-structured, and efficient**.

🔹 **Submission Format:**  
- Your submission **must be a single Jupyter Notebook (.ipynb)** file.  
- **File Naming Convention:**  
  - Use **your university email as the filename**, e.g.,  
    ```
    j.doe@innopolis.university.ipynb
    ```
  - **Do NOT modify this format**, or your submission may not be graded.

🔹 **Assignment Breakdown:**
| Task | Description | Points |
|------|------------|--------|
| **Task 1.1** | Linear Regression | 20 |
| **Task 1.2** | Polynomial Regression | 20 |
| **Task 2.1** | Data Preprocessing | 15 |
| **Task 2.2** | Model Comparison | 45 |
| **Total** | - | **100** |

---

### **📂 Dataset & Assumptions**
The dataset files are stored in the `datasets/` folder.  
- **Regression Dataset:** `datasets/task1_data.csv`
- **Classification Dataset:** `datasets/pokemon_modified.csv`

Each dataset is structured as follows:

🔹 **`task1_data.csv` (for regression tasks)**  
- Contains `X_train`, `y_train`, `X_test`, and `y_test`.  
- The goal is to fit **linear and polynomial regression models** and evaluate their performance.  

🔹 **`pokemon_modified.csv` (for classification tasks)**  
- Contains Pokémon attributes, with `is_legendary` as the **binary target variable (0 or 1)**.  
- Some features contain **missing values** and **categorical variables**, requiring preprocessing.

---

### **🚀 How to Approach the Assignment**
1. **Start with Regression (Task 1)**
   - Implement **linear regression** and **polynomial regression**.
   - Use **GridSearchCV** for polynomial regression to find the best degree.
   - Evaluate using **MSE, RMSE, MAE, and R² Score**.

2. **Move to Data Preprocessing (Task 2.1)**
   - Load and clean the Pokémon dataset.
   - Handle **missing values** correctly.
   - Encode categorical variables properly.
   - Ensure **no data leakage** when doing the preprocessing.

3. **Train and Evaluate Classification Models (Task 2.2)**
   - Train **Logistic Regression, KNN, and Naive Bayes**.
   - Use **GridSearchCV** for hyperparameter tuning.
   - Evaluate models using **Accuracy, Precision, Recall, and F1-score**.

---

### **📌 Grading & Evaluation**
- Your notebook will be **autograded**, so ensure:
  - Your function names **exactly match** the given specifications.
  - Your output format matches the expected results.
- Partial credit will be given where applicable.

🔹 **Need Help?**  
- If you have any questions, refer to the **assignment markdown instructions** in each task before asking for clarifications.
- You can post your question on this [Google sheet](https://docs.google.com/spreadsheets/d/1oyrqXDjT2CeGYx12aZhZ-oDKcQQ-PCgT91wHPhTlBCY/edit?usp=sharing)

🚀 **Good luck! Happy coding!** 🎯

### FAQ

**1) Should we include the lines to import the libraries?**

- **Answer:**  
  It doesn't matter if you include extra import lines, as the grader will only call the specified functions.

**2) Is it okay to submit my file with code outside of the functions?**

- **Answer:**  
  Yes, you can include additional code outside of the functions as long as the entire script runs correctly when converted to a `.py` file.

**Important Clarification:**

- The grader will first convert the Jupyter Notebook (.ipynb) into a Python file (.py) and then run it.
- **Note:** Please do not include any commands like `!pip install numpy` because they may break the conversion process and therefore the submission will not be graded.

## Task 1: Linear and Polynomial Regression (30 Points)

### Task 1.1 - Linear Regression (15 Points)
#### **Instructions**
1. Load the dataset from **`datasets/task1_data.csv`**.
2. Extract training and testing data from the following columns:
   - `"X_train"`: Training feature values.
   - `"y_train"`: Training target values.
   - `"X_test"`: Testing feature values.
   - `"y_test"`: Testing target values.
3. Train a **linear regression model** on `X_train` and `y_train`.
4. Use the trained model to predict `y_test` values.
5. Compute and return the following **evaluation metrics** as a dictionary:
   - **Mean Squared Error (MSE)**
   - **Root Mean Squared Error (RMSE)**
   - **Mean Absolute Error (MAE)**
   - **R² Score**
6. The function signature should match:
   ```python
   def task1_linear_regression() -> Dict[str, float]:

Please do not use any other libraries except for the ones imported below.

In [1]:
# Standard Library Imports
import os
import importlib.util
import nbformat
from tempfile import NamedTemporaryFile
from typing import Tuple, Dict


# Third-Party Library Imports
import numpy as np
import pandas as pd

from nbconvert import PythonExporter

# Scikit-Learn Imports
from sklearn.preprocessing import MinMaxScaler, StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             mean_squared_error, mean_absolute_error, r2_score)
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

In [3]:
def task1_linear_regression() -> Dict[str, float]:

  # Load the dataset from datasets/task1_data.csv

  df = pd.read_csv("datasets/task1_data.csv")

  # Extract training and testing data
  X_train = df.iloc[:, 0].values.reshape(-1, 1)
  y_train = df.iloc[:, 1].values.reshape(-1, 1)
  X_test = df.iloc[:, 2].values.reshape(-1, 1)
  y_test = df.iloc[:, 3].values.reshape(-1, 1)

  # Train a linear regression model on `X_train, y_train`
  regressor = LinearRegression()
  regressor.fit(X_train, y_train)

  # Use the trained model to predict `y_test` values
  y_pred = regressor.predict(X_test)

  # Compute evaluation metrics: **MSE, RMSE, MAE, R² Score**

  mse = mean_squared_error(y_test, y_pred)
  rmse = np.sqrt(mse)
  mae = mean_absolute_error(y_pred, y_test)
  r2 = r2_score(y_test,y_pred)



  return {
        "MSE": mse,
        "RMSE": rmse,
        "MAE": mae,
        "R2": r2
    }

In [4]:
linear_model = task1_linear_regression()
linear_model

{'MSE': 0.78105677092199,
 'RMSE': 0.8837741628504365,
 'MAE': 0.7837610302414408,
 'R2': 0.2609450135378707}

### Task 1.2 - Polynomial Regression (15 Points)

#### **Instructions**
1. Load the dataset from **`datasets/task1_data.csv`**.
2. Extract training and testing data from the following columns:
   - `"X_train"`: Training feature values.
   - `"y_train"`: Training target values.
   - `"X_test"`: Testing feature values.
   - `"y_test"`: Testing target values.
3. Define a **pipeline** that includes:
   - **Polynomial feature transformation** (degree range: **2 to 10**).
   - **Linear regression model**.
4. Use **GridSearchCV** with **8-fold cross-validation** to determine the best polynomial degree.
5. Train the model with the best polynomial degree and **evaluate it on the test set**.
6. Compute and return the following results as a dictionary:
   - **Best polynomial degree** (`best_degree`)
   - **Mean Squared Error (MSE)**

#### **Function Signature**
```python
def task1_polynomial_regression() -> Dict[str, float]:

In [5]:
def task1_polynomial_regression() -> Dict[str, float]:

  # Load the dataset from datasets/task1_data.csv
  df = pd.read_csv("datasets/task1_data.csv")

  # Extract training and testing data
  X_train = df.iloc[:, 0].values.reshape(-1, 1)
  y_train = df.iloc[:, 1].values.reshape(-1, 1)
  X_test = df.iloc[:, 2].values.reshape(-1, 1)
  y_test = df.iloc[:, 3].values.reshape(-1, 1)

  # Define a **pipeline** with polynomial feature transformation and linear regression
  degrees = np.arange(2,11)

  # Define pipline
  polynomial_features = PolynomialFeatures()
  linear_regression = LinearRegression()

  pipline = Pipeline([
         ("polynomial_features", polynomial_features),
         ("linear_regression", linear_regression)
    ])

  # Set up the parameter grid for GridSearchCV to search over polynomial degrees
  param_grid = {'polynomial_features__degree': degrees}

  # Train the best polynomial regression model and evaluate its performance.
  grid_search = GridSearchCV(pipline, param_grid, cv = 8, scoring='neg_mean_squared_error')
  grid_search.fit(X_train, y_train)

  best_degree = grid_search.best_params_['polynomial_features__degree']

  y_pred = grid_search.best_estimator_.predict(X_test)

  mse = mean_squared_error(y_test, y_pred)

  return {
      "best_degree": best_degree,
      "MSE": mse
  }





In [6]:
polynomial_model = task1_polynomial_regression()
polynomial_model

{'best_degree': 2, 'MSE': 0.08205877217937993}

## Task 2: Classification with Data Preprocessing (70 Points)

### Task 2.1 - Data Preprocessing (30 Points)

#### **Instructions**
1. Load the dataset from **`datasets/pokemon_modified.csv`**.
2. Look at the data and study the provided features
3. Remove the **two redundant features**
4. Handle **missing values**:
   - Use **mean imputation** for **"height_m"** and **"weight_kg"**.
   - Use **median imputation** for **"percentage_male"**.
5. Perform **one-hot encoding** for the categorical column **"type1"**.
6. Ensure the **target variable** (`"is_legendary"`) is present.
7. **Split the data into training and testing sets** (`80%-20%` split). Is it balanced?
8. **Apply feature scaling** using **StandardScaler** or **MinMaxScaler**.
9. Return the following:
   - `X_train_scaled`: Processed training features.
   - `X_test_scaled`: Processed testing features.
   - `y_train`: Training labels.
   - `y_test`: Testing labels.

#### **Function Signature**
```python
def task2_preprocessing() -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:

In [7]:
def task2_preprocessing() -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:

  # Load the dataset
  data = pd.read_csv('datasets/pokemon_modified.csv')

  # Remove redundant columns(Name and generation of poke don't bring any useful info for determinig legendary)
  data.drop(columns=["name", "classification"], inplace=True)

  # Handle missing values
  imputer_mean = SimpleImputer(strategy='mean')
  imputer_median = SimpleImputer(strategy='median')

  data['height_m'] = imputer_mean.fit_transform(data[['height_m']])
  data['weight_kg'] = imputer_mean.fit_transform(data[['weight_kg']])
  data['percentage_male'] = imputer_median.fit_transform(data[['percentage_male']])

  # Perform **one-hot encoding** on `"type1"`
  encoder = OneHotEncoder(sparse_output=False, drop='first')
  type1_encoded = encoder.fit_transform(data[['type1']])
  type1_df = pd.DataFrame(type1_encoded, columns=encoder.get_feature_names_out(['type1']))

  # Merge the one-hot encoded columns back to the original dataframe
  data = pd.concat([data, type1_df], axis=1)
  data.drop(columns=['type1'], inplace=True)

  # Split dataset into features X and target y
  X = data.drop(columns=['is_legendary'])
  y = data['is_legendary']

  # Split the dataset into **80% training, 20% testing** using **stratification** to maintain class balance
  X_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.2, stratify=y, random_state=42)

  # Apply feature scaling (**StandardScaler**)
  scaler = MinMaxScaler()
  scaler.fit(X_train)
  X_train = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns)
  x_test = pd.DataFrame(scaler.transform(x_test), columns=x_test.columns)





  return X_train, x_test, y_train, y_test

### Task 2.2 - Model Comparison (40 Points)

#### **Instructions**
1. **Train three classification models** on the preprocessed dataset:
   - **Logistic Regression**
   - **K-Nearest Neighbors (KNN)**
   - **Gaussian Naive Bayes (GNB)**
2. Use **GridSearchCV** for **hyperparameter tuning** on:
   - **Logistic Regression**: Regularization strength (`C`) and penalty (`l1`, `l2`).
   - **KNN**: Number of neighbors (`n_neighbors`), weight function, and distance metric.
3. Train each model on the **training set** and evaluate on the **test set**.
4. Compute the following **evaluation metrics**:
   - **Accuracy**
   - **Precision**
   - **Recall**
   - **F1 Score**
5. Return a dictionary containing the evaluation metrics for each model.

#### **Function Signature**
```python
def task2_model_comparison() -> Dict[str, Dict[str, float]]:

In [8]:
def task2_model_comparison() -> Dict[str, Dict[str, float]]:

  # Load the preprocessed dataset from `task2_preprocessing()`
  X_train, X_test, y_train, y_test = task2_preprocessing()

  # Logistic Regression
  logistic_regression = LogisticRegression()
  knn = KNeighborsClassifier()
  dnb = GaussianNB()

  logistic_regression_params = [

      {
        'penalty': ['l1'],
        'solver': ['liblinear', 'saga'],
        'C': [0.001, 0.01, 0.1, 1, 10, 100]
    },
    {
        'penalty': ['l2'],
        'solver': ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'],
        'C': [0.001, 0.01, 0.1, 1, 10, 100]
    },
    {
        'penalty': [None],
        'solver': ['newton-cholesky', 'lbfgs', 'sag', 'saga'],
        'C': [0.001, 0.01, 0.1, 1, 10, 100]
    }
  ]


  knn_param_grid = {
    'n_neighbors': list(range(1, 11)),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'chebyshev']
    }

    #  GridSearchCV
  logistic_regression_grid_search = GridSearchCV(logistic_regression, logistic_regression_params, cv=8, n_jobs=-1)
  knn_grid_search = GridSearchCV(knn, knn_param_grid,cv=8, n_jobs=-1)

    # Train models
  logistic_regression_grid_search.fit(X_train, y_train)
  knn_grid_search.fit(X_train, y_train)
  dnb.fit(X_train, y_train)

  models = {
        "Logistic Regression": logistic_regression_grid_search.best_estimator_,
        "KNN": knn_grid_search.best_estimator_,
        "Naive Bayes": dnb
    }

  evaluation_metrics = {}

  for model_name, model in models.items():
      y_pred = model.predict(X_test)

        # Calculate metrics
      metrics = {
          "accuracy": accuracy_score(y_test, y_pred),
          "precision": precision_score(y_test, y_pred),
          "recall": recall_score(y_test, y_pred),
          "f1_score": f1_score(y_test, y_pred)
        }
      evaluation_metrics[model_name] = metrics

  return evaluation_metrics

In [None]:
classification_model = task2_model_comparison()
classification_model

{'Logistic Regression': {'accuracy': 0.9875776397515528,
  'precision': 1.0,
  'recall': 0.8571428571428571,
  'f1_score': 0.9230769230769231},
 'KNN': {'accuracy': 0.9751552795031055,
  'precision': 0.9166666666666666,
  'recall': 0.7857142857142857,
  'f1_score': 0.8461538461538461},
 'Naive Bayes': {'accuracy': 0.906832298136646,
  'precision': 0.48148148148148145,
  'recall': 0.9285714285714286,
  'f1_score': 0.6341463414634146}}