# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: Paolo Geronimo

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [2]:
import numpy as np
import pandas as pd

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [3]:
# Import dataset (1 mark)

# The dataset I chose is the "Computer Hardware" dataset from the UCI ML Repo.
# The dataset's taget variable is CPU Estimated Relative Performance (ERP),
# and the feature set is a particular CPU's specifications.
# Link to dataset is found below in the answer to question 1.

#using the column names found in machine.names
column_names = ['Vendor', 'Model', 'MYCT', 'MMIN', 'MMAX', 'CACH', 'CHMIN', 'CHMAX', 'PRP', 'ERP']
X = pd.read_csv('machine.data', names = column_names)

# creating feature matrix and target vector
y = X['ERP']
X.drop('ERP', axis = 1, inplace = True)

### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
1. (1 mark) Why did you pick this particular dataset?
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

*ANSWER HERE*
1. My dataset is sourced from the UCI Machine Learning repository: https://archive.ics.uci.edu/dataset/29/computer+hardware
2. I picked this dataset because one of my hobbies is building PCs, so an analysis on this dataset would be interesting to me. It is also a mix of categorical and continuous features, so I picked it to help me practice scaling and encoding.
3. There wasn't anything challenging in particular about finding a dataset. I looked at previous assignments to see where those datasets were from, so I checked the source to see if there were any other interesting datasets.

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [4]:
# Clean data (if needed)
print (f"Shape of X: {X.shape}")
print ("Type of X:")
print (X.dtypes)

print (f"Shape of y: {y.shape}")
print ("Type of y:")
print (y.dtypes)

Shape of X: (209, 9)
Type of X:
Vendor    object
Model     object
MYCT       int64
MMIN       int64
MMAX       int64
CACH       int64
CHMIN      int64
CHMAX      int64
PRP        int64
dtype: object
Shape of y: (209,)
Type of y:
int64


In [5]:
# check nulls
print(X.isnull().sum())
print(y.isnull().sum())

Vendor    0
Model     0
MYCT      0
MMIN      0
MMAX      0
CACH      0
CHMIN     0
CHMAX     0
PRP       0
dtype: int64
0


In [6]:
# Since there are potentially lots of unique model names, let's take a look and see if we should drop the column
print(X['Model'].nunique())

209


In [7]:
# There are 209 unique model names, which makes sense since there are 209 samples in the dataset.
# Using this as a categorical feature wouldn't make much sense so I'll drop it.

X.drop('Model', axis = 1, inplace = True)
print(X.shape)
X.head()

(209, 8)


Unnamed: 0,Vendor,MYCT,MMIN,MMAX,CACH,CHMIN,CHMAX,PRP
0,adviser,125,256,6000,256,16,128,198
1,amdahl,29,8000,32000,32,8,32,269
2,amdahl,29,8000,32000,32,8,32,220
3,amdahl,29,8000,32000,32,8,32,172
4,amdahl,29,8000,16000,32,8,16,132


In [8]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed

# The Vendor feature is categorical, so I will use One Hot Encoding.
# Two of the models I chose require scaling, so I will use a StandardScaler
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

ct = make_column_transformer(
    (OneHotEncoder(sparse_output = False, handle_unknown = 'ignore'), ['Vendor']),
    (StandardScaler(), ['MYCT', 'MMIN', 'MMAX', 'CACH', 'CHMIN', 'CHMAX', 'PRP']))


### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

*ANSWER HERE*
1. There were no missing values in my dataset. If there were, I would probably replace them with the mean of that column from that particular vendor, as opposed to the entire mean.

2. My data types were 1 string feature, and 7 integers. Based on this, I used one hot encoding on the string, and standard scaling on the integers.

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [10]:
# Implement pipeline and grid search here. Can add more code blocks if necessary
# I will use the Lasso, SVR, and KNN regression models

from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
                                                    
# Lasso pipeline
las_pipe = Pipeline ([
    ('column transformer', ct),
    ('classifier', Lasso())
])

# SVR pipeline
svr_pipe = Pipeline ([
    ('column transformer', ct),
    ('classifier', SVR())
])

# KNN regression pipeline
knn_pipe = Pipeline ([
    ('column transformer', ct),
    ('classifier', KNeighborsRegressor())
])


In [11]:
# lasso parameter grid
las_param_grid = {
    'classifier__alpha': [0.1, 1, 10],
    'classifier__max_iter': [1000, 2000, 3000]
}

# SVR parameter grid
svr_param_grid = {
    'classifier__C': [0.1, 1, 10],
    'classifier__kernel': ['rbf', 'linear', 'poly']
}

# KNN parameter grid
knn_param_grid = {
    'classifier__n_neighbors': [5, 10, 15],
    'classifier__weights': ['uniform', 'distance']
}


In [12]:
from sklearn.model_selection import GridSearchCV
# GridSearchCV instances
grid_search_las = GridSearchCV(las_pipe, las_param_grid, cv=5, scoring = 'r2')
grid_search_svr = GridSearchCV(svr_pipe, svr_param_grid, cv=5, scoring = 'r2')
grid_search_knn = GridSearchCV(knn_pipe, knn_param_grid, cv=5, scoring = 'r2')

# fitting
grid_search_las.fit(X_train, y_train)
grid_search_svr.fit(X_train, y_train)
grid_search_knn.fit(X_train, y_train)

In [13]:
# results 
las_results = grid_search_las.cv_results_                                
svr_results = grid_search_svr.cv_results_
knn_results = grid_search_knn.cv_results_

print("Lasso model results:")
print(f"Best parameters: {grid_search_las.best_params_}")
print(f"R2 Scores: {las_results['mean_test_score']}")

print("\nSVR model results:")
print(f"Best parameters: {grid_search_svr.best_params_}")
print(f"R2 Scores: {svr_results['mean_test_score']}")

print("\nKNN model results:")
print(f"Best parameters: {grid_search_knn.best_params_}")
print(f"R2 Scores: {knn_results['mean_test_score']}")

Lasso model results:
Best parameters: {'classifier__alpha': 1, 'classifier__max_iter': 1000}
R2 Scores: [0.8795304  0.8795304  0.8795304  0.88619392 0.88619392 0.88619392
 0.88588897 0.88588897 0.88588897]

SVR model results:
Best parameters: {'classifier__C': 10, 'classifier__kernel': 'linear'}
R2 Scores: [-0.17041403  0.23749559  0.26346383 -0.04344277  0.85930605 -0.2891827
  0.39882262  0.91363379  0.09084528]

KNN model results:
Best parameters: {'classifier__n_neighbors': 5, 'classifier__weights': 'distance'}
R2 Scores: [0.84410499 0.87177212 0.76774506 0.82271625 0.70731536 0.7725932 ]


### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

*ANSWER HERE*
1. This was a regression problem, since the target variable, estimated relative performance, is a continuous value.
2. I chose the Lasso, SVR, and KNN regression models. I chose the Lasso model since we haven't worked with it since assignment 2, and I chose the SCR and KNN models since we discussed the classification versions, but not the regression versions.
3. The Lasso model had the most consistently good results, but the SVR model had the highest single result. This makes sense because SVRs are more complex.

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [14]:
# Calculate testing accuracy (1 mark)
print("Best model testing accuracy:" , grid_search_svr.score(X_test, y_test))

Best model testing accuracy: 0.9345877481969539



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

*ANSWER HERE*
1. I chose to use R2 score since this is a regression task.
2. The validation score is ~0.935, which is fairly close to the test score of ~0.914. I would say this model did generalize well.
3. I would say yes, this model did perform "well enough" to be used in the real-world. The scores are very high, and I don't think predicting CPU performance requires a perfect accuracy. My suggestion to improve the analysis would be to retrain the model using current hardware as the samples, since this dataset is from 1987.

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*
1. My code is sourced from the encoding example from D2L, and the Lab 6 solution. I also looked at the sklearn documentation for each of the 3 models to look at what hyperparameters I can tune. 

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

2. I completed the steps in chronological order. 
3. I did not use generative AI.
4. I think the biggest challenge I faced was during the preprocessing stage. Specifically, I wasn't exactly sure how to set up the preprocessing steps with the pipeline. At first, I encoded the data on as its own step, then put the scaler in my pipeline. Then I realized the best way to do it would be to use a column transformer, and put that in the pipeline.

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

I liked being able to choose my own dataset. While the datasets we have worked on in the past have been interesting to me, it was fun being able to work with a dataset that involves my own personal interests. It was challenging to figure out the best way to set up the pipeline with two preprocessing methods, but was rewarding seeing the code work and printing the results.