# Assignment 4: Pipelines and Hyperparameter Tuning (52 total marks)
### Due: March 19 at 11:59pm

### Name:

The purpose of this assignment is to practice following the grid-search workflow:
- Split data into training and test set
- Use the training portion to find the best model using grid search and cross-validation
- Retrain the best model
- Evaluate the retrained model on the test set

In [40]:
import numpy as np
import pandas as pd
from yellowbrick.datasets import load_mushroom
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

## Part 1: Classification (21 marks)

### 1.1: Load data (2 marks)
For this task, we will be using the yellowbrick mushroom dataset. This dataset uses physical characteristics of mushrooms to predict whether or not the mushroom is poisonous.

More information on the dataset can be found here:
https://www.scikit-yb.org/en/latest/api/datasets/mushroom.html

#### Prepare the feature matrix and target vector

Using the yellowbrick `load_mushroom()` function, load the mushroom data set into feature matrix `X` and target vector `y`

Print the shape of `X` and `y`

In [41]:
# library already installed from previous assignments. this is to ensure the
# library is installed
!pip install yellowbrick



In [42]:
# TODO: Load the dataset
X, y = load_mushroom()

# TODO: Print the shape of X and y
print(X)
print('-------------------------------')
print(y)

        shape surface   color
0      convex  smooth  yellow
1        bell  smooth   white
2      convex   scaly   white
3      convex  smooth    gray
4      convex   scaly  yellow
...       ...     ...     ...
8118  knobbed  smooth   brown
8119   convex  smooth   brown
8120     flat  smooth   brown
8121  knobbed   scaly   brown
8122   convex  smooth   brown

[8123 rows x 3 columns]
-------------------------------
0          edible
1          edible
2       poisonous
3          edible
4          edible
          ...    
8118       edible
8119       edible
8120       edible
8121    poisonous
8122       edible
Name: target, Length: 8123, dtype: object


### 1.2: Pre-processing (3 marks)
In this dataset, all the features are categorical, so they need to be encoded. We will use `OneHotEncoder(sparse_output=False)` for this case

In [43]:
# TODO: Create OneHotEncoder object
encoder = OneHotEncoder(sparse=False)

The next step is to build a pipeline to combine the encoding with the selected machine learning method. To initialize the pipeline, we will use `LogisticRegression(max_iter=1000)` as a placeholder

In [44]:
# TODO: Build the pipeline
logistic_regression = LogisticRegression(max_iter=1000)
pipeline = Pipeline([
    ('the_encoder', encoder),
    ('classifier', logistic_regression)
])

The next step is to split the data into training and testing sets. Use `test_size=0.1, stratify=y, random_state=42`

In [45]:
# TODO: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y, random_state=42)

### 1.3: Grid Search (4 marks)

For the grid search, we would like to test three different models: `LogisticRegression(max_iter=1000)`, `KNeighborsClassifier()` and `SVC()`. Build your parameter grid based on what you think are reasonable values to test

In [46]:
# TODO: Build a parameter grid
param_grid_lr = {
    'classifier': [LogisticRegression(max_iter=1000)],
    'classifier__C': [0.1, 1, 10]
}
param_grid_kn = {
    'classifier': [KNeighborsClassifier()],
    'classifier__n_neighbors': [2,4,6]
}
param_grid_svm = {
    'classifier': [SVC()],
    'classifier__C': [0.1, 1, 10],
    'classifier__kernel': ['linear', 'rbf']
}

In [47]:
# TODO: Implement grid search
grid_search_lr = GridSearchCV(pipeline, param_grid_lr, cv=5)
grid_search_lr.fit(X_train, y_train)
grid_search_kn = GridSearchCV(pipeline, param_grid_kn, cv=5)
grid_search_kn.fit(X_train, y_train)
grid_search_svm = GridSearchCV(pipeline, param_grid_svm, cv=5)
grid_search_svm.fit(X_train, y_train)



### 1.4: Visualize Results (2 marks)

The final step is to print out the results from the grid search. You will need to print out the following items:
- Best parameters
- Best cross-validation train score
- Best cross-validation test score
- Test set accuracy

In [48]:
# TODO: Print the results from the grid search
print('Best parameters:')
print('Logistic Regression:\n\t', grid_search_lr.best_params_)
print('KNeighbors Classifier:\n\t', grid_search_kn.best_params_)
print('SVC:\n\t', grid_search_svm.best_params_)

print('\nBest Cross-Validation Score:')
print('Logistic Regression:\t', grid_search_lr.best_score_)
print('KNeighbors Classifier:\t', grid_search_kn.best_score_)
print('SVC:\t\t\t', grid_search_svm.best_score_)

print('\nTest set accuracy:')
test_lr_accuracy = grid_search_lr.score(X_test, y_test)
print('Logistic Regression:\t', test_lr_accuracy)
test_kn_accuracy = grid_search_kn.score(X_test, y_test)
print('KNeighbors Classifier:\t', test_kn_accuracy)
test_svm_accuracy = grid_search_svm.score(X_test, y_test)
print('SVC:\t\t\t', test_svm_accuracy)

Best parameters:
Logistic Regression:
	 {'classifier': LogisticRegression(C=1, max_iter=1000), 'classifier__C': 1}
KNeighbors Classifier:
	 {'classifier': KNeighborsClassifier(n_neighbors=6), 'classifier__n_neighbors': 6}
SVC:
	 {'classifier': SVC(C=10), 'classifier__C': 10, 'classifier__kernel': 'rbf'}

Best Cross-Validation Score:
Logistic Regression:	 0.6693570451436388
KNeighbors Classifier:	 0.6908344733242135
SVC:			 0.7138166894664842

Test set accuracy:
Logistic Regression:	 0.6555965559655597
KNeighbors Classifier:	 0.6691266912669127
SVC:			 0.6912669126691267


### Questions (6 marks)

1. Which model and what parameters produced the best results?
1. Was this model a good fit? Why or why not?
1. Is there anything else we could do to try to improve model performance? Provide two ideas.

### ANSWERs HERE
1. the SVC model with its parameters are produce the best results for the given data
2. this model is the best out of the 3 we are using, but it is not a good fit. The SVC has a cross validation score of 71% and a test accuracy of 69%. From this we know that the svc model is consistent as 71 and 69 are a very close fit to one another. with that said this model misses about 30% of the data from the dataset, meaning the model is underfitted when it works with the dataset. this causes the model to miss a large amount of the data from both the training and testing sub-sets meaning it will neglect alot of values which could be useful for later processes
3. Yes, the first thing we can do would be to increase the complexity of the model. this will allow it to take more values into a count and increase the cross validation and testing accuracies. Another way to improve the the scores for the model is to configure its hard and soft margins, this will allow the system to ignore any extreme outliers and focus on the more accurate datapoints. with that said, we must be careful to not set the hard margin too aggressively or the model will ignore datapoints which could be valuable for the calculations.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?


1. I sourced most of the code from previous assignments/labs plus my memory of working on them. For the OneHotEncoder, I googled it in order to get the basic understanding on how to call it and what it does. I used ChatGPT to decode the setting parameters section.
2. I did the assignment in order.
3. yes, I used chatGPT to help me figure out how to set the parameters for the Kneighbors classifier. I took the code it gave me and adapted it to my current code.
4. the challenge I faced was figuring out how to set the parameters of KNeigbors correctly as well as finding the best scores for the different requirements (i.e cross validation, testing accuracy and best parameters), to solve them I went to Lab 6, which I missed, and looked over it and was able to figure out what to do next

# Part 2: Regression (26 marks)

For this task, we will be using the auto-mpg dataset. The dataset can be found here: https://archive.ics.uci.edu/ml/datasets/Auto%2BMPG

### 2.1: Load data (3 marks)

#### Prepare the feature matrix and target vector

Using the code below, load the dataset and separate it into feature matrix `X` and target vector `y`. Which column represents the target vector?

Print the shape of `X` and `y`

**Note that you will need to download the file from D2L or from the UCI website and store it in the same folder as the code for this to work**

In [49]:
# library already installed from previous assignments. this is to ensure the
# library is installed
!pip install ucimlrepo
from google.colab import drive



In [50]:
# Code to read in the dataset - DO NOT CHANGE
data = pd.read_csv('auto-mpg.data',
               header=None,
              names=["mpg",
                    "cylinders",
                    "displacement",
                    "horsepower",
                    "weight",
                    "acceleration",
                    "model_year",
                    "origin",
                    "car_name"],
               na_values='?',
               sep=r'\s+')

In [51]:
# TODO: Separate dataset into feature matrix and target vector
y = data['mpg']
X = data.drop(['mpg','car_name'], axis=1)
# TODO: Print shape of X and y
print('Shape of X:', X.shape)
print('Shape of y:', y.shape)

Shape of X: (398, 7)
Shape of y: (398,)


Do we have any missing values in this case?

In [52]:
# TODO: Check if there are any missing values
missing_values = data.isna().any()
print(missing_values)

mpg             False
cylinders       False
displacement    False
horsepower       True
weight          False
acceleration    False
model_year      False
origin          False
car_name        False
dtype: bool


### 2.2: Pre-processing (5 marks)
In this dataset, we have a mixture of categorical and numerical data. This means that we will need to use a `ColumnTransformer()`

If you try to use a ColumnTransformer on the data with all the existing features, you will get an error. This is because there are too many unique feature values in the `car_name` column to capture all possible values in the training set. For this assignment, we will remove the `car_name` column to avoid this problem

In [53]:
# TODO: Remove car_name column
data = data.drop('car_name', axis=1)

For this case, we will use:
- `OneHotEncoder(sparse_output=False)` for any categorical columns
- `StandardScaler()` for any numerical columns
- Minimal information imputation for any missing values

In [54]:
# TODO: Create ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

# origin_col = ['origin']

numerical_pipeline = Pipeline([
    ('data_type', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])
categorical_pipline = Pipeline([
    ('data_type', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(sparse_output=False))
])

preprocessor = ColumnTransformer(
    transformers = [
        ('num', numerical_pipeline, X),
        ('cat', categorical_pipline, ['origin'])
    ]
)

The next step is to build a pipeline to combine the ColumnTransformer with the selected machine learning method. To initialize the pipeline, we will use `LinearRegression()` as a placeholder

In [55]:
# TODO: Build the pipeline
from sklearn.linear_model import LinearRegression

# pipeline_models = Pipeline(steps=[
#     ('preprocessor', preprocessor),
#     ('regression', LinearRegression())
# ])
pipeline_models = Pipeline(steps=[
    ('preprocessor', ColumnTransformer(transformers=[
        ('num',
         Pipeline(steps=[
             ('data_type', SimpleImputer()),
             ('scaler', StandardScaler())]),
         ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year']),
        ('cat',
         Pipeline(steps=[
             ('data_type', SimpleImputer(strategy='most_frequent')),
             ('encoder', OneHotEncoder(sparse_output=False))]),
         ['origin'])
    ])),
    ('regressor', LinearRegression())
])

The next step is to split the data into training and testing sets. Use `test_size=0.1, random_state=0`

In [56]:
# TODO: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### 2.3: Grid Search (4 marks)

For the grid search, we would like to test three different models: `LinearRegression()`, `KNeighborsRegressor()` and `RandomForestRegressor(random_state=0)`. Build your parameter grid based on what you think are reasonable values to test

In [57]:
# TODO: Build a parameter grid
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

param_grid_lnr = {
    'regressor': [LinearRegression()],
}
param_grid_knr = {
    'regressor': [KNeighborsRegressor()],
    'regressor__n_neighbors': [2,4,6],
    'regressor__weights': ['uniform', 'distance'],
}
param_grid_rfr = {
    'regressor': [RandomForestRegressor()],
    'regressor__n_estimators': [50, 100, 150],
    'regressor__max_depth': [None, 10, 20],
    'regressor__min_samples_split': [2, 5, 10],
    'regressor__min_samples_leaf': [1, 2, 4],
}

In [58]:
# TODO: Implement Grid Search
grid_search_lnr = GridSearchCV(pipeline_models, param_grid_lnr, cv=5)
grid_search_lnr.fit(X_train, y_train)
grid_search_knr = GridSearchCV(pipeline_models, param_grid_knr, cv=5)
grid_search_knr.fit(X_train, y_train)
grid_search_rfr = GridSearchCV(pipeline_models, param_grid_rfr, cv=5)
grid_search_rfr.fit(X_train, y_train)

### 2.4: Visualize Results (2 marks)

The final step is to print out the results from the grid search. You will need to print out the following items:
- Best parameters
- Best cross-validation train score
- Best cross-validation test score
- Test set accuracy

In [59]:
# TODO: Print the results from the grid search
print('Best parameters:')
print('Linear Regression:\n\t', grid_search_lnr.best_params_)
print('KNeighbors Regression:\n\t', grid_search_knr.best_params_)
print('Random Forest Regression:\n\t', grid_search_rfr.best_params_)

print('\nBest Cross-Validation Score:')
print('Linear Regression:\t\t', grid_search_lnr.best_score_)
print('KNeighbors Regression:\t\t', grid_search_knr.best_score_)
print('Random Forest Regression:\t', grid_search_rfr.best_score_)

print('\nTest set accuracy:')
test_lnr_accuracy = grid_search_lnr.score(X_test, y_test)
print('Linear Regression:\t\t', test_lnr_accuracy)
test_knr_accuracy = grid_search_knr.score(X_test, y_test)
print('KNeighbors Regression:\t\t', test_knr_accuracy)
test_rfr_accuracy = grid_search_rfr.score(X_test, y_test)
print('Random Forest Regression:\t', test_rfr_accuracy)

Best parameters:
Linear Regression:
	 {'regressor': LinearRegression()}
KNeighbors Regression:
	 {'regressor': KNeighborsRegressor(n_neighbors=6, weights='distance'), 'regressor__n_neighbors': 6, 'regressor__weights': 'distance'}
Random Forest Regression:
	 {'regressor': RandomForestRegressor(max_depth=20, n_estimators=50), 'regressor__max_depth': 20, 'regressor__min_samples_leaf': 1, 'regressor__min_samples_split': 2, 'regressor__n_estimators': 50}

Best Cross-Validation Score:
Linear Regression:		 0.8051113323772965
KNeighbors Regression:		 0.8417410474764548
Random Forest Regression:	 0.8510101045331651

Test set accuracy:
Linear Regression:		 0.8449024450695694
KNeighbors Regression:		 0.9058461419712156
Random Forest Regression:	 0.9133858104572892


### Questions (8 marks)

1. Which model and what parameters produced the best results?
1. Was this model a good fit? Why or why not?
1. Is there anything else we could do to try to improve model performance? Provide two ideas (must be different than the two ideas given for the previous part).
1. Comparing the two parts, which one took longer to run the grid search? Why do you think it took longer?

1. the model and parameters which produce the best results are Random Forest Regression.

2. Yes Random Forest Regression is a good fit. The values for both the cross validation and test accuracy are fairly high, the values are close to each other meaning that the model fits the data well.

3. some of the ways we could improve the scores of the testing accuracy and cross validation is to tune the hyperparameters in order to better fit the database. The other possibility is to apply some pre-processing in order to help clean up the data which will help prevent the random forest regressor from overfitting the data.

4. the Random Forest regression took much longer to run in comparisons to the linear regression and KNeighbor regression. The reason is because of the way RandomForest regression works. since the regression goes through data in the format of tree diagrams it means that it has to go through every single value and their equivalent branches. this type of work, though more accurate, takes a long time since the system has to cycle back and fourth through the different brancheds

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

1. most of the code was sourced from previous assignments/lab, some of the examples used during lectures, and from my memory. I used the colab AI to help me debug parts of the code.

2. I finished the code in order

3. I used the Colab AI feature in order to help me debug the pipeline part of the system. It helped me take the code from the previous section and help implement it in the correct format so that it will not cause any errors when the code gets to the implementation section

4. Other than struggling to get the pipeline to work as stated above, the rest of the code was fairly successful.

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


A pattern I noticed when working on the system was that for each section of the code, the algorithms used were fairly close to one another in terms of values. which made me really consider which algorithm is best. at the end i chose to ignore the time it takes for each system to run and focus only on the true values since the datasets we are using are not as large. If i had to run different datasets which contained more datapoints i might have to reconsider which algorithm to choose.



## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


I liked the format of the assignment as per usual. the flow of the structure makes sense and helps simply each step rather than overwhelm with one empty section to write all the code at once. i found it challenging to get the pipelines figured out. I had to go to lab 6 and learn all what i missed two weeks ago, and thus took me longer to complete the assignment