# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: Marie Howell

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [33]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.metrics import make_scorer, accuracy_score, f1_score
from sklearn.model_selection import cross_validate


## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [34]:
# Import dataset (1 mark)
df = pd.read_csv('/Users/marie/Desktop/ENSF611/Assignment4/penguins.csv',index_col=0)
df.head()


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
4,Adelie,Torgersen,,,,,,2007
5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
    The dataset is from Kaggle and is the called the Clustering Penguins Species. Link: https://www.kaggle.com/datasets/youssefaboelwafa/clustering-penguins-species
1. (1 mark) Why did you pick this particular dataset?
    It seemed like an interesting application of a classification model, seeing if the model can identify different species by the data provided. 
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?
    It was fairly easy to find a data set, I just went through kaggle datasets and checked to make sure they had data that would work well for the machine learning models we have been using. 



## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [35]:
# Clean data (if needed)
# Checking for null values
print(df.isnull().sum())
# Deleting rows with null values
df.dropna(inplace=True)
df.isnull().sum()
# drop year column since its not impactfull on pengiun species
df = df.drop(columns='year',axis=1)
df



species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
year                  0
dtype: int64


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female
6,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male
...,...,...,...,...,...,...,...
340,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male
341,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female
342,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male
343,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male


In [4]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed
# species is the traget vector need to be encoded with label encoder
label_encoder = LabelEncoder()
df['species'] = label_encoder.fit_transform(df['species'])
# colunms sex and island also need to be encoded
df = pd.get_dummies(df, columns=['sex','island'],dtype=float)
df


Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex_female,sex_male,island_Biscoe,island_Dream,island_Torgersen
1,0,39.1,18.7,181.0,3750.0,0.0,1.0,0.0,0.0,1.0
2,0,39.5,17.4,186.0,3800.0,1.0,0.0,0.0,0.0,1.0
3,0,40.3,18.0,195.0,3250.0,1.0,0.0,0.0,0.0,1.0
5,0,36.7,19.3,193.0,3450.0,1.0,0.0,0.0,0.0,1.0
6,0,39.3,20.6,190.0,3650.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...
340,1,55.8,19.8,207.0,4000.0,0.0,1.0,0.0,1.0,0.0
341,1,43.5,18.1,202.0,3400.0,1.0,0.0,0.0,1.0,0.0
342,1,49.6,18.2,193.0,3775.0,0.0,1.0,0.0,1.0,0.0
343,1,50.8,19.0,210.0,4100.0,0.0,1.0,0.0,1.0,0.0


### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
    There were a select number of null values in the data set. There were 2 rows missing data for the majority of the columns and then there was 11 rows missing a value for sex. Since sex is a categorical column, it is very hard to replace the null values with a value that would make sense in the data. Because of this I decided to drop all the rows containing null values. 
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?
    This data set is a combination of categorical and measurement data. One hot encoding had to be applied to convert the categorical data into numerical data. 


## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [26]:
# Implement pipeline and grid search here. Can add more code blocks if necessary
# building the pipelines
lr_pipeline = Pipeline([('classifier', LogisticRegression(solver='liblinear'))])

rf_pipeline = Pipeline([('classifier', RandomForestClassifier(random_state=0))])

svm_pipeline = Pipeline([("preprocessing", StandardScaler()),('classifier', SVC(random_state=0))])


In [27]:
# splitting the data
X = df.drop('species',axis=1)
y = df['species']
X_train, X_test, y_train, y_test = train_test_split(
    X, y,test_size=0.2, random_state=0)


In [28]:
# Train the pipelines 
lr_pipeline.fit(X_train, y_train)
rf_pipeline.fit(X_train,y_train)
svm_pipeline.fit(X_train,y_train)


In [29]:
# Defining parameter grids for each pipeline 
param_grid_lr = [{'classifier': [LogisticRegression(solver='liblinear')],
            'classifier__C': [0.01, 0.1, 1.0, 10.0],
            'classifier__fit_intercept': [True, False]
             }]

param_grid_rf = [{'classifier': [RandomForestClassifier()],
            'classifier__n_estimators': [100, 200, 300],
            'classifier__max_depth': [1, 2, 3, 5]
             }]

param_grid_svm = [{'classifier': [SVC()],
            'classifier__C': [0.01, 0.1, 1],
            'classifier__kernel': [ 'rbf']
            }]

In [38]:
# logistic regression grid search results
grid_lr = GridSearchCV(lr_pipeline, param_grid_lr, cv=5)
grid_lr.fit(X_train, y_train)

print("logistical regression best estimator: ")
print(str(grid_lr.best_estimator_) + "\n")
print("logistical regression best parameters: ")
print(str(grid_lr.best_params_) + "\n")
print(f'Cross-Validation accuracy {grid_lr.best_score_:.2f}')
print(f'Test accuracy {grid_lr.score(X_test, y_test):.2f}\n')

logistical regression best estimator: 
Pipeline(steps=[('classifier', LogisticRegression(C=0.1, solver='liblinear'))])

logistical regression best parameters: 
{'classifier': LogisticRegression(C=0.1, solver='liblinear'), 'classifier__C': 0.1, 'classifier__fit_intercept': True}

Cross-Validation accuracy 0.99
Test accuracy 1.00



In [39]:
# random forest grid search results
grid_rf = GridSearchCV(rf_pipeline, param_grid_rf, cv=5)
grid_rf.fit(X_train, y_train)

print("random forest best estimator: ")
print(str(grid_rf.best_estimator_) + "\n")
print("random forest best parameters: ")
print(str(grid_rf.best_params_) + "\n")
print(f'Cross-Validation accuracy {grid_rf.best_score_:.2f}')
print(f'Test accuracy {grid_rf.score(X_test, y_test):.2f}\n')

random forest best estimator: 
Pipeline(steps=[('classifier', RandomForestClassifier(max_depth=5))])

random forest best parameters: 
{'classifier': RandomForestClassifier(max_depth=5), 'classifier__max_depth': 5, 'classifier__n_estimators': 100}

Cross-Validation accuracy 0.99
Test accuracy 1.00



In [40]:
# SVM grid search results
grid_svm = GridSearchCV(svm_pipeline, param_grid_svm, cv=5)
grid_svm.fit(X_train, y_train)

print("random forest best estimator: ")
print(str(grid_svm.best_estimator_) + "\n")
print("random forest best parameters: ")
print(str(grid_svm.best_params_) + "\n")
print(f'Cross-Validation accuracy {grid_svm.best_score_:.2f}')
print(f'Test accuracy {grid_svm.score(X_test, y_test):.2f}\n')

random forest best estimator: 
Pipeline(steps=[('preprocessing', StandardScaler()), ('classifier', SVC(C=1))])

random forest best parameters: 
{'classifier': SVC(C=1), 'classifier__C': 1, 'classifier__kernel': 'rbf'}

Cross-Validation accuracy 0.99
Test accuracy 1.00



In [41]:
# Putting all three models into gird search to determeine the model and parameters with the lest performance 
# Defining parameter grids
param_grid = [{'classifier': [LogisticRegression(solver='liblinear')],
             'classifier__C': [0.01, 0.1, 1.0, 10.0],
             'classifier__fit_intercept': [True, False],
             'preprocessing': [StandardScaler(), None]},

             {'classifier': [RandomForestClassifier()],
             'classifier__n_estimators': [100, 200, 300],
             'classifier__max_depth': [3, 5, 7, 10, 15],
             'preprocessing': [None]},

             {'classifier': [SVC()],
             'classifier__C': [0.01, 0.1, 1, 10, 100],
             'classifier__kernel': ['linear', 'rbf'],
             'preprocessing': [StandardScaler(), None]}]

grid = GridSearchCV(svm_pipeline, param_grid, cv=5)
grid.fit(X_train, y_train)

print("best overall estimator: ")
print(str(grid_svm.best_estimator_) + "\n")
print("best overall parameters: ")
print(grid_svm.best_params_)

best overall estimator: 
Pipeline(steps=[('preprocessing', StandardScaler()), ('classifier', SVC(C=1))])

best overall parameters: 
{'classifier': SVC(C=1), 'classifier__C': 1, 'classifier__kernel': 'rbf'}


### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
    This data set is classifying by species so a classification model was needed. 
1. (2 marks) Which models did you select for testing and why?
    I chose to use a logistical regression model, a random forest model and an SVC model. I chose these models to test a wide variety of models, all three use very different methods to fit the data. I thought this would give me a wide variety of results.  
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?
    All the models performed very similarly so it was hard to choose which one work the best. When I put all the models in the same parameter grid and used the best estimator method the result was the SVC model. Since there was no difference in the results when I evaluated the models individually, I chose the best model based on the parameter grid. 





## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [48]:
# Calculate testing accuracy (1 mark)
scoring_grid = GridSearchCV(svm_pipeline, param_grid_svm, cv=5, scoring='accuracy')
best_model = scoring_grid.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
acccuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred,  average='weighted')
f1




1.0


### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 
    I chose to use f1 score.
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
    the model preformed very well in both part 3 and 4. The model was able to generalize very well to the test data with a score of 1.0. 
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?
    The model had a prefect test score and an almost prefect training score which indicates that model performed extremely well. The scores were almost too good and could indicate over fitting, however since there isn’t high variance between the test and training scores the high scores may be an indication that the data set is simple and very distinct making it easy to learn and predict. If we look at the scores produced by the other models, we see that every model produced similar scores this also hints that the data set is very easy to learn as different model complexities produced the same results. Before using this model on real world data, I would want to test it on a few more unseen data sets to ensure that the performance is maintained and that the very high scores are not a product of skewed data. Adding more features to the data might make it a more complex and meaningful model.  


*ANSWER HERE*

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
    I used the class examples and the labs to build my code. 
1. In what order did you complete the steps?
    I completed the steps in the laid-out order in the assignment. 
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
    I did not use any AI. 
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?
    I had some challenges getting the pipelines to work properly but studying the class examples helped me figure it out.


*DESCRIBE YOUR PROCESS HERE*

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.

I thought it was interesting picking my own data set. It required more thought into what type of data might work well for different models. I found the pipelines confusing at first, especially figuring out how best to do the preprocessing. 