# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: 

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [83]:
import numpy as np
import pandas as pd

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [84]:
# Import dataset (1 mark)
path = '/Users/robbie/Desktop/ensf611/Assignments/Assignment 4/drug200.csv'

df = pd.read_csv(path)
df.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,DrugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,DrugY


### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
1. (1 mark) Why did you pick this particular dataset?
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

*ANSWER HERE*
1. dataset source: https://www.kaggle.com/datasets/prathamtripathi/drug-classification
2. I picked this particular dataset because it contained a combination of features/target variable that seemed as though it make for a interesting classification exercise.
3. I found it challenging finding a dataset which had an obvious target variable to be used in a machine learning application.

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [85]:
# Clean data (if needed)
df.isnull().sum()

Age            0
Sex            0
BP             0
Cholesterol    0
Na_to_K        0
Drug           0
dtype: int64

In [86]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

BP_order = ['LOW','NORMAL','HIGH']
Chol_order = ['NORMAL','HIGH']
ord_BP = OrdinalEncoder(categories = [BP_order])
ord_Chol = OrdinalEncoder(categories = [Chol_order])

df['BP'] = ord_BP.fit_transform(df[['BP']])
df['Cholesterol'] = ord_Chol.fit_transform(df[['Cholesterol']])
df = pd.get_dummies(df, columns=['Sex'], drop_first=True)
df['Sex_M'] = df['Sex_M'].astype(int)

features = ['Age', 'BP', 'Cholesterol', 'Na_to_K', 'Sex_M']
target = 'Drug'

df.head()

Unnamed: 0,Age,BP,Cholesterol,Na_to_K,Drug,Sex_M
0,23,2.0,1.0,25.355,DrugY,0
1,47,0.0,1.0,13.093,drugC,1
2,47,0.0,1.0,10.114,drugC,1
3,28,1.0,1.0,7.798,drugX,0
4,61,0.0,1.0,18.043,DrugY,0


In [87]:
from sklearn.model_selection import train_test_split

X = df.drop(target, axis=1)
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=32)

### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

*ANSWER HERE*
1. There were no missing or null values is my dataset.  If there were missing values in a small number of rows, I would've have dropped the samples containing null values.  I would choose to drop the samples containing nulls rather than fill values because this is a relatively small dataset, I would be concerned that replacing any nulls might have a significant impact and skew the model.

2. The dataset contains a combination of numerical and text values.  I had to use both ordinal and one hot encoding during the preprocessing of my data. 


## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [88]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Random Forest Pipeline
rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier())
])

# Logistic Regression Pipeline
lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# SVM Pipeline
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', SVC())
])

In [89]:
# Define parameter grids
param_grid_rf = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [5, 10, 15]
}

param_grid_lr = {
    'classifier__C': [0.1, 1, 10],
    'classifier__penalty': ['l1'],  
    'classifier__solver': ['liblinear', 'saga']  
}

param_grid_svm = {
    'classifier__C': [0.1, 1, 10],
    'classifier__kernel': ['linear', 'rbf']
}

In [90]:
# Define the scoring metrics 
from sklearn.metrics import make_scorer, f1_score
scoring = {
    'f1_score': make_scorer(f1_score, average='macro')}

In [91]:

# Create GridSearchCV instances for each algorithm with multiple scoring metrics
grid_search_rf = GridSearchCV(rf_pipeline, param_grid_rf, cv=5, scoring=scoring, refit='f1_score')
grid_search_lr = GridSearchCV(lr_pipeline, param_grid_lr, cv=5, scoring=scoring, refit='f1_score')
grid_search_svm = GridSearchCV(svm_pipeline, param_grid_svm, cv=5, scoring=scoring, refit='f1_score')

# Fit the models
grid_search_rf.fit(X_train, y_train)
grid_search_lr.fit(X_train, y_train)
grid_search_svm.fit(X_train, y_train)

# Get the best parameters based on F1
best_params_rf = grid_search_rf.best_params_
best_params_lr = grid_search_lr.best_params_
best_params_svm = grid_search_svm.best_params_

# Access the results for both scoring metrics
results_rf = grid_search_rf.cv_results_
results_lr = grid_search_lr.cv_results_
results_svm = grid_search_svm.cv_results_



In [92]:
# Print the results for best parameters and F1
print("Random Forest Results:")
print("F1 scores:", [round(n, 3) for n in results_rf['mean_test_f1_score']])
print("Best Parameters for Random Forest based on F1:", best_params_rf)

print("\nLogistic Regression Results:")
print("F1 scores:", [round(n, 3) for n in results_lr['mean_test_f1_score']])
print("Best Parameters for Logistic Regression based on F1:", best_params_lr)

print("\nSVM Results:")
print("F1 scores:", [round(n, 3) for n in results_svm['mean_test_f1_score']])
print("Best Parameters for SVM based on F1:", best_params_svm)

Random Forest Results:
F1 scores: [0.962, 0.976, 0.962, 0.976, 0.976, 0.976, 0.986, 0.976, 0.986]
Best Parameters for Random Forest based on F1: {'classifier__max_depth': 15, 'classifier__n_estimators': 100}

Logistic Regression Results:
F1 scores: [0.327, 0.32, 0.849, 0.958, 0.993, 0.962]
Best Parameters for Logistic Regression based on F1: {'classifier__C': 10, 'classifier__penalty': 'l1', 'classifier__solver': 'liblinear'}

SVM Results:
F1 scores: [0.856, 0.175, 0.968, 0.893, 0.937, 0.943]
Best Parameters for SVM based on F1: {'classifier__C': 1, 'classifier__kernel': 'linear'}


### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

*ANSWER HERE*
1. My dataset requires classification models.
2. I selected Logisitc Regression, SVM, and Random Forest for my dataset because I wanted to see how models of varying complexity would perform on this dataset. 
3. The Random Forest model performed the best which makes sense based on the theory discussed in the course because it combines an ensembles of individual models which can often provide better results when compared against indivdual models like Linear Regression and SVM.

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [93]:
# Calculate testing accuracy (1 mark)
# Make predictions on the test set
rf_test_predictions = grid_search_rf.predict(X_test)
lr_test_predictions = grid_search_lr.predict(X_test)
svm_test_predictions = grid_search_svm.predict(X_test)

# Calculate F1 scores 
rf_f1 = f1_score(y_test, rf_test_predictions, average='macro')
lr_f1 = f1_score(y_test, rf_test_predictions, average='macro')
svm_f1 = f1_score(y_test, svm_test_predictions, average='macro')

In [94]:
results = pd.DataFrame(index = ['Random Forest','Logistic Regression', 'SVM'], columns=['Test F1 Score'])

results.loc['Random Forest'] = {'Test F1 Score':rf_f1}
results.loc['Logistic Regression'] = {'Test F1 Score':lr_f1}
results.loc['SVM'] = {'Test F1 Score':svm_f1}

results

Unnamed: 0,Test F1 Score
Random Forest,0.987918
Logistic Regression,0.987918
SVM,0.844933



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

*ANSWER HERE*

1. I chose F1 as the accuracy metric. 
2. The Random Forest and Logistic Regression models performed similar to part 3, however the SVM model performed worse than part 3 indicating that it did not generalize as well. 
3. The best models achieved an F1 score over 98% which I believe can be considered "good enough" for the context of this dataset.  A model which can predict which drug a patient is taking with a 98% accuracy I believe would be useful because a model like this could be used to validate the effectness of one drug over another.  This analysis could be improved by seeking out a larger dataset to improve the model further and to validate results on a larger testing set. 

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1. My code was mainly sourced from the Lab 6 examples as well as other previous class examples. 
2. I completed the steps in the order laid out in the assignment template. 
3. I did not use any generative AI for this assignment. 
4. I had challenges with understanding the pipeline and grid search setup.  I used the lecture notes and class examples to gain a better understanding. 

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

I enjoyed that there was more flexibility in this assignment which made it so I had to think harder about what steps were required and why I was doing them, as opposed to following a strict template.  I found it challenging setting up the pipelines and grid search. 