# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: 

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [81]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import make_scorer, accuracy_score, f1_score



## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [82]:
# Import dataset (1 mark)
column_names = ['Variance', 'Skewness', 'curtosis', 'entropy', 'class']
df = pd.read_csv('data_banknote_authentication.txt', names=column_names, header=None)

print(df.head())
print((df['class'] == 1).sum()) #class = 0 means authentic banknote
print((df['class'] == 0).sum()) #class = 1 means inauthentic banknote

   Variance  Skewness  curtosis  entropy  class
0   3.62160    8.6661   -2.8073 -0.44699      0
1   4.54590    8.1674   -2.4586 -1.46210      0
2   3.86600   -2.6383    1.9242  0.10645      0
3   3.45660    9.5228   -4.0112 -3.59440      0
4   0.32924   -4.4552    4.5718 -0.98880      0
610
762


### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
1. (1 mark) Why did you pick this particular dataset?
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

*ANSWER HERE*
1. https://archive.ics.uci.edu/dataset/267/banknote+authentication
1. Because it is not too large and would work well with a classifier model.
1. Ensuring the dataset could be imported using pandas easily (such as in a csv or txt format).


## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [83]:
# Clean data (if needed)
df.isnull().sum() / len(df) * 100

Variance    0.0
Skewness    0.0
curtosis    0.0
entropy     0.0
class       0.0
dtype: float64

In [84]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed
y = df['class']
X = df.drop('class', axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

scaler = StandardScaler()
scaler.fit(X_train)

### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

*ANSWER HERE*
1. There were no missing/null values in my dataset. If there were I would drop rows if there were only a few missing/null values and/or drop columns if the columns had >50% missing/null values.
1. All the feature data I have is numerical. I would have to apply scaling if I wanted to use SVMs and K-NNs.

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [85]:
# Implement pipeline and grid search here. Can add more code blocks if necessary

#K-NN Pipeline
knn_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', KNeighborsClassifier())
])

# Logistic Regression Pipeline
lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# SVM Pipeline
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', SVC())
])

# Define parameter grids
param_grid_knn = {
    'classifier__n_neighbors': [5, 10, 15],
    'classifier__weights': ['uniform', 'distance']
}

param_grid_lr = {
    'classifier__C': [0.1, 1, 10],
    'classifier__penalty': ['l1' ,'l2'], 
    'classifier__solver': ['liblinear', 'saga']  # Choose an appropriate solver
}

param_grid_svm = {
    'classifier__C': [0.1, 1, 10],
    'classifier__kernel': ['linear', 'poly', 'rbf']
}

# Define the scoring metrics you want to use
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'f1_score': make_scorer(f1_score)
}

# Create GridSearchCV instances for each algorithm with multiple scoring metrics
grid_search_knn = GridSearchCV(knn_pipeline, param_grid_knn, cv=5, scoring=scoring, refit='f1_score')
grid_search_lr = GridSearchCV(lr_pipeline, param_grid_lr, cv=5, scoring=scoring, refit='f1_score')
grid_search_svm = GridSearchCV(svm_pipeline, param_grid_svm, cv=5, scoring=scoring, refit='f1_score')

# Fit the models
grid_search_knn.fit(X_train, y_train)
grid_search_lr.fit(X_train, y_train)
grid_search_svm.fit(X_train, y_train)




In [86]:
# Access the results for both scoring metrics
results_knn = grid_search_knn.cv_results_
results_lr = grid_search_lr.cv_results_
results_svm = grid_search_svm.cv_results_

# Print the results for accuracy and F1
print("\nK-NN Results:")
print("Accuracy scores:", results_knn['mean_test_accuracy'])
print("F1 scores:", results_knn['mean_test_f1_score'])
print("\nBest Parameters for K-NN based on F1:", grid_search_knn.best_params_)
print("\nF1 score for Best Parameters for K-NN:", grid_search_knn.best_score_)

print("\nLogistic Regression Results:")
print("Accuracy scores:", results_lr['mean_test_accuracy'])
print("F1 scores:", results_lr['mean_test_f1_score'])
print("\nBest Parameters for Logistic Regression based on F1:", grid_search_lr.best_params_)
print("\nF1 score for Best Parameters for K-NN:", grid_search_lr.best_score_)

print("\nSVM Results:")
print("Accuracy scores:", results_svm['mean_test_accuracy'])
print("F1 scores:", results_svm['mean_test_f1_score'])
print("\nBest Parameters for SVM based on F1:", grid_search_svm.best_params_)
print("\nF1 score for Best Parameters for SVM:", grid_search_svm.best_score_)


K-NN Results:
Accuracy scores: [0.99817767 0.99817767 0.9945247  0.99909091 0.99270237 0.99909091]
F1 scores: [0.99797975 0.99797975 0.99401985 0.99899497 0.9919996  0.99899497]

Best Parameters for K-NN based on F1: {'classifier__n_neighbors': 10, 'classifier__weights': 'distance'}

F1 score for Best Parameters for K-NN: 0.9989949748743718

Logistic Regression Results:
Accuracy scores: [0.97630967 0.97630967 0.97356995 0.97174761 0.98633873 0.98816521
 0.97904525 0.97904525 0.98725197 0.98816521 0.98907846 0.98907846]
F1 scores: [0.97418605 0.97418605 0.9711684  0.96912962 0.98493331 0.98692306
 0.977123   0.977123   0.98585619 0.98689214 0.98791778 0.98791778]

Best Parameters for Logistic Regression based on F1: {'classifier__C': 10, 'classifier__penalty': 'l2', 'classifier__solver': 'liblinear'}

F1 score for Best Parameters for K-NN: 0.9879177765492285

SVM Results:
Accuracy scores: [0.97995019 0.93344956 0.98814446 0.983599   0.98633873 1.
 0.98816521 0.98815276 1.        ]
F1 s

### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

*ANSWER HERE*
1. Classification models. Needed to predict if banknote was inauthentic or not (class 0 for authentic, class 1 for inauthentic).
1. I selected KNN because I wanted to see if more neighbors is always better or not. Also, I selected Logistic Regression because it was clearly a linear classification model and I needed to have one according to this assignment. Lastly, I selected SVM because I was curious about how a linear SVM worked versus the non-linear SVMs we learned about (rbf and polynomial kernels).
1. The SVM with a C of 1 and kernel of 'rbf' worked the best. This makes sense as SVMs work well with low dimensional data (few features) such as in this case. 

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [87]:
# Calculate testing accuracy (1 mark)
print(f'SVM')

# Make predictions on the test set
test_predictions = grid_search_svm.predict(X_test)

# Calculate accuracy and F1 score on the test set
accuracy = accuracy_score(y_test, test_predictions)
f1 = f1_score(y_test, test_predictions)

# Print test results
print("Test Accuracy:", accuracy)
print("Test F1 Score:", f1)

SVM
Test Accuracy: 1.0
Test F1 Score: 1.0



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose?
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

*ANSWER HERE*
1. I used Accuracy and f1 score, but primarily f1 score (used in refit parameter)
1. The reults were the same as in part 3 (both were 1). It seems like model did generalize well.
1. It did perform well enough to be used out in the real-world as the test f1 score was 1. I do not have any suggestions.

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1. I sourced my code primarily from lab 6 and other labs and class examples.
1. The order they were in. Although, step 2 and 3, I kind of combined (preprocessing was in pipeline).
1. I did not use generative AI.
1. I struggled with understanding why my logistic regression model gave a warning about not conb=verging even though the reusults seemed to be good.

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

I liked step 3 and 4. It was nice testing the models after I figured out how pipelines worked.
I found step 2 challenging as I was unsure about what to include as the preprocessing mostly happens in the pipeline.