# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: Brandon Lac

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [29]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [16]:
# Import dataset (1 mark)
df = pd.read_csv("./fake_bills.csv", delimiter = ";")
df

Unnamed: 0,is_genuine,diagonal,height_left,height_right,margin_low,margin_up,length
0,True,171.81,104.86,104.95,4.52,2.89,112.83
1,True,171.46,103.36,103.66,3.77,2.99,113.09
2,True,172.69,104.48,103.50,4.40,2.94,113.16
3,True,171.36,103.91,103.94,3.62,3.01,113.51
4,True,171.73,104.28,103.46,4.04,3.48,112.54
...,...,...,...,...,...,...,...
1495,False,171.75,104.38,104.17,4.42,3.09,111.28
1496,False,172.19,104.63,104.44,5.27,3.37,110.97
1497,False,171.80,104.01,104.12,5.51,3.36,111.95
1498,False,172.06,104.28,104.06,5.17,3.46,112.25


### Questions (3 marks)

1. (1 mark) What is the source of your dataset?

The source of the dataset is from Kaggle.

1. (1 mark) Why did you pick this particular dataset?

I picked this dataset because I think it would be quite interesting to see how ML would be able to predict fraudlant bills from this dataset.
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

The challenging aspect was finding a dataset that wasnt too large as it would then take longer.



## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [26]:
# Clean data (if needed)
df.dtypes
print(df.isna().sum())
df.dropna(inplace = True)
print(df.isna().sum())


is_genuine       0
diagonal         0
height_left      0
height_right     0
margin_low      37
margin_up        0
length           0
dtype: int64
is_genuine      0
diagonal        0
height_left     0
height_right    0
margin_low      0
margin_up       0
length          0
dtype: int64


In [49]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

data_features = df.drop(columns=["is_genuine"])
X_train, X_val, y_train, y_val = train_test_split(
    data_features, df.is_genuine, random_state=0)

### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.

2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?


There were missing values in this dataset for margin_low, since there was only 32 of the rows that were missing this value, dropping the rows was the best choice.

The data that I have is numerical and I would only have to apply standard scaler to the data as there are no extreme outliers, the standard scaler was a good choice

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [50]:
# Implement pipeline and grid search here. Can add more code blocks if necessary
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
pipeline_lr = Pipeline([('scaler', StandardScaler()), ('clf', LogisticRegression(max_iter=1000))])
pipeline_rf = Pipeline([('clf', RandomForestClassifier(random_state=42))])
pipeline_svm = Pipeline([('scaler', StandardScaler()), ('clf', SVC())])

#Implementing grid parameters 
param_grid_lr = {'clf__C': [0.1, 1, 10]}
param_grid_rf = {'clf__max_depth': [5, 10, 15], 'clf__n_estimators': [100, 200, 300]}
param_grid_svm = {'clf__C': [0.1, 1, 10], 'clf__gamma': [0.001, 0.01]}

grid_lr = GridSearchCV(pipeline_lr, param_grid_lr, cv=5, return_train_score=True)
grid_rf = GridSearchCV(pipeline_rf, param_grid_rf, cv=5, return_train_score=True)
grid_svm = GridSearchCV(pipeline_svm, param_grid_svm, cv=5, return_train_score=True)
grid_lr.fit(X_train, y_train)
grid_rf.fit(X_train, y_train)
grid_svm.fit(X_train, y_train)

print("Best params:\n{}\n".format(grid_lr.best_params_))
print("Best cross-validation train score: {:.2f}".format(grid_lr.cv_results_['mean_train_score'][grid_lr.best_index_]))
print("Best cross-validation validation score: {:.2f}".format(grid_lr.best_score_))

print("Best params:\n{}\n".format(grid_rf.best_params_))
print("Best cross-validation train score: {:.2f}".format(grid_rf.cv_results_['mean_train_score'][grid_rf.best_index_]))
print("Best cross-validation validation score: {:.2f}".format(grid_rf.best_score_))

print("Best params:\n{}\n".format(grid_svm.best_params_))
print("Best cross-validation train score: {:.2f}".format(grid_svm.cv_results_['mean_train_score'][grid_svm.best_index_]))
print("Best cross-validation validation score: {:.2f}".format(grid_svm.best_score_))

Best params:
{'clf__C': 0.1}

Best cross-validation train score: 0.99
Best cross-validation validation score: 0.99
Best params:
{'clf__max_depth': 5, 'clf__n_estimators': 100}

Best cross-validation train score: 1.00
Best cross-validation validation score: 0.99
Best params:
{'clf__C': 10, 'clf__gamma': 0.01}

Best cross-validation train score: 0.99
Best cross-validation validation score: 0.99


### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

Classification is needed for this model as its a binary of true and false, not predicting continous values

Required 2 non-linear models, SVM and random forest was picked due to ease of setting up and better knowledge of both. Logical regression was the linear model that was picked because it was the most simple.

It looks like all of the models have done a very good job of doing a good job in training score and validation score. I would pick the random forest because it has the highest training score. All the models have done a good job in predicting the outcome, which means that the data is very indicative of classication. Random forest make more sense in being better as it picks the features that are more useful thus creating a better model.



## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [52]:
# Calculate testing accuracy (1 mark)

from sklearn.metrics import accuracy_score

best_model = grid_rf.best_estimator_
best_model.fit(X_train, y_train)
pred = best_model.predict(X_val)
accuracy = accuracy_score(y_val, pred)
print(f"Validation Accuracy of Model: {accuracy}")

Validation Accuracy of Random Forest Model: 0.9918032786885246



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

Used the default accuracy metric of r^2.

The results from part 3 were almost identical, it means that the model was able to generalize well and predict the outcome perfectly.

The model performed perfectly, there is nothing that would need to be changed for it to perform well in the real world. Initally when the high testing and vaildation scores was resulted in the previous section, i had my suspicions that it would be over fitting but with the accuracy score, it has solidified that the model is perfect. 

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

Bulk of the code was taking from the class example of "Imputation Example".

Followed the steps from top to bottom. 

No generative AI was used in order to complete the assignment. The steps were pretty straight forward.

The challenging parts of the assignment was applying the grid search as we have not applied them in our previous labs. 

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


I throughly enjoyed using the pipline to apply the scaling instead of spilting the task up into various lines.

The motitvating aspect of this lab was that applying 3 different models was quite easy and not time consuming.