# Assignment 5: PCA and Neural Networks (50 marks)
### Due: April 4 at 11:59pm 

### Name: Elias Poitras-Whitecalf
### UCID: 30193066


In [55]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk

## Part 1: Principle Component Analysis (PCA) (20 marks)

### Step 1. Load data (2 marks)

You have been asked by an agricultural company to help them predict the type of wheat based on kernel characteristics. You have been given all the information you need in the seeds_dataset.txt file on D2L. The original dataset can be found [here](https://archive.ics.uci.edu/dataset/236/seeds)

The first step is to read in the file using pandas and inspect the first few columns. Note that the data is in a text file, so the separator is different compared to a csv file.

In [56]:
# TO DO: Import dataset and inspect the first few columns (1 mark)
df = pd.read_csv("seeds_dataset.txt", delimiter="\t", names=["area", "perimeter", "compactness", "length of kernel","width of kernel","asymmetry coefficient","length of kernel groove","class"]) 
df.head()
#class field 1 is kama, 2 is Rosa, 3 is Canadian

Unnamed: 0,area,perimeter,compactness,length of kernel,width of kernel,asymmetry coefficient,length of kernel groove,class
0,15.26,14.84,0.871,5.763,3.312,2.221,5.22,1
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1
2,14.29,14.09,0.905,5.291,3.337,2.699,4.825,1
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1


Next, you will need to separate the data into your feature matrix and target vector.

In [76]:
# TO DO: Separate the data into feature matrix and target vector. Print the shape of X and y (1 mark)
X = df.iloc[:, :-1]
y = df["class"]

print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}")

Shape of X: (210, 7)
Shape of y: (210,)


### Steps 2+3: Preprocessing and Model Selection (3 marks)

The first step is to split the data into training and testing sets.

In [58]:
# TO DO: Split the data into training and testing sets using 10% for test set (1 mark)
import sklearn.model_selection as sk
Xtrain,Xtest,ytrain,ytest = sk.train_test_split(X, y, test_size=0.10, random_state=0) 


Now that we have our training data, we need to decide which preprocessing methods to use. Since we do not want any information leaking into the model validation stage, we will need to create a Pipeline. For this case, our model for the Pipeline is `Logistic Regression(max_iter=1000)`.

In [None]:
# TO DO: Setup required preprocessing method(s) and model (1 mark)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
ct = ColumnTransformer([("scaling",StandardScaler(),["area", "perimeter", "compactness", "length of kernel","width of kernel","asymmetry coefficient","length of kernel groove"])])



In [60]:
# TO DO: Setup pipeline (1 mark)
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipe = Pipeline(steps=[('preprocessor', ct),
                      ('classifier', LogisticRegression())])

### Step 4: Validation (2 marks)

To validate the results, we need to use cross-validation. To make sure we are using the best hyperparmeters, we can use a grid search. The parameter grid has been provided below.

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = [{'classifier': [LogisticRegression(max_iter=1000)],
              'classifier__C': [0.01, 0.1, 1.0, 10.0, 100],
              'classifier__fit_intercept': [True, False]}]

grid = GridSearchCV(pipe, param_grid, cv=5, return_train_score=True)


In [62]:
#TO DO: Implement the grid search and print the best parameters, training score, cross-validation score and test score (2 marks)
grid.fit(Xtrain,ytrain)

print("Best params:\n{}\n".format(grid.best_params_))
print("Best cross-validation train score: {:.2f}".format(grid.cv_results_['mean_train_score'][grid.best_index_]))
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Test-set score: {:.2f}".format(grid.score(Xtest, ytest)))

Best params:
{'classifier': LogisticRegression(max_iter=1000), 'classifier__C': 10.0, 'classifier__fit_intercept': True}

Best cross-validation train score: 0.97
Best cross-validation score: 0.95
Test-set score: 1.00


### Step 5: Visualize using PCA (5 marks)

Now you can visualize the results from Step 4 using PCA. Use the best parameters from the previous step to predict the label for the testing data. For the plot, each data point must be colored based on the class label.

In [63]:
# TO DO: Predict target vector labels using best estimator (1 mark)


In [64]:
# TO DO: Reduce dimensions of test data for plotting (2 marks)


In [65]:
# TO DO: Plot test points colored by predicted label (2 marks)


### Questions (4 marks)

1. Did the logistic regression model work well for this dataset? Why would you make this conclusion? 
1. Did PCA work well for this model? How does this relate to the results from the model used? 

*ANSWER HERE*

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

## Part 2: Neural Networks (28 marks)

For this assignment, you will be practicing using scikit-learn and TensorFlow to implement basic neural networks (MLP). The dataset we will be using is the energy dataset from Yellowbrick (https://www.scikit-yb.org/en/latest/api/datasets/energy.html).

### Step 1: Load data (1 mark)

You will need to load the file and split it into the feature matrix and target vector. Note that this dataset has two targets, heating load and cooling load. To retrieve the single target dataset, you need `return_dataset=False`. 

In [66]:
# TO DO: Load dataset into feature matrix and target vector (1 mark)


### Step 2: Process your dataset (6 marks)

In [67]:
# TO DO: Check if there are any missing values - if yes, decide how to fill them (1 mark)


In [68]:
# TO DO: Check data type for each column and the number of unique values - do you need to encode any of them? (1 mark)


In [69]:
# TO DO: Check the range of each feature - do you need to scale your data? (1 mark)


In [70]:
# TO DO: Split your data into training and testing sets using 20% for test set (1 mark)


In [71]:
# TO DO: Implement scaling and/or encoding here if needed (2 marks for preprocessing properly or justifying why it isn't needed)


### Step 3: Implement Neural Network (MLP) (8 marks)

### Part 1: Use scikit-learn (3 marks)

For each case, you will need to find the predicted target values.

In [72]:
# TO DO: Test using default parameters and max_iter = 10000 (1 mark)
                

In [73]:
# TO DO: Test using two hidden layers with 100 nodes each (1 mark)


In [74]:
# TO DO: Test using three hidden layers with 100 nodes each (1 mark)


### Part 2: Use TensorFlow (5 marks)

In [75]:
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

ModuleNotFoundError: No module named 'tensorflow'

Instead of scaling the data using a scikit-learn scaler, you can scale the data using a normalization layer.

In [None]:
# TO DO: Define normalization layer (1 mark)
 

Using `keras.Sequential`, implement an MLP with the same hidden layer setups as above:

In [None]:
# TO DO: Implement MLP with one hidden layer with 100 nodes and the relu activation function (2 marks)
# Compile the model with loss='mean_absolute_error' and optimizer=tf.keras.optimizers.Adam(0.001)
# Fit the model using validation_split=0.2, verbose=0 and epochs=100


In [None]:
# TO DO: Repeat with two hidden layers with 100 nodes each and the relu activation function (1 mark)


In [None]:
# TO DO: Repeat with three hidden layers with 100 nodes each and the relu activation function (1 mark)


### Step 4: Compare the accuracy of both methods (4 marks)

For this part, calculate the mean absolute error for each model and print in a table using pandas

In [None]:
# TO DO: Calculate the MAE for the three scikit-learn tests (1.5 marks)


In [None]:
# TO DO: Calculate the MAE for the three TensorFlow tests (1.5 marks)


In [None]:
# TO DO: Print the results (1 mark)


### Questions (5 marks)

1. Which model produced the least amount of error?
1. Do the results change if you run your code multiple times? Why or why not?
1.  Why are the numbers different between the scikit-learn and TensorFlow methods when we used the same number of hidden layers and hidden units per layer? List two potential reasons.

*ANSWER HERE*

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

## Part 3: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*