# Assignment 3: Non-Linear Models and Validation Metrics (37 total marks)
### Due: October 24 at 11:59pm

### Name: Paolo Geronimo

### In this assignment, you will need to write code that uses non-linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Part 1: Regression (14.5 marks)

For this section, we will be continuing with the concrete example from yellowbrick. You will need to compare these results to the results from the previous assignment. Please use the results from the solution if you were unable to complete Assignment 2

### Step 1: Data Input (0.5 marks)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the concrete dataset into the feature matrix `X` and target vector `y`.

In [2]:
# TO DO: Import concrete dataset from yellowbrick library
import yellowbrick.datasets
X, y = yellowbrick.datasets.loaders.load_concrete()

### Step 2: Data Processing (0 marks)

Data processing was completed in the previous assignment. No need to repeat here.

### Step 3: Implement Machine Learning Model

1. Import the Decision Tree, Random Forest and Gradient Boosting Machines regression models from sklearn
2. Instantiate the three models with `max_depth = 5`. Are there any other parameters that you will need to set?
3. Implement each machine learning model with `X` and `y`

### Step 4: Validate Model

Calculate the average training and validation accuracy using mean squared error with cross-validation. To do this, you will need to set `scoring='neg_mean_squared_error'` in your `cross_validate` function and negate the results (multiply by -1)

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: DT, RF and GB
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [3]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split, cross_validate

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

decision_tree = DecisionTreeRegressor(max_depth = 5, random_state = 0).fit(X_train, y_train)
random_forest = RandomForestRegressor(max_depth = 5, random_state = 0).fit(X_train, y_train)
gradient_boosting = GradientBoostingRegressor(max_depth = 5, random_state = 0).fit(X_train, y_train)

models = [decision_tree, random_forest, gradient_boosting]
results = pd.DataFrame(index = ["DT", "RF", "GB"], columns = ["Training Accuracy", "Validation Accuracy"])

# model = models[0]
# scores = cross_validate(model, X_train, y_train, scoring = "neg_mean_squared_error")
# scores['test_score'].mean()

for index in range(len(models)):
    model = models[index]
    scores = cross_validate(model, X_train, y_train, scoring = "neg_mean_squared_error", return_train_score = True)
    results.iloc[index, 0] = scores['train_score'].mean() * -1
    results.iloc[index, 1] = scores['test_score'].mean() * -1
    
results
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

Unnamed: 0,Training Accuracy,Validation Accuracy
DT,47.279761,73.447331
RF,29.577455,45.059351
GB,3.37944,22.783221


Repeat the step above to print the R2 score instead of the mean-squared error. For this case, you can use `scoring='r2'`

In [4]:
# TO DO: ADD YOUR CODE HERE
for index in range(len(models)):
    model = models[index]
    scores = cross_validate(model, X_train, y_train, scoring = "r2", return_train_score = True)
    results.iloc[index, 0] = scores['train_score'].mean()
    results.iloc[index, 1] = scores['test_score'].mean()
results

Unnamed: 0,Training Accuracy,Validation Accuracy
DT,0.834465,0.738697
RF,0.896557,0.840927
GB,0.988171,0.919471


### Questions (6 marks)
1. How do these results compare to the results using a linear model in the previous assignment? Use values.
1. Out of the models you tested, which model would you select for this dataset and why?
1. If you wanted to increase the accuracy of the tree-based models, what would you do? Provide two suggestions.

*ANSWER HERE*
1. These results are generally a significant improvement compared to the linear models in the previous assignment. For example, the Training MSEs were ~111 and the Validation MSEs were ~95 for the linear models. Compared to the non-linear models which had MSEs between ~47 and ~3. Taking a look at the R2 scores, the linear models had R2 scores of ~0.6 for both the Training and Validation results. Here, the R2 scores are between ~0.73 and ~0.98.

2. Out of all the models we have tested, the Gradient Boosting Model has the best performance. Looking at the MSEs, the Gradient Boosting Model's MSE is a fraction of the other two models, at ~3.4 for the Training set and ~22.8 for the Validation set. The model also definitively has the best R2 Scores, at ~0.99 for the Training and ~0.92 for the validation. Another thing that makes the Gradient Boosting model is its balance between bias and variance. Looking at the R2 Scores, the Random Forest has the smallest difference between the training and validation scores, making it the model with the highest bias. The Decision Tree has the greatest difference between the training and validation scores, making it the model with the highest variance. 

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*
1. My code is sourced from my previous assignment, where I gathered inspiration on how to implement the loop. I also took a look at Decision Trees Example from D2L on how to use the cross_validate() method. 
2. I completed the steps in chronological order. 
3. I did not use generative AI to complete this section, the examples found on D2L and scikit's website were enough material. 
4. My biggest challenge was understanding how to use and access the results from the cross_validate() method. Taking a look at scikit's documentation on the method (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html), I realized the method returns a dictionary of arrays. I realized I had to access the appropriate keys, find the mean of the values, and append it to the results DataFrames. 

## Part 2: Classification (17.5 marks)

You have been asked to develop code that can help the user classify different wine samples. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 1: Data Input (2 marks)

The data used for this task can be downloaded from UCI: https://archive.ics.uci.edu/dataset/109/wine

Use the pandas library to load the dataset. You must define the column headers if they are not included in the dataset 

You will need to split the dataset into feature matrix `X` and target vector `y`. Which column represents the target vector?

Print the size and type of `X` and `y`

In [9]:
# TO DO: Import wine dataset

# import data, set column names
column_names = ["Class", "Alcohol", "Malicacid", "Ash", "Alcalinity_of_ash", "Magnesium", "Total_phenols", "Flavanoids", "Nonflavanoid_phenols", "Proanthocyanins", "Color_intensity", "Hue", "0D280_0D315_of_diluted_wines", "Proline"]
X = pd.read_csv("wine.data", names = column_names)

# splitting into feature matrix and target vector
y = X["Class"]
X = X.drop("Class", axis = 1)

print("Types in X:")
print(X.dtypes)
print(f"Shape of X: {X.shape}")

print("\nTypes in y:")
print(y.dtypes)
print(f"Shape of y: {y.shape}")

Types in X:
Alcohol                         float64
Malicacid                       float64
Ash                             float64
Alcalinity_of_ash               float64
Magnesium                         int64
Total_phenols                   float64
Flavanoids                      float64
Nonflavanoid_phenols            float64
Proanthocyanins                 float64
Color_intensity                 float64
Hue                             float64
0D280_0D315_of_diluted_wines    float64
Proline                           int64
dtype: object
Shape of X: (178, 13)

Types in y:
int64
Shape of y: (178,)


### Step 2: Data Processing (1.5 marks)

Print the first five rows of the dataset to inspect:

In [10]:
# TO DO: ADD YOUR CODE HERE
X.iloc[0:5]

Unnamed: 0,Alcohol,Malicacid,Ash,Alcalinity_of_ash,Magnesium,Total_phenols,Flavanoids,Nonflavanoid_phenols,Proanthocyanins,Color_intensity,Hue,0D280_0D315_of_diluted_wines,Proline
0,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values

In [11]:
# TO DO: ADD YOUR CODE HERE
print(X.isnull().sum())
print(y.isnull().sum())

Alcohol                         0
Malicacid                       0
Ash                             0
Alcalinity_of_ash               0
Magnesium                       0
Total_phenols                   0
Flavanoids                      0
Nonflavanoid_phenols            0
Proanthocyanins                 0
Color_intensity                 0
Hue                             0
0D280_0D315_of_diluted_wines    0
Proline                         0
dtype: int64
0


How many samples do we have of each type of wine?

In [19]:
# TO DO: ADD YOUR CODE HERE

type_1 = (y == 1).sum()
type_2 = (y == 2).sum()
type_3 = (y == 3).sum()

print (f"Number of type 1: {type_1}\n\
Number of type 2: {type_2}\n\
Number of type 3: {type_3}")

# checks out with info found in wine.names

Number of type 1: 59
Number of type 2: 71
Number of type 3: 48


### Step 3: Implement Machine Learning Model

1. Import `SVC` and `DecisionTreeClassifier` from sklearn
2. Instantiate models as `SVC()` and `DecisionTreeClassifier(max_depth = 3)`
3. Implement the machine learning model with `X` and `y`

### Step 4: Validate Model 

Calculate the average training and validation accuracy using `cross_validate` for the two different models listed in Step 3. For this case, use `scoring='accuracy'`

### Step 5: Visualize Results (4 marks)

#### Step 5.1: Compare Models
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [None]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

#### Step 5.2: Visualize Classification Errors
Which method gave the highest accuracy? Use this method to print the confusion matrix and classification report:

In [None]:
# TO DO: Implement best model

In [None]:
# TO DO: Print confusion matrix using a heatmap

In [None]:
# TO DO: Print classification report

### Questions (6 marks)
1. How do the training and validation accuracy change depending on the method used? Explain with values.
1. What are two reasons why the support vector machines model did not work as well as the tree-based model?
1. How many samples were incorrectly classified in step 5.2? 
1. In this case, is maximizing precision or recall more important? Why?

*YOUR ANSWERS HERE*

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

## Part 5: Bonus Question (3 marks)

Repeat Part 2 and compare the support vector machines model used to `LinearSVC(max_iter=5000)`. Does using `LinearSVC` improve the results? Why or why not?

Is `LinearSVC` a good fit for this dataset? Why or why not?

In [1]:
# TO DO: ADD YOUR CODE HERE

*ANSWER HERE*