# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: 

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [1]:
import numpy as np
import pandas as pd

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [2]:
# TO DO: Import spam dataset from yellowbrick library
# TO DO: Print size and type of X and y
from yellowbrick.datasets import load_spam
X, y = load_spam()

# TO DO: Print size and type of X and y
print(f"Size of X: {X.size} and y: {y.size}\nType of X: {type(X)}\nType of y: {type(y)}")

Size of X: 262200 and y: 4600
Type of X: <class 'pandas.core.frame.DataFrame'>
Type of y: <class 'pandas.core.series.Series'>


### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [3]:
# TO DO: Check if there are any missing values and fill them in if necessary
if X.isna().sum().sum():
    print("X contains nan values")
    X.fillna(0, inplace=True) # this step was not neccesary but I added it to practice
    
if(y.isna().any()):
    print("y contains nan values")

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [4]:
# TO DO: Create X_small and y_small 
from sklearn.model_selection import train_test_split

# 95% of the data is ignored and only 5% is kept 
X_small, _, y_small, _ = train_test_split(X, y, test_size=0.95, random_state=0) 

### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

In [5]:
from sklearn.linear_model import LogisticRegression
import copy

logreg = LogisticRegression(max_iter=2000)
model = []

# Implementing X and y using the standard split (80% data for training and 20% for testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)  
model.append(copy.deepcopy(logreg).fit(X_train, y_train)) 

# Implementing only first two colums of X and y
X_subset = X.iloc[:, :2]  # Selecting only the first two columns of X
X_subset_train, X_subset_test, y_subset_train, y_subset_test = train_test_split(X_subset, y, test_size=0.2, random_state=0)
model.append(copy.deepcopy(logreg).fit(X_subset_train, y_subset_train))

# Implementing X_small and y_small
X_small_train, X_small_test, y_small_train, y_small_test = train_test_split(X_small, y_small, test_size=0.2, random_state=0)
model.append(copy.deepcopy(logreg).fit(X_small, y_small))

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

In [6]:
from sklearn.metrics import accuracy_score

training_accuracy = []
validation_accuracy = []

# Validate model 1
y_train_pred = model[0].predict(X_train)
training_accuracy.append(accuracy_score(y_train, y_train_pred))
y_val_pred = model[0].predict(X_test)
validation_accuracy.append(accuracy_score(y_test, y_val_pred))

# Validate model 2
y_train_pred = model[1].predict(X_subset_train)
training_accuracy.append(accuracy_score(y_subset_train, y_train_pred))
y_val_pred = model[1].predict(X_subset_test)
validation_accuracy.append(accuracy_score(y_subset_test, y_val_pred))

# Validate model 3
y_train_pred = model[2].predict(X_small_train)
training_accuracy.append(accuracy_score(y_small_train, y_train_pred))
y_val_pred = model[2].predict(X_small_test)
validation_accuracy.append(accuracy_score(y_small_test, y_val_pred))

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [7]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

results = pd.DataFrame({
    'Data size': [X_train.size, X_subset_train.size, X_small.size],
    'Training accuracy': [*training_accuracy],
    'Validation accuracy': [*validation_accuracy]
}, index = ['X and y', 'First two columns', '5% of data'])

print(results)

                   Data size  Training accuracy  Validation accuracy
X and y               209760           0.927174             0.938043
First two columns       7360           0.614946             0.593478
5% of data             13110           0.945652             0.956522


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

Answers:
1. They change based on the amount of samples and the amount of features. In general having more samples improves the model and the accuracy but sometimes small subsets can capture the pattern well enough. That was the case while using only 5% of the data for this exercise. Fitting the model with only two features performed poorly because the model only yielded correct prediction 60% of time meaning the model may be overfitted.
2. False positive is an e-email that was classified as spam by the model but in reality it is not. On the other hand false negative is an e-mail classified as not spam by the model but in reality it is spam. For a real life scenario a false positive is worse because you can miss a very important email if it sent to a spam folder.


### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

I completed all the steps in this assignment in the order they appeared.

I sourced my code mainly from the Binary Classification-filled and Linear Classification-filled notebooks from D2L and some help from ChatGPT. For example I wrote "I have a dataframe df, how do I obtain the first columns only?" and it suggested me to use first_two_columns = df[:, :2] but it didn't work. Then I looked at the notebooks from ENSF 592 and realized I needed to use .iloc so I made that modification myself. 

Did you have any challenges? If yes, what were they? If not, what helped you to be successful? I don't consider I countered any big challenges in terms of coding but I had some problems understanding the confusion matrix, recall, Precision-Recall curves etc so I watched a bunch of youtube videos specifically from StatQuest, and they were helpful. I like that I can go back an rewatch them as many times as needed.


## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [8]:
# TO DO: Import spam dataset from yellowbrick library
# TO DO: Print size and type of X and y
from yellowbrick.datasets import load_concrete
X, y = load_concrete()

print(f"Size of X: {X.size} and y: {y.size}\nType of X: {type(X)}\nType of y: {type(y)}")

Size of X: 8240 and y: 1030
Type of X: <class 'pandas.core.frame.DataFrame'>
Type of y: <class 'pandas.core.series.Series'>


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [9]:
# TO DO: Check if there are any missing values and fill them in if necessary
if X.isna().any().any():
    print("X contains nan values")
    
if y.isna().any():
    print("X contains nan values")

### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Implement the machine learning model with `X` and `y`

In [10]:
# Note: for any random state parameters, you can use random_state = 0
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
linear = LinearRegression().fit(X_train,y_train)

### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [11]:
from sklearn.metrics import mean_squared_error, r2_score

y_train_pred = linear.predict(X_train)
mse_train = mean_squared_error(y_train, y_train_pred)
r2_train = r2_score(y_train, y_train_pred)

y_val_pred = linear.predict(X_test)
mse_val = mean_squared_error(y_test, y_val_pred)
r2_val = linear.score(X_test, y_test) # the score method also calculates the R2

### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [12]:
# TO DO: ADD YOUR CODE HERE
results = pd.DataFrame({
    "Training accuracy": [mse_train, r2_train],
    "Validation accuracy": [mse_val, r2_val]
    }, index = ["MSE", "R2"], )

results

Unnamed: 0,Training accuracy,Validation accuracy
MSE,110.345501,95.635335
R2,0.609071,0.636898


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

A linear model for this specific dataset is not the best approach. This is due because the concrete compressive strength is a highly nonlinear function of age and ingredients as described in the yellowbrick's concrete dataset website. This means that a straight line is unable to capture the complexity of the patterns accurately, the resulting model is highly biased and underfit. 

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

I sourced my code from the Linear Regresion and Regression Metrics notebooks from D2L. And I completed the steps in the order they were presented. Altough not required I printed the X features and look up at the concrete yellowbrick website to understand the data at hand.

I did not used any generative AI for this exercise as they were not neccesary. But I did watch R2 cofficient of determination from the StatQuest channel on youtube. 

Again there were no big challenges code related. Although the R2 scores I observed where not as good as I expected and because of that I reviewed my code several times to make sure the low R2 values were not due to a coding mistake.

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.

One pattern I observed that is not very usual is that the R2 validation accuracy 0.636898 is slightly better than the R2 for the training accuracy 0.609071 meaning that the model is generalizing "well" (although the R2 is still poor for both) when it sees new data. This is also observed when comparing the mse training accuracy 110.34 vs the mse validation accuracy 95.63 indicating once again that the error or distance from the line the predictions of new values is minor than of the training set.

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.

Something that I am finding challenging is to follow along the lab sessions because they feel a little rushed and sometimes the microphone is dead so I don't hear the explanations and it makes it confusing for the assigments. But I like that we have several examples on D2L to guide us trough the process of implementing and validating the machine learning models we study. I found machine learning interesting overall and I think we will see a huge increase in the implementation of this algorithms in the real world.

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [13]:
# TO DO: ADD YOUR CODE HERE
from sklearn.linear_model import Ridge


alpha = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
df = pd.DataFrame(columns = alpha, 
    index = ['Training score', 'Validation score', 'Number of features'])
df.columns.name = 'Alpha Values'

for a in alpha:
    ridge = Ridge(alpha=a).fit(X_train, y_train) 
    df[a] = [ridge.score(X_train, y_train), ridge.score(X_test, y_test), 
             np.sum(ridge.coef_ != 0)]

df.transpose()

Unnamed: 0_level_0,Training score,Validation score,Number of features
Alpha Values,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0001,0.609071,0.636898,8.0
0.001,0.609071,0.636898,8.0
0.01,0.609071,0.636898,8.0
0.1,0.609071,0.636898,8.0
1.0,0.609071,0.636899,8.0
10.0,0.609071,0.636902,8.0
100.0,0.609071,0.636937,8.0


In [14]:
from sklearn.linear_model import Lasso
    
alpha = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
df = pd.DataFrame(columns = alpha, 
    index = ['Training score', 'Validation score', 'Number of features'])
df.columns.name = 'Alpha Values'

for a in alpha:
    lasso = Lasso(alpha=a).fit(X_train, y_train) 
    df[a] = [lasso.score(X_train, y_train), lasso.score(X_test, y_test), 
             np.sum(lasso.coef_ != 0)]

df.transpose()

Unnamed: 0_level_0,Training score,Validation score,Number of features
Alpha Values,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0001,0.609071,0.636898,8.0
0.001,0.609071,0.636899,8.0
0.01,0.609071,0.636912,8.0
0.1,0.609069,0.637034,8.0
1.0,0.608852,0.638035,8.0
10.0,0.60288,0.638874,6.0
100.0,0.463736,0.52107,5.0


For the Ridge model I found no significant improvement in R2 with respect of different values of alpha. The only very slighly increase observed was for the validation score with higher values of alpha probably due to the regularization effect.

This is a similar case for the Lasso model, although in this model I observed that the number of fetures droped from 8 to 5 for higher values of alpha along with the training and validation score. Suggesting that higher regularization is worsening and further underfitting the model.

In conclusion the R2 scores are poor regardless of the model and alpha value used because a linear model is unable to capture the pattern within the concrete dataset and a more complex model is needed.