# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: Paolo Geronimo (30086289)

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [1]:
import numpy as np
import pandas as pd
import yellowbrick.datasets
from sklearn.model_selection import train_test_split

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [2]:
# TO DO: Import spam dataset from yellowbrick library
X, y = yellowbrick.datasets.loaders.load_spam()

# TO DO: Print size and type of X and y
print (f"Size of X: {X.size}")
print (f"Shape of X: {X.shape}")
print ("Type of X:")
print (X.dtypes)

print (f"Size of y: {y.size}")
print (f"Shape of y: {y.shape}")
print ("Type of y:")
print (y.dtypes)

Size of X: 262200
Shape of X: (4600, 57)
Type of X:
word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp                  float6

### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [3]:
# TO DO: Check if there are any missing values and fill them in if necessary
print(X.isnull().sum())
print(y.isnull().sum())

word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq_will                0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_freq_telnet              0
word_fre

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [4]:
# TO DO: Create X_small and y_small

X_small, X1, y_small, y1 = train_test_split(X, y, train_size = 0.05, random_state = 0)
#X1 and y1 are not needed because we only want to extract 5% of the data, which are appended to X_small and y_small

X_small.shape

(230, 57)

### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [5]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

# instantiate model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter = 2000)

# in order to loop through the 3 different datasets, create a list of lists
# where each list is the X and Y DataFrames to be processed
# datasets[0] is the full dataset
# datasets[1] is the first 2 columns of X and Y
# datasets[2] is X_small and Y_small
datasets = [[X, y], [X.iloc[:, :2], y], [X_small, y_small]]

results = pd.DataFrame(index = ["Full Data" , "Two Columns", "Small Data"], columns = ["Size", "Training Accuracy", "Validation Accuracy"])

for index in range(len(datasets)):
    current_dataset = datasets[index] # set current dataset
    X_train, X_test, y_train, y_test = train_test_split(current_dataset[0], current_dataset[1], random_state=0)
    model.fit(X_train, y_train)
    results.iloc[index, 0] = current_dataset[0].shape
    results.iloc[index, 1] = model.score(X_train, y_train) # training accuracy
    results.iloc[index, 2] = model.score(X_test, y_test) # validation accuracy

results

Unnamed: 0,Size,Training Accuracy,Validation Accuracy
Full Data,"(4600, 57)",0.928406,0.937391
Two Columns,"(4600, 2)",0.608406,0.613043
Small Data,"(230, 57)",0.936047,0.931034


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

*YOUR ANSWERS HERE*
1. The validation accuracy increases as the size of the training set increases. When the full dataset is used to create the split, we get a validation accuracy of 0.937. When we have the same number of data points, but only 2 features, we get a validation accuracy of 0.613. Finally, when we use all features but only 5% of the dataset to create the split, we get a validation accuracy of 0.931.

2. A false positive would be an email marked as spam when it's not. A false negative would be a spam email not marked as spam. I think the first case, where a real email is marked as spam, is worse. This is because we don't typically go through our spam folders, so the real email isn't likely to be opened. A spam email marked as real isn't as bad because the user could potentially recognize it as spam upon reading it, then mark the email as spam or even block the sender.


### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*
1. I sourced my code from the examples found on D2L. In particular, ML_ex_solution.ipynb.
2. I completed the steps in chronological order.
3. I did not use generative AI.
4. My biggest challenge came from using the test_train_split function. At first I didn't quite understand how it worked. I had a bit of trouble understanding what exactly it returned, and which hyperparameters to use. To get a better grasp of the function, I went over the documentation on the scikit website: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. Another challenge I faced was creating the loop to generate the results. At first, I tried iterating through the objects themselves, but that was unsuccessful. The current implementation iterates through integers, which seems to work better when getting specific indexes in the list of datasets.

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [6]:
# TO DO: Import spam dataset from yellowbrick library
# TO DO: Print size and type of X and y
X, y = yellowbrick.datasets.loaders.load_concrete()

print(f"Size of X: {X.shape}")
print(f"Types in X: {X.dtypes}")
print(f"Size of y: {y.shape}")
print(f"Types in y: {y.dtypes}")

Size of X: (1030, 8)
Types in X: cement    float64
slag      float64
ash       float64
water     float64
splast    float64
coarse    float64
fine      float64
age         int64
dtype: object
Size of y: (1030,)
Types in y: float64


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [7]:
# TO DO: Check if there are any missing values and fill them in if necessary
print(X.isnull().sum())
print(y.isnull().sum())

cement    0
slag      0
ash       0
water     0
splast    0
coarse    0
fine      0
age       0
dtype: int64
0


### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Implement the machine learning model with `X` and `y`

In [8]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0

from sklearn.linear_model import LinearRegression
model = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
model.fit(X_train, y_train)

### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [9]:
# TO DO: ADD YOUR CODE HERE
from sklearn.metrics import mean_squared_error, r2_score

# training accuracy
y_train_pred = model.predict(X_train)
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

# validation accuracy
y_test_pred = model.predict(X_test)
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [10]:
# TO DO: ADD YOUR CODE HERE
results = pd.DataFrame(index = ["MSE", "R2 Score"], columns = ["Training Accuracy", "Validation Accuracy"])

results.loc["MSE"]["Training Accuracy"] = train_mse
results.loc["MSE"]["Validation Accuracy"] = test_mse
results.loc["R2 Score"]["Training Accuracy"] = train_r2
results.loc["R2 Score"]["Validation Accuracy"] = test_r2

results

Unnamed: 0,Training Accuracy,Validation Accuracy
MSE,111.358439,95.904136
R2 Score,0.610823,0.623414


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

Since an R2 score of 1 means a perfect model, an R2 score of ~0.6 shown above is a moderately good model.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*
1. I sourced my code from the examples and lab notebooks on D2L. In particular, ML_ex.ipynb and Linear_Regression_Lab_2.ipynb.
2. I completed the steps in chronological order.
3. I did not use generative AI.
4. The challenge I faced in this section was step 4. I originally misinterpreted the problem statement, and thought that mean squared error and R2 score should have been used in some sort of formula to calculate training and validation accuracy. I realized that it meant to use the training/test data to make predictions, then use that to calculate MSE and R2.

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

The most obvious pattern I found was in the results in Part 1. The most complete dataset is the one that had the best results, which is pretty self-explanatory. However, I found it interesting that the dataset with 2 columns performed significantly worse than the small dataset. The 2 column dataset had a training accuracy of ~0.608, while the small dataset had a training accuracy of ~0.936.  This demonstrates that removing features has a much more significant impact than removing samples.

In Part 2, I noticed that the validation accuracy results were slightly better than the training accuracy results. This is interesting to me because I would have guessed that the training accuracy results would be better, since that data was used to train the model in the first place.

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

What I liked about this assignment is there is a clear path to get the results, similar to the ML workflow example. Each step in the workflow is distinct and it's not really possible to do them in the wrong order, making it easy to get the hang of. My biggest challenge was learning how to use the train_test_split() method, which I described in the Part 1 Process Description.

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [11]:
# TO DO: ADD YOUR CODE HERE
from sklearn.linear_model import Ridge, Lasso

# Ridge Regression
ridge_results = pd.DataFrame(index = ["MSE", "R2 Score"], columns = ["Training Accuracy", "Validation Accuracy"])
ridge_model = Ridge(alpha = 100).fit(X_train, y_train)

# training accuracy
ridge_train_pred = ridge_model.predict(X_train) 
ridge_results.loc["MSE"]["Training Accuracy"] = mean_squared_error(y_train, ridge_train_pred)
ridge_results.loc["R2 Score"]["Training Accuracy"] = r2_score(y_train, ridge_train_pred)

# validation accuracy
ridge_test_pred = ridge_model.predict(X_test) 
ridge_results.loc["MSE"]["Validation Accuracy"] = mean_squared_error(y_test, ridge_test_pred)
ridge_results.loc["R2 Score"]["Validation Accuracy"] = r2_score(y_test, ridge_test_pred)

# Lasso Regression
lasso_results = pd.DataFrame(index = ["MSE", "R2 Score"], columns = ["Training Accuracy", "Validation Accuracy"])
lasso_model = Lasso(alpha = 0.1).fit(X_train, y_train)

# training accuracy
lasso_train_pred = lasso_model.predict(X_train)
lasso_results.loc["MSE"]["Training Accuracy"] = mean_squared_error(y_train, lasso_train_pred)
lasso_results.loc["R2 Score"]["Training Accuracy"] = r2_score(y_train, lasso_train_pred)

# validation accuracy
lasso_test_pred = lasso_model.predict(X_test)
lasso_results.loc["MSE"]["Validation Accuracy"] = mean_squared_error(y_test, lasso_test_pred)
lasso_results.loc["R2 Score"]["Validation Accuracy"] = r2_score(y_test, lasso_test_pred)

In [12]:
print("Linear results\n", results, "\n")
print("Ridge results\n", ridge_results,"\n" )
print("Lasso results\n", lasso_results, "\n")

Linear results
          Training Accuracy Validation Accuracy
MSE             111.358439           95.904136
R2 Score          0.610823            0.623414 

Ridge results
          Training Accuracy Validation Accuracy
MSE             111.358548           95.894268
R2 Score          0.610823            0.623453 

Lasso results
          Training Accuracy Validation Accuracy
MSE             111.359051           95.866646
R2 Score          0.610821            0.623562 



*ANSWER HERE*

After some trial and error with different alpha values, the best value for the Ridge model was 100. When trying different values, the Training R2 score remained the same as the Linear model, while the Validation R2 score saw small improvements as alpha increased.

For the Lasso model, alpha = 0.1 provided the best balance in terms of how the Training R2 score and Validation R2 score were affected. When using alpha = 0.01, the Training R2 score is the same as the Linear model, while the Validation R2 score improves very slightly. However, when using 0.1, the Training R2 score goes down almost insigificantly, while the R2 score sees a bigger improvement. This trend continues as alpha increases.

As shown in the three tables above, the R2 scores better, but not significantly different than the Linear model. Since the R2 Scores didn't change much, these models are also moderately good.