<font size="+3"><b>Assignment 2: Linear Models and Validation Metrics</b></font>

***
* **Full Name** = Dominic Choi
* **UCID** = 30109955
***

<font color='Blue'>In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.</font>

You can use the Table of Content on the left side of this notebook to efficiently navigate within this documents.

|                **Question**                | **Point** |
|:------------------------------------------:|:---------:|
|         **Part 1: Classification**         |           |
|          Step 0: Import Libraries          |           |
|             Step 1: Data Input             |     1     |
|           Step 2: Data Processing          |    1.5    |
| Step 3: Implement Machine Learning   Model |           |
|           Step 4: Validate Model           |           |
|          Step 5: Visualize Results         |     4     |
|                  Questions                 |     4     |
|             Process Description            |     4     |
|           **Part 2: Regression**           |           |
|             Step 1: Data Input             |     1     |
|           Step 2: Data Processing          |    0.5    |
| Step 3: Implement Machine Learning   Model |     1     |
|            Step 4: Validate Mode           |     1     |
|          Step 5: Visualize Results         |     1     |
|                  Questions                 |     2     |
|             Process Description            |     4     |
|  **Part 3:   Observations/Interpretation** |   **3**   |
|           **Part 4: Reflection**           |   **2**   |
|                  **Total**                 |   **30**  |
|                                            |           |
|                  **Bonus**                 |           |
|         **Part 5: Bonus Question**         |   **4**   |

# **Part 1: Classification (14.5 marks total)**

|                **Question**                | **Point** |
|:------------------------------------------:|:---------:|
|         **Part 1: Classification**         |           |
|          Step 0: Import Libraries          |           |
|             Step 1: Data Input             |     1     |
|           Step 2: Data Processing          |    1.5    |
| Step 3: Implement Machine Learning   Model |           |
|           Step 4: Validate Model           |           |
|          Step 5: Visualize Results         |     4     |
|                  Questions                 |     4     |
|             Process Description            |     4     |
|                  **Total**                 |  **14.5** |

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:


## **Step 0:** Import Libraries

In [73]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.metrics import mean_squared_error

## **Step 1:** Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library:
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [74]:
# Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_spam

# Load the spam dataset
X, y = load_spam()

# Print the size and type of X and y
print(f"X is size {len(X)}, and of type {type(X)}")
print(f"y is size {len(y)}, and of type {type(y)}")

X is size 4600, and of type <class 'pandas.core.frame.DataFrame'>
y is size 4600, and of type <class 'pandas.core.series.Series'>


## **Step 2:** Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [75]:
# Check for missing values in X
missing_values = X.isnull().any().any()

if missing_values:
    print("There are missing values in X")

    # Fill missing values with the mean of each column
    X.fillna(X.mean(), inplace=True)

    print("Missing values filled")
else:
    print("There are no missing values in X")

There are no missing values in X


For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [76]:
# Split the data into a smaller subset containing 5% of the original data
X_small, _, y_small, _ = train_test_split(X, y, train_size=0.05, random_state=0)

# Print the size of X_small and y_small
print("Size of X_small:", X_small.shape)
print("Size of y_small:", len(y_small))

Size of X_small: (230, 57)
Size of y_small: 230


## **Step 3:** Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets:
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

## **Step 4:** Validate Model
Calculate the training and validation accuracy for the three different tests implemented in Step 3

## **Step 5:** Visualize Results (4 marks for steps 3-5)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [77]:
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

# Create DataFrame to store the results
results = pd.DataFrame(columns=["Data Length", "Training Accuracy", "Validation Accuracy"])

# Set up the X and y args array
    # 1st args: X and y
    # 2nd args: Only first two columns of X and y
    # 3rd args: X_small and y_small
X_arg = [X, X.iloc[:, :2], X_small]
y_arg = [y, y, y_small]

for i in range(0, 3):
    # Instantiate model LogisticRegression(max_iter=2000)
    model = LogisticRegression(max_iter=2000)

    # Split data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X_arg[i], y_arg[i], test_size=0.2, random_state=0)

    # Implement the machine learning model
    model.fit(X_train, y_train)

    # Calculate the training and validation accuracy for the three different tests implemented in Step 3
    y_train_model = model.predict(X_train)
    y_train_accuracy = accuracy_score(y_train, y_train_model)
    
    y_test_model = model.predict(X_test)
    y_test_acc = accuracy_score(y_test, y_test_model)
    
    results.loc[i] = [len(X_arg[i]), y_train_accuracy, y_test_acc]

# Print results
print(results)

   Data Length  Training Accuracy  Validation Accuracy
0       4600.0           0.927174             0.936957
1       4600.0           0.614946             0.593478
2        230.0           0.934783             0.913043


## **Questions (4 marks)**
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

<font color='Green'><b>YOUR ANSWERS HERE</b></font>

1. The training and validation accuracy are impacted greatly by the amount of definiing features of a sample, while the quantity of samples are not as important. This is shown by the fact that the model trained on the original, and the model trained on only 5% of the original dataset, both still had about 92% Training and Validation Accuracy. This contrasts the model trained on only the first 2 columns, which only had 60% in both training and validation accuracy.

2. In this case, a False Positive represents an email that was marked as spam, that was actually not spam. A False Negative represents a spam email that was not caught. A False Negative would be worse, since you may get a malicous spam email in your inbox, compromising your security or personal information.

## **Process Description (4 marks)**
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

<font color='Green'><b>DESCRIBE YOUR PROCESS HERE</b></font>

1. Where did you source your code?
    - Lecture Slides on Model Selection: https://d2l.ucalgary.ca/d2l/le/content/569863/viewContent/6238013/View
    - Source from Lecture Slides: https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html
    - Github Copilot

2. In what order did you complete the steps?
    - I followed the order provided in this notebook.

3. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
    - GitHub Copilot uses the code already written to try and predict what is coming next. This is not always exactly what I need, so I have to modify it. This is because Copilot is useful for tasks that are very common; It is better to use its suggestions as a template. Human intervention is necessary for more specialized tasks.

4. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?
    - I was confused as to what was meant by "loop to store data," until I actually implemented all three, and realized that it was the same code 3 times, just with different datasets. I ended up implementing the loop with an array of the datasets as arguments for train_test_split function

# **Part 2: Regression (10.5 marks total)**

| **Question**                               | **Point** |
|--------------------------------------------|-----------|
| **Part 2: Regression**                     |           |
| Step 1: Data Input                         | 1         |
| Step 2: Data Processing                    | 0.5       |
| Step 3: Implement Machine Learning   Model | 1         |
| Step 4: Validate Mode                      | 1         |
| Step 5: Visualize Results                  | 1         |
| Questions                                  | 2         |
| Process Description                        | 4         |
| **Total**                                  | **10.5**  |

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

## **Step 1:** Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library:
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [78]:
# Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_concrete
# Print size and type of X and y
X_con, y_con = load_concrete()

print(f"X is size {len(X_con)}, and of type {type(X_con)}")
print(f"y is size {len(y_con)}, and of type {type(y_con)}")

X is size 1030, and of type <class 'pandas.core.frame.DataFrame'>
y is size 1030, and of type <class 'pandas.core.series.Series'>


## **Step 2:** Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [79]:
# Check for missing values in X
missing_values = X_con.isnull().any().any()

if missing_values:
    print("There are missing values in X")

    # Fill missing values with the mean of each column
    X.fillna(X.mean(), inplace=True)

    print("Missing values filled")
else:
    print("There are no missing values in X")

There are no missing values in X


## **Step 3:** Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Implement the machine learning model with `X` and `y`

In [80]:
# Note: for any random state parameters, you can use random_state = 0
# First, split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_con, y_con, train_size=0.75, random_state=0)

# Import LinearRegression from sklearn
from sklearn.linear_model import LinearRegression

# Instantiate model LinearRegression()
model = LinearRegression()

# Implement the machine learning model
model.fit(X_train, y_train)

## **Step 4:** Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [81]:
# Training accuracy
y_train_model = model.predict(X_train)

# Calculate the training and validation accuracy using R^2 score and mean squared error
y_train_R2 = model.score(X_train, y_train)
y_train_mse = mean_squared_error(y_train, y_train_model)

# Print the training accuracy
print("Training R^2 Score:", y_train_R2)
print("Training MSE:", y_train_mse)

# Test accuracy
y_test_model = model.predict(X_test)

# Calculate the training and validation accuracy using R^2 score and mean squared error
y_test_R2 = model.score(X_test, y_test)
y_test_mse = mean_squared_error(y_test, y_test_model)

# Print the validation accuracy
print("Validation R^2 Score:", y_test_R2)
print("Validation MSE", y_test_mse)

Training R^2 Score: 0.6108229424520554
Training MSE: 111.35843861132469
Validation R^2 Score: 0.6234144623633329
Validation MSE 95.90413603680645


## **Step 5:** Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [82]:
# Create a DataFrame to store the results
results = pd.DataFrame(columns=["Training Accuracy", "Validation Accuracy"], index=["MSE", "R^2 score"])

# Add Results
results["Training Accuracy"] = [y_train_mse, y_train_R2]
results["Validation Accuracy"] = [y_test_mse, y_test_R2]

# Print results
print(results)

           Training Accuracy  Validation Accuracy
MSE               111.358439            95.904136
R^2 score           0.610823             0.623414


## **Questions (2 marks)**
1. Did using a linear model produce good results for this dataset? Why or why not?

<font color='Green'><b>YOUR ANSWER HERE</b></font>

- The linear model produced good results, because the R^2 Score for both training and validation are about 0.61, but the both Training (111) and Validation (95) MSE scores are quite high, which indicates that there are quite a few errors in the model's predicitions. 

## **Process Description (4 marks)**
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

<font color='Green'><b>Explain YOUR PROCESS here:</b></font>

1. Where did you source your code?
    - Same D2L resources as Part 1
    - Github Copilot

2. In what order did you complete the steps?
    - Same order as notebook

3. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
    - Copilot created a template from the previously written code

4. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?
    - No challenges were encountered for this part, as it was very similar to Part 1, but shorter.

# **Part 3: Observations/Interpretation (3 marks)**

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.

<font color='Green'><b>ADD YOUR FINDINGS HERE</b></font>

- From both Part 1 and 2, it is clear that the number of features of the dataset have a great effect on the accuracy of the model trained on it. Part 1 had nearly 60 columns, which had helped the model achieve very high accuracies. Part 2 only had 8 features, and the MSE was very high. This shows a correlation between amount of features and accuracy of the model trained on the dataset. 

- The model in part 2 seems to have high bias, as discussed in the lectures. While the R^2 scores were normal, the MSE scores were ranging between 95 and 110, which indicates that Linear Regression might be too simple to predict accurately; this could be a result of underfitting. 

# **Part 4: Reflection (2 marks)**
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.

<font color='Green'><b>ADD YOUR THOUGHTS HERE</b></font>

- I liked that this is an application of what we have been doing in the labs, and gives more practice and introduction to Regression and Classification models.
- I disliked some of the repetitiveness in the assignment. It felt like I did the same assignment twice.
- I found learning more about regression and classifaction models to be interesting.

# **Part 5: Bonus Question (4 marks)**

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [83]:
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV

# Instantiate models
models = []
models.append(RidgeCV(alphas=[0.001, 0.01, 0.1, 1, 10, 100]))
models.append(LassoCV(alphas=[0.001, 0.01, 0.1, 1, 10, 100]))

# Use the concrete dataset from part 2
X_train, X_test, y_train, y_test = train_test_split(X_con, y_con, train_size=0.8, random_state=0)

# Implement the machine learning model
for model in models:
    model.fit(X_train, y_train)

    # Training accuracy
    y_train_model = model.predict(X_train)
    y_train_R2 = model.score(X_train, y_train)
    y_train_mse = mean_squared_error(y_train, y_train_model)

    # Test accuracy
    y_test_model = model.predict(X_test)
    y_test_R2 = model.score(X_test, y_test)
    y_test_mse = mean_squared_error(y_test, y_test_model)

    results = pd.DataFrame(columns=["Training accuracy", "Validation accuracy"], index=["MSE", "R^2 score"])
    results["Training accuracy"] = [y_train_mse, y_train_R2]
    results["Validation accuracy"] = [y_test_mse, y_test_R2]

    print(f"Chosen alpha for {str(model)[:5]}:", model.alpha_)
    print(results)
    print("\n")

Chosen alpha for Ridge: 100.0
           Training accuracy  Validation accuracy
MSE               110.345597            95.625173
R^2 score           0.609071             0.636937


Chosen alpha for Lasso: 0.001
           Training accuracy  Validation accuracy
MSE               110.345501            95.634971
R^2 score           0.609071             0.636899




<font color='Green'><b>ADD YOUR ANSWER HERE</b></font>

- For Ridge, the best alpha was 100. 
- For Lasso, the best alpha was 0.001.

- The accuracy scores did not change very much compared to linear in part 2, because Ridge and Lasso are good at preventing overfitting, but not so much when it comes to underfitting. In this case, it was underfitting, so both Ridge and Lasso are not the appropriate solution to improve the scores. 