# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: 

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [112]:
import numpy as np
import pandas as pd

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [113]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_spam

X, y = load_spam() # returns features and targets for the spam dataset

# TO DO: Print size and type of X and y
print('Size of X:', X.size)  
print('Type of X:', type(X))   

print('Size of y:', y.size)  
print('Type of y:', type(y))  


Size of X: 262200
Type of X: <class 'pandas.core.frame.DataFrame'>
Size of y: 4600
Type of y: <class 'pandas.core.series.Series'>


### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [114]:
# TO DO: Check if there are any missing values and fill them in if necessary
print(X.isnull().sum())
print(y.isnull().sum())

word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq_will                0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_freq_telnet              0
word_fre

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [115]:
# TO DO: Create X_small and y_small 
from sklearn.model_selection import train_test_split

X_train, X_small, y_train, y_small = train_test_split(X, y, test_size=0.05, random_state=0)

### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [116]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression(max_iter=2000)

results_list = []

for feat, targ, size in [[X, y, "X and y"],[X.iloc[:, :2], y, "Only first two columns of X and y"], [X_small, y_small, "X_small and y_small"]]:

    X_train, X_val, y_train, y_val = train_test_split(feat, targ, test_size=0.20, random_state=0)
    
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    
    y_pred = model.predict(X_val)
    
    train_acc = accuracy_score(y_train, y_train_pred)
    val_acc = accuracy_score(y_val, y_pred)

    results_list.append([size, train_acc, val_acc])

results = pd.DataFrame(results_list, columns=['Data size', 'Training accuracy', 'Validation accuracy'])

print(results)

                           Data size  Training accuracy  Validation accuracy
0                            X and y           0.927446             0.936957
1  Only first two columns of X and y           0.614946             0.593478
2                X_small and y_small           0.956522             0.804348


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

*YOUR ANSWERS HERE*

1. The accuracies increase with the amount of data being used, especially the amount of correlated features or columns being used, but the increase in accuracy diminishes with the use of more data. The specific ways the accuracies behaved in the models above is shown below.

Training accuracry was similar when using 5% of the data (0.95) and all the data (0.93) but the validation accuracry was around 14% higher when using all the data (0.94 vs 0.80).
This seems to illustrate the fact than when using less data, the model more easily overfits the training data and is less accurate with the validation data.

Training and validation accuracry was significantly, or around 20-30%, worse when only using the first two columns of features (training accuracy = 0.61, validation accuracy = 0.59) compared to using all columns or 5% of the data in those columns. This could be due to the fact that the first two columns of features are not the only columns affecting if an email is spam and by ignoring the other columns, the predictions are less accurate.


2. A false positive means that a valid E-mail was seen as spam (1) and a false negative means a spam email was seen as valid (0). A false positive is worse in this case as data like this is often used to filter emails and a false positive means a valid, and maybe important, email would be marked as spam.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

I started created my code by following the steps outlined. For step 1, I looked at ML_ex_solution.ipynb on d2l and just used previous pandas knowledge to write the code for size and type. For step 2, I looked at labs 1 and 2 and wrote code based on them. For step 3-5, I looked at ML_ex_solution.ipynb, lab 2 and I also prompted chatgpt with "Create a pandas DataFrame results with columns: Data size, training accuracy, validation accuracy and a for loop to add three sets of data" and referenced that code to write my own (I had an initial for loope where I was printing three pairs of training and validation accuracies, I referenced chatgpt code to create the dataframe and add the "size" variable to the loop.)

Biggest challenge I had was finding how to get the validation accuracy for steps 3-5, I had to ask some friends regarding what to do and then realised I had to use the train_test_split function each loop. Also I'm still unsure how to only create X_small and y_small (without X_train and y_train).

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [117]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_concrete

X, y = load_concrete() # returns features and targets for the spam dataset

# TO DO: Print size and type of X and y
print('Size of X:', X.size)  
print('Type of X:', type(X))   

print('Size of y:', y.size)  
print('Type of y:', type(y))   


Size of X: 8240
Type of X: <class 'pandas.core.frame.DataFrame'>
Size of y: 1030
Type of y: <class 'pandas.core.series.Series'>


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [118]:
# TO DO: Check if there are any missing values and fill them in if necessary
print(X.isnull().sum())
print(y.isnull().sum())

cement    0
slag      0
ash       0
water     0
splast    0
coarse    0
fine      0
age       0
dtype: int64
0


### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Implement the machine learning model with `X` and `y`

In [119]:
# TO DO: ADD YOUR CODE HERE
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
    
linear_model.fit(X_train, y_train)

# Note: for any random state parameters, you can use random_state = 0

### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [120]:
# TO DO: ADD YOUR CODE HERE
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

y_train_pred = linear_model.predict(X_train)
y_test_pred = linear_model.predict(X_test)

mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)

r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [121]:
# TO DO: ADD YOUR CODE HERE
results = pd.DataFrame([[mse_train, mse_test],[r2_train, r2_test]],columns=['Training accuracy', 'Validation accuracy'], index=['MSE', 'R2'])
print(results)

     Training accuracy  Validation accuracy
MSE         110.345501            95.635335
R2            0.609071             0.636898


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

Answer: No, primarily due to the r2 value. The r2 value was around 0.55 for both the training and validation datasets which indicates the goodness of fit of the model was not high. The mse value was around 0.1 for both the training and validation datasets which seems acceptable as the compressive strength values are between 2.3 and 82.6 according to https://www.scikit-yb.org/en/latest/api/datasets/concrete.html, but again the r2 value was too low.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

I started created my code by following the steps outlined. For step 1 and step 2, I referenced step 1 and 2 for part 1. For steps 3 and 4, I referenced Regression Metrix.ipynb on d2l, lab 2 and the sklearn.metrics documentation online (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html). For step 5, I referenced my previous creation of a pandas dataframe in step 5 of part 1. 

Biggest challenge I had was clearly naming my variables and keeping track of which parameter is first (y_pred or y). I tried to be very careful to not mess up, but still might have.

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*
The main pattern I saw in the results was as you increate the amount of data, overfitting tends to decrease, and the model's performance on unseen data (validation accuracy) improves as mentioned in lecture. The first pattern is demonstrated by the training and validation accuracy scores of using 5% of the data and all the data. When 5% of the data was used, the training accuracy score was 0.96 and the validation accuracy score was 0.80, which shows that the model was overfit to the training data. Then when all the data was used, the training accuracy was slightly lower at 0.93, but the validation accuracy score was 0.94 which indicates that overfitting decreased and the model's performance on unseen data increased.

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*
I enjoyed how the assignment was split into guided steps.
I didn't like the observations/interpretations portion of the assignment as I felt like most of the observations/interpretations were completed in the questions for each part.

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [None]:
# TO DO: ADD YOUR CODE HERE

*ANSWER HERE*