# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: Marie Howell

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [1]:
import numpy as np
import pandas as pd


### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [3]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_spam

X, y = load_spam()
# TO DO: Print size and type of X and y
print("Shape of x:")
print(X.shape)
print("\nShape of y:")
print(y.shape)

print("\nType of x:")
print(X.dtypes) 
print("\nType of y:")
print(y.dtypes)


Shape of x:
(4600, 57)

Shape of y:
(4600,)

Type of x:
word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp                  fl

### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [4]:
# TO DO: Check if there are any missing values and fill them in if necessary
print("Checking for null values in X: ")
print(X.isnull().sum())
print("\nChecking for null values in y: ")
print(y.isnull().sum())


Checking for null values in X: 
word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq_will                0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_fre

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [6]:
# TO DO: Create X_small and y_small 
from sklearn.model_selection import train_test_split

X_small, _, y_small, _= train_test_split(X,y, train_size=0.05, random_state=43)

### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [10]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

from sklearn.linear_model import LogisticRegression


model = LogisticRegression(max_iter=2000)

# using X and y 
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.2, random_state=0)

model.fit(X_train,y_train)

data_accuracy = np.array([[model.score(X_train, y_train), model.score(X_test, y_test)]])

# using only the first two columns of X and y

X2 = X[['word_freq_make', 'word_freq_address']]

X_train, X_test, y_train, y_test= train_test_split(X2,y, test_size=0.2, random_state=0)

model.fit(X_train,y_train)

data_accuracy = np.append(data_accuracy,[[model.score(X_train, y_train), model.score(X_test, y_test)]], axis=0)


# using X_small and y_small
X_train, X_test, y_train, y_test= train_test_split(X_small,y_small, test_size=0.2, random_state=0)

model.fit(X_train,y_train)

data_accuracy = np.append(data_accuracy,[[model.score(X_train, y_train), model.score(X_test, y_test)]], axis=0)

# creating data frame 
results = pd.DataFrame(data_accuracy, columns=["Training accuracy", "Validation accuracy"])

data_size = [X.shape, X2.shape, X_small.shape]

results.insert(0, "Data Size", data_size, True)

results



Unnamed: 0,Data Size,Training accuracy,Validation accuracy
0,"(4600, 57)",0.927717,0.936957
1,"(4600, 2)",0.614946,0.593478
2,"(230, 57)",0.934783,0.869565


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.

Case 1: All the data is used 

The training score is relatively close to 1 which on its own is an attribute of a well-fitting model. But when compared against the validation score, we see that the two values are very close and the validation score is slightly higher. Based off the validation curve discussed in class this seems like a peculiar outcome. The ideal situation is to have both values high and relatively close to one another so based on this you could say that this is a well-fitting model, however before concluding this I would want to see how the model performs on other unseen data. 


Case 2: Only the first two columns 

In this case our training score is low indicating low accuracy of the model. The training and validation score are also very close in value which indicates that the model has high bias and is underfitting the data. Reducing the number of column’s considered did not improve the model. This means that the two columns used are not predictive of if an email is spam or not spam on their own. 


Case 3: 5% of the data 

In this case the training value is close to one and even a little higher than in the first case. The validation score is lower than in the first case which may indicate that the peak of the validation curve has been passed. If this is the case the model would be overfitting the data. 


2. In this case, what do a false positive and a false negative represent? Which one is worse?

Since we are looking for emails that would be classified as spam the positive class would be spam emails and the negative class would be non-spam emails.  Emails being classified as spam when they are not would be a false positive and spam emails not being classified as spam would be a false negative. Regular emails being classified as spam when they aren’t could prevent users from receiving important emails, so I would say false positives are worse in this scenario. That being said, spam emails can be potentially harmful so which is worse is very dependant on the priorities of the user.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

I based my code off the lecture examples and the lab content. When there were small concepts, I needed a refresher on (like data frames) I just googled it and looked through websites until I collected enough info to understand the concept.  Initially I tried to immediately start writing code with the lab exercises as my guidance but very soon I realized I need to have a better understanding of the concepts behind what I was trying to do. I went back and read through the lecture slides and took some notes on the key concepts. Then I went through the examples and made sure I understood what was being done in the examples. From there it was relatively easy to write the code required to complete the assignment.

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [74]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_concrete

X_concrete, y_concrete = load_concrete()
# TO DO: Print size and type of X and y
print("Shape of x:")
print(X_concrete.shape)
print("\nShape of y:")
print(y_concrete.shape)

print("\nType of x:")
print(X_concrete.dtypes) 
print("\nType of y:")
print(y_concrete.dtypes)

Shape of x:
(1030, 8)

Shape of y:
(1030,)

Type of x:
cement    float64
slag      float64
ash       float64
water     float64
splast    float64
coarse    float64
fine      float64
age         int64
dtype: object

Type of y:
float64


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [75]:
# TO DO: Check if there are any missing values and fill them in if necessary
print("Checking for null values in X: ")
print(X_concrete.isnull().sum())
print("\nChecking for null values in y: ")
print(y_concrete.isnull().sum())



Checking for null values in X: 
cement    0
slag      0
ash       0
water     0
splast    0
coarse    0
fine      0
age       0
dtype: int64

Checking for null values in y: 
0


### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with `X` and `y`

In [76]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0
from sklearn.linear_model import LinearRegression

X_train, X_val, y_train, y_val= train_test_split(X_concrete ,y_concrete,test_size=0.2, random_state=0)

lr =LinearRegression().fit(X_train,y_train)

### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [77]:
# TO DO: ADD YOUR CODE HERE
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

y_train_pred = lr.predict(X_train)
y_val_pred = lr.predict(X_val)

r2_train = r2_score(y_train, y_train_pred)
r2_val = r2_score(y_val,y_val_pred)

print("Training R2 score: {:.2f}".format(r2_train))
print("Validation R2 score: {:.2f}".format(r2_val))

mse_train = mean_squared_error(y_train, y_train_pred)
mse_val = mean_squared_error(y_val,y_val_pred)

print("\nTraining Mean Squared Erorr: {:.2f}".format(mse_train))
print("Validation Mean Squared Error : {:.2f}".format(mse_val))



Training R2 score: 0.61
Validation R2 score: 0.64

Training Mean Squared Erorr: 110.35
Validation Mean Squared Error : 95.64


### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [87]:
# TO DO: ADD YOUR CODE HERE
results = pd.DataFrame({'Metric' : ['R2','MSE'], 'Training accuracy' : [r2_train, mse_train],'Validation accuracy' : [r2_val, mse_val]})

results.set_index('Metric', inplace=True)

results


Unnamed: 0_level_0,Training accuracy,Validation accuracy
Metric,Unnamed: 1_level_1,Unnamed: 2_level_1
R2,0.609071,0.636898
MSE,110.345501,95.635335


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

Based on the scores calculated the linear model did not produce good quality results. The training R2 value is low which indicates that the model is not predicting variability in the target vector very effectively. The validation R2 value is very close to the training R2 value which indicates high bias and that the model is underfitting the data. More flexibility is required in the model to capture the underlying patterns in the data. Both mean squared error values were very high for the given data set, indicating that the model was not able to learn or predict the patterns in the data effectively. 

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

I used pretty much the same process as in the first part of the assignment. When I started on the second part of the assignment, I had already done a thorough review of the lecture content so it went much quicker than the first half. I was also able to use the code from the first part and just adjust it to fit the specifications of the second part. When it came to interpreting the values, I needed to do a little more reading on MSE and R2 values to ensure I fully understood what they represented and what their value says about the model. I googled the two concepts and did a little reading until I felt I had a firm grasp on the concepts. 

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.

In Part 1 we saw the effect of the size of the data on the model. Using the whole data set produced a training accuracy and validation accuracy of 0.93 and 0.94, using only the first two columns produced values of 0.61 and 0.59, and using only five percent of the data set produced values of 0.93 and 0.87. The model with the poorest performance was when only two columns of data were used, the model turned out to have a high bias and under fit the data. This showed that for this data set the additional columns were necessary to capture the full view of the underlying patterns in the data. The results for the whole data set were a little difficult to interpret, the values reflect very high accuracy but could be too close together. The validation accuracy turned out to be higher than the training accuracy which may be an indication that the training data and testing data may be too similar. Overall, the best performance seemed to be the model using only five precent of the data set, both values are reasonably high which is an indication of a well-fitting model. 

In part 2 we saw a data set that was too complex to be adequately described by a linear model. The training accuracy and the validation accuracy were 0.61 and 0.64 respectively. These values are low and close together indicating that the model has high bias and is underfitting the data set. This tells us that we need to increase the model’s complexity to able to describe the data well. The linear regression model was not able to capture the under lying trends in the data set. The error values for the model were also very high. The maximum value for compressive strength in the target vector y_concrete was 82.60 MPa and the mean squared error value for the training set was 110.34 MPa. This means that the error was greater than the largest value in the set indicating that percent error values are roughly 100%.  The results for the mean squared error also reflect a poor fitting model In the bonus question we were able to see that ridge and lasso regression were also unable to produce an adequately fitting model, even with the ability to tune the hyperparameter alpha.  


## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.

I liked getting the chance to go through the ML work flow myself, I think it helped me solidify the concepts we’ve been going over in class. I would have liked to apply a linear model to a data set that fit the linear model better to see what that looks like for myself.  

At first, I found it confusing trying to figure out how to calculate and interpret the different values related to the different models. But after reviewing the class notes and the examples on d2l I was able to understand the assignment better. 

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [72]:
# TO DO: ADD YOUR CODE HERE
# Ridge Regression 
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from yellowbrick.datasets import load_concrete
from sklearn.model_selection import train_test_split

X_concrete, y_concrete = load_concrete()

X_train, X_val, y_train, y_val= train_test_split(X_concrete ,y_concrete,test_size=0.2, random_state=42)
ridge001 = Ridge(alpha=0.001).fit(X_train, y_train)

print("Ridge Regression with alpha = 0.001")
print(" Training score: {:.2f}".format(ridge001.score(X_train, y_train)))
print(" Validation score: {:.2f}".format(ridge001.score(X_val, y_val)))

ridge100 = Ridge(alpha=100).fit(X_train, y_train)

print("\nRidge Regression with alpha = 100")
print(" Training score: {:.2f}".format(ridge100.score(X_train, y_train)))
print(" Validation score: {:.2f}".format(ridge100.score(X_val, y_val)))

#Lasso Regression 
lasso001 = Lasso(alpha=0.001).fit(X_train, y_train)

print("\nLasso Regression with alpha = 0.001")
print("Training set score: {:.2f}".format(lasso001.score(X_train, y_train)))
print("Validation set score: {:.2f}".format(lasso001.score(X_val, y_val)))

lasso100 = Lasso(alpha=100).fit(X_train, y_train)

print("\nLasso Regression with alpha = 100")
print("Training set score: {:.2f}".format(lasso100.score(X_train, y_train)))
print("Validation set score: {:.2f}".format(lasso100.score(X_val, y_val)))



Ridge Regression with alpha = 0.001
 Training score: 0.61
 Validation score: 0.63

Ridge Regression with alpha = 100
 Training score: 0.61
 Validation score: 0.63

Lasso Regression with alpha = 0.001
Training set score: 0.61
Validation set score: 0.63

Lasso Regression with alpha = 100
Training set score: 0.47
Validation set score: 0.44


Ridge Regression 

Changing alpha from 0.001 to 100 had no effect on the accuracy scores. This means that both lowering and raising the regularization of the model made no improvement. 

Lasso Regression 

Lowering alpha had no effect on the accuracy scores, while raising alpha decreased the accuracy scores. This means that lowering the regularization of the model had no effect and raising the regularization made the model more inaccurate. Increasing regularization strength forces the model to simplify and generalize, the fact that further regularization had a negative effect on the model indicates that this model is not complex enough to capture the patterns in the data. 

Conclusion 

None of the linear models were able to effectively fit this data set. Even with the ability to tune the hyperparameter alpha in the Ridge and Lasso regression models did not yield better results. 

