# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: Bogdan Constantinescu

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split


### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [3]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_spam
X, y = load_spam()
# TO DO: Print size and type of X and y
print("Size of X: " + str(X.size))
print("Type of X: " + str(type(X)))
print("Size of y: " + str(y.size))
print("Type of y: " + str(type(y)))

Size of X: 262200
Type of X: <class 'pandas.core.frame.DataFrame'>
Size of y: 4600
Type of y: <class 'pandas.core.series.Series'>


### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [4]:
# TO DO: Check if there are any missing values and fill them in if necessary
print(X.isna().sum().sum())
print("There are no null values")

0
There are no null values


For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [6]:
# TO DO: Create X_small and y_small 
X_small, _, y_small, _ = train_test_split(X, y, test_size=0.95, random_state=0)


### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [9]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression(max_iter = 2000)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model.fit(X_train, y_train)    
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
train_accuracy_1 = accuracy_score(y_train, y_train_pred)
test_accuracy_1 = accuracy_score(y_test, y_test_pred)

X_train, X_test, y_train, y_test = train_test_split(X.iloc[:, :2], y, test_size=0.3, random_state=0)
model.fit(X_train, y_train)    
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
train_accuracy_2 = accuracy_score(y_train, y_train_pred)
test_accuracy_2 = accuracy_score(y_test, y_test_pred)

X_train, X_test, y_train, y_test = train_test_split(X_small, y_small, test_size=0.3, random_state=0)
model.fit(X_train, y_train)    
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
train_accuracy_3 = accuracy_score(y_train, y_train_pred)
test_accuracy_3 = accuracy_score(y_test, y_test_pred)

results = pd.DataFrame(columns=["Data size", "Training accuracy", "Validation accuracy"])

dataset_names = ["X and y", "First two columns of X and y", "X_small and y_small"]
train_accuracy = [train_accuracy_1, train_accuracy_2, train_accuracy_3]
test_accuracy = [test_accuracy_1, test_accuracy_2, test_accuracy_3]

results["Data size"] = dataset_names
results["Training accuracy"] = train_accuracy
results["Validation accuracy"] = test_accuracy

print(results)




                      Data size  Training accuracy  Validation accuracy
0                       X and y           0.927640             0.934783
1  First two columns of X and y           0.611180             0.607246
2           X_small and y_small           0.931677             0.956522


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

*YOUR ANSWERS HERE*

1. The training accuracy tends to improve when more data is used. As can be seen when only using the first two columns of X we get a training accuracy of 60% whereas when we use the entire dataset we get a training accuracy of 93%. A caveat here is that when X_small was used to train the data, the training accuracy was higher (95%). A possible explanation for this is that the model overfit the data. For the validation accuracy, it is clear that the more data that is used to train the model, the higher the accuracy. This is visible as the validation accuracy goes up from 60% (first two columns of X) to 93% (entire dataset). 
The total number of data points used to train the model when only using the first two columns is 4600 * 2 = 9200. The total number of data points used to train the model with X_small is 4600 * 0.05 * 57 = 13110. The total number of data points used to train the model with the entire dataset is 4600 * 57 = 262200. As can be seen here, in relative terms, the number of data points used to train the model between X_small and using only the first two columns is almost the same yet the validation accuracy is much higher for X_small. This would seem to indicate that it is more important to have a wider range of unique values (X_small) to test rather than just having a lot of data from only two parameters (Two columns of X).
2. In this case (determining whether an email is spam or not) a false positive represents an email that was flagged as being spam when in reality it is not. A false negative represents an email that was not flagged as spam when in reality it is. It is worse to have a false positive in this situation becuase this means an email that is not spam will appear in your spam folder and you might miss important information.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?


*DESCRIBE YOUR PROCESS HERE*

The process I implemented was to use my current knowledge, previous notebooks from the labs that we have already done, as well as online resources such as the scikit website (https://www.scikit-yb.org/en/latest/api/datasets/spam.html). The majority of my code was sourced from similar examples covered in the labs as well as knowledge I accumulated from ENSF 593. I completed the steps in the exact order they were outlined in the notebook. I did not use generative AI for this assignment. A challenge I had was figuring out how to use train_test_split to only keep 5% of the overall data. The way I overcame this challenge was to use the VS code prompt that explains exactly what the method is and all of its parameters. In this way I determined how to use it effectively to accomplish only keeping 5% of the data.

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [10]:
# TO DO: Import spam dataset from yellowbrick library
# TO DO: Print size and type of X and y
from yellowbrick.datasets import load_concrete
X, y = load_concrete()

print("Size of X: " + str(X.size))
print("Type of X: " + str(type(X)))
print("Size of y: " + str(y.size))
print("Type of y: " + str(type(y)))


Size of X: 8240
Type of X: <class 'pandas.core.frame.DataFrame'>
Size of y: 1030
Type of y: <class 'pandas.core.series.Series'>


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [11]:
# TO DO: Check if there are any missing values and fill them in if necessary
print(X.isna().sum().sum())
print("There are no null values")

0
There are no null values


### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with `X` and `y`

In [12]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0

from sklearn.linear_model import LinearRegression
model = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
lr = model.fit(X_train, y_train) 
y_train_pred = lr.predict(X_train)   
y_pred = lr.predict(X_test)


### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [13]:
# TO DO: ADD YOUR CODE HERE
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

mse_train = mean_squared_error(y_train, y_train_pred)
r2_train = r2_score(y_train, y_train_pred)
mse_val = mean_squared_error(y_test, y_pred)
r2_val= r2_score(y_test, y_pred)


### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [14]:
# TO DO: ADD YOUR CODE HERE
results = pd.DataFrame(columns=["Training accuracy", "Validation accuracy"], index = ["MSE", "R2"])
results["Training accuracy"] = [mse_train, r2_train]
results["Validation accuracy"] = [mse_val, r2_val]
results

Unnamed: 0,Training accuracy,Validation accuracy
MSE,113.410826,93.624364
R2,0.606594,0.635277


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?
 
### Answer
A linear model produced decent results for this dataset but not fantastic. First of all, the R2 score is fairly low indicating that the model is not a very good fit for the data. Furthermore, the MSE score is quite high which again is not indicative of a good model, as this means the model is not accurately predicting the values. 

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

Most of the code was sourced from my own knowledge as well as the material that has been covered so far in the course (lab notebooks covering similar content). I completed the steps in the order that they are outlined. 
For this part of the assignment I did use generative AI (namely ChatGPT). I used it for clarification on how to use the linear regression model to predict my target vectors. I also used it to to show me how to calculate MSE and R2. I prompted it using the code I already had, and asked it to perform the MSE and R2 calculations. I did not have to modify the code at all.
I did not encounter any real challenges for this part of the assignment.

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

For part 2 of the assignment there is no real pattern to determine as we are only using one model (linear regression) on one set of data, so there is nothing to compare.

For part 1, I already touched upon the pattern in the Questions section. I will reiterate it below.

The first pattern is that the more data used to train the model the better the validation accuracy.
The second pattern I noticed is that a wider range of unique data improves the accuracy of the model more than the same amount of data points but from a smaller range of attributes (in this case two columns).

For a more in depth answer, you can refer back to the Questions section of Part 1.

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

I enjoyed this assignment because by working through all the steps I gained a much better understanding of the models themselves, how they work, and their output. I also learned more about the methods that we have at our disposal (like train_test_split) and what they fundamentally mean. This assignment also helped clarify the distinction between classification (logistic regression) modeling, which is discrete and linear regression which is continuous, as well as the methods we use to determine their model accuracy.
I liked that the reflection questions made me actually think about what I was doing and how the models worked, instead of just plugging in the methods.

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [15]:
# TO DO: ADD YOUR CODE HERE

from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
value = 0.001
val_r = value
val_l = value
max_r = 0
max_l = 0
while value <= 100:
    model_r = Ridge(alpha=value)
    model_l = Lasso(alpha=value)
    
    r = model_r.fit(X_train, y_train) 
    l = model_l.fit(X_train, y_train)
    y_train_pred_r = r.predict(X_train)   
    y_pred_r = r.predict(X_test)
    y_train_pred_l = l.predict(X_train)   
    y_pred_l = l.predict(X_test)

    mse_train_r = mean_squared_error(y_train, y_train_pred_r)
    r2_train_r = r2_score(y_train, y_train_pred_r)
    mse_val_r = mean_squared_error(y_test, y_pred_r)
    r2_val_r = r2_score(y_test, y_pred_r)
    if r2_val_r > max_r:
        max_r = r2_val_r
        val_r = value
    mse_train_l = mean_squared_error(y_train, y_train_pred_l)
    r2_train_l = r2_score(y_train, y_train_pred_l)
    mse_val_l = mean_squared_error(y_test, y_pred_l)
    r2_val_l = r2_score(y_test, y_pred_l)
    if r2_val_l > max_l:
        max_l = r2_val_l
        val_l = value
    results_r = pd.DataFrame(columns=["Training accuracy", "Validation accuracy"], index = ["MSE", "R2"])
    results_r["Training accuracy"] = [mse_train_r, r2_train_r]
    results_r["Validation accuracy"] = [mse_val_r, r2_val_r]

    results_l = pd.DataFrame(columns=["Training accuracy", "Validation accuracy"], index = ["MSE", "R2"])
    results_l["Training accuracy"] = [mse_train_l, r2_train_l]
    results_l["Validation accuracy"] = [mse_val_l, r2_val_l]

    value *= 10
    #print(results_r)
    #print(results_l)

print("Ridge highest R2 value: " + str(max_r) + " with alpha value of: " + str(val_r))
print("Lasso highest R2 value: " + str(max_l) + " with alpha value of: " + str(val_l))




Ridge highest R2 value: 0.6353495924197798 with alpha value of: 100.0
Lasso highest R2 value: 0.6370126170149091 with alpha value of: 10.0


*ANSWER HERE*

The lasso method gave me the highest R2 value of 0.637 with an alpha value of 10. This score is not very good as only 64% of the variability of the data is explained by the regression model. It might be good enough, as this data is concerning concrete mix parameters so there is probably some error tolerance allowed, but overall it is not a great model.