# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: <span style="color:red"> Christopher Proc</span>

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [8]:
import numpy as np
import pandas as pd

In [9]:
# TO DO: Import spam dataset from yellowbrick library
# TO DO: Print size and type of X and y
from yellowbrick.datasets.loaders import load_spam
x,y = load_spam()
print(x.shape)
print(x.dtypes)
print(y.shape)
print(y.dtypes)

(4600, 57)
word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp                  float64
word_freq_hpl                 float64
w

### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [10]:
print(f'Check for Nulls in x: {x.isnull().sum()}')

print(f'Check for Null values in y: {y.isnull().sum()}')

Check for Nulls in x: word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq_will                0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_freq_telnet  

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [11]:
# TO DO: Create X_small and y_small 

from sklearn.model_selection import train_test_split
dataset = {}
X_small = train_test_split(x, train_size=0.05, random_state=0)
y_small = train_test_split(y, train_size = 0.05, random_state=0)
X_small = train_test_split(X_small[0], random_state=0)
y_small = train_test_split(y_small[0], random_state=0)
dataset["Small"] = [X_small, y_small]

X_full = train_test_split(x, random_state = 0)
y_full = train_test_split(y, random_state = 0)
dataset["Full"] = [X_full, y_full]

X_col = train_test_split(x.iloc[:, 0:1], random_state = 0)
y_col = train_test_split(y, random_state = 0)
dataset["Partial"] = [X_col, y_col]

for data in dataset:
    print(dataset.get(data)[0][0])

      word_freq_make  word_freq_address  word_freq_all  word_freq_3d  \
2008            0.00               0.00           0.52           0.0   
2813            0.00               0.00           0.00           0.0   
4180            0.00               0.00           0.00           0.0   
1472            0.10               0.00           0.70           0.0   
2850            0.16               0.00           0.00           0.0   
...              ...                ...            ...           ...   
4051            0.00               0.00           0.00           0.0   
2176            0.00               0.00           0.12           0.0   
180             0.34               0.26           0.26           0.0   
2389            0.00               0.00           0.00           0.0   
1589            0.00               0.00           0.00           0.0   

      word_freq_our  word_freq_over  word_freq_remove  word_freq_internet  \
2008           0.52            0.00              0.00     

### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [12]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(max_iter=2000)


results = pd.DataFrame(columns=['Size', 'Training Accuracy', 'Validation Accuracy'])

for modelName in dataset:
    data = dataset.get(modelName)
    X_train, X_val = data[0]
    y_train, y_val = data[1]

    model = log_reg.fit(X_train, y_train)
    results.loc[modelName] = {'Size': X_train.size, 'Training Accuracy': model.score(X_train, y_train), 'Validation Accuracy': model.score(X_val, y_val)}

print(results)


           Size  Training Accuracy  Validation Accuracy
Small      9804           0.936047             0.931034
Full     196650           0.928116             0.938261
Partial    3450           0.609565             0.612174


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.

    <span style="color:red"> The results show two general trends:</span>

    <span style="color:red">- Limiting the model to only two features such as in the "Two Column" model greatly reduces its ability to properly fit to both the training and validation data. Both the Training and Validation accuracy (0.609 and 0.612) are significantly lower than other models. This indicates that more features will generally produce better results, as long as those features are relevant.</span>
    
    <span style="color:red">- Training the model on a larger training set ("Full" vs "Small") resulted in a lower training accuracy score (0.928 vs 0.936), but resulted in a higher validation accuracy score (0.938 vs 0.931). The larger training set appears to have provided a better representation of the overall characteristics of the full dataset, and is less prone to overfitting. This looks to have resulted in a better performance against the validation data. The small training dataset was more prone to overfitting and may not have captured all the characteristics of the full dataset, and therefore performed less well on the validation set. Direct comparisons are difficult because the two models were validated against different-sized validation datasets. When tested against the same validation set as the Full model, the model trained on a small amount of data has a worse Validation Accuracy (0.913 - not shown in the above code)</span>

2. In this case, what do a false positive and a false negative represent? Which one is worse?

    <span style="color:red"> In this example, a false positive represents a valid e-mail being marked as spam, and filtered out. A false negative represents a spam e-mail which is marked as valid, and allowed through the filter. False positives are much worse in this particular example - missing a valid e-mail may result in significant harm or inconvenience, whereas the penalty for a false negative (receiving and subsequently deleting a spam e-mail) is relatively low. </span>


### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?

    <span style="color:red"> Code was sourced from the course notes and examples, as well as documentation from scikit-learn.org. Small portions of syntax-related questions were answered by reading threads on StackOverflow.com and Geeksforgeeks.org.</span>
1. In what order did you complete the steps?

    <span style="color:red"> Steps were completed in order, 1-5. For the models, all data was split first, then all the models were fit, then the appending of results.</span> 
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?

    <span style="color:red"> I did not use any Generative AI for this assignment.</span>
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

    <span style="color:red"> Creating and validating the models was straightforward, based on the course content. This exercise was similar to ones we did during the lab and in class - attending helped. </span>

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [13]:
# TO DO: Import spam dataset from yellowbrick library
# TO DO: Print size and type of X and y
from yellowbrick.datasets.loaders import load_concrete
x,y = load_concrete()
print(x.shape)
print(x.dtypes)
print(y.shape)
print(y.dtypes)

(1030, 8)
cement    float64
slag      float64
ash       float64
water     float64
splast    float64
coarse    float64
fine      float64
age         int64
dtype: object
(1030,)
float64


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [14]:
# TO DO: Check if there are any missing values and fill them in if necessary
print(f'Check for Nulls in x: {x.isnull().sum()}')

print(f'Check for Null values in y: {y.isnull().sum()}')

Check for Nulls in x: cement    0
slag      0
ash       0
water     0
splast    0
coarse    0
fine      0
age       0
dtype: int64
Check for Null values in y: 0


### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with `X` and `y`

In [15]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0
from sklearn.linear_model import LinearRegression
X_train, X_val, y_train, y_val = train_test_split(x,y,random_state=0)
lin_reg = LinearRegression().fit(X_train, y_train)

### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [16]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
y_train_pred = lin_reg.predict(X_train)
y_pred = lin_reg.predict(X_val)

train_r2 = r2_score(y_train, y_train_pred)
train_mse = mean_squared_error(y_train, y_train_pred)
print(f'Train R2: {train_r2}')
print(f'Train MSE: {train_mse}')

val_r2 = r2_score(y_val, y_pred)
val_mse = mean_squared_error(y_val, y_pred)

print(f'Val R2: {val_r2}')
print(f'Val MSE: {val_mse}')

Train R2: 0.6108229424520553
Train MSE: 111.35843861132471
Val R2: 0.6234144623633329
Val MSE: 95.90413603680645


### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [17]:
results = pd.DataFrame(columns=['Training Accuracy', 'Validation Accuracy'])
results.loc['MSE'] = [train_mse, val_mse]
results.loc['R2'] = [train_r2, val_r2]
print(results)

     Training Accuracy  Validation Accuracy
MSE         111.358439            95.904136
R2            0.610823             0.623414


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

<span style="color:red"> Based on the R2 scores, the linear model did not produce good results from this dataset. Only 60-65% of the variability in the dataset could be explained by our linear model. This indicates that a more complex model is required to properly fit the data. </span>

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

<span style="color:red"> As in part 1, all of the code used in this example was written by me, with examples and syntax questions answered by looking through the course notes, as well as the SKlearn resources. </span>

<span style="color:red">Sequencing of this part of the exercise was slightly different - I had trouble developing a framework that would allow me to loop and store the results from multiple models into the results. I ended up creating training and validation datasets for each model, building the three models separately, and analyzing the results. Once this was complete, I had a better understanding of the neccessary data structures that would be required in order to build the loops. I therefore went back and refactored the entire process to split the data and then use a single loop to fit and store the model results. </span>

<span style="color:red">I did not use generative AI for any part of this assignment.</span>

<span style="color:red">The content of the exercises was straightforward, and completed easily based on the course notes and lab exercises. Most of my challenges still stem from the learning curve of Pandas Dataframe syntax and slicing, but are improving quickly. </span>

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


<span style="color:red">Overall Observations and Conclusions:</span>

<span style="color:red">For classification: 
    
    - Using more data for training will generally result in a better fitting model 
    - Using more features will generally result in a better fitting model, as long as those features are relevant

<span style="color:red">For Regression:

    - R2 score gives us a useful indicator of model fit that can be used to compare different models.
    - MSE can be used to give us a general idea of how well our predictions fit our actual data, but do not give us a particularly meaningful metric by which to compare models. 
    - Normalization can be used to alter the fitting of a model, but it cannot make up for a poor choice of model (Eg. a linear model for a non-linear dataset)
    



## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.

<span style="color:red"> In general, I enjoyed the assignment as an opportunity to go back and re-implement things that we already learned in the lab and in the class. I felt it was a good way to refresh and cement my understanding of linear models, and to get more practice with Pandas and SKLearn modules. Some of the Pandas syntax continues to be challenging, but I'm motivated by the fact that I am able to work through it and find solutions on my own. I found that some of the questions seemed a bit vague or repetitive - I found myself repeating some of the same insights as in previous sections. I really preferred the questions which were formulated as specific numbered points to talk about. </span>

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [18]:
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

#Lasso

X_train, X_val, y_train, y_val= train_test_split(x, y, random_state = 0)

a = 0.001
l_results = pd.DataFrame(columns = ['Training Accuracy (R2)', 'Validation Accuracy(R2)'])
l_results.index.name = 'Alpha'

while a <= 1000:
    lm = Lasso(alpha=a)
    lm.fit(X_train, y_train)
    y_train_pred = lm.predict(X_train)
    y_val_pred = lm.predict(X_val)
    l_results.loc[a] = [r2_score(y_train, y_train_pred), r2_score(y_val, y_val_pred)]
    a *= 10

print("Lasso Testing:")
print(l_results)

# Ridge
r_results = pd.DataFrame(columns=['Training Accuracy (R2)', 'Validation Accuracy (R2)'])
r_results.index.name = 'Alpha'
a = 0.001
while a<= 1000:
    rm = Ridge(alpha=a)
    rm.fit(X_train, y_train)
    y_train_pred = rm.predict(X_train)
    y_val_pred = rm.predict(X_val)
    r_results.loc[a] = [r2_score(y_train, y_train_pred), r2_score(y_val, y_val_pred)]
    a *= 10

print("\nRidge Testing:")
print(r_results)





Lasso Testing:
          Training Accuracy (R2)  Validation Accuracy(R2)
Alpha                                                    
0.001                   0.610823                 0.623416
0.010                   0.610823                 0.623429
0.100                   0.610821                 0.623562
1.000                   0.610609                 0.624669
10.000                  0.604314                 0.626774
100.000                 0.467576                 0.507413
1000.000                0.000000                -0.011576

Ridge Testing:
          Training Accuracy (R2)  Validation Accuracy (R2)
Alpha                                                     
0.001                   0.610823                  0.623414
0.010                   0.610823                  0.623414
0.100                   0.610823                  0.623415
1.000                   0.610823                  0.623415
10.000                  0.610823                  0.623418
100.000                 0.610823  

<span style="color:red"> Ridge regression gave the best R2 scores for training and accuracy at Alphas between (0.001 - 1.0), however the R2 scores at these alphas would not be considered good enough to constitute a well fit model. For Lasso, the R2 scores start low at low alphas, and continue to drop as alpha is increased. This is indicative of underfitting in the model, and is a good indicator that a linear model may not be sufficient for this dataset. Lasso R2 scores drop as alpha increases, likely due to over-normalization of the few features that correlate with our target. The ridge model maintains consistent accuracy throughout, indicating that the Ridge model is successful in normalizing out features which are not well represented by a linear model, and only suffers from over-normalization of the remaining features once alpha climbs to near 100. </span>