# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: 

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [50]:
import numpy as np
import pandas as pd

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [51]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_spam
X,y = load_spam()

# TO DO: Print size and type of X and y
print('The size of X is:', X.size)
print('The size of y is:', y.size)
print('The type of X is:', type(X))
print('The type of y is:', type(y))

The size of X is: 262200
The size of y is: 4600
The type of X is: <class 'pandas.core.frame.DataFrame'>
The type of y is: <class 'pandas.core.series.Series'>


### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [52]:
# TO DO: Check if there are any missing values and fill them in if necessary
X.isnull().sum()

word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq_will                0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_freq_telnet              0
word_fre

In [53]:
y.isnull().sum()

0

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [54]:
# TO DO: Create X_small and y_small 
from sklearn.model_selection import train_test_split

X_small, X_large, y_small, y_large = train_test_split(X, y, train_size=0.05, random_state=0)


### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [55]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

from sklearn.linear_model import LogisticRegression 

indices = ['Total Dataset', 'First 2 Columns', 'Small Dataset']

X_2Columns = X.iloc[:,[0,1]]

results = pd.DataFrame(index = indices, columns=['Data Size', 'Training Accuracy', 'Validation Accuracy'])

features = [X, X_2Columns, X_small]
target = [y, y, y_small]

for i in range(len(features)):
    X_train, X_test, y_train, y_test = train_test_split(features[i], target[i], random_state=0)
    model = LogisticRegression(random_state=0, max_iter=2000).fit(X_train, y_train)
    training_score = model.score(X_train, y_train) 
    validation_score = model.score(X_test, y_test)
    results.loc[indices[i]] = {'Data Size':features[i].shape, 'Training Accuracy':training_score, 'Validation Accuracy':validation_score}

results

Unnamed: 0,Data Size,Training Accuracy,Validation Accuracy
Total Dataset,"(4600, 57)",0.928696,0.936522
First 2 Columns,"(4600, 2)",0.608406,0.613043
Small Dataset,"(230, 57)",0.936047,0.931034


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

*YOUR ANSWERS HERE*

1. The training/accuracy scores for the Total Dataset and Small Dataset produced similar results, however the model training on only the first 2 columns performed significantly worse.  This indicates that the features in the remaining columns are important in predicting whether an email should be classified as spam.  Although the model trained on the total dataset slighlty outperformed the model trained on the small dataset in validation accuracy, the scores are relatively close.  This indicates that the small dataset is a good representation of the total dataset. 


2. In this case, a false positive represents the model classifying a non-spam email as spam.  A false negative represents the model failing to identify a spam email as such.  A false positive is likely worse because it may prevent a user from receiving an important email because the model has labeled it as spam. 

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1.  I used a combination of the class and lab examples for the logistic regression coding as well as google for general python syntax. 

2. I completed the steps in order as they were laid out in the template. 

3. I did not use any generative AI for this assignment. 

4. I did not run in to any challenges during the first part of this assignment. 


## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [56]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_concrete
X,y = load_concrete()

# TO DO: Print size and type of X and y
print('The size of X is:', X.size)
print('The size of y is:', y.size)
print('The type of X is:', type(X))
print('The type of y is:', type(y))

The size of X is: 8240
The size of y is: 1030
The type of X is: <class 'pandas.core.frame.DataFrame'>
The type of y is: <class 'pandas.core.series.Series'>


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [57]:
# TO DO: Check if there are any missing values and fill them in if necessary
X.isnull().sum()

cement    0
slag      0
ash       0
water     0
splast    0
coarse    0
fine      0
age       0
dtype: int64

In [58]:
y.isnull().sum()

0

### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with `X` and `y`

In [59]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0
from sklearn.linear_model import LinearRegression 
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

lr = LinearRegression().fit(X_train, y_train)

### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [60]:
# TO DO: ADD YOUR CODE HERE
from sklearn.metrics import mean_squared_error

R2_Train = lr.score(X_train, y_train)
R2_Test = lr.score(X_test, y_test)

linear_predictions_train = lr.predict(X_train)
linear_predictions_test = lr.predict(X_test)

MSE_Train = mean_squared_error(linear_predictions_train, y_train)
MSE_Test = mean_squared_error(linear_predictions_test, y_test)

### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [61]:
# TO DO: ADD YOUR CODE HERE

results = pd.DataFrame(index = ['MSE','R2 score'], columns=['Training Accuracy', 'Validation Accuracy'])

results.loc['R2 score'] = {'Training Accuracy':R2_Train, 'Validation Accuracy':R2_Test}
results.loc['MSE'] = {'Training Accuracy':MSE_Train, 'Validation Accuracy':MSE_Test}

results

Unnamed: 0,Training Accuracy,Validation Accuracy
MSE,111.358439,95.904136
R2 score,0.610823,0.623414


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

A linear model did not produce good results for this dataset.  The low training and validation R2 scores indicate that the linear model is underfitting. 


### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1. I used the class examples/labs for the code in the 2nd part of the assignment. 
2. I completed the steps in the order laid out in the template. 
3. I did not use any generative AI for this assigment. 
4. I did not have any challenges for this part of the assignment. 







## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

A pattern that presented itself in the classification part of the assignment is that the model trained on the complete dataset produced the best validation score.  This was to be expected as the other models, trained on subsets of the total dataset, excluded information that was helpful to the classification prediction. 

Part 2 of the assignment highlited the limitations of using a linear model as the model was not able to accurately fit the dataset.  Both the training and validation scores for this portion were below what would generally be considered a useful. 


## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

I enjoyed interpreting the results between different models and trying to understand why certain models would be a better choice for different scenarios.   



## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [62]:
# TO DO: ADD YOUR CODE HERE
from sklearn.linear_model import Ridge, Lasso

ridge_model = Ridge(alpha=100)  
ridge_model.fit(X_train, y_train)
ridge_train_score = ridge_model.score(X_train, y_train)
ridge_test_score = ridge_model.score(X_test, y_test)

lasso_model = Lasso(alpha=10)  
lasso_model.fit(X_train, y_train)
lasso_train_score = lasso_model.score(X_train, y_train)
lasso_test_score = lasso_model.score(X_test, y_test)

results = pd.DataFrame(index = ['Ridge','Lasso'], columns=['Training Accuracy', 'Validation Accuracy'])
results.loc['Ridge'] = {'Training Accuracy':ridge_train_score, 'Validation Accuracy':ridge_test_score}
results.loc['Lasso'] = {'Training Accuracy':lasso_train_score, 'Validation Accuracy':lasso_test_score}
results

Unnamed: 0,Training Accuracy,Validation Accuracy
Ridge,0.610823,0.623453
Lasso,0.604314,0.626774


*ANSWER HERE*

The Lasso regression model with an alpha value of 10 produced the highest validation accuracy, however this was still only ~62.7% which indicates it is still not a very good predictive model.  A linear model does not appear to a very good choice for this dataset.  A more complex model is likely required.  