# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: Romil Dhagat

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [78]:
import numpy as np
import pandas as pd
import yellowbrick
from yellowbrick.datasets import load_spam, load_concrete


### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [79]:
# TO DO: Import spam dataset from yellowbrick library
X,y = load_spam()
# TO DO: Print size and type of X and y
print("Type of X",type(X))
print("Shape of X",X.shape)
print("Size of X",X.size)
print("Type of y",type(y))
print("Shape of y",y.shape)
print("Size of y",y.size)

Type of X <class 'pandas.core.frame.DataFrame'>
Shape of X (4600, 57)
Size of X 262200
Type of y <class 'pandas.core.series.Series'>
Shape of y (4600,)
Size of y 4600


### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [80]:
# TO DO: Check if there are any missing values and fill them in if necessary
missing = np.isnan(X).sum().sum()
missing

0

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [81]:
import sklearn 
from sklearn.model_selection import train_test_split
# Create X_small and y_small 
X_waste,X_small,y_waste,y_small = train_test_split(X,y,test_size = 0.05, random_state=0)
X_small_train,X_small_test,y_small_train,y_small_test = train_test_split(X_small,y_small,random_state=0)
# Creating X and y test train split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
# Creating the two column test train split
X_col = X.iloc[:,0:2]
X_col_train,X_col_test,y_col_train,y_col_test = train_test_split(X_col,y)

### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

### Step 3: Implementing Logistic Regression Model

In [82]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT
from sklearn.linear_model import LogisticRegression

model0 = LogisticRegression(max_iter=2000)
model1 = LogisticRegression(max_iter=2000)
model2 = LogisticRegression(max_iter=2000)

# Creating ML Logistic Regression prediction for X and y
model0.fit(X_train,y_train)
y_pred = model0.predict(X_test)
y_train_pred = model0.predict(X_train)

# Creating ML Logistic Regression prediction for X_small and y_small
model1.fit(X_small_train,y_small_train)
y_small_pred = model1.predict(X_small_test)
y_small_train_pred = model1.predict(X_small_train)

# Creating ML Logistic Regression prediction for X_col and y_col
model2.fit(X_col_train,y_col_train)
y_col_pred = model2.predict(X_col_test)
y_col_train_pred = model2.predict(X_col_train)



### Step 4: Validating dataset predictions 

In [83]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score

# Validating X and y dataset predictions
# X_y_scores = cross_validate(model0,X_train,y_train,cv=5,return_train_score=True)
# X_y_train_score = X_y_scores['train_score'].mean()
# X_y_val_score = X_y_scores['test_score'].mean()
X_y_train_acc_score = accuracy_score(y_train,y_train_pred)
X_y_val_acc_score = accuracy_score(y_test,y_pred)

# Validating X_small and y_small dataset predictions 
# X_y_small_scores = cross_validate(model1,X_small_train,y_small_train,cv=5,return_train_score=True)
# X_y_small_train_score = X_y_small_scores['train_score'].mean()
# X_y_small_val_score = X_y_small_scores['test_score'].mean()
X_y_small_train_acc_score = accuracy_score(y_small_train,y_small_train_pred)
X_y_small_val_acc_score = accuracy_score(y_small_test,y_small_pred)

# Validating the first two columns of X and y dataset predictions 
# col_scores = cross_validate(model2,X_col_train,y_col_train,cv=5,return_train_score=True)
# col_train_score = X_y_scores['train_score'].mean()
# col_val_score = X_y_scores['test_score'].mean()
col_train_acc_score = accuracy_score(y_col_train,y_col_train_pred)
col_val_acc_score = accuracy_score(y_col_test,y_col_pred)

print("\n Using Accuracy scores:\n")
print("The full data set (X and y) has the following results: ")
print("Training score: {:.3f}".format(X_y_train_acc_score))
print("Validation score: {:.3f}".format( X_y_val_acc_score))

print("\nThe small data set (X_small and y_small) has the following results: ")
print("Training score: {:.3f}".format(X_y_small_train_acc_score))
print("Precision score: {:.3f}".format( X_y_small_val_acc_score))


print("\nThe two column data set (X_col and y_col) has the following results: ")
print("Training score: {:.3f}".format(col_train_acc_score))
print("Precision score: {:.3f}".format(col_val_acc_score))


# print("\n Using Cross-Validation: (I initially used Cross-Validation until I saw that I wasn't supposed to, so disregard as needed)\n")
# print("The full data set (X and y) has the following coss-validation results: ")
# print("Training score: {:.3f}".format(X_y_train_score))
# print("Validation score: {:.3f}".format( X_y_val_score))

# print("\nThe small data set (X_small and y_small) has the following cross-validation results: ")
# print("Training score: {:.3f}".format(X_y_small_train_score))
# print("Precision score: {:.3f}".format( X_y_small_val_score))


# print("\nThe two column data set (X_col and y_col) has the following cross-validation results: ")
# print("Training score: {:.3f}".format(col_train_score))
# print("Precision score: {:.3f}".format(col_val_score))




 Using Accuracy scores:

The full data set (X and y) has the following results: 
Training score: 0.928
Validation score: 0.938

The small data set (X_small and y_small) has the following results: 
Training score: 0.965
Precision score: 0.793

The two column data set (X_col and y_col) has the following results: 
Training score: 0.619
Precision score: 0.595


### Step 5: Visual Results

In [84]:
### 1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
### 2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
### 3. Print `results`

results = pd.DataFrame(columns=['Data Size', 'Training Accuracy', 'Validation Accuracy'])

X_y_row = {'Data Size':X.size,'Training Accuracy':X_y_train_acc_score,'Validation Accuracy':X_y_val_acc_score}
X_y_small_row = {'Data Size':X_small.size,'Training Accuracy':X_y_small_train_acc_score,'Validation Accuracy':X_y_small_val_acc_score}
col_row = {'Data Size':X_col.size,'Training Accuracy':col_train_acc_score,'Validation Accuracy':col_val_acc_score}

results.loc[0] = X_y_row
results.loc[1] = X_y_small_row
results.loc[2] = col_row

print(results)


   Data Size  Training Accuracy  Validation Accuracy
0     262200           0.928406             0.938261
1      13110           0.965116             0.793103
2       9200           0.618841             0.594783


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

*YOUR ANSWERS HERE*

1. 
The amount of data used has high impact on the accuracy and the variance. As seen from the results dataframe, the largest data set has a variance between the Training Accuracy and the Validation Accuracy of 1% meaning this model has high accuracy and low variance but high bias, and could be an indication of underfitting. Once the amount of data is lowered, to 5% of the full data, the training accuracy increases however the validation accuracy drops signinficantly, causing a 17.2% variance between the two, meaning smaller data created high variance and low bias and could be an indication for overfitting. Finally dropping the columns, signifcantly reduces the accuracy of both the training and validation sets, however the varaince is around 1.4% which means the model retains a low variance, and high bias. 

2. 
A false positive is a non-spam email being catergorized as spam and a false negative is a spam email being catergorized as non-spam. A false postive is much worse becuase you could miss an important email because it got catergorized as spam.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
2. In what order did you complete the steps?
3. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
4. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1.
I mostly used the example jupyter notebooks, and googled whenever I got confused, or lost. The website I used the most, when I was lost, was geeksforgeeks.com or scikit-learn.com, as they have very good resources for learning.

2. 
I did everythign in the steps that they were written. Data input and procerssing then training and validating models and finally visualizing.

3.
I did not use generative AI, as I don't learn much by using it.

4.
I was confused during the validation process, as I had assumed I was meant to use Cross-Validation however I soon learned from peers that was wrong. However I still did not understand why the training and validation scores for the 2-column data set were so much higher using cross-validation compared to the method that I did end up using.

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [85]:
# TO DO: Import spam dataset from yellowbrick library

X,y = load_concrete()
# TO DO: Print size and type of X and y
print("Size of X",X.shape)
print("Type of X", type(X))
print("Size of y", y.shape)
print("Type of y", type(y))

Size of X (1030, 8)
Type of X <class 'pandas.core.frame.DataFrame'>
Size of y (1030,)
Type of y <class 'pandas.core.series.Series'>


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [86]:
# TO DO: Check if there are any missing values and fill them in if necessary
missing = np.isnan(X).sum().sum()
missing

0

### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Implement the machine learning model with `X` and `y`

In [87]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0
from sklearn.linear_model import LinearRegression

X_train,X_test,y_train,y_test = train_test_split(X,y)

model = LinearRegression()
model.fit(X_train,y_train)
y_val_pred = model.predict(X_test)
y_train_pred = model.predict(X_train)

### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [88]:
# TO DO: ADD YOUR CODE HERE
from sklearn.metrics import r2_score, mean_squared_error

val_r2 = r2_score(y_test,y_val_pred)
val_mse = mean_squared_error(y_test,y_val_pred)

train_r2 = r2_score(y_train,y_train_pred)  
train_mse = mean_squared_error(y_train,y_train_pred)


print("The Training data's R^2 score is {:.3f}".format(train_r2))
print("The Training data's mean squared error is {:.3f}".format(train_mse))

print("The Validation data's R^2 score is {:.3f}".format(val_r2))
print("The Validation data's mean squared error is {:.3f}".format(val_mse))


The Training data's R^2 score is 0.594
The Training data's mean squared error is 110.945
The Validation data's R^2 score is 0.672
The Validation data's mean squared error is 96.631


### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [89]:
# TO DO: ADD YOUR CODE HERE
results = pd.DataFrame(columns=['Training Accuracy', 'Validation Accuracy'],index=['Mean Squared Error','R2 Score'])

results.at['Mean Squared Error','Training Accuracy'] = train_mse
results.at['R2 Score','Training Accuracy'] = train_r2
results.at['Mean Squared Error','Validation Accuracy'] = val_mse
results.at['R2 Score','Validation Accuracy'] = val_r2

print(results)

                   Training Accuracy Validation Accuracy
Mean Squared Error        110.945138           96.631335
R2 Score                    0.594377            0.671732


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

The linear model created produced alright results, the model does not completely fail but it does not preform well, the accuracy is not the best and the variance between the training and validation accuracy is around 4% which shows some bias and variance but it is not too high. I would not say that this is a good result mainly because of the low R2 score

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
2. In what order did you complete the steps?
3. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
4. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1. 
I got my code and methodology from the lecture notes and the example jupyter notebooks 

2.
I did the steps in the general order of data collection and processing then model training and validation and then finally visualization

3.
I did not use generative AI 

4.
I was not challanged by this part, however I was confused by what was meant by getting the training and validation accuracy from R2 and MSE, as I thought there was an accuracy score that could be derived by them, there is not.

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

The pattern I noticed is the fact that more data results in better accuracy and higher bias, when there is less data there is higher variance, and when there are less parameters (columns) there is lowered accuracy. 

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

I really liked learning about how to do proper validation for binary classification models, however this led to some confusion about how to do that validation. Having so many variables make the code harder to track, especially since I commited to a confusing looking naming convention. I found the linear regression bit motivating becuase I realized I could do it easily compared to the binary classification. 

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

### Ridge and Lasso Regresson

In [97]:
# TO DO: ADD YOUR CODE HERE
from sklearn.linear_model import Ridge, Lasso

X,y = load_concrete()

X_train,X_test,y_train,y_test = train_test_split(X,y)

ridge_model = Ridge(alpha=0.001)
ridge_model.fit(X_train,y_train)
ridge_val_pred = ridge_model.predict(X_test)
ridge_train_pred = ridge_model.predict(X_train)

lasso_model = Lasso(alpha=0.001)
lasso_model.fit(X_train,y_train)
lasso_val_pred = lasso_model.predict(X_test)
lasso_train_pred = lasso_model.predict(X_train)

ridge_val_r2 = r2_score(y_test,ridge_val_pred)
ridge_val_mse = mean_squared_error(y_test,ridge_val_pred)
ridge_train_r2 = r2_score(y_train,ridge_train_pred)  
ridge_train_mse = mean_squared_error(y_train,ridge_train_pred)

lasso_val_r2 = r2_score(y_test,lasso_val_pred)
lasso_val_mse = mean_squared_error(y_test,lasso_val_pred)
lasso_train_r2 = r2_score(y_train,lasso_train_pred)  
lasso_train_mse = mean_squared_error(y_train,lasso_train_pred)

ridgeResults = pd.DataFrame(columns=['Training Accuracy', 'Validation Accuracy'],index=['Mean Squared Error','R2 Score'])

ridgeResults.at['Mean Squared Error','Training Accuracy'] = ridge_train_mse
ridgeResults.at['R2 Score','Training Accuracy'] = ridge_train_r2
ridgeResults.at['Mean Squared Error','Validation Accuracy'] = ridge_val_mse
ridgeResults.at['R2 Score','Validation Accuracy'] = ridge_val_r2

lassoResults = pd.DataFrame(columns=['Training Accuracy', 'Validation Accuracy'],index=['Mean Squared Error','R2 Score'])

lassoResults.at['Mean Squared Error','Training Accuracy'] = lasso_train_mse
lassoResults.at['R2 Score','Training Accuracy'] = lasso_train_r2
lassoResults.at['Mean Squared Error','Validation Accuracy'] = lasso_val_mse
lassoResults.at['R2 Score','Validation Accuracy'] = lasso_val_r2

print("Lasso accuracy results:")
print(lassoResults)
print("\nRidge accuracy results:")
print(ridgeResults)



Lasso accuracy results:
                   Training Accuracy Validation Accuracy
Mean Squared Error        105.381711          114.753285
R2 Score                    0.635541            0.537013

Ridge accuracy results:
                   Training Accuracy Validation Accuracy
Mean Squared Error        105.381711          114.752904
R2 Score                    0.635541            0.537015


*ANSWER HERE*

Alpha in both Lasso and Ridge will determine how much the L1 or L2 regularization will effect the model. For lasso it will be L1 regularization, if the alpha is zero the model will be the same as Linear Regression, and a higher alpha means the model will apply L1 regularization more aggressively. Ridge Regression uses L2 regularization, meaning when alpha is zero the model will be the same as Linear Regression, and a higher alpha means the model will apply L2 regularization more aggressively, until all the coefficients become zero.