# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: Sieu Eric Diep

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [45]:
import numpy as np
import pandas as pd

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [46]:
# TO DO: Import spam dataset from yellowbrick library

from yellowbrick.datasets.loaders import load_spam
(X,y) = load_spam()

# TO DO: Print size and type of X and y

print("Size of X: ", X.shape)
print("Type of X: ")
print(X.dtypes)
type(X)
print("");

print("Size of y: ", y.shape)
print("Type of y: ", y.dtype)
type(y)

Size of X:  (4600, 57)
Type of X: 
word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp                  float64
word_freq_hpl  

pandas.core.series.Series

### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [47]:
# TO DO: Check if there are any missing values and fill them in if necessary
print("# of missing values in: ")
print(" X: ", X.isnull().sum())
print(" y: ", y.isnull().sum())

# of missing values in: 
 X:  word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq_will                0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_freq_

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [48]:
# TO DO: Create X_small and y_small 

from sklearn.model_selection import train_test_split

X_small, X_small_test, y_small, y_small_test = train_test_split(X,y,train_size = 0.05, test_size = 0.95, random_state = 0)
print(X_small.shape)
print(X_small_test.shape)
print(y_small.shape)
print(y_small_test.shape)


(230, 57)
(4370, 57)
(230,)
(4370,)


### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [49]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression(max_iter = 2000)

#extracting the first two columns of X
X_2col = X.iloc[:,0:2]
#print(X_2col.head(5))

#putting all the X and y into a list 
X_list = [X,X_2col,X_small]
y_list = [y,y,y_small]

#create an empty dataframe with 3 columns: Data size, training accuracy, validation accuracy
results = pd.DataFrame(columns=["Data size", "Training Accuracy", "Validation Accuracy"])

i = 0

for xdata, ydata in zip(X_list, y_list):
    results.loc[i,"Data size"]= xdata.shape
    X_train, X_test, y_train, y_test = train_test_split(xdata, ydata, random_state=0)
    lr = model.fit(X_train,y_train)
    results.loc[i,"Training Accuracy"] = lr.score(X_train, y_train)
    y_pred = model.predict(X_test)
    results.loc[i,"Validation Accuracy"]= accuracy_score(y_test, y_pred)
    i += 1

print(results)

    Data size Training Accuracy Validation Accuracy
0  (4600, 57)          0.928696            0.938261
1   (4600, 2)          0.608406            0.613043
2   (230, 57)          0.936047            0.931034


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

*YOUR ANSWERS HERE*
1. In this dataset, the sample numbers (number of rows in the dataset) doesn't have a big impact on the accuracy score. The accuracy scores gained from 4600 rows of data is very similar (within 0.01 difference) to the scores gained from 5% of the origial dataset (230 rows). The models obtain from both of these dataset perform quite well, with the scores above 0.9, very close to 1. 

However, the number of features has a big impact on the accuracy scores. The model performs far worse with few features. The dataset with two features only acheives a score of 0.6 while a 57 features dataset acheives a score above 0.9. The model is underfit with less features provided, which oversimplifies the model.

2. A true positive means is_spam = 1, in other words, it is spam. So false positive represents not_spam and false negative represents spam. The false negative means an important email would be classified as spam, and could be lost in the spam folder. Lost of information can create real damages. A false positive means a spam email would be classified as not_spam. This may cause some inconvinient, but not significant damage (assumming the email reader would excercises proper caution and measurement regarding cybersecurity). So a false negative is worse in this case.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*
1. First, I reviewed all the lecture notes to reinforce my understanding of the materials, then I reviewed the codes in the examples provided in the lectures. I find most of the codes in the jupiter notebook on D2L. I also checked the scikit website to read about how to load the dataset and how to use the train_test_split function
https://www.scikit-yb.org/en/latest/api/datasets/index.html
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

2. I complete the steps according to the order of the instruction. I also checked and mirrored the order in the ML_ex_solution Jupyter notebook. I recognize that the order that I follow is actually the five steps in the Machine Learning Workflow stated in the lecture. 

3. I didn't use genertive AI to complete the assignment.

4. I found that the lecture notes are quite comprehensive and straight forward. The concept is easy to understand, however the coding part is a little overwhelming although there's not a lot of them. The learning curve is a little steep that I went through the jupiter notebook many times to locate the correct codes that I need. I also looked up the codes in the user guide to understand each parameters.

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [50]:
# TO DO: Import spam dataset from yellowbrick library
import yellowbrick
X,y = yellowbrick.datasets.loaders.load_concrete(data_home=None, return_dataset=False)
# TO DO: Print size and type of X and y
print("Size of X: ", X.shape)
print("Type of X: ")
print(X.dtypes)

print("")
print("Size of y: ", y.shape)
print("Type of y: ", y.dtype)

print(type(X))
print(type(y))


Size of X:  (1030, 8)
Type of X: 
cement    float64
slag      float64
ash       float64
water     float64
splast    float64
coarse    float64
fine      float64
age         int64
dtype: object

Size of y:  (1030,)
Type of y:  float64
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [51]:
# TO DO: Check if there are any missing values and fill them in if necessary
print("Total # missing values in")
print("X: ", X.isnull().sum())
print("y: ", y.isnull().sum())

Total # missing values in
X:  cement    0
slag      0
ash       0
water     0
splast    0
coarse    0
fine      0
age       0
dtype: int64
y:  0


### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Implement the machine learning model with `X` and `y`

In [52]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state = 0)
lr = model.fit(X_train, y_train)

print("linear regression coefficient: ", lr.coef_)
print("intercept: ", lr.intercept_)

linear regression coefficient:  [ 0.12185954  0.11060501  0.0953879  -0.1419938   0.31529263  0.02485841
  0.02486899  0.11270849]
intercept:  -36.541098199910955


### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [53]:
# TO DO: ADD YOUR CODE HERE
training_score = lr.score(X_train, y_train)
val_score = lr.score(X_test, y_test)

### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [54]:
# TO DO: ADD YOUR CODE HERE
from sklearn.metrics import mean_squared_error as MSE

results = pd.DataFrame(columns=["Training Accuracy", "Validation Accuracy"], index=["MSE", "R2 Score"])
results.iloc[0,0] = MSE(y_train,model.predict(X_train))
results.iloc[0,1] = MSE(y_test, model.predict(X_test))
results.iloc[1,:] = (training_score, val_score)
print(results)


         Training Accuracy Validation Accuracy
MSE             111.358439           95.904136
R2 Score          0.610823            0.623414


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?
#
For this dataset, linear model doesn't produce a good result because the R2 score is only 0.6, which is far away (40% off) from 1. Although the variance here is quite low, but we want to achieve an R2 score as close to 1 as possible. There are non-linear feature in the dataset, that a linear model does not do well in predicting the result in this case.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?


After completing part 1 of the assigment, I found this part is a lot easier. I can found the code in the lectures and the examples provided in D2L. I didn't need to use the internet nor AI to source the code. 

I follow the steps exactly as per the assignment instructions. 

I don't find any challenges in this part because I have done enough study and practice in part 1 of the assignment. Practicing to get familiar with the code help a lot, and while doing the problems, the concept becomes clearer to me.



## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


In the lectures, it was mentioned that more data does not neccessary result in better model. It is confirmed in the part 1 that 5% of the dataset (about 230 rows) obtain more or less the same accuracy as a full dataset (over 4000 rows). This means that a well-diversed small sample can represent the model quite well. So starting with a small sample and increasing the datasize as we go would be a good practice. 

However, the accuracy score seems to be better, the more features we have in the dataset. A dataset with only 2 features performs significant worse than a dataset with 57 features. Noting that different features may have different weight on the performance, some may be a lot more important than the other. But missing too many features (57 vs 2) seems too have a very bad impact on the overall score. 


## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


#My thought:
It was quite confusing at first trying to understand the concept of underfitting/overfitting fully. But after carefully reading the materials, it's not bad and becomes interesting. I like the fact that the assignment gives me enough practice to reinforce the understanding of the materials.

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [55]:
#Ridge
from sklearn.linear_model import Ridge
training_score = []
validation_score = []
alpha = [0.001, 0.01, 0.1, 1, 10, 100]
i = 0

print("Ridge\nalpha training_score validation_score")

for a in alpha :
    ridge = Ridge(a).fit(X_train, y_train)
    training_score.append(ridge.score(X_train, y_train))
    validation_score.append(ridge.score(X_test, y_test))
    print("{} \t {:0.6f} \t {:0.6f}".format(a, training_score[i], validation_score[i]))
    i +=1

#Lasso
from sklearn.linear_model import Lasso

training_score_Lasso = []
val_score_Lasso = []

i = 0
print("\nLasso\nalpha training_score validation_score")
for a in alpha:
    lasso= Lasso(a, max_iter=100000).fit(X_train, y_train)
    training_score_Lasso.append(lasso.score(X_train, y_train))
    val_score_Lasso.append(lasso.score(X_train, y_train))
    print("{} \t {:0.6f} \t {:0.6f}".format(a, training_score_Lasso[i], val_score_Lasso[i]))
    i +=1



Ridge
alpha training_score validation_score
0.001 	 0.610823 	 0.623414
0.01 	 0.610823 	 0.623414
0.1 	 0.610823 	 0.623415
1 	 0.610823 	 0.623415
10 	 0.610823 	 0.623418
100 	 0.610823 	 0.623453

Lasso
alpha training_score validation_score
0.001 	 0.610823 	 0.610823
0.01 	 0.610823 	 0.610823
0.1 	 0.610821 	 0.610821
1 	 0.610609 	 0.610609
10 	 0.604314 	 0.604314
100 	 0.467576 	 0.467576



*ANSWER HERE*
The accuracy scores didn't improve much with both the Lasso and Ridge method. They remain more or less the same 
around 0.6 even with different value of alpha. In Ridge, the training score doesn't change regardless of alpha, while the 
validation score improves very insignificantly when increase alpha. In Lasso, the scores remain more or less the same
when alpha is between 0.001, 0.01 and 0.1. As alpha is increasing over 1, the scores decreases quite significantly (over 20%)
from 0.61 (alpha = 1) to 0.47 (alpha = 100).

The Ridge model seems to be a better model with a higher R score. But this score is still not good enough as it is still 
about 40% off from the max value of 1. Again, this means that the dataset is not linear in nature that a linear model simply 
is not suitable for this dataset.