# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: Moaz Barakat

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [6]:
import numpy as np
import pandas as pd
from yellowbrick.datasets import load_spam
from sklearn.model_selection import train_test_split


### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [34]:
# TO DO: Import spam dataset from yellowbrick library
# Note: regular import did not work due to a bug with yellowbrick. I downloaded the dataset (loadspam) and used the relative path
X, y = load_spam(data_home="Lab2/spambase.data")
# TO DO: Print size and type of X and y
print("X Size: ", X.size, ", Shape: ", X.shape)
print("y Size: ", y.size, ", Shape: ", y.shape)
print(X.dtypes)
print(y.dtypes)

X Size:  262200 , Shape:  (4600, 57)
y Size:  4600 , Shape:  (4600,)
word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp       

### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [35]:
# TO DO: Check if there are any missing values and fill them in if necessary
nullsX = X.isnull().sum().sort_values(ascending=False)
nullsy = y.isnull().sort_values().sum
print(nullsX) #Note: no nulls confirmed
print(nullsy) #Note: no nulls confirmed

word_freq_make                0
word_freq_labs                0
word_freq_857                 0
word_freq_data                0
word_freq_415                 0
word_freq_85                  0
word_freq_technology          0
word_freq_1999                0
word_freq_parts               0
word_freq_pm                  0
word_freq_direct              0
word_freq_cs                  0
word_freq_meeting             0
word_freq_original            0
word_freq_project             0
word_freq_re                  0
word_freq_edu                 0
word_freq_table               0
word_freq_conference          0
char_freq_;                   0
char_freq_(                   0
char_freq_[                   0
char_freq_!                   0
char_freq_$                   0
char_freq_#                   0
capital_run_length_average    0
capital_run_length_longest    0
word_freq_telnet              0
word_freq_lab                 0
word_freq_address             0
word_freq_650                 0
word_fre

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [36]:
# TO DO: Create X_small and y_small 
X_small, X_test, y_small, y_test = train_test_split(X, y, train_size=0.05) #note test sets will be used in step 4


### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

In [37]:
# Importing LogisticRegression from sklearn
from sklearn.linear_model import LogisticRegression
# Instiating the model with max_iter = 2000
#3.1
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(X, y, test_size=0.2)
logreg_Xy = LogisticRegression(max_iter=2000).fit(X_train_full, y_train_full)

#3.2, only first two coulumns
x_sample = X.loc[:, [X.columns[0], X.columns[1]]]
X_train_sample, X_test_sample, y_train_sample, y_test_sample = train_test_split(x_sample, y, test_size=0.2)
logreg_Xysample = LogisticRegression(max_iter=2000).fit(X_train_sample, y_train_sample)

#3.3
logreg_Xysmall = LogisticRegression(max_iter=2000).fit(X_small, y_small)


### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

In [38]:
from sklearn.model_selection import cross_validate

def print_accuracy_validation(name, model, X_train, y_train, X_test, y_test) : #Summarizes the different scores
    print("\nScore for", name)

    scores = cross_validate(model, X_train, y_train, cv=5, 
                        scoring='accuracy',
                       return_train_score=True)
    print('training accuracy (all data) {:.3f}'.format(model.score(X_train, y_train)))
    print('training accuracy (cross-validation) {:.3f}'.format(scores['train_score'].mean()))
    print('validation accuracy (cross-validation) {:.3f}'.format(scores['test_score'].mean()))
    print('test accuracy (new data) {:.3f}'.format(model.score(X_test, y_test)))
    return model.score(X_train, y_train), model.score(X_test, y_test) #Note: return values used in step 5


ta_1, va_1 = print_accuracy_validation("X and y", model=logreg_Xy, X_train=X_train_full, y_train=y_train_full, X_test=X_test_full, y_test=y_test_full)
ta_2, va_2 = print_accuracy_validation("X and y (first two columns)", model=logreg_Xysample, X_train=X_train_sample, y_train=y_train_sample, X_test=X_test_sample, y_test=y_test_sample)
ta_3, va_3 = print_accuracy_validation("X_small and y_small", model=logreg_Xysmall, X_train=X_small, y_train=y_small, X_test=X_test, y_test=y_test)



Score for X and y
training accuracy (all data) 0.928
training accuracy (cross-validation) 0.930
validation accuracy (cross-validation) 0.922
test accuracy (new data) 0.922

Score for X and y (first two columns)
training accuracy (all data) 0.616
training accuracy (cross-validation) 0.618
validation accuracy (cross-validation) 0.618
test accuracy (new data) 0.623

Score for X_small and y_small
training accuracy (all data) 0.935
training accuracy (cross-validation) 0.946
validation accuracy (cross-validation) 0.878
test accuracy (new data) 0.914


### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [40]:
# Note: Steps 3-4 are above
# Note: for any random state parameters, you can use random_state = 0
#5.1
results = pd.DataFrame(columns=["Data size","Training Accuracy", "Validation Accuracy"])

#5.2, HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT
for size, ta, va in zip([X_train_full.size, X_train_sample.size, X_small.size],[ta_1, ta_2, ta_3],[va_1, va_2, va_3]):
     results.loc[len(results)] = [size,ta,va]

results.sort_values("Data size", inplace = True)
results


Unnamed: 0,Data size,Training Accuracy,Validation Accuracy
1,7360.0,0.616033,0.622826
2,13110.0,0.934783,0.913501
0,209760.0,0.928261,0.921739


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

*YOUR ANSWERS HERE*
1. As shown in the above table, the training and validation accuracy tend to increase as the data size increase.
This is evident for all the datasets shown in sorted order. Example 7360 produced ~0.62 validation and 209760 produced ~0.93
2. This is a binary classification, hence we have two classes. *positive* class and *negative* class.

In the spam dataset, we are trying to identify emails that are spam. 
- The *positive* class would be 'Spam'
- The *negative* class would be 'Non-spam'.

**False positive**: If we falsly label a sample as *positive* that in reality is *negative*. 
- We say that an email is a spam, whereas in reality it is not.

**False negative**: If we falsly label a sample as *negative* that in reality is *positive*. 
- We miss to identify an email as spam.

Typically which is one is worst depends on the application and priorities. For me, *False positives* in this case is way worse than *False negatives*. The reason being marking a regular email as spam is bad compared to the other (E.g what if it is an important email)

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1. Where did you source your code?
 - Code is based on a combination of lectures examples and documentation from scikit learn
  - Lecture examples adopted from: <cite>Introduction to Machine Learning with Python, Müller and Guido, 1st ed, 2016 https://github.com/amueller/introduction_to_ml_with_python</cite>
  - Scikit learn general documentation available at: <cite> https://scikit-learn.org/stable/ </cite>
  - Spam dataset: <cite> https://www.scikit-yb.org/en/latest/api/datasets/spam.html </cite>
2. In what order did you complete the steps?
  - The order of the steps was done as per the assignment, steps 1-5 as it seemed logically
3. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
  - Similar to assignment 1, I explicitly avoided using any generative AI throughout this process so I can learn while doing the assignment
4. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?
  - Yes I did have challenges. One of the challenges was to deal with the yellowbricks bug which prevented me from loading the dataset normally. I overcame this by downloading the dataset and used a relative path to load the dataset. Otherwise, everything else was straightforward and the examples on D2L really helped me a lot. 

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [50]:
# TO DO: Import concrete dataset from yellowbrick library
from yellowbrick.datasets import load_concrete
X, y = load_concrete() #Note: concrete dataset worked, spam dataset did not work and required manually downloading
# TO DO: Print size and type of X and y
print("X Size: ", X.size, ", Shape: ", X.shape)
print("y Size: ", y.size, ", Shape: ", y.shape)
print(X.dtypes)
print(y.dtypes)

X Size:  8240 , Shape:  (1030, 8)
y Size:  1030 , Shape:  (1030,)
cement    float64
slag      float64
ash       float64
water     float64
splast    float64
coarse    float64
fine      float64
age         int64
dtype: object
float64


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [51]:
# TO DO: Check if there are any missing values and fill them in if necessary
nullsX = X.isnull().sum().sort_values(ascending=False)
nullsy = y.isnull().sort_values().sum
print(nullsX) #Note: no nulls confirmed
print(nullsy) #Note: no nulls confirmed

cement    0
slag      0
ash       0
water     0
splast    0
coarse    0
fine      0
age       0
dtype: int64
<bound method NDFrame._add_numeric_operations.<locals>.sum of 0       False
678     False
679     False
680     False
681     False
        ...  
349     False
350     False
351     False
385     False
1029    False
Name: strength, Length: 1030, dtype: bool>


### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Implement the machine learning model with `X` and `y`

In [52]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0
# Importing LinearRegression from sklearn
from sklearn.linear_model import LinearRegression
# 3.1, instantiate the model
model = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) #Note splitted set for train/test
# 3.2, implement the machine learning model
model.fit(X_train, y_train)


### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [10]:
# TO DO: ADD YOUR CODE HERE
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)
rt = r2_score(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
mae = mean_absolute_error(y_test, predictions)

print("Linear Regression:")
print(f"Mean Squared Error: {mse:.3f}")
print(f"R2 Error: {rt:.3f}")
print(f"Root Mean Squared Error: {rmse:.3f}")
print(f"Mean Absolute Error: {mae:.3f}")

Linear Regression:
Mean Squared Error: 95.635
R2 Error: 0.637
Root Mean Squared Error: 9.779
Mean Absolute Error: 7.865


### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [29]:
# TO DO: ADD YOUR CODE HERE
#5.1
results = pd.DataFrame(columns=["Training Accuracy", "Validation Accuracy"], index = ['MSE', 'R2'])

#5.2
predictions_trn = model.predict(X_train)
predictions_tst = model.predict(X_test)

for pred, act, i in zip([predictions_trn,predictions_tst], [y_train, y_test], [0,1]):
     mse_r = mean_squared_error(act, pred)
     rt_r = r2_score(act, pred)
     results[results.columns[i]] = [round(mse_r,3), round(rt_r,3)]

#5.3
results

Unnamed: 0,Training Accuracy,Validation Accuracy
MSE,110.346,95.635
R2,0.609,0.637


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?
 - In my opinion, I think the model can be further improved. Take for example the 0.64 R2 value. This means the remaining 36 % of the variability is still unaccounted for which is not very promising. This makes sense since the compressive strength is complex and nonlinear feature in nature. Due to the low variance and high bias observed, it looks like the model is underfitting the data. 

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1. Where did you source your code?
 - Code is based on a combination of lectures examples and documentation from scikit learn
  - Lecture slides adopted from: <cite>Introduction to Machine Learning with Python, Müller and Guido, 1st ed, 2016 https://github.com/amueller/introduction_to_ml_with_python</cite>
  - Scikit learn general documentation available at: <cite> https://scikit-learn.org/stable/ </cite>
  - Concrete dataset: <cite> https://www.scikit-yb.org/en/latest/api/datasets/concrete.html </cite>
2. In what order did you complete the steps?
  - The order of the steps was done as per the assignment, steps 1-5 as it seemed logically
3. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
  - Similar to assignment 1, I explicitly avoided using any generative AI throughout this process so I can learn while doing the assignment
4. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?
  - No major challenges, everything was straightforward and the examples on D2L really helped me a lot. 

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.

*ADD YOUR FINDINGS HERE*

General:
The ML process for both classification and regression is very similar and follows the machine learning workflow we are taught in the lectures

The ML process:
- Data input: the load spam and load concrete
- Data processing: detecting if there are any nulls and processing the data if required
- Choosing a ML model: for the case of spam we used LogesticRegression and for concrete we used LinearRegression
- Validation: for both cases we validated the model, validation I've used is cross-validation, test score, train score, R2, MSE
- Visualization: for both cases we had a summarization table and viewed the result
 

The spam dataset:
- The model accuracy increases as the size of the dataset increases - as evident by the table in step #5 of part 1
- In the best case we are able to verify ~93% accurately a spam email (7% misclassification)
- The result has a decently high accuracy but the 7% misclassification is not a small issue
- Training and validation scores (for all cases) were similar, this means we have low variance and we are not overfitting
- As evident from the code below, the top two words to identifying a spam email is the "your" and "000"


The concrete dataset:
- The linear model produced a 0.64 R2 score (ideal value = 1) and 95.64 MSE (ideal value = 0)
- The remaining 36 % of the variability is still unaccounted
- Training and validation scores were similar (0.61 & 0.64), we have low variance 
- Validation scores are far from maximum (1), we have high bias.
- This suggests the model underfits the data
- As evident from the code below, the top two correlated features related to the yield strength are the cement and splast
 

In [49]:
import matplotlib.pyplot as plt
import seaborn as sns

data1 = pd.DataFrame(load_concrete()[0])
data1["Yield Strength"] = load_concrete()[1]
cols1 = data1.corr().abs().nlargest(3, 'Yield Strength')['Yield Strength'].index
print(cols1)

data2 = pd.DataFrame(load_spam(data_home="Lab2/spambase.data")[0])
data2["Is Spam"] = load_spam(data_home="Lab2/spambase.data")[1]
cols2 = data2.corr().abs().nlargest(3, 'Is Spam')['Is Spam'].index
print(cols2)

Index(['Yield Strength', 'cement', 'splast'], dtype='object')
Index(['Is Spam', 'word_freq_your', 'word_freq_000'], dtype='object')


## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

what you liked or disliked: 
- I liked that we went through both examples of classification and regression 

found interesting, confusing, challangeing, motivating:
- It was intersting how the ML process is similar for classification and regression. Addtionally, the example datasets were intersting and motivating


## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [129]:
# TO DO: ADD YOUR CODE HERE
from sklearn.linear_model import Ridge, Lasso

#Ridge ----------------------------------------------------
ridge_model = Ridge(alpha=100)  
ridge_model.fit(X_train, y_train)

# Make predictions
ridge_predictions = ridge_model.predict(X_test)

# Evaluate the model
ridge_rt = r2_score(y_test, ridge_predictions)
ridge_mse = mean_squared_error(y_test, ridge_predictions)


print("\nRidge Regression: ")
print(f"R2 Error: {ridge_rt:.8f}")
print(f"Mean Squared Error: {ridge_mse:.8f}")


#Lasso ----------------------------------------------------
lasso_model = Lasso(alpha=0.001)  
lasso_model.fit(X_train, y_train)

# Make predictions
lasso_predictions = lasso_model.predict(X_test)

# Evaluate the model
lasso_rt = r2_score(y_test, lasso_predictions)
lasso_mse = mean_squared_error(y_test, lasso_predictions)


print("\nLasso Regression:")
print(f"R2 Error: {lasso_rt:.8f}")
print(f"Mean Squared Error: {lasso_mse:.8f}")





Ridge Regression: 
R2 Error: 0.63693669
Mean Squared Error: 95.62517337

Lasso Regression:
R2 Error: 0.63689949
Mean Squared Error: 95.63497052


*ANSWER HERE*

 Which method and what value of alpha gave you the best R^2 score?
 - Ridge with 100 as alpha gave the highest R2 score 

 Is this score "good enough"? Explain why or why not
 - No, it did not improve the model significantally and it was not good enough to move away from underfitting the data. I still think linear model to this problem will not suffice, we will need a different model to achieve higher accuracy 