# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: 

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [1]:
import numpy as np
import pandas as pd
import yellowbrick as yb

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [69]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_spam
# TO DO: Print size and type of X and y
X, y = load_spam()

print("Size of X:", X.shape)
print("Type of X:", type(X))
print(X.dtypes)
print("Size of y:", y.shape)
print("Type of y:", type(y))
print(y.dtypes)

Size of X: (4600, 57)
Type of X: <class 'pandas.core.frame.DataFrame'>
word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp     

### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [70]:
# TO DO: Check if there are any missing values and fill them in if necessary
missing_values_X = X.isnull().sum().sum()
print(missing_values_X)
missing_x_values = X.isna().sum()
print("", missing_x_values)


missing_values_y = y.isnull().sum().sum()
print(missing_values_y)
missing_y_values = y.isna().sum()
print("", missing_y_values)


0
 word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq_will                0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_freq_telnet              0
word_

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [23]:
# TO DO: Create X_small and y_small 
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets, retaining 5% of the data for testing
X_small, _, y_small, _ = train_test_split(X, y, test_size=0.95, random_state=42)

# Print the size and type of X_small and y_small
print("Size of X_small:", X_small.shape)
print("Type of X_small:", type(X_small))
print("Size of y_small:", y_small.shape)
print("Type of y_small:", type(y_small))


Size of X_small: (230, 57)
Type of X_small: <class 'pandas.core.frame.DataFrame'>
Size of y_small: (230,)
Type of y_small: <class 'pandas.core.series.Series'>


### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [85]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# STEP 3

from sklearn.linear_model import LogisticRegression

# Create a validation dataset for each model
X_train_original, X_val_original, y_train_original, y_val_original = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_first_two_columns, X_val_first_two_columns, y_train_first_two_columns, y_val_first_two_columns = train_test_split(X.iloc[:, :2], y, test_size=0.2, random_state=42)
X_train_small, X_val_small, y_train_small, y_val_small = train_test_split(X_small, y_small, test_size=0.2, random_state=42)

# Instantiate a new Logistic Regression model with a maximum of 2000 iterations for each dataset
model_original = LogisticRegression(max_iter=2000)
model_first_two_columns = LogisticRegression(max_iter=2000)
model_small = LogisticRegression(max_iter=2000)

# Fit each model with its respective dataset
model_original.fit(X.values, y.values)
model_first_two_columns.fit(X.iloc[:, :2].values, y.values)
model_small.fit(X_small.values, y_small.values)

# STEP 4

# Calculate and print the training accuracy for each model
train_accuracy_original = model_original.score(X.values, y.values)
train_accuracy_first_two_columns = model_first_two_columns.score(X.iloc[:, :2].values, y.values)
train_accuracy_small = model_small.score(X_small.values, y_small.values)

# Calculate and print validation accuracy for each model
val_accuracy_original = model_original.score(X_val_original.values, y_val_original.values)
val_accuracy_first_two_columns = model_first_two_columns.score(X_val_first_two_columns.values, y_val_first_two_columns.values)
val_accuracy_small = model_small.score(X_val_small.values, y_val_small.values)

# STEP 5

# Create a list to store the results for each dataset
data_sizes = ["Full Dataset", "First Two Columns of X", "X_small"]
data_shapes = [X_train_original.shape, X_train_first_two_columns.shape, X_train_small.shape]
training_accuracies = [train_accuracy_original, train_accuracy_first_two_columns, train_accuracy_small]
validation_accuracies = [val_accuracy_original, val_accuracy_first_two_columns, val_accuracy_small]

# Create the results DataFrame
results = pd.DataFrame({
    "Data Size": data_sizes,
    "Data Shape": data_shapes,
    "Training Accuracy": training_accuracies,
    "Validation Accuracy": validation_accuracies
})

# Print the results DataFrame
print(results)


# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

                Data Size  Data Shape  Training Accuracy  Validation Accuracy
0            Full Dataset  (3680, 57)           0.932391             0.917391
1  First Two Columns of X   (3680, 2)           0.616304             0.590217
2                 X_small   (184, 57)           0.960870             0.934783


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

*YOUR ANSWERS HERE*

1. Training and Validation Accuracy vs. Amount of Data:

Full Dataset: The "Full Dataset" exhibits the highest training and validation accuracy. This indicates that the model has been trained on a large amount of data, and it performs well not only on the training data but also on unseen validation data. The training and validation accuracy values for the "Full Dataset" are 0.932391 and 0.917391, respectively.

First Two Columns of X: The "First Two Columns of X" dataset has lower training and validation accuracy compared to the "Full Dataset." This suggests that when you use a subset of features (only the first two columns of X), the model's predictive power is reduced. The training and validation accuracy values for this dataset are 0.616304 and 0.590217, respectively.

X_small: The "X_small" dataset achieves high training and validation accuracy, even though it contains a smaller amount of data. This indicates that the model can generalize well from the limited data available in this dataset. The training and validation accuracy values for this dataset are 0.960870 and 0.934783, respectively.

Explanation: In general, as you increase the amount of training data, both training and validation accuracy tend to improve. This is because a larger training dataset provides the model with more information to learn from, resulting in better performance. However, it's important to note that increasing the amount of data does not guarantee improvement, and there can be diminishing returns beyond a certain point. The specific behavior may vary depending on the complexity of the model and the quality of the data.



2. False Positive and False Negative:

In the context of a binary classification problem (e.g., spam classification), a false positive occurs when the model incorrectly predicts a positive (spam) class when the actual class is negative (not spam). In other words, it's a situation where the model incorrectly identifies something as spam when it's not.

A false negative occurs when the model incorrectly predicts a negative (not spam) class when the actual class is positive (spam). In this case, the model fails to identify actual spam.

Which One Is Worse?:

False Positives: False positives can be annoying or inconvenient for users because legitimate emails may be classified as spam. However, they generally have fewer severe consequences compared to false negatives.

False Negatives: False negatives can be more problematic as they allow actual spam to reach the inbox, potentially causing users to miss important emails or exposing them to potentially harmful content.

In many spam classification scenarios, minimizing false negatives (ensuring that spam is accurately detected) is often considered more critical. However, the balance between false positives and false negatives can be adjusted based on user preferences and the specific goals of the application.


### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [61]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_concrete
# TO DO: Print size and type of X and y
X, y = load_concrete()

print("Size of X:", X.shape)
print("Type of X:", type(X))
print("Size of y:", y.shape)
print("Type of y:", type(y))

Size of X: (1030, 8)
Type of X: <class 'pandas.core.frame.DataFrame'>
Size of y: (1030,)
Type of y: <class 'pandas.core.series.Series'>


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [62]:
# TO DO: Check if there are any missing values and fill them in if necessary
missing_values_X = X.isnull().sum().sum()
print(missing_values_X)

missing_values_y = y.isnull().sum().sum()
print(missing_values_y)

0
0


### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with `X` and `y`

In [115]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0

# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# STEP 3

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Create a validation dataset for each model
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate a new Linear Regression model with a maximum of 2000 iterations for each dataset
model = LinearRegression()

# Fit each model with its respective dataset
model.fit(X_train.values, y_train.values)

# Fit the model to the training data
y_train_pred = model.predict(X_train.values)
y_val_pred = model.predict(X_val.values)

### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [118]:
# TO DO: ADD YOUR CODE HERE
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Mean squared error for training and test

mse_train = mean_squared_error(y_train, y_train_pred)
mse_val = mean_squared_error(y_val, y_val_pred)

# r2 error for training and test

r2_train = r2_score(y_train, y_train_pred)
r2_val = r2_score(y_val, y_val_pred)

# Print the results
print("Training Mean Squared Error: ", mse_train)
print("Training R-squared: ", r2_train)
print("Validation Mean Squared Error: ", mse_val)
print("Validation R-squared: ", r2_val)

Training Mean Squared Error:  0.10334473427036059
Training R-squared:  0.5641265095933534
Validation Mean Squared Error:  0.11476906473081404
Validation R-squared:  0.5300409463562602


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [117]:
# TO DO: ADD YOUR CODE HERE

# STEP 5

results_data = {
    "MSE": [mse_train, mse_val],
    "R2": [r2_train, r2_val]
}

results = pd.DataFrame(results_data)
print(results)

        MSE        R2
0  0.103345  0.564127
1  0.114769  0.530041


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

MSE: The Mean Squared Error measures the average squared difference between the actual target values and the predicted values. In this case, a lower MSE indicates better model performance. The training MSE is 0.103345, and the validation MSE is 0.114769. The training MSE is slightly lower, suggesting that the model fits the training data better. However, the difference between the training and validation MSE is relatively small, which is a positive sign.

R2: The R-squared (R2) score measures the proportion of the variance in the target variable that is predictable from the independent variables. A higher R2 score indicates better model fit. The training R2 is 0.564127, and the validation R2 is 0.530041. These values suggest that the model explains a reasonable portion of the variance in the data, with the training set performing slightly better.

In summary, the linear regression model seems to perform reasonably well on this dataset. The training and validation metrics are close, indicating that the model generalizes well to unseen data. However, the goodness of fit can be further assessed by comparing these metrics to other models and conducting cross-validation.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
Code was sourced primarily from lab exercises, as the example shown was similar. Generative AI tool ChatGPT was utilized in some troubleshooting of the code.
1. In what order did you complete the steps?
a. Step 1: Loading the data using the Yellowbrick library.
b. Step 2: Checking for missing values in the dataset.
c. Step 3: Implementing a machine learning model (either LinearRegression or LogisticRegression based on your request) using scikit-learn.
d. Step 4: Evaluating the model's performance by calculating MSE and R-squared for both training and validation sets.
e. Step 5: Creating a DataFrame to present the results.
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
ChatGPT assisted with troubleshooting of my code. Specifically, I kept encountering the following error in most of my code:
/Users/satchy/anaconda3/envs/ensf-ml/lib/python3.11/site-packages/sklearn/utils/validation.py:605: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.
ChatGPT explained that this code could be ignored but also suggested using .values at the end of my variables when importing them into my datasets to avoid this warning message which I ended up implemented throughout the code.
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?
I found it difficult to know exactly what needed to be done in each step but following lab examples helped trememnedously. It would have been very difficult to complete this assignment otherwise.

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

MSE Comparison:
The training MSE (Mean Squared Error) is 0.103345.
The validation MSE is 0.114769.
Interpretation: The MSE measures the average squared difference between the actual and predicted values. A lower MSE indicates a better model fit. In this case, the training MSE is slightly lower than the validation MSE, which is a common pattern. The model typically fits the training data better than the validation data.

R2 Comparison:
The training R2 (R-squared) is 0.564127.
The validation R2 is 0.530041.
Interpretation: R2 measures the proportion of the variance in the target variable that is predictable from the independent variables. A higher R2 indicates a better model fit. Both the training and validation R2 values are relatively close to each other, suggesting that the model generalizes reasonably well to new, unseen data.

Training vs. Validation:
The training MSE and R2 are slightly better than the validation MSE and R2.
Interpretation: This is a typical pattern. The training set is the data the model was trained on, so it tends to fit this data better. The validation set is used to assess how well the model generalizes to new, unseen data. The relatively small difference between the training and validation metrics suggests that the model's performance is consistent and does not show signs of severe overfitting.

Overall, these results align with what is often discussed in machine learning lectures:
Generalization: The model appears to generalize well, as indicated by the similarity in performance between the training and validation sets. This suggests that the model is not overfitting the training data.
Model Fit: The model shows reasonable performance in explaining the variance in the target variable. However, the results may vary depending on the specific dataset and problem. Further analysis, such as comparing this linear model with more complex models or conducting cross-validation, would provide a more comprehensive assessment of the model's fit to the data.
Model Evaluation: It's crucial to evaluate regression models using appropriate metrics like MSE and R2. These metrics provide insights into how well the model predicts the target variable.

In summary, the observed pattern in the results aligns with the principles discussed during machine learning lectures, emphasizing the importance of model evaluation, generalization, and the choice of appropriate metrics for assessing model performance.

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

While assisting with this assignment, I enjoyed the clarity of the tasks and questions provided. It made it straightforward to create the code and provide explanations. However, I also noticed that some of the provided code contained inconsistencies, such as references to "Logistic Regression" when working with a linear regression model, which could potentially lead to confusion. Ensuring that the code and terminology are consistent with the task description is essential for clarity and understanding. Overall, it was motivating to help with this assignment and provide guidance on data processing, modeling, and evaluation, which are fundamental concepts in machine learning.

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [120]:
# TO DO: ADD YOUR CODE HERE
from sklearn.linear_model import Ridge, Lasso
import numpy as np
import warnings
warnings.filterwarnings("ignore")

# Define a range of alpha values on a logarithmic scale
alphas = np.logspace(-3, 2, 100)

best_r2 = -np.inf  # Initialize to negative infinity
best_alpha = None
best_model = None

# Loop through different alpha values
for alpha in alphas:
    # Create and fit a Ridge model
    ridge_model = Ridge(alpha=alpha)
    ridge_model.fit(X_train, y_train)

    # Make predictions on the validation set
    y_val_pred_ridge = ridge_model.predict(X_val)

    # Calculate R-squared for Ridge
    r2_val_ridge = r2_score(y_val, y_val_pred_ridge)

    if r2_val_ridge > best_r2:
        best_r2 = r2_val_ridge
        best_alpha = alpha
        best_model = ridge_model

print("Best Ridge R-squared:", best_r2)
print("Best Ridge Alpha:", best_alpha)

# Repeat the process for Lasso
best_r2 = -np.inf
best_alpha = None
best_model = None

for alpha in alphas:
    # Create and fit a Lasso model
    lasso_model = Lasso(alpha=alpha)
    lasso_model.fit(X_train, y_train)

    # Make predictions on the validation set
    y_val_pred_lasso = lasso_model.predict(X_val)

    # Calculate R-squared for Lasso
    r2_val_lasso = r2_score(y_val, y_val_pred_lasso)

    if r2_val_lasso > best_r2:
        best_r2 = r2_val_lasso
        best_alpha = alpha
        best_model = lasso_model

print("Best Lasso R-squared:", best_r2)
print("Best Lasso Alpha:", best_alpha)


Best Ridge R-squared: 0.5308491236372105
Best Ridge Alpha: 55.90810182512222
Best Lasso R-squared: 0.5287688594784852
Best Lasso Alpha: 0.001


*ANSWER HERE*