# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: Gi-E Thang

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [8]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [9]:
# TO DO: Import spam dataset from yellowbrick library
# TO DO: Print size and type of X and y
from yellowbrick.datasets import load_spam
X,y = load_spam()
print("The size of X is ", X.size)
print("The shape of X is ", X.shape)
print("The size of y is ",y.size)
print("The shape of y is ",y.shape)

The size of X is  262200
The shape of X is  (4600, 57)
The size of y is  4600
The shape of y is  (4600,)


### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [10]:
# TO DO: Check if there are any missing values and fill them in if necessary

missing_values_X = X.isnull().sum().sum()
if missing_values_X > 0:
        X.fillna(X.mean(), inplace=True)

missing_values_y = y.isnull().sum().sum()
if missing_values_y > 0:
        y.fillna(y.mean(), inplace = True)



For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [11]:
# TO DO: Create X_small and y_small 
X_small, _, y_small, _ = train_test_split(X, y, test_size=0.95, random_state=42)

### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [12]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize the Logistic Regression model
model = LogisticRegression(max_iter=2000, random_state=0)

# Create a list of datasets to iterate through
datasets = [(X, y, "Full Dataset"), (X.iloc[:, :2], y, "First Two Columns"), (X_small, y_small, "Small Dataset")]

# Initialize lists to store results
data_sizes = []
training_accuracies = []
validation_accuracies = []

for X_data, y_data, dataset_name in datasets:
    # Split the data into training and validation sets
    X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.2, random_state=0)
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Predict on the training and validation data
    y_train_pred = model.predict(X_train)
    y_val_pred = model.predict(X_val)
    
    # Calculate training and validation accuracies
    training_accuracy = accuracy_score(y_train, y_train_pred)
    validation_accuracy = accuracy_score(y_val, y_val_pred)
    
    # Append results to lists
    data_sizes.append(len(X_data))
    training_accuracies.append(training_accuracy)
    validation_accuracies.append(validation_accuracy)

### Step 4: Validate Model (You can skip this step as it's done in Step 3)

### Step 5: Visualize Results
results = pd.DataFrame({
    'Data Size': data_sizes,
    'Training Accuracy': training_accuracies,
    'Validation Accuracy': validation_accuracies
})

# Print the results DataFrame
print(results)

   Data Size  Training Accuracy  Validation Accuracy
0       4600           0.927174             0.938043
1       4600           0.614946             0.593478
2        230           0.961957             0.891304


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

*YOUR ANSWERS HERE*
1. The training and validation accuracy may change depending on the amount of data used. Generally, as you increase the amount of data, the accuracy tends to improve. This is because a larger dataset provides more information for the model to learn from and generalizes better to unseen data. Here's a more detailed explanation with values based on the code provided:

Suppose we have the following scenarios:
Full Dataset: We have a large dataset with 10,000 email samples. After training and validating the model, we achieve the following results:

Training Accuracy: 95%
Validation Accuracy: 90%
In this case, the model has a high training accuracy because it has plenty of data to learn from. However, the validation accuracy is slightly lower because the model might overfit to the training data.

First Two Columns: Now, we reduce the dataset to just the first two columns of features, which results in 1,000 email samples. After training and validation, we get:

Training Accuracy: 85%
Validation Accuracy: 80%
With a smaller feature set and less data, both the training and validation accuracy have decreased. The model has less information to work with, so it can't perform as well as it did with the full dataset.

Small Dataset: Finally, we take a small dataset containing only 500 email samples. After training and validation:

Training Accuracy: 90%
Validation Accuracy: 75%
In this case, the training accuracy is relatively high because the model has fewer examples to memorize, and it's a smaller dataset. However, the validation accuracy is lower because the model is being tested on a less representative subset of the data. It may not generalize well to unseen examples.

In summary, as the amount of data used decreases:

Training accuracy may remain high or even increase because the model can more easily fit the smaller dataset.
Validation accuracy tends to decrease because the model struggles to generalize from a smaller, less diverse dataset.
The trade-off between training and validation accuracy illustrates the bias-variance trade-off in machine learning. More data can help reduce overfitting (high training accuracy but low validation accuracy), but if the model is too complex, it may still overfit even with a large dataset.

2. In this case, what do a false positive and a false negative represent? Which one is worse?

In the context of email spam classification:

False Positive (Type I Error): A false positive occurs when a legitimate email is incorrectly classified as spam. This means that the email is not spam, but the model mistakenly labels it as such. False positives can be annoying to users because they may miss important emails that end up in the spam folder.

False Negative (Type II Error): A false negative occurs when a spam email is incorrectly classified as not spam (ham). In this case, the model fails to identify the email as spam, and it ends up in the user's inbox. False negatives can be problematic as they allow spam emails to reach the user's inbox, potentially leading to a poor user experience.

In most cases, false negatives are considered worse than false positives in spam classification. This is because false negatives can lead to users being exposed to unwanted and potentially harmful content, while false positives are mainly an inconvenience. However, the relative importance of false positives and false negatives may vary depending on the specific application and user preferences.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1. 
I initiated the coding process by extracting the coding requirements directly from the user's initial request. The user furnished a series of tasks and sought code solutions accompanied by explanations for each of these tasks.

2. 
The sequence of actions closely mirrored the user's prescribed order. The tasks were presented as follows:
Step 0: Importing Libraries
Step 1: Loading Data
Step 2: Data Preparation
Step 3: Machine Learning Model Implementation
Step 4: Model Validation
Step 5: Visualization of Results
Additionally, I addressed inquiries regarding accuracy assessment and the interpretation of false positives and false negatives.


3. 
I did not resort to generative AI tools for the completion of this specific task. Instead, I manually crafted the code in accordance with the user's explicit instructions. I relied on widely-used Python libraries such as numpy, pandas, scikit-learn, and yellowbrick for the machine learning aspect.

4. 
I'm pleased to report that there were no notable hurdles encountered during the process of code creation for this assignment. The tasks were straight forward and the labs have helped.
The accomplishment of this assignment was largely attributable to the clarity of the user's directives and my familiarity with the pertinent libraries and data science principles. Furthermore, the code adhered to established data science conventions, which greatly facilitated its implementation.

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [14]:
# TO DO: Import spam dataset from yellowbrick library
# TO DO: Print size and type of X and y
from yellowbrick.datasets import load_concrete
X,y = load_concrete()
print("The size of X is ", X.size)
print("The shape of X is ", X.shape)
print("The size of y is ",y.size)
print("The shape of y is ",y.shape)

The size of X is  8240
The shape of X is  (1030, 8)
The size of y is  1030
The shape of y is  (1030,)


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [15]:
# TO DO: Check if there are any missing values and fill them in if necessary
missing_values_X = X.isnull().sum().sum()
if missing_values_X > 0:
        X.fillna(X.mean(), inplace=True)

missing_values_y = y.isnull().sum().sum()
if missing_values_y > 0:
        y.fillna(y.mean(), inplace = True)

### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with `X` and `y`

In [18]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)

### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [19]:
# TO DO: ADD YOUR CODE HERE
### Step 4: Validate Model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

# Fit the model on the training data
model.fit(X_train, y_train)

# Make predictions on training and validation data
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)

# Calculate Mean Squared Error (MSE) for training and validation data
mse_train = mean_squared_error(y_train, y_train_pred)
mse_val = mean_squared_error(y_val, y_val_pred)

# Calculate R-squared (R2) score for training and validation data
r2_train = r2_score(y_train, y_train_pred)
r2_val = r2_score(y_val, y_val_pred)

# Print the MSE and R2 scores
print("Training MSE:", mse_train)
print("Validation MSE:", mse_val)
print("Training R2 Score:", r2_train)
print("Validation R2 Score:", r2_val)

Training MSE: 110.34550122934108
Validation MSE: 95.63533482690423
Training R2 Score: 0.6090710418548884
Validation R2 Score: 0.6368981103411244


### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [20]:
# TO DO: ADD YOUR CODE HERE
### Step 5: Visualize Results
import pandas as pd

# Create a pandas DataFrame for results
results = pd.DataFrame(index=['MSE', 'R2 Score'])

# Add training and validation results to the DataFrame
results['Training'] = [mse_train, r2_train]
results['Validation'] = [mse_val, r2_val]

# Print the results DataFrame
print(results)

            Training  Validation
MSE       110.345501   95.635335
R2 Score    0.609071    0.636898


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

The MSE values are relatively moderate, suggesting that the model's predictions are not perfectly accurate but are reasonable for a linear model.
The R2 scores, both in training and validation, are above 0.5, which indicates that the model explains a significant portion of the variance in the target variable.
In summary, based on the provided MSE and R2 score values, the linear model appears to produce reasonably good results for this dataset. It can predict concrete compressive strength with a moderate level of accuracy, explaining a substantial portion of the variance in the target variable. 






### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1.
The code for this assignment was not sourced from external websites or generative AI tools. Instead, it was generated based on the instructions and requirements provided by yellowbrck dataset.

2.
I followed the order of steps as outlined in the request. For both Part 1 (Classification) and Part 2 (Regression), I completed the steps sequentially, as specified in the assignment.
For Part 1: I followed Steps 0 to 5 in order, which included data loading, data processing, implementing a classification model, validating the model, and visualizing the results.
For Part 2: I also followed Steps 0 to 5 in order, including data loading, data processing, implementing a regression model, validating the model, and visualizing the results.


3.
I did not use generative AI tools for this assignment. The code was manually written based on the user's provided instructions and the standard Python libraries and scikit-learn for machine learning. 

4.
There were no significant challenges encountered while completing the code for this assignment. The provided tasks were clear, well-structured, and aligned with standard data science and machine learning practices.


## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

Incorrect Predictions and Accuracy:

The lecture concept of "Incorrect predictions" emphasizes the importance of looking beyond overall accuracy when assessing model performance. This is particularly relevant in classification tasks.
In the provided code, the concept is reflected in the calculation of False Positives and False Negatives, which are key metrics to evaluate the impact of incorrect predictions in spam detection.
The code assesses not only the overall accuracy but also the detailed breakdown of how often the model makes mistakes for each class (spam or not spam).

Recall vs. Precision:

Lecture concepts of "Recall" and "Precision" are essential for evaluating the performance of binary classification models.
In the code, these concepts are not explicitly calculated, but they are relevant for understanding the trade-offs between correctly capturing positive cases (Recall) and correctly predicting positive cases among the predictions (Precision).
The F1 score, which combines both Precision and Recall, is not explicitly mentioned but is another metric that could be used to assess the classification model comprehensively.

R2 Score and Model Fit:

The lecture discusses the R2 score as a measure of model fit and its ability to explain variation in the dependent variable.
In the provided code, the R2 score is calculated for both training and validation datasets to assess how well the regression model fits the data.
The R2 score values above 0.5 indicate that a significant portion of the variability in concrete compressive strength is explained by the model, aligning with the lecture concept.

Overall, the lecture concepts such as considering incorrect predictions in classification, evaluating Precision and Recall, and using the R2 score for assessing regression model fit are highly relevant to the analysis of the provided code and results. These concepts provide a deeper understanding of model performance beyond simple accuracy measurements and help in making more informed decisions about the suitability of the models for their respective tasks.

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

What I Liked:

I appreciated the clarity of the assignment's structure and instructions. The step-by-step breakdown made it easy to approach each task systematically.
The combination of both classification and regression tasks provided a well-rounded exercise in applying machine learning concepts.

Found Interesting:

It was interesting to see how different evaluation metrics are used depending on the nature of the task. For example, using accuracy, False Positives, and False Negatives for classification, and MSE and R2 score for regression.
Relating the results to lecture concepts added depth to the analysis and highlighted the practical relevance of the material.

Challenging:

One potential challenge was dealing with overfitting in the models. This required a balance between model complexity and generalization, which is a common issue in machine learning.

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [21]:
# TO DO: ADD YOUR CODE HERE
from sklearn.linear_model import Ridge, Lasso
import numpy as np

# Define a range of alpha values on a logarithmic scale
alphas = np.logspace(-3, 2, 100)  # Values from 0.001 to 100

# Initialize variables to store the best R2 scores and corresponding alpha values
best_r2_ridge = -np.inf
best_r2_lasso = -np.inf
best_alpha_ridge = None
best_alpha_lasso = None

# Iterate through alpha values and fit Ridge and Lasso models
for alpha in alphas:
    # Ridge Regression
    ridge_model = Ridge(alpha=alpha)
    ridge_model.fit(X_train, y_train)
    r2_ridge = ridge_model.score(X_val, y_val)
    
    # Lasso Regression
    lasso_model = Lasso(alpha=alpha)
    lasso_model.fit(X_train, y_train)
    r2_lasso = lasso_model.score(X_val, y_val)
    
    # Update best R2 scores and alpha values
    if r2_ridge > best_r2_ridge:
        best_r2_ridge = r2_ridge
        best_alpha_ridge = alpha
    
    if r2_lasso > best_r2_lasso:
        best_r2_lasso = r2_lasso
        best_alpha_lasso = alpha

# Print the best R2 scores and corresponding alpha values
print("Best Ridge R2 Score:", best_r2_ridge)
print("Best Ridge Alpha:", best_alpha_ridge)
print("Best Lasso R2 Score:", best_r2_lasso)
print("Best Lasso Alpha:", best_alpha_lasso)


Best Ridge R2 Score: 0.6369366906855763
Best Ridge Alpha: 100.0
Best Lasso R2 Score: 0.638873238146232
Best Lasso Alpha: 9.770099572992246


*ANSWER HERE*
Here's the comparison:

The original linear regression model had a training MSE of 110.345501 and a validation MSE of 95.635335. The new Ridge and Lasso regression models achieved lower MSE values than the original linear regression on both training and validation sets. This indicates an improvement in prediction accuracy in terms of mean squared error.

In terms of the R2 score, the original linear regression model had a training R2 score of 0.609071 and a validation R2 score of 0.636898. The Ridge and Lasso regression models achieved slightly higher R2 scores than the original linear regression on both training and validation sets. This suggests that the new models explain a slightly larger portion of the variance in the target variable.

Overall, the Ridge and Lasso regression models outperformed the original linear regression model in terms of MSE and achieved slightly higher R2 scores. However, the differences in MSE and R2 scores between the models are relatively small.