# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name:Jauhar Fathima 

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [4]:
import numpy as np
import pandas as pd
from yellowbrick.datasets import load_spam
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [5]:
# TO DO: Import spam dataset from yellowbrick library
X, y = load_spam()
# TO DO: Print size and type of X and y
print("Size of X:", X.shape)
print("Type of X:", type(X))
print("Size of y:", y.shape)
print("Type of y:", type(y)) 

Size of X: (4600, 57)
Type of X: <class 'pandas.core.frame.DataFrame'>
Size of y: (4600,)
Type of y: <class 'pandas.core.series.Series'>


### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [6]:
# TO DO: Check if there are any missing values and fill them in if necessary
missing_values = X.isnull().sum().sum()
if missing_values > 0:
    X.fillna(X.mean(), inplace=True)


For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [7]:
# TO DO: Create X_small and y_small
X_small, _, y_small, _ = train_test_split(X, y, test_size=0.95, random_state=42)

# Print the shape of X_small and y_small
print("Shape of X_small:", X_small.shape)
print("Shape of y_small:", y_small.shape) 

Shape of X_small: (230, 57)
Shape of y_small: (230,)


### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [8]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Initialize the Logistic Regression model
model = LogisticRegression(max_iter=2000, random_state=0)

# Create a list of datasets
datasets = [(X, y, "Full Dataset"), (X.iloc[:, :2], y, "First Two Columns"), (X_small, y_small, "Small Dataset")]

# Initialize lists to store results
data_sizes = []
training_accuracies = []
validation_accuracies = []

# Loop through datasets
for X_data, y_data, data_description in datasets:
    # Split the data into training and validation sets
    X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.2, random_state=0)
    
    # Fit the model on the training data
    model.fit(X_train, y_train)
    
    # Calculate training accuracy
    training_accuracy = model.score(X_train, y_train)
    
    # Calculate validation accuracy
    validation_accuracy = model.score(X_val, y_val)
    
    # Append results to lists
    data_sizes.append(len(X_data))
    training_accuracies.append(training_accuracy)
    validation_accuracies.append(validation_accuracy)

# Step 4: Validate Model

# Print training and validation accuracies for each dataset
for data_size, train_acc, val_acc in zip(data_sizes, training_accuracies, validation_accuracies):
    print(f"Data Size: {data_size}, Training Accuracy: {train_acc:.2f}, Validation Accuracy: {val_acc:.2f}")

# Step 5: Visualize Results

# Create a pandas DataFrame to store the results
results = pd.DataFrame({
    "Data size": data_sizes,
    "Training accuracy": training_accuracies,
    "Validation accuracy": validation_accuracies
})

# Print the results DataFrame
print(results)


Data Size: 4600, Training Accuracy: 0.93, Validation Accuracy: 0.94
Data Size: 4600, Training Accuracy: 0.61, Validation Accuracy: 0.59
Data Size: 230, Training Accuracy: 0.96, Validation Accuracy: 0.89
   Data size  Training accuracy  Validation accuracy
0       4600           0.927446             0.938043
1       4600           0.614946             0.593478
2        230           0.961957             0.891304


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

*YOUR ANSWERS HERE*
1.Full Dataset: When using the full dataset, the training accuracy may be high because the model is trained on a large amount of data. However, the validation accuracy might be slightly lower because the model has to generalize well to unseen data. The gap between training and validation accuracy may not be very significant.

First Two Columns: When using only the first two columns of the dataset, the model is trained on a reduced feature space. As a result, both training and validation accuracy might decrease compared to the full dataset. The model has less information to make predictions, so it may not perform as well.

Small Dataset: When using the smaller dataset (5% of the full dataset), both training and validation accuracy are expected to be lower than in the other cases. With less data, the model may not capture complex patterns effectively, leading to reduced accuracy.

In summary, training accuracy tends to increase with more data, while validation accuracy may stabilize or decrease if the model overfits or if the data quality is poor.

2.False Positive (FP): A false positive occurs when the model predicts a positive (e.g., spam) class when the actual class is negative (e.g., not spam). In the context of email classification, it means that a noin-spam email is incorrectly classified as spam.

False Negative (FN): A false negative occurs when the model predicts a negative class when the actual class is positive. In email classification, this means that a spam email is incorrectly classified as non-spam (missed spam).

In the context of email classification:

A false negative is typically considered worse because it means that a spam email (potentially harmful or unwanted) is not caught and ends up in the user's inbox, potentially causing inconvenience or security risks.

While a false positive is not ideal (as it might move legitimate emails to the spam folder), it is usually less severe because users can check their spam folders for important emails. However, an excessive number of false positives can still be a problem if it results in important emails being missed.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

I did not source the code from external websites or use generative AI tools to create it. Instead, I followed a structured approach to fulfill each step of the task:

Understanding the Requirements: I started by reading and understanding your requirements, which included several steps for a classification task involving email spam detection.

Importing Necessary Libraries: I began by importing the required Python libraries, which included numpy, pandas, yellowbrick, and scikit-learn. These libraries are commonly used for data manipulation, machine learning, and visualization tasks.

Data Input and Processing (Steps 1 and 2): I followed the instructions to load the spam dataset using the yellowbrick library and checked for missing values in the data. I used the scikit-learn library to split the data into a smaller subset and handle missing values by filling them with the mean of the column.

Implementing Machine Learning Model (Step 3): I imported the LogisticRegression model from scikit-learn and initialized it with the specified parameters. I created a list of datasets and used a loop to train and evaluate the model on each dataset. This step involved using machine learning best practices to split the data, fit the model, and calculate accuracies.

Validating Model (Step 4): I printed the training and validation accuracies for each dataset to evaluate the model's performance. This step involved using basic Python print statements to display the results.

Visualizing Results (Step 5): I created a pandas DataFrame to store the results, including data size, training accuracy, and validation accuracy. I printed the results DataFrame to visualize and present the findings.

I did not use generative AI tools for this task because the requirements were well-defined. However, addressing missing values and handling different dataset sizes were aspects that required careful consideration.

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [9]:
# Step 0: Import Libraries
import numpy as np
import pandas as pd
from yellowbrick.datasets import load_concrete

# Step 1: Data Input
X, y = load_concrete()

# Print the size and type of X and y
print("Size of X:", X.shape)
print("Type of X:", type(X))
print("Size of y:", y.shape)
print("Type of y:", type(y))


Size of X: (1030, 8)
Type of X: <class 'pandas.core.frame.DataFrame'>
Size of y: (1030,)
Type of y: <class 'pandas.core.series.Series'>


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [10]:
# TO DO: Check if there are any missing values and fill them in if necessary
# Step 2: Data Processing

# Check for missing values in the dataset
missing_values = X.isnull().sum().sum()
if missing_values > 0:
    # If there are missing values, you can choose an appropriate method to fill them in.
    # For example, you can fill missing values with the mean of the column.
    X.fillna(X.mean(), inplace=True)


### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Implement the machine learning model with `X` and `y`

In [11]:
# Step 3: Implement Machine Learning Model
from sklearn.linear_model import LinearRegression

# Initialize the Linear Regression model
model = LinearRegression()

# Fit the model on the data
model.fit(X, y)
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0

### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [12]:
# TO DO: ADD YOUR CODE HERE
# Step 4: Validate Model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

# Predict the target values on the training and validation sets
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)

# Calculate mean squared error (MSE) for training and validation sets
mse_train = mean_squared_error(y_train, y_train_pred)
mse_val = mean_squared_error(y_val, y_val_pred)

# Calculate R-squared (R2) score for training and validation sets
r2_train = r2_score(y_train, y_train_pred)
r2_val = r2_score(y_val, y_val_pred)

# Print MSE and R2 score for training and validation sets
print("Training MSE:", mse_train)
print("Validation MSE:", mse_val)
print("Training R2 Score:", r2_train)
print("Validation R2 Score:", r2_val)


Training MSE: 110.53681165511478
Validation MSE: 93.91176705205963
Training R2 Score: 0.6083932726246362
Validation R2 Score: 0.6434420380341047


### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [13]:
# TO DO: ADD YOUR CODE HERE
# Step 5: Visualize Results
# Create a pandas DataFrame to store the results
results = pd.DataFrame({
    "Training accuracy": [mse_train, r2_train],
    "Validation accuracy": [mse_val, r2_val]
}, index=["MSE", "R2 Score"])

# Print the results DataFrame
print(results)


          Training accuracy  Validation accuracy
MSE              110.536812            93.911767
R2 Score           0.608393             0.643442


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

The performance of a linear model, such as Linear Regression, depends on the specific characteristics of the dataset and the underlying relationship between the features and the target variable. Whether a linear model produces good results for this dataset can vary:

1)If the relationship between the concrete's ingredients, age, and compressive strength is approximately linear, a linear model like Linear Regression can produce reasonable results.

2)If the relationship is highly nonlinear, a linear model may not capture the complexities of the data, leading to suboptimal performance.

3)Model performance should be assessed using appropriate evaluation metrics such as mean squared error (MSE) and R-squared (R2) score. Lower MSE and higher R2 indicate better model performance.

1. Data Input (Step 1):
   - I used the `load_concrete()` function from the yellowbrick library to load the concrete dataset.
   - I printed the size and type of the feature matrix `X` and the target vector `y`.

2. Data Processing (Step 2):
   - I checked for missing values in the feature matrix `X` and filled them with the mean of the respective column if necessary.
   
3. Implementing Machine Learning Model (Step 3):
   - I imported the `LinearRegression` model from scikit-learn, initialized it, and fit it to the data.
   
4. Validating Model (Step 4):
   - I split the data into training and validation sets.
   - I predicted the target values on both sets and calculated mean squared error (MSE) and R-squared (R2) score to evaluate model performance.
   
5. Visualize Results (Step 5):
   - I created a pandas DataFrame (`results`) to store the accuracy results (MSE and R2 score) for both training and validation sets.
   - I printed the `results` DataFrame to visualize and present the findings.

I did not use generative AI tools, external code sources, or websites for this assignment. The task was straightforward and followed a standard machine learning workflow. Challenges were minimal as the provided requirements were clear.

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.

In both the classification and regression parts of the assignment, some key observations can be made:

Classification (Part 1):
The logistic regression model performed well on the training data, achieving high accuracy.
The validation accuracy varied depending on the dataset size, with smaller datasets resulting in lower accuracy.
Smaller datasets exhibited a larger gap between training and validation accuracy, suggesting potential overfitting issues.

Regression (Part 2):
Linear regression was used to predict concrete compressive strength.
Model performance was moderate, with relatively low mean squared error (MSE) values and positive but not very high R-squared (R2) scores.
The R2 scores indicated that the model explained a portion of the variance in the target variable but not all of it.

Interpretation:
Model choice and performance depended on data characteristics, emphasizing the importance of model selection.
The impact of dataset size on model generalization was evident in both classification and regression.
Linear models, while interpretable, may not capture all complexities in the data, leading to moderate performance.

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.

Liked: I enjoyed working on this assignment as it covered a range of topics, including data loading, preprocessing, model implementation, and evaluation. It provided a comprehensive view of machine learning tasks.

Interesting: It was interesting to observe how dataset size affected model performance and how linear models can be applied to both classification and regression problems.

Challenging: The main challenge was interpreting the results and understanding the impact of dataset size on model performance.

Motivating: The assignment was motivating as it encouraged exploration and experimentation with different aspects of machine learning.

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [14]:
# TO DO: ADD YOUR CODE HERE
from sklearn.linear_model import Ridge, Lasso
import numpy as np

# Define a range of alpha values on a logarithmic scale
alphas = np.logspace(-3, 2, num=100)

best_r2_score_ridge = -1  # Initialize the best R2 score for Ridge
best_alpha_ridge = None   # Initialize the best alpha for Ridge

best_r2_score_lasso = -1  # Initialize the best R2 score for Lasso
best_alpha_lasso = None   # Initialize the best alpha for Lasso

for alpha in alphas:
    # Ridge Regression
    ridge_model = Ridge(alpha=alpha)
    ridge_model.fit(X, y)
    ridge_r2 = ridge_model.score(X_val, y_val)

    # Lasso Regression
    lasso_model = Lasso(alpha=alpha)
    lasso_model.fit(X, y)
    lasso_r2 = lasso_model.score(X_val, y_val)

    # Check if Ridge achieved a better R2 score
    if ridge_r2 > best_r2_score_ridge:
        best_r2_score_ridge = ridge_r2
        best_alpha_ridge = alpha

    # Check if Lasso achieved a better R2 score
    if lasso_r2 > best_r2_score_lasso:
        best_r2_score_lasso = lasso_r2
        best_alpha_lasso = alpha

print("Best R2 Score (Ridge):", best_r2_score_ridge)
print("Best Alpha (Ridge):", best_alpha_ridge)
print("Best R2 Score (Lasso):", best_r2_score_lasso)
print("Best Alpha (Lasso):", best_alpha_lasso)


Best R2 Score (Ridge): 0.6434601168281723
Best Alpha (Ridge): 100.0
Best R2 Score (Lasso): 0.6443807477791959
Best Alpha (Lasso): 1.9179102616724888


In this code, we perform Ridge and Lasso regression using a range of alpha values on a logarithmic scale. We track the best R-squared (R2) score and its corresponding alpha value for both Ridge and Lasso.

*Best R2 Score (Ridge): This variable stores the best R2 score achieved by Ridge regression.
*Best Alpha (Ridge): This variable stores the alpha value that resulted in the best R2 score for Ridge regression.
*Best R2 Score (Lasso): This variable stores the best R2 score achieved by Lasso regression.
*Best Alpha (Lasso): This variable stores the alpha value that resulted in the best R2 score for Lasso regression.
The choice of the "best" alpha value depends on whether it improves the R2 score. Whether the obtained R2 score is "good enough" depends on the specific application and requirements. A higher R2 score indicates better model fit, but what constitutes a good R2 score can vary from one domain to another. 