# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: Mohammed Atifkhan Pathan

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [1]:
import numpy as np
import pandas as pd

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library:
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [2]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_spam
X, y = load_spam()
# TO DO: Print size and type of X and y
print("Size of X is", X.size)
print("Shape of X is", X.shape)
print("Type of X is", type(X).__name__)
print("\n")
print("Size of y is", y.size)
print("Shape of y is", y.shape)
print("Type of y is", type(y).__name__)

Size of X is 262200
Shape of X is (4600, 57)
Type of X is DataFrame


Size of y is 4600
Shape of y is (4600,)
Type of y is Series


### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [3]:
# TO DO: Check if there are any missing values and fill them in if necessary
print("The total number of missing values in X are", X.isnull().sum().sum())
print("The total number of missing values in y are", y.isnull().sum())

The total number of missing values in X are 0
The total number of missing values in y are 0


Therefore, no need to fill any values as there are no missing values in the dataset

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [4]:
# TO DO: Create X_small and y_small
from sklearn.model_selection import train_test_split

_, X_small, _, y_small = train_test_split(X, y,
                                          test_size=0.05,
                                          stratify=y,
                                          random_state=0)

print("Length of X is", len(X), ", 5% of that is", 0.05*len(X))
print("Shape of X_small is", X_small.shape)
print("Shape of y_small is", y_small.shape)

Length of X is 4600 , 5% of that is 230.0
Shape of X_small is (230, 57)
Shape of y_small is (230,)


The shape of X_small and y_small matches that of 5% of the data

### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets:
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [27]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

from sklearn.linear_model import LogisticRegression
# Instantiate model
model = LogisticRegression(max_iter=2000)

# Create an empty dataframe to store the results
results = pd.DataFrame(columns=['Data size', 'Training accuracy', 'Validation accuracy'])

# make another copy of X with just first 2 columns
X_two_col = X.iloc[:, :2]

# Implement machine learning models with different datasets
for X_vals, y_vals in zip([X, X_two_col, X_small],[y, y, y_small]):
    X_train, X_test, y_train, y_test = train_test_split(X_vals, y_vals, stratify=y_vals, random_state=0)
    logreg = model.fit(X_train, y_train)
    val_acc = logreg.score(X_test, y_test)
    train_acc = logreg.score(X_train, y_train)
    new_row = pd.DataFrame({
        'Data size': [X_vals.size],
        'Training accuracy': [train_acc],
        'Validation accuracy': [val_acc]
    })
    results = pd.concat([results, new_row], ignore_index=True)
print("Results:\n")
print(results)

Results:

  Data size  Training accuracy  Validation accuracy
0    262200           0.933043             0.930435
1      9200           0.619420             0.605217
2     13110           0.930233             0.913793


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

Answer:
1. As seen from the printed results, the data size with the highest accuracy is the biggest in size: `262,200` vs the worst accuracy being the lowest size: `9,200`. This means that less data can significantly impact the training and validation accuracies.

  Now, looking further into the values, lets compare the full dataset with the 5% of the dataset. Here, the accuracies are similar yet still _lower for the smaller set_. However, interestingly, the validation score is even lower in comparision. This can be due to the fact that there is even less _new_ data to test the accuracy against. Meaning, each wrong prediction will matter more in terms of percentage but also due to the small train set, the test set could be more challenging for the model to predict.

  The reason the `5% dataset` and the `2 columns` of X data set are so different in accuracy even when the apparent _size_ is very similar is because of the way the data was split essentially. The 5% is a percentage of the whole data, so includes all the columns and `5% of the total instances/rows`, sampling a subset of the data without losing features. The sampling may reduce the data but in this case it seems it was still big enough to generalize and predict with high accuracy. This is very different than only using 2 columns from X. This means there are _only `2 features`_ taken into account. This means that you have essentially simplified the data by removing different features (words that are used rto filter spam) and in turn, removing information that the model could use to make accurate predictions. Basically, instead of using multiple differnt words to predict spam, model only used two words. Hence, the accuracy of the model suffers significantly.

2. A false positive would mean that a mail was good but identified as a spam. A false negative would mean that a mail was spam and was not filtered (not identified as spam). In this case the worse option is filtering out good mail by wrongly classifying it as spam as it could be an important email. With the other category, it may always be hard to predict and filter out all spam emails. But a human can be the second defense and look at the email and filter spams that may not have been filtered initially.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

1. The process was fairly simple. The code was sourced mainly from my own knowledge and using the notes/examples given in class. There were instances when I had to use generativeAI for help mainly with syntax related issues.
2. I think the steps outlined were well organized so I tried to follow the steps as it is. I did realise after I got to step five that I had to make a loop so I had to go back and essentially fit steps 3, 4, and 5 into one.
3. The prompts I used were mainly in regards to explaining things or snippets of code. For example: `"Create a pandas DataFrame results with columns: Data size, training accuracy, validation accuracy"` or `"how can I use only the first two columns of a dataframe"`. I find it very helpful for things like syntax and explaining snippets of code. I did have to adjust the code to fit my problem and variables.
4. The biggest challenge with using chatgpt is that even with context it is hard to explain exactly what you need. And since its generativeAI it is predicting text from stuff it has seen, which is code that does not always apply to what I want. The most success I have found is on smaller scale things. If I have a solution in mind and I know the overall steps, I can breakdown the steps and ask it to help one at a time on independant things and then combine it all myself. That is mainly how I have started to use chatgpt as it often struggles if you copy paste a whole code snippet plus problem question and all the required delivarables all at once. Its answers are sometimes not detailed, not relavant, incomplete etc.

Tools and websites used:
- ChatGPT
- BingAI

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library:
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [36]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_concrete
X, y = load_concrete()
# TO DO: Print size and type of X and y
print("Size of X is", X.size)
print("Shape of X is", X.shape)
print("Type of X is", type(X).__name__)
print("\n")
print("Size of y is", y.size)
print("Shape of y is", y.shape)
print("Type of y is", type(y).__name__)

Size of X is 8240
Shape of X is (1030, 8)
Type of X is DataFrame


Size of y is 1030
Shape of y is (1030,)
Type of y is Series


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [37]:
# TO DO: Check if there are any missing values and fill them in if necessary
print("The total number of missing values in X are", X.isnull().sum().sum())
print("The total number of missing values in y are", y.isnull().sum())

The total number of missing values in X are 0
The total number of missing values in y are 0


Therefore, no need to fill any values as there are no missing values in the dataset

### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with `X` and `y`

In [43]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0

from sklearn.linear_model import LinearRegression

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Initialize and train a linear regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [45]:
# TO DO: ADD YOUR CODE HERE
from sklearn.metrics import mean_squared_error, r2_score

# Make predictions
train_ypred = linear_model.predict(X_train)
val_ypred = linear_model.predict(X_test)

### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [49]:
# TO DO: ADD YOUR CODE HERE
# Create a dataframe to store the results
results = pd.DataFrame({'Training accuracy': [0, 0],
                        'Validation accuracy': [0, 0]},
                        index=['MSE', 'R2 score'])

# calculate accuracy results and update results
results.loc['MSE'] = [mean_squared_error(y_train, train_ypred), mean_squared_error(y_test, val_ypred)]
results.loc['R2 score'] = [r2_score(y_train, train_ypred), r2_score(y_test, val_ypred)]

display(results)

Unnamed: 0,Training accuracy,Validation accuracy
MSE,111.358439,95.904136
R2 score,0.610823,0.623414


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

I dont think the result was that good since the R2 score was `~0.6` and MSE was `~100` which is not very close to 0. However, it could be considered decent. I think using a linear model for this dataset is not a good idea however, because as described on the yellowbricks page, the relationship between the features (ingredients and age) and the target (concrete compressive strength) is highly nonlinear. Thus, its not the best to try to use a linear regression model for this and why the results were okay to not good.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

1. The code was sourced mainly from my own knowledge and using the notes/examples given in class. There were instances when I had to use generativeAI for help in understanding the different accuracy metrics.
2. I think the steps outlined were well organized so I followed the steps as is.
3. The prompts I used were mainly in regards to explaining the accuracy metrics and how to interpret them. For example: `"What does R2 score and MSE of a model describe"` or `"My scores were xx and xx, how do I interpret them in this context"`. Since there was no code being generated I did not have to adjust anything. It explained to me what I needed to understand to answer questions.
4. The biggest challenge with using chatgpt is that even with context it is hard to explain exactly what you need. And since its generativeAI it is predicting text from stuff it has seen, which is code that does not always apply to what I want. The most success I have found is on smaller scale things. In my example, I used it to explain concepts or use things in an example to make it easier to understand. Then I can look at my own results and apply that understanding to it.

Tools and websites used:
- ChatGPT
- BingAI

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.

The only pattern I see is the whole machine learning framework itself. As mentioned in class as well, the steps for these analyses and creating models and getting the data ready are relatively the same. They can all be reduced to 4-5 parts as described in the two different problems above.

But in terms of quantitative patterns, I think the classification problem had a pattern of `more data = higher accuracy`. For example, the size of the whole data vs 5% of the data. The validation scores for the 5% was ~0.92 and for the whole data was ~0.93.

Another observation was that a problem requires not only sufficient data but _also_ the relavant model. We saw that in problem 2, the model was not best suited for the data and relationship between the data. Even with over a 1000 points available, a linear model is not best for a very non-linear relationship. The scores (R^2 and MSE) were not indicitive of a highly accurate model as 0.6 and 100 are not very good scores, especially in the context of determining concrete compressive strength based on the age/ingredients of that concrete.

This shows that machine learning challenge lies more in the selection of models, experimentation of fine tuning parameters and feeding larger and full data. This is ultimately even more depandant on the problem itself as different approaches will yield different results based on the problem. And even after having everything "right" the results can also be interpreted differently based on the context of the problem. Where even a high accuracy score (92% for example) may not be good enough and in other cases, even a lower score might do.


## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


I liked that this assignment is very structred. It makes it less confusing and easier to get through the code of it and focus on what the results ential.

I found the datasets interesting. They were simple and made the calculations easy but I could see what a real world example could be which was cool.

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [None]:
# TO DO: ADD YOUR CODE HERE

*ANSWER HERE*