# SDS 3386 Assignment 5

2023 Fall, Tanya Schmah. Due **Monday November 20th** at 11:59pm.

Total Marks: 20

**Joe Zhang, Mohamed Shakir, Rahul Atre**

This assignment may be done individually or in a group of 2 or 3 (maximum); it's your choice.

Complete the assignment by editing this file. If working in a group, combine your files into one clean copy. Before submission, check everything by selecting Kernel->"Restart and Run All". Submit:

1) a **single** Notebook file, named in the format:

    SDS3386_A5_Name1_Name2_Name3.ipynb
   
    where "Name1" etc. is the last name of a group member. 

2) the corresponding single HTML file. (Check that it shows all output.)

3) **Optionally, if you prefer:** you may submit your answer to Question 3 in a separate PDF, using a similar naming convention.

Each student should submit a copy of the same assignment to Brightspace, with a comment (added in the box on the assignment submission page) that lists the names of your group members.

In [1]:
import pandas as pd
import numpy as np
import altair as alt

In [None]:
# Only run these lines if you need to
import warnings
warnings.filterwarnings('ignore')

The first two questions are about regression, using Scikit-learn. See:

- Week 7 Lab
- https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html .


## Question 1: Simple linear regression (7 marks)

The aim of this question is to predict penguin body mass from flipper length.

### A)

Load the penguins dataset and drop all rows with any missing values. Plot flipper length vs. body mass, choosing axis scales so that the data points are easy to see and the chart doesn't have lots of white space.

*Note: it would in general be best practice to first select only the variables we are using and then drop rows with missing data. But doing it this way ensures that we consider the same penguins as in Question 2.*

### B)

Create two dataframes: `X` containing only the flipper length column, and `y` containing only the body mass column.

Split the data randomly into training and test sets, containing 2/3 and 1/3 of the data respectively, using `sklearn.model_selection.train_test_split`, as follows:

`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)`
     
Notes: 

- `random_state` is the "seed" of the random number generator. If you choose a different seed (or don't specify it, so that a default is used), you will probably get slightly different results. Using the specified random state makes it easier to compare methods, and makes the assignment easier to mark.

- The output of `train_test_split` is four Pandas dataframes.

### C)

Using `sklearn.linear_model.LinearRegression`, fit a linear model, using only the training data. Print the coefficients of the fitted model, which are: the slope `model.coef_[0]` and the intercept `model.intercept_`. Do a sanity check on this model, by calculating "by hand" the `y` value predicted by this model if `X` equals $200$, using only arithmetic (i.e. python but not Scikit-learn).

*Using your chart above, check that your predicted y value is a plausible body mass for a penguin with flipper length 200 mm. (No need to write anything about this.)*

### D)

Evaluate the model on the test dataset, i.e. for every penguin in the test set (only), use your linear model from (C) to predict body mass from flipper length, storing results in an array called `y_test_pred`. Make a scatter plot of true vs predicted mass.

### E)

Write your own function to calculate the Mean Squared Error between true and predicted body mass:
$$
MSE = \frac{1}{N} \sum_{i=1}^N (y_i - \hat y_i)^2,
$$
where $y_i$ and $\hat y_i$ are the true and predicted values of $y$ (i.e. body mass) for the $i^{th}$ observation (i.e. $i^{th}$ penguin here).

Evaluate this function on the test dataset. *You may need to use `.to_numpy()` .*
The result is called the **Test MSE**. This is one of the most common metrics for evaluating regression methods.

Check that you get the same result as given by `sklearn.metrics.mean_squared_error`.

*Note: a common alternative is Root Mean Squared Error (RMSE), which is just $\sqrt{\textrm{MSE}}$, the advantage being that it has the same units as the dependent variable $y$.*

### F)

Write your own function to calculate $R^2$, the coefficient of determination, defined as:
$$
R^2 = 1 - \frac{RSS}{TSS},
$$
where: 
- RSS is the *residual sum of squares*, which is the same thing as the Sum of Squared Errors (SSE), i.e. 
$\sum_i(y_i - \hat y_i)^2$, where $y_i$ are the true values and $\hat y_i$ are the 
predicted values; and
- TSS is the *total sum of squares*, which is the sum of $(y_i - \bar y)^2$, where
$y_i$ are the true values and $\bar y$ is the mean of those true values.

Apply your function to the *test data*. The result is called **Predicted R^2** ("predicted" meaning calculated from test data not used in the fitting of the model).

Check that your answer is the same as the one given by  `sklearn.metrics.r2_score`.

*Note: $R^2$ is easier to interpret since it has a maximum value of $1$, attained when predictions are perfect. But it's just a linear function of MSE, since TSS is independent of the predictions. MSE is a more common metric in machine learning.*

### G)

Apply your model to the *training* data. Calculate MSE and $R^2$.

*Usually, a model fitted to training data will fit that training data better than independent test data. So Training MSE will be lower than Test MSE, and the $R^2$  (which, by default is calculated on the training data) is higher than Predicted $R^2$ (calculated on test data). This is the case here, with this random split, though not for all possible splits.*

## Question 2: Multiple linear regression (7 marks)

The aim of this question is to predict penguin body mass from other variables in the dataset.

There are four continuous variables, and three categorical ones (species, island, sex). To use the latter, we first have to encode them as numerical values, as shown below or alternatively using e.g. `pandas.get_dummies` or `sklearn.preprocessing.OneHotEncoder`. You might want to try these functions and/or use all columns, but for this assignment it's optional, and we instead propose a simple model using only three predictors.

Run the following cell to load the penguins dataset, drop all rows with any missing values, and then encode the `sex` variable as a binary numerical variable `sex_male`. 

In [None]:
df = pd.read_csv('penguins.csv').dropna()

df['sex_male'] = df['sex'].apply(lambda x: 1 if x == 'MALE' else 0)

Run the following cell to create two dataframes: X containing the columns 'flipper_length_mm', 'bill_length_mm' and 'sex_male' and y containing only the body mass column; and then split the data randomly into training and test sets.

*Note: `y_train` and `y_test` will be the same as in Question 1.*

In [None]:
X = df[['flipper_length_mm', 'bill_length_mm', 'sex_male']]
y = df[['body_mass_g']]

X_train2, X_test2, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

### A)

Fit a linear model to the training data using `sklearn.linear_model.LinearRegression`. Print the coefficients of the fitted model, which are in the array `model.coef_` and `model.intercept_`.

Are any of these the same as the simple linear regression coefficients found in Question 1?

### B)

Evaluate this model on the test dataset, storing results in an array called `y_test_pred2`. Calculate the Test MSE and Predicted $R^2$, and compare these with the results from the simple linear regression in Question 1.

This multiple regression model performs better than simple regression on flipper length only, by both of these metrics.

*Remark: adding "relevant" predictor variables usually improves results, as here, but not always. Regularization often helps, as in Part (C).*

### C)

One of the great things about Scikit-learn is that it offers hundreds of classification and regression methods with the same API, so once you've learned how to use one method, you can try many others without even needing to know how they work.

Run the next cell to initialize Ridge Regression, which can be interpreted as multiple linear regression with a prior belief that regression coefficients should be small.

In [None]:
from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=10.0)  # higher alpha means stronger regularization

Fit a ridge regression model using the same training data as above, and evaluate it on the same test data. Print the Test MSE and Predicted $R^2$. 

### D)

Try ridge regression with at least 10 different alpha values including $0.01$ and $1000$. Plot Test MSE vs. alpha. Use a log scale for alpha and start the MSE axis at a nonzero value.

*Note: alpha = 0 makes ridge regression equivalent to linear regression, as you can check. **However,** don't use alpha = 0 here because log(0) is undefined, and the question asks for a log scale.*

## Question 3 (6 marks)

List three ways one can try to prevent, detect or mitigate algorithmic bias.
For each of these three, describe and discuss briefly in maximum three sentences. As a starting point, see:

- https://www.technologyreview.com/2019/02/04/137602

- https://en.wikipedia.org/wiki/Algorithmic_bias

- https://youtu.be/fMym_BKWQzk (especially from 47:00)

- Luk Arbuckle's CANSSI Lecture shown in Week 8.

Your answer to this question should be maximum one page in length including references.