In [None]:
# initializing otter-grader
import otter
grader = otter.Notebook()

# Lab 8: Multiple Linear Regression

In this lab, you will be working with the diamond dataset. You will fit a linear model to predict the price of a diamond using its characteristics. You will get experience with extracting and creating features  using techniques such as one-hot encoding or log transformation to improve the accuracy of your model. At the end, you will get a chance to create your own features for the linear model! 

**This lab should be completed and submitted by 11:59 PM on Friday May 22, 2020.**

### Collaboration Policy

Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually** and do not copy them from others. 

By submitting your work in this course, whether it is homework, a lab assignment, or a quiz/exam, you agree and acknowledge that **this submission is your own work and that you have read the policies regarding Academic Integrity**: https://studentconduct.sa.ucsb.edu/academic-integrity. The Office of Student Conduct has policies, tips, and resources for proper citation use, recognizing actions considered to be cheating or other forms of academic theft, and students’ responsibilities. You are required to read the policies and to abide by them.

*List collaborators here*

In [1]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sklearn
import altair as alt

## Preliminary

First, we load the diamond dataset and look at the fields in this dataset.

In [2]:
data = pd.read_csv("diamonds.csv.zip", index_col=0)
data.head()

Each record in the dataset corresponds to a single diamond.  The fields are

1. **carat**: The weight of the diamonds.
2. **cut**: The quality of the cut. This is an *ordinal* variable which takes on a value in the set: {`Fair`, `Good`, `Very Good`, `Premium`, and `Ideal`}.
3. **color**: The color of the diamond. This is an *ordinal* variable which takes on a value from the set of characters between `J` (worst) and `D` (best).
4. **clarity**: How obvious inclusions are within the diamond. This is an *ordinal* variable that takes on a value from the set: {`I1` (worst), `SI2`, `SI1`, `VS2`, `VS1`, `VVS2`, `VVS1`, `IF` (best)}.
5. **depth**: The height of a diamond, measured from the culet to the table, divided by its average girdle diameter.
6. **table**: The width of the diamond's table expressed as a percentage of its average diameter.
7. **price**: Price of the diamond in USD.
8. **x**: Length of the diamond measured in mm.
9. **y**: Width of the diamond measured in mm.
10. **z**: Depth of the diamond measured in mm.

We are interested in **predicting the price of a diamond given it's characteristics**.  Mathematically, we would like to fit a linear model with parameters $\theta$ corresponding to features $\textbf{x}$ to best capture the price of the diamonds:

$$
f_{\theta} (\textbf{x}) \rightarrow \text{Price}.
$$

## Part 1

For the first part of the lab, we will be focusing on diamond's **carat**, **depth**, and **table** characteristics. Hence $\textbf{x} = [$ **carat**, **depth**, **table** $]$ for a given diamond.

We are interested in using a linear model with a bias term as our model. We could express the model mathematically as:

$$
f_\theta(\textbf{x}) = f_\theta\left(\textbf{carat}, \textbf{depth}, \textbf{table}\right)
=
\theta_0 + 
\theta_1 * \textbf{carat} +
\theta_2 * \textbf{depth} +
\theta_3 * \textbf{table}.
$$

### Question 1a
Set the variable `data1` to be a subset of the original dataframe `data` such that `data1` only contains the columns `carat`, `depth`, `table` and `price`. (Note that the order of the columns in dataframe `data1` should follow the order `carat`, `depth`, `table`, `price` in order to pass the autograder test.)

<!--
BEGIN QUESTION
name: q1a
manual: false
points: 3
gradescope: show
-->

In [3]:
data1 = ...

In the following code, we split `data1` into two variables:

(1) Target values `y`: this consists of the prices of the diamonds.

(2) Set of features `X_features`: this is a data frame where each row is a feacture vector consisting of features $[$ **carat**, **depth**, **table** $]$ (without the bias term).

In [7]:
Y = data1['price']
X_features = data1[['carat', 'depth', 'table']]

### Question 1b 
We defined a function `add_bias` which takes in a dataframe and adds a column of 1's to the left of the input dataframe. This function should modify the input dataframe in place. Please fill in this function with your solution. **Please name this extra column 'ones' in the dataframe.** After calling the function on `X_features` you will get a dataframe whose first five rows of `X_features` will look like the following:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>ones</th>
      <th>carat</th>
      <th>depth</th>
      <th>table</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1.0</td>
      <td>0.23</td>
      <td>61.5</td>
      <td>55.0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>1.0</td>
      <td>0.21</td>
      <td>59.8</td>
      <td>61.0</td>
    </tr>
    <tr>
      <th>2</th>
      <td>1.0</td>
      <td>0.23</td>
      <td>56.9</td>
      <td>65.0</td>
    </tr>
    <tr>
      <th>3</th>
      <td>1.0</td>
      <td>0.29</td>
      <td>62.4</td>
      <td>58.0</td>
    </tr>
    <tr>
      <th>4</th>
      <td>1.0</td>
      <td>0.31</td>
      <td>63.3</td>
      <td>58.0</td>
    </tr>
  </tbody>
</table>

**Hint: You might find `pd.insert` method to be useful as you can specify the column index for the newly-added column: https://www.geeksforgeeks.org/python-pandas-dataframe-insert/**

<!--
BEGIN QUESTION
name: q1b
-->

<!--
BEGIN QUESTION
name: q1b
manual: false
points: 5
gradescope: show
-->

In [8]:
def add_bias(data):
...
    
X = X_features.copy() 
add_bias(X)
X.head()

### Question 1c ###
We need a loss function to evaluate how good our model approximates the prices of the diamonds. In the cell below, complete the function `avg_squared_loss` which returns the average squared loss between true target values `y` and our predictions `y_hat`. Note that both inputs, `y` and `y_hat`, to the function are arrays. You can assume that they have the same length. 

Recall that the average squared loss is defined as:
$$
Avg\ Squared\ Loss(y, \hat{y}) = \frac{1}{n} \sum\limits_{i=1}^n (y_i - \hat{y}_i)^2
$$

<!--
BEGIN QUESTION
name: q1c
manual: false
points: 3
gradescope: show
-->

In [14]:
def avg_squared_loss(y, y_hat):
    ...

Now we are ready to build our linear model. We saw that the **predictions** for the entire data set, $\hat{\mathbb{Y}}$, with a linear model can be computed as:

$$
\hat{\mathbb{Y}} = \mathbb{X} \theta  
$$

The **covariate matrix** $\mathbb{X} \in \mathbb{R}^{n \times (d+1)}$ consists of $n$ rows where each row corresponds to a record in the dataset and the $d+1$ columns correspond to the $d$ features extracted from the data plus an additional bias term.

The following function `linear_model` computes the prediction $\hat{\mathbb{Y}}$ given parameters $\theta$ and covariate matrix $\mathbb{X}$.

In [18]:
def linear_model(theta, X): 
    return X @ theta # The @ symbol is matrix multiplication

Here the `@` symbol is the matrix multiply operation and is equivalent to writing `X.dot(theta)`.

### Question 1d
In the cell below, choose any `theta` you would like (please note that the dimension of the `theta` you choose should match the number of columns of the covariate matrix) and make predictions for `Y` using the linear model defined above given the `theta` you chose. Assign the variable `Y_hat` with the predictions and the variable `loss` with the average squared loss of your predictions based on the `theta` you chose.

<!--
BEGIN QUESTION
name: q1d
manual: false
points: 3
gradescope: show
-->

In [19]:
...

You might notice the loss of the predictions for an arbitrary choice of `theta` is quite big. We can find the optimal `theta` by minimizing the mean square loss:

\begin{align}
L(\theta) &= \frac{1}{n}\sum_{i=1}^n \left( \mathbb{Y}_i - \left(\mathbb{X} \theta\right)_i \right)^2 \\
&= \frac{1}{n}\sum_{i=1}^n \left( \mathbb{Y}_i - \mathbb{X}_i \theta \right)^2 \\
&= \frac{1}{n} || \mathbb{Y} - \mathbb{X} \theta ||_2^2 \\
&= \frac{1}{n}\left( \mathbb{Y} - \mathbb{X}\theta \right)^T \left( \mathbb{Y} - \mathbb{X}\theta \right)
\end{align}

By taking derivative with respect to $\theta$ and set the derivative equal to 0. We can get the normal equation: 

$$
(\mathbb{X}^T \mathbb{X}) \hat{\theta} = \mathbb{X}^T \mathbb{Y}
$$

Solving for $\hat{\theta}$ in the above equation gives us the minimizer of the squared loss with respect to our data.  

If $\mathbb{X}^T \mathbb{X}$ is invertible (full rank), $\hat{\theta}$ can be computed analytically as:

$$
 \hat{\theta} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbb{Y}.
$$

We will not use the above analytic approach for solving $\hat{\theta}$ in this lab. Instead, we will use the `sklearn` library to fit our model and find the optimal $\theta$.

In [23]:
# Import the LinearRegression model from sklearn
from sklearn.linear_model import LinearRegression

### Question 1e
In lab7 we have learned how to use the sklearn package to create a linear regression model, as well as using it to fit on the data and get the predicted values. Today we are going to use it again. In case you are not familiar with the syntax, check lab7 to get refreshed! In the cell below, 
1. Fit a linear model `model1` using `X` and `Y` defined earlier in the lab. 
2. Make predictions `Y_hat1` for `Y` using the fitted model.
3. Calculate the average squared loss `loss1` of your prediction.

<!--
BEGIN QUESTION
name: q1e
manual: false
points: 6
gradescope: show
-->

In [24]:
...
loss1

In the cell below, we create a scatter plot by plotting (`Y`, `Y_hat1`). The red line is the identity line where each point on the line has the same values for the variables representing the x-axis and y-axis. If our model is very accurate, we would expect $Y \approx Y_{hat1}$, and thus all the points should be very close to the identity line. However, what do you observe in the current plot? 

In [28]:
alt.data_transformers.disable_max_rows()

source = pd.DataFrame({
    'Y': Y,
    'Y_hat1': Y_hat1
})

layer1 = alt.Chart(source).mark_circle(size=1).encode(
    x='Y',
    y='Y_hat1'
).properties(
    title='Y VS Y_hat1'
)

layer2 = alt.Chart(source).mark_line(size=1).encode(
    x='Y',
    y='Y',
    color = alt.value("red")
)

layer1 + layer2

## Part 2
For part 1, we only used the quantitative features `carat`, `depth`, `table`. As you can see from Question 1(e), the loss seems to be big. Is there a way to fit a better model by incorporating other features?

In this second part of the lab, we explore incorporating qualitative features into our model.

Recall our dataframe looks like the following:


In [29]:
data.head()

We only incorporated information about `carat`, `depth`, `table` in our previous features. Do `cut`, `color`, and `clarity` matter when it comes to predicting the prices of the diamonds?

Based on this online article https://www.pricescope.com/diamond-prices, these characteristics should matter! So let's try to incoporate these into our model!

Recall from the lecture, to include qualitative variables as features, we may use one-hot encoding. The idea of one-hot encoding is to vectorize the variables with 1's and 0's. For example, suppose we have a qualitative variable `smoking` and the variable can take on either 'smoker' or 'non-smoker' like what we show below:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
          <th></th>
      <th>smoking</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <th>smoker</th>
    </tr>
    <tr>
      <th>1</th>
      <th>non-smoker</th>
    </tr>
    <tr>
      <th>2</th>
      <th>smoker</th>
    </tr>
    <tr>
      <th>3</th>
      <th>non-smoker</th>
    </tr>
    <tr>
      <th>4</th>
      <th>non-smoker</th>
    </tr>
  </tbody>
</table>

After one-hot encoding, the resulting dataframe will look like:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
          <th></th>
      <th>smoker</th>
      <th>non-smoker</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <th>1</th>
      <th>0</th>
    </tr>
    <tr>
      <th>1</th>
      <th>0</th>
      <th>1</th>
    </tr>
    <tr>
      <th>2</th>
      <th>1</th>
      <th>0</th>
    </tr>
    <tr>
      <th>3</th>
      <th>0</th>
      <th>1</th>
    </tr>
    <tr>
      <th>4</th>
      <th>0</th>
      <th>1</th>
    </tr>
  </tbody>
</table>

For this lab, we will use the [`DictVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) method from `sklearn` package to implement one-hot encoding.

Let's first examine how the model will behave if we include the `cut` feature. In the cell below, we created a new dataframe `X_char_w_cut` which adds one more column `cut` to the features defined in Part 1.

In [30]:
X_features_with_cut = data[['carat', 'cut', 'depth', 'table']]
add_bias(X_features_with_cut)
X_features_with_cut

### Question 2a 
In the cell below, complete the code so that `X_with_cut` is the new feature matrix after one-hot encoding. There are a few things we need to do:
1. Review the notebook example in the lecture on how to convert a categorical variable into a one-hot encoding matrix.
2. Adjust the index issue (Done for you already).
3. Combine the other features (except cut), and the one-hot encoding matrix together to form `X_with_cut`. You can use `pd.concat`.

<!--
BEGIN QUESTION
name: q2a
manual: false
points: 5
gradescope: show
-->

In [31]:
from sklearn.feature_extraction import DictVectorizer

# one-hot encoding 


cuts = ...

encoder = DictVectorizer(sparse=False)
cuts_df = pd.DataFrame(
    data = ...
    columns = ...
)


# adjusting the index inconsistency issue
X_features_with_cut.reset_index(drop=True, inplace=True)
cuts_df.reset_index(drop=True, inplace=True)

# Combine the features together with pd.concat
X_with_cut = ...

### Question 2b
Now please fit a linear model using our new covariate matrix `X`. Compute the average squared loss of the predictions. Compare this loss with the loss you computed in Question 1(e) without using the `cut` feature. 

<!--
BEGIN QUESTION
name: q2b
manual: false
points: 6
gradescope: show
-->

In [37]:
...

Let us see the proportion that the loss decreases after incorporating the `cut` feature.

In [41]:
loss2 < loss1

In [42]:
alt.data_transformers.disable_max_rows()

source = pd.DataFrame({
    'Y': Y,
    'Y_hat2': Y_hat2
})

layer1 = alt.Chart(source).mark_circle(size=1).encode(
    x='Y',
    y='Y_hat2'
).properties(
    title='Y VS Y_hat2'
)

layer2 = alt.Chart(source).mark_line(size=1).encode(
    x='Y',
    y='Y',
    color = alt.value("red")
)

layer1 + layer2

The plot looks similar to the earlier one we have. Can we do better?

### Question 2c
In the cell below, we consider adding `color` and `clarity` as features. Please fill in the relevant code below to fit a model with the covariate matrix `X_features_with_cut_color_clarity`. 

<!--
BEGIN QUESTION
name: q2c
manual: false
points: 6
gradescope: show
-->

In [43]:
# extract the columns 'carat', 'cut', 'color', 'clarity', 'depth', 'table'
X_features_with_cut_color_clarity = data[['carat', 'cut', 'color', 'clarity', 'depth', 'table']] # Do not change this line
add_bias(X_features_with_cut_color_clarity)

cut_color_clarity = ...
encoder = ...

cut_color_clarity_df = pd.DataFrame(
    data = ...
    columns = ...
)

# adjusting the index inconsistency issue. Uncomment the following two lines 
X_features_with_cut_color_clarity.reset_index(drop=True, inplace=True)
cut_color_clarity_df.reset_index(drop=True, inplace=True)

# Combine the features together 
X_with_cut_color_clarity = ...
...

model3 = ...
...
Y_hat3 = ...
loss3 = ...

loss3

Compare `loss3` with `loss2` to check if our model works better. 

In [47]:
loss3 < loss2

In [48]:
alt.data_transformers.disable_max_rows()

source = pd.DataFrame({
    'Y': Y,
    'Y_hat3': Y_hat3
})

layer1 = alt.Chart(source).mark_circle(size=1).encode(
    x='Y',
    y='Y_hat3'
).properties(
    title='Y VS Y_hat3'
)

layer2 = alt.Chart(source).mark_line(size=1).encode(
    x='Y',
    y='Y',
    color = alt.value("red")
)

layer1 + layer2

### Question 3 
Try coming up with more features to make the model perform even better! Some suggestions are: include a `log(carat)` feature with the logarithmic values of `carat` or the characteristics `x`, `y`, `z` in the feature set. Write your code in the cell below. 

<!--
BEGIN QUESTION
name: q3
manual: true
points: 10
gradescope: show
-->
<!-- EXPORT TO PDF -->

In [49]:
# write your code here

### Congratulations! You have completed this assignment. Hope you enjoyed it!

# Running Built-in Tests
1. All tests are in `tests` directory
1. Each python file in `tests` is a test
1. `grader.check('testname')` runs test `'testname'`, e.g. `'q1'`
1. `grader.check_all()` runs all visible tests

In [None]:
# Run built-in checks
grader.check_all()

In [None]:
# Generate pdf in classic notebook (does not work in JupyterLab)
import nb2pdf
nb2pdf.convert('lab8.ipynb')

# To generate pdf using command-line, run in terminal,
# nb2pdf lab8.ipynb

# Submission Checklist
1. Check filename is 'lab8.ipynb'
1. Save file to confirm all changes are on disk
1. Run *Kernel > Restart & Run All* to execute all code from top to bottom
1. Check `grader.check_all()` output
1. Save file again to write any new output to disk
1. Check generated pdf that all responses are displayed correctly
1. Submit to Gradescope