## Self-Study Colab Activity 7.2: Defining, Computing, and Optimizing Loss

**Expected Time = 60 Minutes**


This activity focuses on computing and minimizing the L2 loss for different values of theta and identifying the theta that minimizes the L2 loss.  

## Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)
- [Problem 5](#Problem-5)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

### The Dataset

The geyser data from seaborn is loaded below.  You are to  build a model using the waiting time to predict the geyser explosion duration.

Note that this model will not have an intercept term.  

In [2]:
geyser = sns.load_dataset('geyser')

In [3]:
geyser.head()

Unnamed: 0,duration,waiting,kind
0,3.6,79,long
1,1.8,54,short
2,3.333,74,long
3,2.283,62,short
4,4.533,85,long


[Back to top](#Index:) 

## Problem 1

### Creating an array of $\theta$'s


Below, create an array of 100 equally spaced values between -1 and 1.  Use the `np.linspace` method demonstrated in the lectures and assigned your answer as a numpy array to `thetas` below.

In [7]:

thetas = np.linspace(-1, 1, 100)


# Answer check
print(type(thetas)) # This will print the type of the object (should be <class 'numpy.ndarray'>)
print(thetas.shape) # This will print the shape of the array (should be (100,))

<class 'numpy.ndarray'>
(100,)


[Back to top](#Index:) 

## Problem 2

### The Model



In this assignment, our model takes the form:

$$\text{duration} = \text{waiting} \times \theta$$

Multiply the values in the `waiting` column of the `geyser` dataset by `0.8` to create a prediction for the case of $\theta = 0.8$. Assign them as a Series to the variable `prediction` below.

In [9]:

Beta = 0.8
prediction = geyser['waiting']*Beta 

# Answer check
print(type(prediction))  # This will print <class 'pandas.core.series.Series'>
print(prediction.shape)  # This will print the shape of the resulting series

<class 'pandas.core.series.Series'>
(272,)


[Back to top](#Index:) 

## Problem 3

### Determining Mean Squared Error




Use the `mean_squared_error` function to calculate the MSE between the `duration` column of the `geyser` DataFrame and the `0.8*geyser['waiting']` data.

Use the function `float` to convert your result to floats. 

Assign your result as a float to `mse` below.

In [11]:
# Define Beta and calculate the predictions
Beta = 0.8
predictions = 0.8 * geyser['waiting']

# Calculate the Mean Squared Error (MSE)
mse = np.mean((geyser['duration'] - predictions) ** 2)

# Convert MSE to a float
mse = float(mse)

# Answer check
print(type(mse)) # This will print <class 'float'>
print(mse) # This will print the calculated MSE

<class 'float'>
2930.2861285845593


[Back to top](#Index:) 

## Problem 4

### Computing the Mean Squared Error for `thetas`



Use a `for` loop over `thetas` to compute the MSE between the column `geyser['duration']` and the column `geyser['waiting']`multiplied by each value of `theta`.  Assign these values in order to the list `mses` below.

In [12]:
# Example array of thetas (equally spaced values between -1 and 1)
thetas = np.linspace(-1, 1, 100)

# Initialize an empty list to store the MSE values
mses = []

# Loop over each value of theta and compute the MSE
for theta in thetas:
    # Compute predictions as geyser['waiting'] multiplied by theta
    predictions = geyser['waiting'] * theta
    
    # Calculate the MSE between geyser['duration'] and predictions
    mse = np.mean((geyser['duration'] - predictions) ** 2)
    
    # Append the computed MSE to the mses list
    mses.append(mse)


# Answer check
print(type(mses)) # Should print <class 'list'>
print(len(mses)) # Should print 100 (as there are 100 thetas)
print(mses[:10]) # Print the first 10 MSE values for verification

<class 'list'>
100
[5746.399297702205, 5527.445557830223, 5312.744883371734, 5102.29727432674, 4896.102730695238, 4694.161252477228, 4496.472839672713, 4303.037492281691, 4113.855210304161, 3928.925993740124]


[Back to top](#Index:) 

## Problem 5

### Which $\theta$ minimizes Mean Squared Error



Using the list of `mses`, determine the value for $\theta$ that minimized the mean squared error.  You may want to amend your loop above to check for the smallest value as the loop proceeds.  Assign your answer as a float to `theta_min` below.

In [13]:
# Example array of thetas (equally spaced values between -1 and 1)
thetas = np.linspace(-1, 1, 100)

# Initialize an empty list to store the MSE values
mses = []

# Initialize variables to track the minimum MSE and corresponding theta
min_mse = float('inf')  # Start with infinity so any computed MSE will be smaller
theta_min = None  # Placeholder for the theta that minimizes MSE

# Loop over each value of theta and compute the MSE
for theta in thetas:
    # Compute predictions as geyser['waiting'] multiplied by theta
    predictions = geyser['waiting'] * theta
    
    # Calculate the MSE between geyser['duration'] and predictions
    mse = np.mean((geyser['duration'] - predictions) ** 2)
    
    # Append the computed MSE to the mses list
    mses.append(mse)
    
    # Check if this mse is the smallest we've seen so far
    if mse < min_mse:
        min_mse = mse
        theta_min = theta



# Answer check
print(type(theta_min))
print(min_mse)
print(theta_min)

<class 'numpy.float64'>
0.3695626511606713
0.05050505050505061


Note that, again, the shape of the Mean Squared Error is a parabola.  The plot below shows the values of thetas againt their mean squared error.  

<center>
    <img src = 'images/mse_min.png' >
</center>

In [14]:
#Code for Plot
# plt.plot(thetas, mses)
# plt.plot(thetas[np.argmin(mses)], min(mses), 'ro', label = 'Minimum')
# plt.legend()
# plt.title('Minimum MSE')
# plt.xlabel(r'$\theta$')
# plt.ylabel('MSE')
# plt.grid();
# plt.savefig('images/mse_min.png')