In [1]:
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

# Exercises

- Another loss function called the Huber loss combines the absolute and
  squared loss to create a loss function that is both smooth and robust
  to outliers. The Huber loss accomplishes this by behaving like the squared loss
  for $\theta$ values close to the minimum and switching to absolute loss for
  $\theta$ values far from the minimum. Below is a formula for a simplified
  version of Huber loss. Use this definition of Huber loss to
   - Write a function called `mhe` to compute the mean Huber error.
   - Plot the smooth `mhe` curve for the bus times data where $\theta$ ranges from -2
     to 8.
   - Use trial and error to find the minimizing $\hat \theta$ for bus times.

$$
\begin{aligned}
l(\theta, y)
&= \frac{1}{2} (y - \theta)^2  &\textrm{for}~ |y-\theta| \leq 2\\
&= 2(|y - \theta| - 1)  &\textrm{otherwise.}\\
\end{aligned}
$$

- Continue with Huber loss and the function `mhe` in the previous problem:
   - Plot the smooth `mhe` for the five data points $[-2, 0, 1, 5, 10]$.
   - Describe the curve. 
   - For these five points, what is the minimizing $\hat \theta$? 
   - What happens when the data point 10 is swapped for 100? Compare the minimizer to the
     mean and median of the five points.

- Consider a loss function that has 0 loss for negative errors and linear (or quadratic) loss for positive errors. 
    - Write a function, called `mLe` that computes the average loss for this function.
    - Plot the `mLe` curve for many $\theta$s given the data  $\mathbf{y} = [-2, 0, 1, 5, 10]$
    - Use trial and error to find the minimizing $\hat \theta$.

- In this exercise, we again show that the mean minimizes the mean square error, but we will use calculus instead.
   -  Take the derivative of the average loss with respect to $\theta$.
   - Set the derivative to 0 and solve for $\hat{\theta}$.
   - To be thorough, take a second derivative to confirm that $\bar{y}$ is a minimizer. (Recall that if the second derivative is positive than the quadratic is concave.)  

- Follow the steps below to establish that MAE is minimized for the median. 
   - Split the summation, $\frac{1}{n} \sum_{i = 1}^{n}|y_i - \theta|$ into
     three terms for when $y_i - \theta$ is negative, 0, and positive. 
   - Set the middle term to 0 so that the equations are easier to work with.
     Use the fact that the derivative of the absolute value is -1 or +1 to
     differentiate the remaining two terms with respect to $\theta$. 
   - Set the derivative to 0 and simplify terms. Explain why when there are an
     odd number of points, the solution is the median.
   - Explain why when there are an even number of points, the minimizing
     $\theta$ is not uniquely defined (just as with the median). 