This is a simple notebook to generate linear data with some (non-Gaussian) scatter, and do linear fits with different loss functions.

It accompanies Chapter 5 of the book (1 of 5).

Author: Viviana Acquaviva, with contributions by Jake Postiglione and Olga Privman.

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib
import matplotlib.pyplot as plt
import sklearn
from sklearn import metrics
from sklearn.model_selection import train_test_split, cross_validate, cross_val_predict
from sklearn.model_selection import KFold
from sklearn import linear_model #New!

%matplotlib inline

font = {'size'   : 16}
matplotlib.rc('font', **font)
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14) 
matplotlib.rcParams.update({'figure.autolayout': False})
matplotlib.rcParams['figure.dpi'] = 300

#### We begin by generating some data.

In [None]:
np.random.seed(16) #set seed for reproducibility purposes

x = np.arange(100) 

yp = 3*x + 3 + 5*(np.random.poisson(3*x+3,100)-(3*x+3)) #generate some data with scatter following Poisson distribution 
                                                    #with exp value = y from linear model, centered around 0

In [None]:
#Let's take a look!

plt.scatter(x, yp);

#### Here comes the linear regression model ;) 

In [None]:
model = linear_model.LinearRegression()

In [None]:
model

I can fit the model (right now, I will do it using the entire data set just to compare with the analytic solution). When only one predictor is present, I need to reshape it to column form.

In [None]:
model.fit(x.reshape(-1,1),yp) 

The fitted model has attributes "coef_", "intercept_"

In [None]:
slope, intercept = model.coef_, model.intercept_

In [None]:
print(slope, intercept)

We can plot the original and the fitted line.

In [None]:
plt.figure(figsize = (10,6))
plt.scatter(x,yp, s = 20, c = 'gray', label = 'Data')
plt.plot(x, slope*x + intercept, c ='k', label = 'Ordinary least squares fit')
plt.plot(x, 3*x + 3, c = 'r', label = 'True regression line')
plt.legend(fontsize = 14)
plt.xlabel('X')
plt.ylabel('Y')

What are the analytic predictions for the coefficients?

In [None]:
#Predictions - fill in the analytic formula

theta1 = np.sum((x - np.mean(x))*(yp - np.mean(yp)))/np.sum((x - np.mean(x))*(x - np.mean(x)))

theta0 = np.mean(yp) - theta1*np.mean(x)

In [None]:
print('Theta_0, Theta_1:', theta0, theta1)

I can also obtain the second one in the variance/covariance notation (note: the small difference is due to 1/n vs 1/(n-1) in the definition)

In [None]:
print('Sample Cov / Sample var:', np.cov(x,yp, bias=True)[0,1]/np.var(x))

#### We can (and should!) do cross validation and all the nice things we have learned to do for classification problems.

In [None]:
cv = KFold(n_splits = 5 , shuffle = True , random_state = 10)

In [None]:
scores = cross_validate(model, x.reshape(-1,1), yp, cv = cv, return_train_score = True)

In [None]:
scores

In [None]:
print('{:.3f}'.format(scores['test_score'].mean()), '{:.3f}'.format(scores['test_score'].std()))
print('{:.3f}'.format(scores['train_score'].mean()), '{:.3f}'.format(scores['train_score'].std()))

### Questions: 

- What are the scores that are being printed out?

- How are the scores? 

- Does it suffer from high variance? High bias?

- What would happen to the scores if we increased the scatter (noise)?

### <font color='green'> Scoring in regression problems. </font>

### Here is a way to visualize all the available scorers.

In [None]:
print(sorted(sklearn.metrics.SCORERS.keys()))

### Do you recognize some of them?

Let's see if we can find the MSE.

In [None]:
scores = cross_validate(model, x.reshape(-1,1), yp, cv = cv, scoring = 'neg_mean_squared_error', return_train_score = True)

In [None]:
print('{:.3f}'.format(scores['test_score'].mean()), '{:.3f}'.format(scores['test_score'].std()))
print('{:.3f}'.format(scores['train_score'].mean()), '{:.3f}'.format(scores['train_score'].std()))

Can also try MAE

In [None]:
scores = scores = cross_validate(model, x.reshape(-1,1), yp, cv = cv, scoring = 'neg_mean_absolute_error', return_train_score = True)

In [None]:
print('{:.3f}'.format(scores['test_score'].mean()), '{:.3f}'.format(scores['test_score'].std()))
print('{:.3f}'.format(scores['train_score'].mean()), '{:.3f}'.format(scores['train_score'].std()))

By plotting the residuals, we can see that they are independent of x (the assumptions of the probabilistic linear model are not satisfied). But that doesn't mean we can't create a model.

In [None]:
plt.scatter(x, slope*x + intercept - yp, color = 'b', label = 'Residuals')

plt.legend();

### Custom scores

We might like to implement a scorer where we care about percentage error instead. Here is how to do a custom scorer:

In [None]:
from sklearn.metrics import make_scorer

### Learning Check-in
    
How would you implement a scorer? Please fill in the code.

```python
def mape(...,...): #Mean Absolute Percentage Error
    return ....

mape_scorer = make_scorer(mape, greater_is_better = False)
```

</br>

<details>
<summary style="display: list-item;">Click here for the answer!</summary>
<p>
    
```python
def mape(true,pred): #Mean Absolute Percentage Error
    return np.mean(np.abs(true-pred)/(true))

mape_scorer = make_scorer(mape, greater_is_better = False)
```
    
</p>
</details>
</br>


We'll try it with Modified Mean Absolute Percentage Error, instead, to avoid zeros.

In [None]:
def mape(true,pred): #Modified Mean Absolute Percentage Error
    return np.mean(np.abs(true-pred)/(0.5*(true+pred)))

mape_scorer = make_scorer(mape, greater_is_better = False)

In [None]:
scores = cross_validate(model, x.reshape(-1,1), yp, cv = cv, scoring = mape_scorer, return_train_score = True)

In [None]:
scores

In [None]:
print('{:.3f}'.format(scores['test_score'].mean()), '{:.3f}'.format(scores['test_score'].std()))
print('{:.3f}'.format(scores['train_score'].mean()), '{:.3f}'.format(scores['train_score'].std()))

#### Note: as we already discussed, so far we have not changed the loss function (MSE), or the coefficients of the model. We have only looked at different evaluation metrics.

#### <font color = 'green'> Question 1: would the best fit line change if we optimize a different loss function? </font>

Yes!

#### <font color = 'green'> Question 2: How can we implement that without an analytic solution? </font>

Grid Search

This is an example using the Mean Square Error.

In [None]:
theta0 = np.linspace(-5,5,200)
theta1 = np.linspace(-5,5,200)

In [None]:
mse = np.empty((200,200))

for i,t0 in enumerate(theta0):
    for j,t1 in enumerate(theta1):
        mse[i,j] = np.sum((t0 + t1*x - yp)**2)/len(yp)

To get the indices of the 2D array, I need to unravel it

In [None]:
np.unravel_index(mse.argmin(), mse.shape)

I can now find the minimum MSE (not very informative, TBH) and the best fit coefficients:

In [None]:
mse[25,160]

In [None]:
theta0[25], theta1[160]

#### Question: How do they compare to the ones found by the Linear Model / analytic ones?

It will be interesting to see what happens to the parameters if we use a different loss function (MAE, MAPE, Huber loss).

However, because these data are so regular, it's kind of boring, so before trying the different losses let's inject some outliers in the data.

### What happens when we add outliers?

In [None]:
np.random.seed(12) #set 
out = np.random.choice(100,15) #select 15 outliers indexes
yp_wo = np.copy(yp)
np.random.seed(12) #set again
yp_wo[out] = yp_wo[out] + 5*np.random.rand(15)*yp[out]

In [None]:
plt.scatter(x,yp_wo, label = 'Data + outliers')
plt.scatter(x,yp, label = 'Original data')
plt.legend();

We can see the effect for the MSE loss right away:

In [None]:
model.fit(x.reshape(-1,1),yp_wo)

slope, intercept = model.coef_, model.intercept_

print(slope, intercept)

### Learning Check-in
    
What can we expect when we increase the number of outliers to 30?

<details>
<summary style="display: list-item;">Click here for the answer!</summary>
<p>
    
```
The new values are visibly skewed by the outliers.
```
    
</p>
</details>

</br>

</p>
</details>

What are the new values for the slope and intercept? 

<details>
<summary style="display: list-item;">Click here for the answer!</summary>
<p>
    
```
[4.39426943] 5.633663366336549
```
    
</p>
</details>
</br>

### Exercise: 

1. Calculate the best fitting coefficients (e.g. using a grid, like the one we made in the previous example) for the MSE, MAE and modified MAPE, and Huber loss.

2. Plot the data and the four best fits.

3. Explain the results by commenting on the differences.

Note: the Huber loss is a hybrid between MSE and MAE (behaves like MAE when the error is larger than a certain amount, often called delta, so it's less sensitive to outliers). One possibility is to use the std of the y values to set delta.

### Solution

In [None]:
#from https://www.astroml.org/book_figures/chapter8/fig_huber_loss.html

# Define the log-likelihood via the Huber loss function
def huber_loss(m, b, x, y, dy, c=2):
    y_fit = m * x + b
    t = abs((y - y_fit) / dy)
    flag = t > c
    return np.sum((~flag) * (0.5 * t ** 2) - (flag) * c * (0.5 * c - t), -1)

In [None]:
b0 = np.linspace(-5,5,200)
b1 = np.linspace(-5,5,200)

losses = ['MSE', 'MAE', 'MAPE', 'Huber']

mse = np.empty((200,200))
mae = np.empty((200,200))
mape = np.empty((200,200))
huber = np.empty((200,200))

c = 209 #Huber

coeff = {}

for i,beta0 in enumerate(b0):
    for j,beta1 in enumerate(b1):
        
        #MSE
        mse[i,j] = np.sum((beta0 + beta1*x - yp_wo)**2)/len(yp_wo)
        
        #MAE
        mae[i,j] = np.sum(np.abs(beta0 + beta1*x - yp_wo))/len(yp_wo)
            
        #MAPE
        mape[i,j] = np.sum(np.abs(beta0 + beta1*x - yp_wo)/yp_wo)/len(yp_wo)
        
        #Huber
        t = np.abs(beta0 + beta1*x - yp_wo)
        flag = (t > c)
        huber[i,j] = np.sum((~flag) * (0.5 * t ** 2) - (flag) * c * (0.5 * c - t))/len(yp_wo)

for i,loss in enumerate([mse, mae, mape, huber]):
        
    ind = np.unravel_index(loss.argmin(), loss.shape)
    
    coeff[losses[i]] = b0[ind[0]], b1[ind[1]]

    print('Intercept, slope for loss:', losses[i], b0[ind[0]], b1[ind[1]])