## ![BTS](img/Logo-BTS.jpg)

# Session 22: Regression

### Juan Luis Cano Rodríguez <juan.cano@bts.tech> - Data Science Foundations (2019-01-08)

Open this notebook in Google Colaboratory: [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Juanlu001/bts-mbds-data-science-foundations/blob/master/sessions/22-Regression.ipynb)

## Outline

* OLS with dummy data
* The problem of overfitting
* Learning curves
* Regularization to rescue: Ridge
* Few features are important: Lasso
* Robustness against outliers: RANSAC

## Exercise 2: Predicting bike counts

1. Read the `data/fremont_bridge.csv` and `data/bicycle_weather.csv` datasets, using proper indexes and parsing date columns
2. Build a feature matrix to predict the **daily total** (`East + West`) number of bikes. Features:
  - **Mean temperature** (assume it's the mean of the minimum and the maximum) (originally in _tens of degrees Celsius_, convert to _degrees Celsius_)
  - **Weekday** (hint: use `pd.get_dummies` or `sklearn.preprocessing.OneHotEncoder` to transform the categorical variable into ones and zeroes)
  - **Holiday** (hint: use the code below)
  - **Hours of daylight** (hint: use the code below)
  - **Precipitation** (originally in _1 / 10 of millimeters_, convert to _millimeters_)
  - **Dry day** (1 if the precipitation is zero, 0 otherwise)
3. Use a `sklearn.linear_model import LinearRegression` to predict the **daily total** bikes (no need to `train_test_split` as we don't want to generalize on new data)
  - Print the coefficients of the model and reason about them
  - Visualize the real and predicted values in a single plot
  - Plot the difference
4. Use scikit-plot to display the learning curve of the model (remember to `plt.gca().set_ylim(0, 1)`). Do we need more training data?
5. Can you improve these results using `Ridge` or `SVR`?

<small>[When to train test split](https://stats.stackexchange.com/a/309117/37074)</small>

In [None]:
# Holidays
from pandas.tseries.holiday import USFederalHolidayCalendar
cal = USFederalHolidayCalendar()
holidays = cal.holidays('2012', '2016')
holidays  # And now what?

In [None]:
# Hours of daylight
import numpy as np

def hours_of_daylight(date, axis=23.44, latitude=47.61):
    """Compute the hours of daylight for the given date"""
    days = (date - pd.datetime(2000, 12, 21)).days
    m = (1. - np.tan(np.radians(latitude))
         * np.tan(np.radians(axis) * np.cos(days * 2 * np.pi / 365.25)))
    return 24. * np.degrees(np.arccos(1 - np.clip(m, 0, 2))) / 180.

hours_of_daylight(pd.datetime(2019, 1, 8))  # And now what?