### This exercise is designed to pair with [this tutorial](https://www.kaggle.com/rtatman/bigquery-machine-learning-tutorial). If you haven't taken a look at it yet, head over and check it out first. (Otherwise these exercises will be pretty confusing!) -- Rachael 

# Stocking rental bikes

![bike rentals](https://upload.wikimedia.org/wikipedia/commons/thumb/a/a0/Bay_Area_Bike_Share_launch_in_San_Jose_CA.jpg/640px-Bay_Area_Bike_Share_launch_in_San_Jose_CA.jpg)

You stock bikes for a bike rental company in Austin, ensuring stations have enough bikes for all their riders. You decide to build a model to predict how many riders will start from each station during each hour, capturing patterns in seasonality, time of day, day of the week, etc.

To get started, create a project in GCP and connect to it by running the code cell below. Make sure you have connected the kernel to your GCP account in Settings.

In [None]:
# Set your own project id here
PROJECT_ID = 'bqml-tutorial-249309' # a string, like 'kaggle-bigquery-240818'

from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID, location="US")
dataset = client.create_dataset('model_dataset', exists_ok=True)

from google.cloud.bigquery import magics
from kaggle.gcp import KaggleKernelCredentials
magics.context.credentials = KaggleKernelCredentials()
magics.context.project = PROJECT_ID

In [None]:
%load_ext google.cloud.bigquery

## Linear Regression

Your dataset is quite large. BigQuery is especially efficient with large datasets, so you'll use BigQuery-ML (called BQML) to build your model. BQML uses a "linear regression" model when predicting numeric outcomes, like the number of riders.

## 1) Training vs testing

You'll want to test your model on data it hasn't seen before (for reasons described in the [Intro to Machine Learning Micro-Course](https://www.kaggle.com/learn/intro-to-machine-learning). What do you think is a good approach to splitting the data? What data should we use to train, what data should we use for test the model?

In [None]:
# You can write your notes here
# What do you think is a good approach to splitting the data?
# If I understand question correctly, a good approach to splitting the data will be random splitting on training & evaluation datasests. 
# Moreover, if it is necessary to study seasonality, then I think it’s worthwhile to first separate the data set by season, and then select from each season training & evaluation datasests


In [None]:
# create a reference to our table
table = client.get_table("bigquery-public-data.austin_bikeshare.bikeshare_trips")

# look at five rows from our dataset
client.list_rows(table, max_results=5).to_dataframe()

## Training data

First, you'll write a query to get the data for model-building. You can use the public Austin bike share dataset from the `bigquery-public-data.austin_bikeshare.bikeshare_trips` table. You predict the number of rides based on the station where the trip starts and the hour when the trip started. Use the `TIMESTAMP_TRUNC` function to truncate the start time to the hour.

## 2) Exercise: Query the training data

Write the query to retrieve your training data. The fields should be:
1. The start_station_name
2. A time trips start, to the nearest hour. Get this with `TIMESTAMP_TRUNC(start_time, HOUR) as start_hour`
3. The number of rides starting at the station during the hour. Call this `num_rides`.
Select only the data before 2018-01-01 (so we can save data from 2018 as testing data.)

Write your query below:

In [None]:
%%bigquery bike_rides
SELECT
  COUNT(bikeid) AS num_rides,
  IFNULL(start_station_name, "") AS start_station_name,
  TIMESTAMP_TRUNC(start_time, HOUR) AS start_hour
FROM
  `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
  start_time < TIMESTAMP "2018-01-01 00:00:00"
GROUP BY start_station_name, start_hour
order by start_hour

You'll want to inspect your data to ensure it looks like what you expect. Run the line below to get a quick view of the data, and feel free to explore it more if you'd like (if you don't know how to do that, the [Pandas micro-course](https://www.kaggle.com/learn/pandas)) might be helpful.

In [None]:
bike_rides

## Model creation

Now it's time to turn this data into a model. You'll use the `CREATE MODEL` statement that has a structure like: 

```sql
CREATE OR REPLACE MODEL`model_dataset.bike_trips`
OPTIONS(model_type='linear_reg') AS 
-- training data query goes here
SELECT ...
    column_with_labels AS label
    column_with_data_1 
    column_with_data_2
FROM ... 
WHERE ... (Optional)
GROUP BY ... (Optional)
```

The `model_type` and `optimize_strategy` shown here are good parameters to use in general for predicting numeric outcomes with BQML.

**Tip:** Using ```CREATE OR REPLACE MODEL``` rather than just ```CREATE MODEL``` ensures you don't get an error if you want to run this command again without first deleting the model you've created.

## 3) Exercise: Create and train the model

Below, write your query to create and train a linear regression model on the training data.

Write your query below:

In [None]:
%%bigquery
CREATE OR REPLACE MODEL `model_dataset.sample_model`
OPTIONS(model_type='LINEAR_REG', INPUT_LABEL_COLS = ['num_rides']) AS
SELECT
  COUNT(bikeid) AS num_rides,
  IFNULL(start_station_name, "") AS start_station_name,
  TIMESTAMP_TRUNC(start_time, HOUR) AS start_hour
FROM
  `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
  start_time < TIMESTAMP "2018-01-01 00:00:00"
GROUP BY start_station_name, start_hour

In [None]:
%%bigquery
SELECT
  *
FROM
  ML.TRAINING_INFO(MODEL `model_dataset.sample_model`)
ORDER BY iteration 

## 4) Exercise: Model evaluation

Now that you have a model, evaluate it's performance on data from 2018. 


> Note that the ML.EVALUATE function will return different metrics depending on what's appropriate for your specific model. You can just use the regular ML.EVALUATE funciton here. (ROC curves are generally used to evaluate binary problems, not linear regression, so there's no reason to plot one here.)

Write your query below:

In [None]:
%%bigquery
SELECT
  *
FROM ML.EVALUATE(MODEL `model_dataset.sample_model`, (
  SELECT
  COUNT(bikeid) AS num_rides,
  IFNULL(start_station_name, "") AS start_station_name,
  TIMESTAMP_TRUNC(start_time, HOUR) AS start_hour
FROM
  `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
  start_time >= TIMESTAMP "2018-01-01 00:00:00"
GROUP BY start_station_name, start_hour))

You should see that the r^2 score here is negative. Negative values indicate that the model is worse than just predicting the mean rides for each example.

## 5) Theories for poor performance

Why would your model be doing worse than making the most simple prediction based on historical data?

In [None]:
## Thought question answer here
## Maybe we choose wrong parameters or grouping for prediction

## 6) Exercise: Looking at predictions

A good way to figure out where your model is going wrong is to look closer at a small set of predictions. Use your model to predict the number of rides for the 22nd & Pearl station in 2018. Compare the mean values of predicted vs actual riders.

Write your query below:

In [None]:
%%bigquery
SELECT
  start_station_name,
  AVG(num_rides) as total_rides,
  AVG(ROUND(predicted_num_rides)) as total_predicted_rides
FROM ML.PREDICT(MODEL `model_dataset.sample_model`, (
    SELECT
      COUNT(bikeid) AS num_rides,
      IFNULL(start_station_name, "") AS start_station_name,
      TIMESTAMP_TRUNC(start_time, HOUR) AS start_hour
    FROM
      `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    WHERE
      start_time BETWEEN TIMESTAMP "2018-01-01 00:00:00" AND TIMESTAMP "2018-12-31 23:59:59" AND start_station_name = "22nd & Pearl"
    GROUP BY start_station_name, start_hour))
  GROUP BY start_station_name
  ORDER BY total_predicted_rides DESC

What you should see here is that the model is underestimating the number of rides by quite a bit. 

## 7) Exercise: Average daily rides per station

Either something is wrong with the model or something surprising is happening in the 2018 data. 

What could be happening in the data? Write a query to get the average number of riders per station for each year in the dataset and order by the year so you can see the trend. You can use the `EXTRACT` method to get the day and year from the start time timestamp. (You can read up on EXTRACT [in this lesson in the Intro to SQL course](https://www.kaggle.com/dansbecker/order-by)). 

Write your query below:

In [None]:
%%bigquery
WITH year_rides AS ( 
    SELECT
      COUNT(bikeid) AS num_rides,
      start_station_name,
      EXTRACT(YEAR FROM start_time) AS year,
      EXTRACT(DAY FROM start_time) AS day
    FROM
      `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    GROUP BY start_station_name, year, day) 

SELECT 
    avg(num_rides) AS daily_rides, year
FROM year_rides
GROUP BY year
ORDER BY year

## 8) What do your results tell you?

Given the daily average riders per station over the years, does it make sense that the model is failing?

In [None]:
## Thought question answer here

# 9) Next steps

Given what you've learned, what improvements do you think you could make to your model? Share your ideas on the [Kaggle Learn Forums](https://www.kaggle.com/learn-forum)! (I'll pick a couple of my favorite ideas & send the folks who shared them a Kaggle t-shirt. :)

In [None]:
%%bigquery
CREATE OR REPLACE MODEL `model_dataset.better_sample_model`
OPTIONS(model_type='LINEAR_REG', INPUT_LABEL_COLS = ['num_rides']) AS
SELECT
  COUNT(bikeid) AS num_rides,
  IFNULL(start_station_name, "") AS start_station_name,
  TIMESTAMP_TRUNC(start_time, HOUR) AS start_hour,
  EXTRACT(DAY FROM start_time) AS day
FROM
  `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
  start_time < TIMESTAMP "2018-01-01 00:00:00"
GROUP BY start_station_name, start_hour