# Stocking rental bikes

![bike rentals](https://upload.wikimedia.org/wikipedia/commons/thumb/a/a0/Bay_Area_Bike_Share_launch_in_San_Jose_CA.jpg/640px-Bay_Area_Bike_Share_launch_in_San_Jose_CA.jpg)

You stock bikes for a bike rental company in Austin, ensuring stations have enough bikes for all their riders. You decide to build a model to predict how many riders will start from each station during each hour, capturing patterns in seasonality, time of day, day of the week, etc.

To get started, create a project in GCP and connect to it by running the code cell below. Make sure you have connected the kernel to your GCP account in Settings.

In [None]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.bqml.ex1 import *

In [None]:
# Set your own project id here
PROJECT_ID = ____ # a string, like 'kaggle-bigquery-240818'

from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID, location="US")
dataset = client.create_dataset('model_dataset', exists_ok=True)

from google.cloud.bigquery import magics
from kaggle.gcp import KaggleKernelCredentials
magics.context.credentials = KaggleKernelCredentials()
magics.context.project = PROJECT_ID

## Linear Regression

Your dataset is quite large. BigQuery is especially efficient with large datasets, so you'll use BigQuery-ML (called BQML) to build your model. BQML uses a "linear regression" model when predicting numeric outcomes, like the number of riders.

## 1) Training vs testing

You'll want to test your model on data it hasn't seen before (for reasons described in the [Intro to Machine Learning Micro-Course](https://www.kaggle.com/learn/intro-to-machine-learning). What do you think is a good approach to splitting the data? What data should we use to train, what data should we use for test the model?

In [None]:
# Uncomment the following line to check the solution once you've thought about the answer
# q_1.solution()

## Training data

First, you'll write a query to get the data for model-building. You can use the public Austin bike share dataset from the `bigquery-public-data.austin_bikeshare.bikeshare_trips` table. You predict the number of rides based on the station where the trip starts and the hour when the trip started. Use the `TIMESTAMP_TRUNC` function to truncate the start time to the hour.

## 2) Exercise: Query the training data

Write the query to retrieve your training data. The fields should be:
1. The start_station_name
2. A time trips start, to the nearest hour. Get this with `TIMESTAMP_TRUNC(start_time, HOUR) as start_hour`
3. The number of rides starting at the station during the hour. Call this `num_rides`.
Select only the data before 2018-01-01 (so we can save data from 2018 as testing data.)

In [None]:
# Write your query to retrieve the training data
query = ____

# Create the query job. No changes needed below this line
query_job = client.query(query) 

# API request - run the query, and return DataFrame. No changes needed
model_data = query_job.to_dataframe() 

q_2.check()

In [None]:
# uncomment the lines below to get a hint or solution
# q_2.hint()
# q_2.solution()

In [None]:
## My solution code
query = """
        SELECT start_station_name, 
               TIMESTAMP_TRUNC(start_time, HOUR) as start_hour,
               COUNT(bikeid) as num_rides
        FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
        WHERE start_time < "2018-01-01"
        GROUP BY start_station_name, start_hour
        """

query_job = client.query(query)
model_data = query_job.to_dataframe()

You'll want to inspect your data to ensure it looks like what you expect. Run the line below to get a quick view of the data, and feel free to explore it more if you'd like (if you don't know hot to do that, the [Pandas micro-course](https://www.kaggle.com/learn/pandas)) might be helpful.

In [None]:
model_data.head(20)

## Model creation

Now it's time to turn this data into a model. You'll use the `CREATE MODEL` statement that has a structure like: 

```sql
CREATE OR REPLACE MODEL`model_dataset.bike_trips`
OPTIONS(model_type='linear_reg', 
        input_label_cols=['label_col'],
        optimize_strategy='batch_gradient_descent') AS 
-- training data query goes here
SELECT ...
FROM ... 
WHERE ...
GROUP BY ...
```

The `model_type` and `optimize_strategy` shown here are good parameters to use in general for predicting numeric outcomes with BQML.

**Tip:** Using ```CREATE OR REPLACE MODEL``` rather than just ```CREATE MODEL``` ensures you don't get an error if you want to run this command again without first deleting the model you've created.

## 3) Exercise: Create and train the model

Below, write your query to create and train a linear regression model on the training data.

In [None]:
# Write your query to create and train the model
query = ____

# Create the query job. No changes needed below this line
query_job = client.query(query) 

# API request - run the query. Models return an empty table. No changes needed
query_job.result()

In [None]:
## My solution

query = """
        CREATE OR REPLACE MODEL `model_dataset.bike_trips`
        OPTIONS(model_type='linear_reg', 
                input_label_cols=['num_rides'],
                optimize_strategy='batch_gradient_descent') AS
        SELECT COUNT(bikeid) as num_rides, 
               start_station_name, 
               TIMESTAMP_TRUNC(start_time, HOUR) as start_hour
        FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
        WHERE start_time < "2018-01-01"
        GROUP BY start_station_name, start_hour
        """

query_job = client.query(query)

# API request - run the query. Models return an empty table
query_job.result()

q_3.check()

In [None]:
# q_3.solution()

## 4) Exercise: Model evaluation

Now that you have a model, evaluate it's performance on data from 2018. If you need help with 

In [None]:
# Write your query to evaluate the model
query = "____"

query_job = client.query(query)

# API request - run the query
evaluation_results = query_job.to_dataframe()
evaluation_results

q_4.check()

In [None]:
## My solution

query = """
        SELECT *
        FROM
        ML.EVALUATE(MODEL `model_dataset.bike_trips`, (
        SELECT COUNT(bikeid) as num_rides, 
               start_station_name, 
               TIMESTAMP_TRUNC(start_time, HOUR) as start_hour
        FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
        WHERE start_time >= "2018-01-01"
        GROUP BY start_station_name, start_hour
        ))
        """

query_job = client.query(query)

# API request - run the query
evaluation_results = query_job.to_dataframe()
evaluation_results

You should see that the r^2 score here is negative. Negative values indicate that the model is worse than just predicting the mean rides for each example.

## 5) Theories for poor performance

Why would your model be doing worse than making the most simple prediction?

**Answer:** It's possible there's something broken in the model algorithm. Or the data for 2018 is much different than the historical data before it.

In [None]:
## Thought question answer here

## 6) Exercise: Looking at predictions

A good way to figure out where your model is going wrong is to look closer at a small set of predictions. Use your model to predict the number of rides for the 22nd & Pearl station in 2018. Compare the mean values of predicted vs actual riders.

In [None]:
# Write the query here
query = "____"

query_job = client.query(query)

# API request - run the query
evaluation_results = query_job.to_dataframe()
evaluation_results

In [None]:
## My solution

query = """
        SELECT AVG(ROUND(predicted_num_rides)) as predicted_avg_riders, 
               AVG(num_rides) as true_avg_riders
        FROM
        ML.PREDICT(MODEL `model_dataset.bike_trips`, (
        SELECT COUNT(bikeid) as num_rides,
               start_station_name,
               TIMESTAMP_TRUNC(start_time, HOUR) as start_hour
        FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
        WHERE start_time >= "2018-01-01"
          AND start_station_name = "22nd & Pearl"
        GROUP BY start_station_name, start_hour
        ))
        -- ORDER BY start_hour
        """

query_job = client.query(query)

# API request - run the query
evaluation_results = query_job.to_dataframe()
evaluation_results

What you should see here is that the model is underestimating the number of rides by quite a bit. 

## 7) Exercise: Average daily rides per station

Either something is wrong with the model or something surprising is happening in the 2018 data. 

What could be happening in the data? Write a query to get the average number of riders per station for each year in the dataset and order by the year so you can see the trend. You can use the `EXTRACT` method to get the day and year from the start time timestamp.

In [None]:
# Write the query here
query = "____"

# Create the query job
query_job = ____

# API request - run the query and return a pandas DataFrame
evaluation_results = ____
evaluation_results

In [None]:
## My solution

query = """
        WITH daily_rides AS (
            SELECT COUNT(bikeid) AS num_rides,
                   start_station_name,
                   EXTRACT(DAYOFYEAR from start_time) AS doy,
                   EXTRACT(YEAR from start_time) AS year
            FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
            GROUP BY start_station_name, doy, year
            ORDER BY year
        ), 
        station_averages AS (
            SELECT avg(num_rides) AS avg_riders, start_station_name, year
            FROM daily_rides
            GROUP BY start_station_name, year)
        
        SELECT avg(avg_riders) AS daily_rides_per_station, year
        FROM station_averages
        GROUP BY year
        ORDER BY year
        """

query_job = client.query(query)

# API request - run the query
evaluation_results = query_job.to_dataframe()
evaluation_results

## 8) What do your results tell you?

Given the daily average riders per station over the years, does it make sense that the model is failing?

**Answer:** The daily average riders went from around 10 in 2017 to over 16 in 2018. This change in the bikesharing program caused your model to underestimate the number of riders in 2018. Unexpected things can happen when you predict the future in an ever-changing area. Knowledge of a topic can be helpful here, and if you knew enough about the program, you might be able to predict (or at least explain) these types of changes over time.

In [None]:
## Thought question answer here

## 9) A Better Scenario

It's disappointing that your model was so inaccurate on 2018 data. Fortunately, this issue of the world changing over time is the exception rather than the rule. 

Your model was built on data that went through the end of 2016. So you can also see how the model performs on data from 2017. First, create a model

In [None]:
# Write your query to create and train the model
query = "____"

# Create the query job
query_job = ____ # Your code goes here

# API request - run the query. Models return an empty table
____ # Your code goes here

In [None]:
## My solution

query = """
        CREATE OR REPLACE MODEL `model_dataset.bike_trips_2017`
        OPTIONS(model_type='linear_reg', 
                input_label_cols=['num_rides'],
                optimize_strategy='batch_gradient_descent') AS
        SELECT COUNT(bikeid) as num_rides, 
               start_station_name, 
               TIMESTAMP_TRUNC(start_time, HOUR) as start_hour
        FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
        WHERE start_time < "2017-01-01"
        GROUP BY start_station_name, start_hour
        """

query_job = client.query(query)

# API request - run the query. Models return an empty table
query_job.result()

Now write the query to evaluate your model using data from 2017

In [None]:
# Write your query to evaluate the model
query = "____"

query_job = client.query(query)

# API request - run the query. Models return an empty table
query_job.result()

In [None]:
query = """
        SELECT *
        FROM
        ML.EVALUATE(MODEL `model_dataset.bike_trips_2017`, (
        SELECT COUNT(bikeid) as num_rides, 
               start_station_name, 
               TIMESTAMP_TRUNC(start_time, HOUR) as start_hour
        FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
        WHERE start_time >= "2017-01-01" AND start_time < "2018-01-01"
        GROUP BY start_station_name, start_hour
        ))
        """

query_job = client.query(query)

# API request - run the query
evaluation_results = query_job.to_dataframe()
evaluation_results