# Feature Engineering in BQ and BQML

In this Notebook we will do two exercises. First we will use standard ansi SQL to generate some basic new features. For the second exercise will then use BigQuery ML to train a model and us the build in transformations. BigQuery ML enables users to create and execute machine learning models in BigQuery using SQL queries. The goal is to democratize machine learning by enabling SQL practitioners to build models using their existing tools and to increase development speed by eliminating the need for data movement.

Weâ€™ll use Airline Ontime Performance data, a 70 million row data set from the U.S. Bureau of Transportation statistics, that is available to all users in BigQuery as the airline_ontime_data.flights data set.

## Objectives
In this tutorial, you will:

+ Basic Feature Engineering BQ
+ Feature Engineering BQML 
+ BigQuery ML to prepare a dataset that can be used for analytics and modeling. 
+ Create and train your first BigQuery model. 
+ Evaluate Model

## Feature generation using BQ. 

We have already seen in our previous Data Analysis exercises that we can generate a new feature using:

+ EXTRACT(DAY FROM CAST(date AS DATE)) AS flight_day
+ FORMAT_DATE('%a',  CAST(date AS DATE)) AS flight_day_of_week

You can also use SQL to do things like updating missing values. 

+ IFNULL(airline, 'N/A') AS NA

You can see how it works running the following example.   

In [1]:
%%bigquery --verbose

SELECT 
    CAST(date AS DATE) AS flight_date, 
    EXTRACT(DAY FROM CAST(date AS DATE)) AS flight_day, 
    FORMAT_DATE('%a',  CAST(date AS DATE)) AS flight_day_of_week, 
    airline,
    departure_delay,
    arrival_delay,
    IFNULL(airline, 'N/A') AS NA,
    IFNULL(arrival_delay, 0) AS label
FROM 
    `bigquery-samples.airline_ontime_data.flights`
LIMIT 5

Executing query with job ID: 952f2f67-c7d4-42e6-91df-ab14f3c32575
Query executing: 0.35s
Query complete after 1.28s


Unnamed: 0,flight_date,flight_day,flight_day_of_week,airline,departure_delay,arrival_delay,NA,label
0,2012-06-09,9,Sat,F9,21.0,35.0,F9,35.0
1,2012-08-01,1,Wed,F9,9.0,16.0,F9,16.0
2,2012-07-11,11,Wed,F9,-6.0,-15.0,F9,-15.0
3,2012-06-29,29,Fri,F9,12.0,8.0,F9,8.0
4,2012-06-13,13,Wed,F9,14.0,19.0,F9,19.0


Now if we want to build a classification model that pridicts if a plane will have a delay, meaning departure or arrival delay, then we need to generate a new feature: delayed (yes / no). Let's try to write a query that generates a new target feature. 

To do:
+ Write a query generates a new feature called delayed that can have the value yes / no. 
+ Yes if any delay ( > 0) for departure_delay or arrival_delay. 
+ Feel free to select any features you like as long as you generate the new feature. 

In [2]:
%%bigquery --verbose

SELECT 
    CAST(date AS DATE) AS flight_date, 
    EXTRACT(DAY FROM CAST(date AS DATE)) AS flight_day, 
    FORMAT_DATE('%a',  CAST(date AS DATE)) AS flight_day_of_week, 
    airline,
    departure_delay,
    arrival_delay,
    IF(arrival_delay > 0 OR departure_delay > 0, "Delay", "No_delay") AS Delayed
FROM 
    `bigquery-samples.airline_ontime_data.flights`
LIMIT 10

Executing query with job ID: 8ad499f1-7b8d-401a-a568-5a1e6982e544
Query executing: 1.37s
Query complete after 1.58s


Unnamed: 0,flight_date,flight_day,flight_day_of_week,airline,departure_delay,arrival_delay,Delayed
0,2006-08-20,20,Sun,KH,-2.0,-3.0,No_delay
1,2006-08-19,19,Sat,KH,-6.0,-9.0,No_delay
2,2006-08-18,18,Fri,KH,-1.0,-6.0,No_delay
3,2006-08-17,17,Thu,KH,-3.0,-4.0,No_delay
4,2006-08-16,16,Wed,KH,-6.0,-12.0,No_delay
5,2006-08-15,15,Tue,KH,-2.0,-7.0,No_delay
6,2006-08-14,14,Mon,KH,-5.0,-7.0,No_delay
7,2006-08-13,13,Sun,KH,-7.0,-7.0,No_delay
8,2006-08-12,12,Sat,KH,3.0,2.0,Delay
9,2006-08-11,11,Fri,KH,3.0,0.0,Delay


## Please stop here :)

In [None]:
# Only run if needed

from google.cloud import bigquery

client = bigquery.Client(location="US")

## Let's build our first BigQuery ML model 

Let's build a model using just the raw data. It's not going to be very good, but sometimes it is good to actually experience this. The model will take a bit of time to train. When it comes to ML, this is very fast.

To do:
+ Write your own model use the code below as a starting point. 
+ Tip: Use the [console](https://console.cloud.google.com/) to write, run and debug your query. 
+ Have look at the [BQML documentation](https://cloud.google.com/bigquery-ml/docs/bigqueryml-transform). 
+ Use as your target: departure_delay or arrival_delay.
+ Do not worry to much about performance. Keep things simple for now. 
+ Use a random sample. 

In [4]:
%%bigquery --verbose

CREATE OR REPLACE MODEL bqml_tutorial.model_basic
OPTIONS(input_label_cols=['arrival_delay'], model_type='linear_reg') AS
SELECT 
    airline,
    CAST(date AS DATE) AS flight_date, 
    FORMAT_DATE('%b',  CAST(date AS DATE)) AS flight_month, 
    EXTRACT(DAY FROM CAST(date AS DATE)) AS flight_day, 
    EXTRACT(YEAR FROM CAST(date AS DATE)) AS flight_year,
    departure_delay,
    arrival_delay,
    departure_airport,
    arrival_airport
FROM `bigquery-samples.airline_ontime_data.flights`
WHERE RAND() < 0.001

Executing query with job ID: e3e78528-4953-40d2-a6b9-c3ca29fd7669
Query executing: 40.42s
Query complete after 42.43s


## Model two

Now it's time to build a more complex model and add a BigQuery ML native transformation. 

To do:
+ Copy and past the first BigQuery Model you created and add:
    + Add extra BQML options to the model. 
    + Implement BigQuery ML native transformation (choose any). Have a look at the [documentation](https://cloud.google.com/bigquery-ml/docs/bigqueryml-transform)
+ Use as your target: departure_delay or arrival_delay.
+ Use a random sample. 
+ Train on data from: 2006 AND 2011
+ Tip: Use the [console](https://console.cloud.google.com/) to write, run and debug your query. 

In [5]:
%%bigquery --verbose

CREATE OR REPLACE MODEL bqml_tutorial.model_transform
TRANSFORM(
          flight_date,
          flight_day,
          airline, 
          flight_month,
          flight_year,
          departure_airport,
          arrival_airport,
          arrival_delay,
          ML.MIN_MAX_SCALER(departure_delay) OVER() as buckets
          )
OPTIONS(input_label_cols=['arrival_delay'],  DATA_SPLIT_METHOD='AUTO_SPLIT', OPTIMIZE_STRATEGY='BATCH_GRADIENT_DESCENT', model_type='linear_reg') AS
SELECT 
    flight_date,
    airline,
    flight_day, 
    flight_month,
    flight_year,
    departure_delay,
    arrival_delay,
    departure_airport,
    arrival_airport
FROM 
    (SELECT 
    CAST(date AS DATE) AS flight_date, 
    FORMAT_DATE('%b',  CAST(date AS DATE)) AS flight_month, 
    EXTRACT(DAY FROM CAST(date AS DATE)) AS flight_day, 
    EXTRACT(YEAR FROM CAST(date AS DATE)) AS flight_year,
    FORMAT_DATE('%a',  CAST(date AS DATE)) AS flight_day_of_week,
    airline, 
    departure_airport,
    arrival_airport,
    departure_delay,
    arrival_delay
    FROM `bigquery-samples.airline_ontime_data.flights`)
WHERE flight_year BETWEEN 2006 AND 2011
AND RAND() < 0.001

Executing query with job ID: f8ca7c4b-b0e9-4191-b82a-0917778ce30e
Query executing: 45.04s
Query complete after 46.98s


## Model evaluation. 

Now it's time to evaluate the model you just trained. 

To do:
+ Write the full code for doing model evalution.
+ Have look at the [BQML documentation](https://cloud.google.com/bigquery-ml/docs/bigqueryml-web-ui-start). 
+ Make sure to use the same features as that you trained on. 
+ Use data from a different time period then wehere you trained on (Tip: use the where clause). 
+ Use a random sample. 
+ Tip: Use the [console](https://console.cloud.google.com/) to write, run and debug your query. 

In [7]:
%%bigquery --verbosity

SELECT SQRT(mean_squared_error) AS rmse FROM ML.EVALUATE(MODEL bqml_tutorial.model1,
(
SELECT 
    flight_date,
    airline,
    flight_day, 
    flight_month,
    flight_year,
    departure_delay,
    arrival_delay,
    departure_airport,
    arrival_airport
FROM 
    (SELECT 
    CAST(date AS DATE) AS flight_date, 
    FORMAT_DATE('%b',  CAST(date AS DATE)) AS flight_month, 
    EXTRACT(DAY FROM CAST(date AS DATE)) AS flight_day, 
    EXTRACT(YEAR FROM CAST(date AS DATE)) AS flight_year,
    FORMAT_DATE('%a',  CAST(date AS DATE)) AS flight_day_of_week,
    airline, 
    departure_airport,
    arrival_airport,
    departure_delay,
    arrival_delay
    FROM `bigquery-samples.airline_ontime_data.flights`)
WHERE flight_year=2012
AND RAND() < 0.001
))

Unnamed: 0,rmse
0,20.723608


Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
