**This notebook is an exercise in the [Advanced SQL](https://www.kaggle.com/learn/advanced-sql) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/analytic-functions).**

---


# Introduction

Here, you'll use window functions to answer questions about the [Chicago Taxi Trips](https://www.kaggle.com/chicago/chicago-taxi-trips-bq) dataset.

Before you get started, run the code cell below to set everything up.

In [9]:
# Get most recent checking code
!pip install -U -t /kaggle/working/ git+https://github.com/Kaggle/learntools.git
# Set up feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.sql_advanced.ex2 import *
print("Setup Complete")

Collecting git+https://github.com/Kaggle/learntools.git
  Cloning https://github.com/Kaggle/learntools.git to /tmp/pip-req-build-54073uu6
  Running command git clone --filter=blob:none --quiet https://github.com/Kaggle/learntools.git /tmp/pip-req-build-54073uu6
  Resolved https://github.com/Kaggle/learntools.git to commit a0c08ac1fa73dd42ba4aecf2cce045752f14f72b
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: learntools
  Building wheel for learntools (setup.py) ... [?25l[?25hdone
  Created wheel for learntools: filename=learntools-0.3.5-py3-none-any.whl size=269685 sha256=9c49f7f92a62bd452056d1ab1a6af6033d1cdfb5a237c78fe33ee82eaa4195a0
  Stored in directory: /tmp/pip-ephem-wheel-cache-8ggtjkn3/wheels/50/16/ba/3e0ec276f3238de9e1c59751f34c7c3ea4a3f561af10c4fd0c
Successfully built learntools
Installing collected packages: learntools
Successfully installed learntools-0.3.5
Setup Complete


The following code cell fetches the `taxi_trips` table from the `chicago_taxi_trips` dataset. We also preview the first five rows of the table.  You'll use the table to answer the questions below.

In [10]:
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "chicago_taxi_trips" dataset
dataset_ref = client.dataset("chicago_taxi_trips", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "taxi_trips" table
table_ref = dataset_ref.table("taxi_trips")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()

Using Kaggle's public dataset BigQuery integration.


  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


Unnamed: 0,unique_key,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,...,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,pickup_location,dropoff_latitude,dropoff_longitude,dropoff_location
0,7dfc5e435eebecdcff5c5e34743c8eb5e5f9bc71,6e24d43a7ccff7c16428c8997b99ed9c15dccb81f85f4c...,2015-08-17 23:00:00+00:00,2015-08-17 23:00:00+00:00,0,0.0,,,,,...,,,Credit Card,,,,,,,
1,ddfe58f156b3ce48100cc1997b2b7cb217640628,6e24d43a7ccff7c16428c8997b99ed9c15dccb81f85f4c...,2015-08-17 23:00:00+00:00,2015-08-17 23:00:00+00:00,60,0.0,,,,,...,,,Credit Card,,,,,,,
2,30544de3182e5183eb98f34686d054f0e14d09b2,209204ff3fb18a2e31603ef9cf6d4c0fa21d5dddd332a4...,2015-09-16 18:00:00+00:00,2015-09-16 18:00:00+00:00,360,1.8,,,,,...,,,Credit Card,,,,,,,
3,7a9e0c07ddc65110cd90ccbfc3d3c2cb54ebd86e,e4318a8db4098de8acda060ba7fe7ef05c240a1c81de3f...,2015-09-04 15:30:00+00:00,2015-09-04 15:45:00+00:00,900,3.3,,,,,...,,,Credit Card,,,,,,,
4,be984c2c2c472d4363da100d9ca864f341a5f333,218100bdb8cbbbde4b4edf363cf5890c6bb7e3cb749f7f...,2015-09-08 10:30:00+00:00,2015-09-08 10:30:00+00:00,0,0.93,,17031081401.0,,8.0,...,,,Credit Card,,,,,41.895033,-87.619711,POINT (-87.6197106717 41.8950334495)


# Exercises

### 1) How can you predict the demand for taxis?

Say you work for a taxi company, and you're interested in predicting the demand for taxis.  Towards this goal, you'd like to create a plot that shows a rolling average of the daily number of taxi trips.  Amend the (partial) query below to return a DataFrame with two columns:
- `trip_date` - contains one entry for each date from January 1, 2016, to March 31, 2016.
- `avg_num_trips` - shows the average number of daily trips, calculated over a window including the value for the current date, along with the values for the preceding 3 days and the following 3 days, as long as the days fit within the three-month time frame.  For instance, when calculating the value in this column for January 3, 2016, the window will include the number of trips for the preceding 2 days, the current date, and the following 3 days.

This query is partially completed for you, and you need only write the part that calculates the `avg_num_trips` column.  Note that this query uses a common table expression (CTE); if you need to review how to use CTEs, you're encouraged to check out [this tutorial](https://www.kaggle.com/dansbecker/as-with) in the [Intro to SQL](https://www.kaggle.com/learn/intro-to-sql) course.

In [11]:
# Fill in the blank below
avg_num_trips_query = """
                      WITH trips_by_day AS
                      (
                      SELECT DATE(trip_start_timestamp) AS trip_date,
                          COUNT(*) as num_trips
                      FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                      WHERE trip_start_timestamp > '2016-01-01' AND trip_start_timestamp < '2016-04-01'
                      GROUP BY trip_date
                      ORDER BY trip_date
                      )
                      SELECT trip_date,
                          AVG
                          OVER (
                               ORDER BY trip_date
                               ROWS BETWEEN PRECEDING 2 DAYS AND CURRENT ROW AND FOLLOWING 3 DAYS
                               ) AS avg_num_trips
                      FROM trips_by_day
                      """

# Check your answer
q_1.check()

<IPython.core.display.Javascript object>

<span style="color:#cc3333">Incorrect:</span> You don't have a valid query yet.  Try again.

In [12]:
# Lines below will give you a hint or solution code
q_1.hint()
q_1.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> Use the **AVG()** function. Write an **OVER** clause with that orders the rows with the `trip_date` column and uses a window that includes the 3 preceding rows, the current row, and the following 3 rows.

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python

avg_num_trips_query = """
                      WITH trips_by_day AS
                      (
                      SELECT DATE(trip_start_timestamp) AS trip_date,
                          COUNT(*) as num_trips
                      FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                      WHERE trip_start_timestamp > '2016-01-01' AND trip_start_timestamp < '2016-04-01'
                      GROUP BY trip_date
                      )
                      SELECT trip_date,
                          AVG(num_trips) 
                          OVER (
                               ORDER BY trip_date
                               ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
                               ) AS avg_num_trips
                      FROM trips_by_day
                      """

```

### 2) Can you separate and order trips by community area?

The query below returns a DataFrame with three columns from the table: `pickup_community_area`, `trip_start_timestamp`, and `trip_end_timestamp`.  

Amend the query to return an additional column called `trip_number` which shows the order in which the trips were taken from their respective community areas.  So, the first trip of the day originating from community area 1 should receive a value of 1; the second trip of the day from the same area should receive a value of 2.  Likewise, the first trip of the day from community area 2 should receive a value of 1, and so on.

Note that there are many numbering functions that can be used to solve this problem (depending on how you want to deal with trips that started at the same time from the same community area); to answer this question, please use the **RANK()** function.

In [13]:
# Amend the query below
trip_number_query = """
                    SELECT pickup_community_area,
                        trip_start_timestamp,
                        trip_end_timestamp
                    FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                    WHERE DATE(trip_start_timestamp) = '2013-10-03'
                    """

# Check your answer
q_2.check()



Unnamed: 0,pickup_community_area,trip_start_timestamp,trip_end_timestamp
0,58,2013-10-03 01:15:00+00:00,2013-10-03 01:30:00+00:00
1,32,2013-10-03 02:15:00+00:00,2013-10-03 02:45:00+00:00
2,32,2013-10-03 04:15:00+00:00,2013-10-03 04:30:00+00:00
3,12,2013-10-03 01:00:00+00:00,2013-10-03 01:15:00+00:00
4,8,2013-10-03 05:30:00+00:00,2013-10-03 05:45:00+00:00


<IPython.core.display.Javascript object>

<span style="color:#cc3333">Incorrect:</span> There are many different numbering functions that enumerate the rows in the input. For this exercise, please use the **RANK()** function.

In [14]:
# Lines below will give you a hint or solution code
#q_2.hint()
#q_2.solution()

### 3) How much time elapses between trips?

The (partial) query in the code cell below shows, for each trip in the selected time frame, the corresponding `taxi_id`, `trip_start_timestamp`, and `trip_end_timestamp`. 

Your task in this exercise is to edit the query to include an additional `prev_break` column that shows the length of the break (in minutes) that the driver had before each trip started (this corresponds to the time between `trip_start_timestamp` of the current trip and `trip_end_timestamp` of the previous trip).  Partition the calculation by `taxi_id`, and order the results within each partition by `trip_start_timestamp`.

Some sample results are shown below, where all rows correspond to the same driver (or `taxi_id`).  Take the time now to make sure that the values in the `prev_break` column make sense to you!

![first_commands](https://storage.googleapis.com/kaggle-media/learn/images/qjvQzg8.png)

Note that the first trip of the day for each driver should have a value of **NaN** (not a number) in the `prev_break` column.

In [20]:
# Fill in the blanks below
break_time_query = """
                   SELECT taxi_id,
                       trip_start_timestamp,
                       trip_end_timestamp,
                       TIMESTAMP_DIFF(
                           trip_start_timestamp, 
                           LAG(trip_end_timestamp,1) 
                               OVER (
                                    PARTITION BY taxi_id
                                    ORDER BY trip_start_timestamp), 
                           MINUTE) as prev_break
                   FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                   WHERE DATE(trip_start_timestamp) = '2013-10-03' 
                   """

# Check your answer
q_3.check()



Unnamed: 0,taxi_id,trip_start_timestamp,trip_end_timestamp,prev_break
0,1f2e1481c3358ba234b875b1b0ba26bb61e2f02fa4c463...,2013-10-03 13:45:00+00:00,2013-10-03 14:00:00+00:00,240
1,2f1df8a630d47fa448585b72a26287f920aa3349fb90e6...,2013-10-03 14:30:00+00:00,2013-10-03 15:00:00+00:00,255
2,3bbda357f5d7b4f86d15cd574862ba30dc992a696061f1...,2013-10-03 18:45:00+00:00,2013-10-03 18:45:00+00:00,225
3,49692f79df4c1c5856e2568d485fd41a63acc6e2b16b5c...,2013-10-03 18:45:00+00:00,2013-10-03 18:45:00+00:00,-30
4,4e6624b0e280711a981d800a90d6fb2a923e6ad55fa872...,2013-10-03 16:15:00+00:00,2013-10-03 16:30:00+00:00,900


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [19]:
# Lines below will give you a hint or solution code
q_3.hint()
q_3.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> The `TIMESTAMP_DIFF()` function takes three arguments, where the first (`trip_start_timestamp`) and the last (`MINUTE`) are provided for you.  This function provides the time difference (in minutes) of the timestamps in the first two arguments. You need only fill in the second argument, which should use the **LAG()** function to pull the timestamp corresponding to the end of the previous trip (for the same `taxi_id`).

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python

break_time_query = """
                   SELECT taxi_id,
                       trip_start_timestamp,
                       trip_end_timestamp,
                       TIMESTAMP_DIFF(
                           trip_start_timestamp, 
                           LAG(trip_end_timestamp, 1) OVER (PARTITION BY taxi_id ORDER BY trip_start_timestamp), 
                           MINUTE) as prev_break
                   FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                   WHERE DATE(trip_start_timestamp) = '2013-10-03' 
                   """

break_time_result = client.query(break_time_query).result().to_dataframe()

```

# Keep going

Move on to learn how to query **[nested and repeated data](https://www.kaggle.com/alexisbcook/nested-and-repeated-data)**.

---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/advanced-sql/discussion) to chat with other learners.*