# Introduction

So far, you've worked with many types of data, including numeric types (integers, floating point values), strings, and the [DATETIME](https://www.kaggle.com/dansbecker/order-by) type.  In this tutorial, you'll learn how to query nested and repeated data.  These are the most complex data types that you can find in BigQuery datasets! 

# Nested data 

Consider a hypothetical dataset containing information about pets and their toys.  We could organize this information in two different tables (a `pets` table and a `toys` table).  The `toys` table could contain a "Pet_ID" column that could be used to match each toy to the pet that owns it.

Another option in BigQuery is to organize all of the information in a single table, similar to the `pets_and_toys` table below.

![nested data](https://i.imgur.com/ZC61pQs.png)

In this case, all of the information from the `toys` table is collapsed into a single column (the "Toy" column in the `pets_and_toys` table) containing multiple types of information.

We can refer to the "Toy" column in the `pets_and_toys` table as a **nested** column, and say that the "Color" and "Name" columns are nested inside of it.  Nested columns have datatype **RECORD** (or datatype **STRUCT**).  

If the fields within a RECORD have one value for each row, they are relatively straightforward to query.  (This is the case here, where both the "Name" field and the "Color" field under the "Toy" column have one value for each row.)  An example query is shown below.

![nested data](https://i.imgur.com/pxorrTh.png)

We need only identify each column in the context of the field that contains it: 
- `Toy.Name` refers to the "Name" field in the "Toy" column, and
- `Toy.Color` refers to the "Color" field in the "Toy" column.  

Otherwise, our usual rules remain the same - we need not change anything else about our queries.

# Repeated data 

Now consider the (more realistic!) case where each pet can have multiple toys.  In this case, to collapse all of the information into a single table, we need to leverage another datatype.

![repeated data](https://i.imgur.com/USsRSBH.png)

In the (new!) `pets_and_toys` table above, the "Toys" column is both nested and **repeated**.  The data is repeated, because both the "Toys.Name" and "Toys.Color" fields permit more than one value for each row.  

When querying repeated data, the syntax is slightly more complex.

![repeated data](https://i.imgur.com/ogLy2cp.png)

As you can see above, when writing a query, we need to put the name of the RECORD containing the repeated data inside an **UNNEST()** function.  This essentially flattens the repeated data (which is then appended to the end of the table) so that we have one element on each row.  For an illustration of this, check out the image below.

![repeated data](https://i.imgur.com/TXoNRtK.png)

If this doesn't make complete sense to you yet, you're not alone!  It will start to feel clearer when we apply these ideas to a real dataset in the section below.

# Example

We'll work with the [Google Analytics Sample](https://www.kaggle.com/bigquery/google-analytics-sample) dataset.  It contains information tracking the behavior of visitors to the Google Merchandise store, an e-commerce website that sells Google branded items.

We begin by printing the first few rows of the `ga_sessions_20170801` table.  (_We have hidden the corresponding code.  To take a peek, click on the "Code" button below._)  This table tracks visits to the website on August 1, 2017.  

In [None]:
#$HIDE_INPUT$
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "google_analytics_sample" dataset
dataset_ref = client.dataset("google_analytics_sample", project="bigquery-public-data")

# Construct a reference to the "ga_sessions_20170801" table
table_ref = dataset_ref.table("ga_sessions_20170801")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()

The "totals" and "device" columns appear to be nested fields (with datatype RECORD). We can verify this by looking at the corresponding entries in the table schema. 

> Recall that we refer to the structure of a table as its **schema**.  If you need to review how to interpret table schema, feel free to check out [this lesson](https://www.kaggle.com/dansbecker/getting-started-with-sql-and-bigquery) from the Intro to SQL micro-course.

In [None]:
print("SCHEMA field for the 'totals' column:\n")
print(table.schema[5])

print("\nSCHEMA field for the 'device' column:\n")
print(table.schema[7])

There are several `SchemaField`s nested inside the field for each column.  For instance, `'device'` contains both `'browser'` and `'browserVersion'`.  
> Take a quick look at the table preview above.  Do you see "browser" and "browserVersion" nested under the "device" column?  

As you can see in the schema entries, both the "device" column and the "totals" column have datatype RECORD.

![repeated data](https://i.imgur.com/I8OblyK.png)

We refer to the "browser" field (which is nested in the "device" column) and the "transactions" field (which is nested inside the "totals" column) as `device.browser` and `totals.transactions` in the query below:

In [None]:
# Query to count the number of transactions per browser
query = """
        SELECT device.browser AS device_browser,
            SUM(totals.transactions) as total_transactions
        FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
        GROUP BY device_browser
        ORDER BY total_transactions DESC
        """

# Run the query, and return a pandas DataFrame
result = client.query(query).result().to_dataframe()
result.head()

Now we'll work with the "hits" column as an example of data that is both nested and repeated.  The `SchemaField` for the "hits" column is quite long, so we won't print it here.  Instead, we show a small snapshot below (where many fields are not shown):

![repeated data](https://i.imgur.com/83f8HpL.png)

Since:
- "hits" is a RECORD (contains nested data) and is repeated,
- "hitNumber", "page", and "type" are all nested inside the "hits" column, and
- "pagePath" is nested inside the "page" column,

we can query these fields with the following syntax:

In [None]:
# Query to determine most popular landing point on the website
query = """
        SELECT hits.page.pagePath as path,
            COUNT(hits.page.pagePath) as counts
        FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`, 
            UNNEST(hits) as hits
        WHERE hits.type="PAGE" and hits.hitNumber=1
        GROUP BY path
        ORDER BY counts DESC
        """

# Run the query, and return a pandas DataFrame
result = client.query(query).result().to_dataframe()
result.head()

In this case, most users land on the website through the `"/home"` page.

# Your turn 

...