# Nested and Repeated Data

Consider a hypothetical dataset containing information about pets and their toys. We could organize this information in two different tables (a `pets` table and a `toys` table). The `toys` table could contain a `Pet_ID` column that could be used to match each toy to the pet that owns it.

Another option in BigQuery is to organize all of the information in a single table, similar to the `pets_and_toys` table below.

![https://i.imgur.com/wxuogYA.png](https://i.imgur.com/wxuogYA.png)

In this case, all of the information from the toys table is collapsed into a single column (the "Toy" column in the `pets_and_toys table`). We refer to the "Toy" column in the `pets_and_toys` table as a nested column, and say that the "Name" and "Type" fields are nested inside of it.

Nested columns have type `STRUCT` (or type `RECORD`). This is reflected in the table schema below.

![https://i.imgur.com/epXFXdb.png](https://i.imgur.com/epXFXdb.png)

To query a column with nested data, we need to identify each field in the context of the column that contains it:

1. `Toy.Name` refers to the "Name" field in the "Toy" column
2. `Toy.Type` refers to the "Type" field in the "Toy" column.

![https://i.imgur.com/eE2Gt62.png](https://i.imgur.com/eE2Gt62.png)

## Repeated data

Now consider the (more realistic!) case where each pet can have multiple toys. In this case, to collapse this information into a single table, we need to leverage a different datatype.

![https://i.imgur.com/S93FJTE.png](https://i.imgur.com/S93FJTE.png)

We say that the "Toys" column contains **repeated data**, because it permits more than one value for each row. This is reflected in the table schema below, where the mode of the "Toys" column appears as `REPEATED`.

![https://i.imgur.com/KlrjpDM.png](https://i.imgur.com/KlrjpDM.png)

Each entry in a repeated field is an `ARRAY`, or an ordered list of (zero or more) values with the same datatype. For instance, the entry in the "Toys" column for Moon the Dog is `[Frisbee, Bone, Rope]`, which is an `ARRAY` with three values.

When querying repeated data, we need to put the name of the column containing the repeated data inside an `UNNEST()` function.

![https://i.imgur.com/p3fXPxY.png](https://i.imgur.com/p3fXPxY.png)

This essentially flattens the repeated data (which is then appended to the right side of the table) so that we have one element on each row. For an illustration of this, check out the image below.

![https://i.imgur.com/8j4XK8f.png](https://i.imgur.com/8j4XK8f.png)

## Nested and repeated data

Now, what if pets can have multiple toys, and we'd like to keep track of both the name and type of each toy? In this case, we can make the "Toys" column both **nested** and **repeated**.

![https://i.imgur.com/psKtza2.png](https://i.imgur.com/psKtza2.png)

In the more_pets_and_toys table above, "Name" and "Type" are both fields contained within the "Toys" `STRUCT`, and each entry in both `Toys.Name` and `Toys.Type` is an `ARRAY`.

![https://i.imgur.com/fO5OymI.png](https://i.imgur.com/fO5OymI.png)

Let's look at a sample query.

![https://i.imgur.com/DiMCZaO.png](https://i.imgur.com/DiMCZaO.png)

Since the "Toys" column is repeated, we flatten it with the `UNNEST()` function. And, since we give the flattened column an alias of t, we can refer to the "Name" and "Type" fields in the "Toys" column as `t.Name` and `t.Type`, respectively.

## Example

In [1]:
from google.cloud import bigquery
client = bigquery.Client()
dataset_ref = client.dataset("google_analytics_sample", project="bigquery-public-data")
table_ref = dataset_ref.table("ga_sessions_20170801")
table = client.get_table(table_ref)
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,visitorId,visitNumber,visitId,visitStartTime,date,totals,trafficSource,device,geoNetwork,customDimensions,hits,fullVisitorId,userId,clientId,channelGrouping,socialEngagementType
0,,1,1501583974,1501583974,20170801,"{'visits': 1, 'hits': 1, 'pageviews': 1, 'time...","{'referralPath': None, 'campaign': '(not set)'...","{'browser': 'Chrome', 'browserVersion': 'not a...","{'continent': 'Americas', 'subContinent': 'Car...",[],"[{'hitNumber': 1, 'time': 0, 'hour': 3, 'minut...",2248281639583218707,,,Organic Search,Not Socially Engaged
1,,1,1501616585,1501616585,20170801,"{'visits': 1, 'hits': 1, 'pageviews': 1, 'time...","{'referralPath': None, 'campaign': '(not set)'...","{'browser': 'Chrome', 'browserVersion': 'not a...","{'continent': 'Americas', 'subContinent': 'Nor...","[{'index': 4, 'value': 'North America'}]","[{'hitNumber': 1, 'time': 0, 'hour': 12, 'minu...",8647436381089107732,,,Organic Search,Not Socially Engaged
2,,1,1501583344,1501583344,20170801,"{'visits': 1, 'hits': 1, 'pageviews': 1, 'time...","{'referralPath': None, 'campaign': '(not set)'...","{'browser': 'Chrome', 'browserVersion': 'not a...","{'continent': 'Asia', 'subContinent': 'Souther...","[{'index': 4, 'value': 'APAC'}]","[{'hitNumber': 1, 'time': 0, 'hour': 3, 'minut...",2055839700856389632,,,Organic Search,Not Socially Engaged
3,,1,1501573386,1501573386,20170801,"{'visits': 1, 'hits': 1, 'pageviews': 1, 'time...","{'referralPath': None, 'campaign': '(not set)'...","{'browser': 'Chrome', 'browserVersion': 'not a...","{'continent': 'Europe', 'subContinent': 'Weste...","[{'index': 4, 'value': 'EMEA'}]","[{'hitNumber': 1, 'time': 0, 'hour': 0, 'minut...",750846065342433129,,,Direct,Not Socially Engaged
4,,8,1501651467,1501651467,20170801,"{'visits': 1, 'hits': 1, 'pageviews': 1, 'time...","{'referralPath': None, 'campaign': '(not set)'...","{'browser': 'Chrome', 'browserVersion': 'not a...","{'continent': 'Americas', 'subContinent': 'Nor...","[{'index': 4, 'value': 'North America'}]","[{'hitNumber': 1, 'time': 0, 'hour': 22, 'minu...",573427169410921198,,,Organic Search,Not Socially Engaged


We refer to the "browser" field (which is nested in the "device" column) and the "transactions" field (which is nested inside the "totals" column) as `device.browser` and `totals.transactions` in the query below:

In [3]:
query = """
        SELECT device.browser AS device_browser,
            SUM(totals.transactions) as total_transactions
        FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
        GROUP BY device_browser
        ORDER BY total_transactions DESC
        """
result = client.query(query).result().to_dataframe()
result.head()

Unnamed: 0,device_browser,total_transactions
0,Chrome,41.0
1,Safari,3.0
2,Firefox,1.0
3,Internet Explorer,
4,UC Browser,


By storing the information in the "device" and "totals" columns as STRUCTs (as opposed to separate tables), we avoid expensive `JOIN`s. This increases performance and keeps us from having to worry about `JOIN` keys (and which tables have the exact data we need).

Now we'll work with the "hits" column as an example of data that is both nested and repeated. Since:

1. "hits" is a STRUCT (contains nested data) and is repeated,
2. "hitNumber", "page", and "type" are all nested inside the "hits" column, and
3. "pagePath" is nested inside the "page" field,

we can query these fields with the following syntax:

In [4]:
query = """
        SELECT hits.page.pagePath as path,
            COUNT(hits.page.pagePath) as counts
        FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`, 
            UNNEST(hits) as hits
        WHERE hits.type="PAGE" and hits.hitNumber=1
        GROUP BY path
        ORDER BY counts DESC
        """
result = client.query(query).result().to_dataframe()
result.head()

Unnamed: 0,path,counts
0,/home,1257
1,/google+redesign/shop+by+brand/youtube,587
2,/google+redesign/apparel/mens/mens+t+shirts,117
3,/signin.html,78
4,/basket.html,35


## Exercises

In [5]:
from google.cloud import bigquery
client = bigquery.Client()
dataset_ref = client.dataset("github_repos", project="bigquery-public-data")
dataset = client.get_dataset(dataset_ref)
table_ref = dataset_ref.table("sample_commits")
sample_commits_table = client.get_table(table_ref)
client.list_rows(sample_commits_table, max_results=5).to_dataframe()

Unnamed: 0,commit,tree,parent,author,committer,subject,message,trailer,difference,difference_truncated,repo_name,encoding
0,3eca86e75ec7a7d4b9a9c8091b11676f7bd2a39f,8e1b4380409a85a922ee0d3f622b5dd4d16bcfad,[104a0c02e8b1936c049e18a6d4e4ab040fb61213],"{'name': 'Mark Rutland', 'email': '1db9dd262be...","{'name': 'Catalin Marinas', 'email': '15ce75b2...",arm64: Remove fixmap include fragility,arm64: Remove fixmap include fragility\n\nThe ...,"[{'key': 'Signed-off-by', 'value': 'Mark Rutla...","[{'old_mode': 33188, 'new_mode': 33188, 'old_p...",,torvalds/linux,
1,7158627686f02319c50c8d9d78f75d4c8d126ff2,3b4d781bd966f07cad1b67b137f0ff8b89430e9a,[66aa8d6a145b6a66566b4fce219cc56c3d0e01c3],"{'name': 'Will Deacon', 'email': 'b913f13ef92a...","{'name': 'Catalin Marinas', 'email': '15ce75b2...",arm64: percpu: implement optimised pcpu access...,arm64: percpu: implement optimised pcpu access...,"[{'key': 'Signed-off-by', 'value': 'Will Deaco...","[{'old_mode': 33188, 'new_mode': 33188, 'old_p...",,torvalds/linux,
2,9732cafd9dc0206479be919baf0067239f0a63ca,c8878035ac9cb6dce592957f12dc1723a583989d,[f3c003f72dfb2497056bcbb864885837a1968ed5],"{'name': 'Jiang Liu', 'email': 'c745fa7b96fe79...","{'name': 'Catalin Marinas', 'email': '15ce75b2...","arm64, jump label: optimize jump label impleme...","arm64, jump label: optimize jump label impleme...","[{'key': 'Reviewed-by', 'value': 'Will Deacon ...","[{'old_mode': 33188, 'new_mode': 33188, 'old_p...",,torvalds/linux,
3,4702abd3f9728893ad5b0f4389e1902588510459,32926e7c55ef585d9b9c174a0e5f9ed13ed6bf7e,[ddf28352b80c86754a6424e3a61e8bdf9213b3c7],"{'name': 'Nicolas Pitre', 'email': '408789a210...","{'name': 'Arnd Bergmann', 'email': 'f2c659f019...",ARM: mach-nuc93x: delete,ARM: mach-nuc93x: delete\n\nThis architecture ...,"[{'key': 'Signed-off-by', 'value': 'Nicolas Pi...","[{'old_mode': 33188, 'new_mode': 33188, 'old_p...",,torvalds/linux,
4,57bd4b91a6cfc5bad4c5d829ef85293ea63643ea,2ffc2066eb7638e185663e9d849663403229d4e5,[f74c95c20bad8e183e41283475f68a3e7b247af4],"{'name': 'Ben Dooks', 'email': '1177f64998f284...","{'name': 'Ben Dooks', 'email': '1177f64998f284...",[ARM] S3C24XX: Movev udc headers to arch/arm/p...,[ARM] S3C24XX: Movev udc headers to arch/arm/p...,"[{'key': 'Signed-off-by', 'value': 'Ben Dooks ...","[{'old_mode': 33188, 'new_mode': 33188, 'old_p...",,torvalds/linux,


### 1) Who had the most commits in 2016?

GitHub is the most popular place to collaborate on software projects. A GitHub **repository** (or repo) is a collection of files associated with a specific project, and a GitHub **commit** is a change that a user has made to a repository.  We refer to the user as a **committer**.

The `sample_commits` table contains a small sample of GitHub commits, where each row corresponds to different commit.  The code cell below fetches the table and shows the first five rows of this table.

Write a query to find the individuals with the most commits in this table in 2016.  Your query should return a table with two columns:
- `committer_name` - contains the name of each individual with a commit (from 2016) in the table
- `num_commits` - shows the number of commits the individual has in the table (from 2016)

Sort the table, so that people with more commits appear first.

**NOTE**: You can find the name of each committer and the date of the commit under the "committer" column, in the "name" and "date" child fields, respectively.

In [6]:
max_commits_query = """
                    SELECT committer.name AS committer_name, COUNT(*) AS num_commits
                    FROM `bigquery-public-data.github_repos.sample_commits`
                    WHERE committer.date >= '2016-01-01' AND committer.date < '2017-01-01'
                    GROUP BY committer_name
                    ORDER BY num_commits DESC
                    """

### 2) Look at languages!

In [7]:
table_ref = dataset_ref.table("languages")
languages_table = client.get_table(table_ref)
client.list_rows(languages_table, max_results=5).to_dataframe()

Unnamed: 0,repo_name,language
0,JoaoPedroToledo/C,"[{'name': 'C', 'bytes': 4919}]"
1,brantr/grid-fft,"[{'name': 'C', 'bytes': 100796}]"
2,plkid/demo,"[{'name': 'C', 'bytes': 33}]"
3,digitalmediacenter/zabbix_dns,"[{'name': 'C', 'bytes': 7092}]"
4,AlbandeCrevoisier/trajectoryctc,"[{'name': 'C', 'bytes': 4255}]"


In [8]:
languages_table.schema

[SchemaField('repo_name', 'STRING', 'NULLABLE', None, ()),
 SchemaField('language', 'RECORD', 'REPEATED', None, (SchemaField('name', 'STRING', 'NULLABLE', None, ()), SchemaField('bytes', 'INTEGER', 'NULLABLE', None, ())))]

Assume for the moment that you have access to a table called `sample_languages` that contains only a very small subset of the rows from the `languages` table: in fact, it contains only three rows!  This table is depicted in the image below.

![](https://i.imgur.com/qAb5lZ2.png)

How many rows are in the table returned by the query below?

![](https://i.imgur.com/Q5qYAtz.png)

Fill in your answer in the next code cell.

In [9]:
num_rows = 6

### 3) What's the most popular programming language?

Write a query to leverage the information in the `languages` table to determine which programming languages appear in the most repositories.  The table returned by your query should have two columns:
- `language_name` - the name of the programming language
- `num_repos` - the number of repositories in the `languages` table that use the programming language

Sort the table so that languages that appear in more repos are shown first.

In [10]:
pop_lang_query = """
                 SELECT l.name AS language_name, COUNT(*) AS num_repos
                 FROM `bigquery-public-data.github_repos.languages`, 
                     UNNEST(language) as l
                 GROUP BY language_name
                 ORDER BY num_repos DESC
                 """

### 4) Which languages are used in the repository with the most languages?

For this question, you'll restrict your attention to the repository with name `'polyrabbit/polyglot'`.

Write a query that returns a table with one row for each language in this repository.  The table should have two columns:
- `name` - the name of the programming language
- `bytes` - the total number of bytes of that programming language

Sort the table by the `bytes` column so that programming languages that take up more space in the repo appear first.

In [11]:
all_langs_query = """
                  SELECT l.name, l.bytes
                  FROM `bigquery-public-data.github_repos.languages`,
                      UNNEST(language) as l
                  WHERE repo_name = 'polyrabbit/polyglot'
                  ORDER BY l.bytes DESC
                  """