# Teradata to BigQuery SQL Translation

## Introduction

Both BigQuery and Teradata Database conform to the [ANSI/ISO SQL:2011](https://wikipedia.org/wiki/SQL:2011) standard. In addition, Teradata has created some extensions to the SQL standard to enable Teradata-specific functionalities.

In contrast, BigQuery does not support these proprietary extensions. Therefore, some of your queries might need to be refactored during migration from Teradata to BigQuery. Having queries that only use the ANSI/ISO SQL standard that's supported by BigQuery has the added benefit that it helps ensure portability and helps your queries be agnostic to the underlying data warehouse.

## Teradata SQL differences

This notebook discusses notable differences between Teradata SQL and the BigQuery standard SQL, and some strategies for translating between the two dialects. The list of differences presented in this notebook is not exhaustive. For additional information, see the [Teradata-to-BigQuery SQL translation reference](https://cloud.google.com/solutions/migration/dw2bq/td2bq/td-bq-sql-translation-reference-tables).

## Data Types

BigQuery supports a more concise set of data types than
Teradata, with groups of Teradata types mapping into a single standard SQL data
type. For instance:

-   `INTEGER`, `SMALLINT`, `BYTEINT`, and `BIGINT` all map to `INT64`.
-   `CLOB`, `JSON`, `XML`, `UDT` and other types that contain large
    character fields map to `STRING`.
-   `BLOB`, `BYTE`, and `VARBYTE` types that contain binary information map
    to `BYTES`.

For dates, the main types (`DATE`, `TIME`, and `TIMESTAMP`) are equivalent in
Teradata and BigQuery. However, other specialized date types from
Teradata need to be mapped, such as the following:

-   `TIME_WITH_TIME_ZONE` to `TIME`.
-   `TIMESTAMP_WITH_TIME_ZONE` to `TIMESTAMP`.
-   `INTERVAL_HOUR`, `INTERVAL_MINUTE`, and other `INTERVAL_*` types map to
    `INT64` in BigQuery.
-   `PERIOD(DATE)`,` PERIOD(TIME)`, and other` PERIOD(*)` types map to `STRING`.

[Multi-dimensional arrays](https://docs.teradata.com/reader/S0Fw2AVH8ff3MDA0wDOHlQ/D3QuBsLccP9JObIH8f4yJA)
are not directly supported in BigQuery. Instead, you create an
[array of structs](/bigquery/docs/reference/standard-sql/arrays#building_arrays_of_arrays),
with each struct containing a field of type `ARRAY`.

## Data Types - Exercise

In this exercise, you will examine several of the TIMESTAMP and TIME functions and data types available to you. You will be using a public BigQuery dataset that contains rental records from the London bike share program

Use the `bq` command line tool to examine the schema of the table.

`bq head` or using the `Preview` tab in the BigQuery UI are much more efficient than a `SELECT * LIMIT 1` as this triggers a whole table scan.

In [None]:
!bq head -n 5  --selected_fields rental_id,duration,bike_id,end_date,end_station_id,start_date,start_station_id bigquery-public-data:london_bicycles.cycle_hire

We can similarily see table level data, such as number of rows and the schema of the table. Notice the `TIMESTAMP` fields.

In [None]:
!bq show bigquery-public-data:london_bicycles.cycle_hire

Run a query to return the most recent 5 rentals by end_date:

In [None]:
%%bigquery

SELECT
  rental_id,
  duration,
  bike_id,
  end_date,
  end_station_id,
  end_station_name,
  start_date,
  start_station_id,
  start_station_name
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire
ORDER BY
  end_date DESC
LIMIT
  5

__#TODO(you):__ Modify this query to print the `end_date` and `start_date` fields in UNIX seconds as well.

[Hint: Use UNIX_SECONDS().](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions#unix_seconds)

In [None]:
%%bigquery

SELECT
  rental_id,
  duration,
  bike_id,
  end_date,
  UNIX_SECONDS(end_date) AS end_date_unix,
  end_station_id,
  end_station_name,
  start_date,
  UNIX_SECONDS(start_date) AS start_date_unix,
  start_station_id,
  start_station_name
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire
ORDER BY
  end_date DESC
LIMIT
  5

__#TODO(you):__ Modify this query to print the time from the `end_date` and `start_date` fields in formatted PST timezone.

[Hint: Use EXTRACT( ... AT TIME ZONE ... ).](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions#extract)

In [None]:
%%bigquery

SELECT
  rental_id,
  duration,
  bike_id,
  end_date,
  EXTRACT(TIME FROM end_date AT TIME ZONE "America/Los_Angeles") AS end_time_california,
  end_station_id,
  end_station_name,
  start_date,
  EXTRACT(TIME FROM start_date AT TIME ZONE "America/Los_Angeles") AS start_time_california,
  start_station_id,
  start_station_name
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire
ORDER BY
  end_date DESC
LIMIT
  5

## The SELECT Statement


The syntax of the `SELECT` statement is generally compatible between Teradata and
BigQuery. This section notes differences that often must be
addressed during migration.

### Identifiers

BigQuery lets you use the following as
[identifiers](https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#identifiers): projects;datasets; tables or views; columns.

As a serverless product, BigQuery does not have a concept of a
cluster or environment or fixed endpoint, therefore the project specifies the dataset's
[resource hierarchy](https://cloud.google.com/resource-manager/docs/cloud-platform-resource-hierarchy).


In a `SELECT` statement in Teradata, fully qualified column names can be used.
BigQuery always references column names from tables or aliases,
and never from projects or datasets.

For example, here are some options to address identifiers in BigQuery.

Columns implicitly inferred from the table:

```sql
SELECT
 c
FROM
 project.dataset.table
```






Or by using an explicit table reference:

```sql
SELECT
 table.c
FROM
 project.dataset.table
```

Or by using an explicit table alias:

```sql
SELECT
 t.c
FROM
 project.dataset.table t
```

__#TODO(you):__ Run the following queries showing the different indentifier options.

In [None]:
%%bigquery

SELECT
  rental_id,
  duration,
  bike_id
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire
LIMIT
  1

In [None]:
%%bigquery

SELECT
  cycle_hire.rental_id,
  cycle_hire.duration,
  cycle_hire.bike_id
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire
LIMIT
  1

In [None]:
%%bigquery

SELECT
  r.rental_id,
  r.duration,
  r.bike_id
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire r
LIMIT
  1

### Alias references

In a `SELECT` statement in Teradata, aliases can be defined and referenced
within the same query. For instance, in the following snippet, `flag` is defined
as a column alias, and then immediately referred to in the enclosed `CASE`
statement.

```sql
SELECT
 F AS flag,
 CASE WHEN flag = 1 THEN ...
```




In standard SQL, references between columns *within the same query* are not
allowed. To translate, you move the logic into a nested query:

```sql
SELECT
 q.*,
 CASE WHEN q.flag = 1 THEN ...
FROM (
 SELECT
   F AS flag,
   ...
) AS q
```

The sample placeholder `F` could itself be a nested query that returns a single
column.


__#TODO(you):__ Run the following query, notice the syntax error, and rewrite it with a nested query to conform to standard SQL.

_Note:_ You could just move the EXTRACT() function but for the purposes of the exercise use a nested query.

In [None]:
%%bigquery
-- Query should fail

SELECT
  rental_id,
  duration,
  bike_id,
  start_date,
  EXTRACT(HOUR FROM start_date) AS start_hour,
  CASE
    WHEN start_hour <= 12 THEN TRUE
  ELSE FALSE
END
  AS morning_ride
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire
LIMIT
  1

In [None]:
%%bigquery
-- Query should succeed

SELECT
  *,
  CASE
    WHEN start_hour <= 12 THEN TRUE
  ELSE FALSE
END
  AS morning_ride
FROM (
  SELECT
    rental_id,
    duration,
    bike_id,
    start_date,
    EXTRACT(HOUR
    FROM
      start_date) AS start_hour
  FROM
    `bigquery-public-data`.london_bicycles.cycle_hire ) AS rentals
LIMIT
  1

### Filtering with LIKE

In Teradata, the `LIKE ANY` operator is used to filter the results to a given
set of possible options. For example:

```sql
SELECT*
FROM t1
WHERE a LIKE ANY ('string1', 'string2')
```

To translate statements that have this operator to standard SQL, you can split
the list after `ANY` into several `OR` predicates:

```sql
SELECT*
FROM t1
WHERE a LIKE 'string1' OR a LIKE 'string2'
```

__#TODO(you):__ Rewrite this query with OR predicate so that it succeeds.

In [None]:
%%bigquery
-- Query should fail

SELECT
  rental_id,
  duration,
  bike_id,
  start_date,
  start_station_name
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire
WHERE
  start_station_name LIKE ANY ('%Hyde Park%', '%Soho%')
LIMIT
  5

In [None]:
%%bigquery
-- Query should succeed

SELECT
  rental_id,
  duration,
  bike_id,
  start_date,
  start_station_name
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire
WHERE
  start_station_name LIKE '%Hyde Park%'
  OR start_station_name LIKE '%Soho%'
LIMIT
  5

### The QUALIFY clause



Teradata's
[QUALIFY](https://docs.teradata.com/reader/2_MC9vCtAJRlKle2Rpb0mA/19NnI91neorAi7LX6SJXBw) clause is a conditional clause in the `SELECT` statement that filters results of a previously computed, ordered [analytic function](https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts) according to user‑specified search conditions. Its syntax consists of the `QUALIFY` clause followed by the analytic function, such as [`ROW_NUMBER`](https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/8AEiTSe3nkHWox93XxcLrg) or [`RANK`](https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/8Ex9CS5XErnUTmh7zcrOPg), and the values you want to find:

```sql
SELECT a, b
FROM t1
QUALIFY ROW_NUMBER() OVER (PARTITION BY a ORDER BY b) = 1
```



Teradata users commonly use this function as a shorthand way to rank and
return results without the need for an additional subquery.

The `QUALIFY` clause is translated to BigQuery by adding a
`WHERE` condition to an enclosing query:

```sql
SELECT a, b
FROM (
 SELECT a, b,
 ROW_NUMBER() OVER (PARTITION BY A ORDER BY B) row_num
 FROM t1
) WHERE row_num = 1
```

__#TODO(you):__ Rewrite this query such that it succeeds without a QUALIFY clause

This query is returning the very first completed rental for each unique `bike_id`, ordered by end_date.

In [None]:
%%bigquery
-- Query should fail

SELECT
  rental_id,
  duration,
  bike_id,
  end_date
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire
  QUALIFY ROW_NUMBER() OVER (PARTITION BY bike_id ORDER BY end_date ASC) = 1
LIMIT
  5

In [None]:
%%bigquery
-- Query should succeed

SELECT
  rental_id,
  duration,
  bike_id,
  end_date
FROM (
  SELECT
    rental_id,
    duration,
    bike_id,
    end_date,
    ROW_NUMBER() OVER (PARTITION BY bike_id ORDER BY end_date ASC) rental_num
  FROM
    `bigquery-public-data`.london_bicycles.cycle_hire )
WHERE
  rental_num = 1
LIMIT
  5

### Notes on Scalable Analytic and Aggregate Functions


Many of the Analytic Functions and Aggregate Functions in BigQuery have been implemented in a distributed, scalable manner, meaning it is now harder to overload a single worker. If you have highly skewed data (for example a single `bike_id` accounts for 95% of rides) or you are sorting a very large dataset, this used to be processed on a single BigQuery worker.

That said, it is still important to utilize BigQuery best-pratcies wherever possible. For example filtering early and often and applying `LIMIT` clauses on aggregate functions like `ARRAY_AGG()`.

The 'latest record' use-case has a particularly fast implementation using `ARRAY_AGG(.... LIMIT 1)[offset(0)]` which allows can run more efficiently because the `ORDER BY` is allowed to drop everything except the top record on each `GROUP BY`

In this example query, we are no longer grouping by `bike_id`, so we are asking BigQuery to sort the entire dataset _and_ assign a row_number to all 24 million rows  before only picking the first one:

In [None]:
%%bigquery --verbose
-- Query should succeed, but will take a bit

SELECT
  rental_id,
  duration,
  bike_id,
  end_date
FROM (
  SELECT
    rental_id,
    duration,
    bike_id,
    end_date,
    
    -- NOTE: we removed the 'PARTITION BY bike_id' clause
    ROW_NUMBER() OVER (ORDER BY end_date ASC) rental_num
    
  FROM
    `bigquery-public-data`.london_bicycles.cycle_hire )
WHERE
  rental_num = 1

We can apply the `ARRAY_AGG(.... LIMIT 1)[offset(0)]` trick to this query to speed it up greatly.

In [None]:
%%bigquery --verbose
-- Query should succeed more quickly

SELECT
  rental.*
FROM (
  SELECT
    ARRAY_AGG( rentals
    ORDER BY rentals.end_date ASC LIMIT 1)[OFFSET(0)] rental
  FROM (
    SELECT
      rental_id,
      duration,
      bike_id,
      end_date
    FROM
      `bigquery-public-data`.london_bicycles.cycle_hire) rentals )


The first rule of BigQuery optimization is if your query runs in an acceptable amount of time and with acceptable resources, don't fix it! BigQuery has lots of intelligent (and brute-force) tricks under the hood to optimize your query for you.

For example, applying this ARRAY_AGG() trick to the original query where we had a `GROUP BY bike_id` class will greatly slow it down, mostly because this dataset is too small to benefit from this trick.

__Bonus:__ Try this trick on the previous query and see if it's faster or slower

### CSUM (Cumulative Sum)

`CSUM()` is a Teratadata extension to standard SQL and is not supported in BigQuery. The same effect can be achieved with a `SUM()` over a Window function like this:

In [None]:
%%bigquery

SELECT
  rental_id,
  bike_id,
  end_date,
  duration,
  SUM(duration) OVER (PARTITION BY bike_id ORDER BY end_date ASC) AS running_sum
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire
LIMIT 10

### TIMESTAMP() and Time Zones

By default, all TIMESTAMP objects in BigQuery are UTC time, no matter where in the world you process your queries. Because of that, you can't assume a time-zone, such as America/Los_Angeles. Here are some examples of adding a TimeZone.

In [2]:
%%bigquery

SELECT
  CAST('2020-12-01 14:30:00' AS TIMESTAMP) incoming_time_as_ts,
  CAST('2020-12-01 14:30:00' AS DATETIME) incoming_time_as_dt,
  DATETIME(CAST(TIMESTAMP('2020-12-01 14:30:00', 'America/Los_Angeles') AS TIMESTAMP),
    'US/Central') PST_TO_CST;


Unnamed: 0,incoming_time_as_ts,incoming_time_as_dt,PST_TO_CST
0,2020-12-01 14:30:00+00:00,2020-12-01 14:30:00,2020-12-01 16:30:00


In [None]:
%%bigquery

SELECT
  CAST('2020-12-01 14:30:00' AS TIMESTAMP) incoming_time_as_ts,
  CAST('2020-12-01 14:30:00' AS DATETIME) incoming_time_as_dt,
  DATETIME(CAST(CAST(TIMESTAMP('2020-12-01 14:30:00-08') AS DATETIME) AS TIMESTAMP),
    'US/Central') PST_TO_CST;

To see the current time using `CURRENT_TIMESTAMP` and `AT TIME ZONE`:

In [None]:
%%bigquery

SELECT
  EXTRACT(DATETIME
  FROM
    CURRENT_TIMESTAMP() AT TIME ZONE "America/Los_Angeles")


## Data Manipulation Language (DML)

The [Data Manipulation Language (DML)](https://wikipedia.org/wiki/Data_manipulation_language) is used to list, add, delete, and modify data in a database. It includes the
`SELECT`, `INSERT`, `DELETE`, and `UPDATE` statements.

While the basic forms of these statements are the same between Teradata SQL and
standard SQL, Teradata includes additional, non-standard clauses and special
statement constructs that you need to convert when you migrate. The following
sections present a non-exhaustive list of the most common statements, the main
differences, and the recommended translations.

### The INSERT statement

BigQuery is an enterprise data warehouse that focuses on Online
Analytical Processing (OLAP). Using point-specific DML statements, such as
executing a script with many `INSERT` statements, is an attempt to treat
BigQuery like an Online Transaction Processing (OLTP) system,
which is not a correct approach.



BigQuery DML statements are intended for bulk updates, therefore
each DML statement that modifies data
[initiates an implicit transaction](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-manipulation-language#limitations).
You should group your DML statements whenever possible to avoid unnecessary
transaction overhead.

As an example, if you have the following set of statements from Teradata,
running them as is in BigQuery is an anti-pattern:

```sql
INSERT INTO t1 (...) VALUES (...);
INSERT INTO t1 (...) VALUES (...);
```

You can translate the previous script into a single `INSERT` statement, which
performs a bulk operation instead:

```sql
INSERT INTO t1 VALUES (...), (...)
```



A typical scenario where a large number of `INSERT` statements is used is when
you create a new table from an existing  table. In BigQuery,
instead of using multiple `INSERT` statements, create a new table and insert all
the rows in one operation using the
[`CREATE TABLE ... AS SELECT`](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#creating_a_new_table_from_an_existing_table)
statement.


For the next example we first create a local copy of the data so that we have Write permissions:

In [None]:
%%bash

# Create a dataset in your project
bq mk --location eu my_london_bicycles_dataset

# Copy the public dataset to your project
bq cp bigquery-public-data:london_bicycles.cycle_hire my_london_bicycles_dataset.cycle_hire
bq cp bigquery-public-data:london_bicycles.cycle_stations my_london_bicycles_dataset.cycle_stations

In [None]:
%%bash

# Examine your local table
bq show my_london_bicycles_dataset.cycle_hire

__#TODO(you):__ Rewrite this `INSERT INTO` query so that it only executes one DML transaction

In [None]:
%%bigquery
-- Rows before

SELECT COUNT(*) FROM my_london_bicycles_dataset.cycle_hire

In [None]:
%%bigquery
-- Query to be edited
INSERT INTO
  my_london_bicycles_dataset.cycle_hire
VALUES
  (47469109, 3180, 7054, '2015-09-03 12:45:00 UTC', 111, 'Park Lane, Hyde Park', '2015-09-03 11:52:00 UTC', 300, 'Serpentine Car Park, Hyde Park', NULL, NULL, NULL);
INSERT INTO
  my_london_bicycles_dataset.cycle_hire
VALUES
  (46915469, 7380, 3792, '2015-08-16 11:59:00 UTC', 407, 'Speakers\' Corner 1, Hyde Park', '2015-08-16 09:56:00 UTC', 407, 'Speakers\' Corner 1, Hyde Park', NULL, NULL, NULL);

In [None]:
%%bigquery
-- Fixed query
INSERT INTO
  my_london_bicycles_dataset.cycle_hire
VALUES
  (47469109, 3180, 7054, '2015-09-03 12:45:00 UTC', 111, 'Park Lane, Hyde Park', '2015-09-03 11:52:00 UTC', 300, 'Serpentine Car Park, Hyde Park', NULL, NULL, NULL),
  (46915469, 7380, 3792, '2015-08-16 11:59:00 UTC', 407, 'Speakers\' Corner 1, Hyde Park', '2015-08-16 09:56:00 UTC', 407, 'Speakers\' Corner 1, Hyde Park', NULL, NULL, NULL);

In [None]:
%%bigquery
-- Rows after

SELECT COUNT(*) FROM my_london_bicycles_dataset.cycle_hire

### The UPDATE statement

`UPDATE` statements in Teradata are similar to `UPDATE` statements in standard
SQL. The important differences are:

-   The order of the `SET` and `FROM` clauses is reversed.
-   Any
    [Teradata correlation names](https://docs.teradata.com/reader/huc7AEHyHSROUkrYABqNIg/k6fC7ozmhIZZXa315VjJAw)
    used as table aliases in the `UPDATE` must be removed.
-   In Standard SQL, each `UPDATE` statement must include the `WHERE` keyword,
    followed by a condition. To update all rows in the table, use `WHERE true`.

The following example shows an `UPDATE` statement from Teradata that uses
joins:

```sql
UPDATE t1
FROM t1, t2
SET
 b = t2.b
WHERE a = t2.a;
```

The equivalent statement in standard SQL is the following:

```sql
UPDATE t1
SET
 b = t2.b
FROM t2
WHERE a = t2.a;
```

The considerations from the previous section about executing large numbers of
DML statements in BigQuery also apply in this case. We recommend
using a single
[`MERGE`](https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#merge_statement)
statement instead of multiple `UPDATE` statements.


__#TODO(you):__ Rewrite this UPDATE statement so that it executes in BigQuery

In [None]:
%%bigquery --verbose
-- Query will fail
UPDATE
  my_london_bicycles_dataset.cycle_hire
FROM my_london_bicycles_dataset.cycle_hire t1, `bigquery-public-data`.london_bicycles.cycle_hire t2 
SET
  bike_id = t2.bike_id 
WHERE
  t1.rental_id = t2.rental_id

In [None]:
%%bigquery --verbose
-- Query will succeed but take a bit of time
UPDATE
  my_london_bicycles_dataset.cycle_hire t1
SET
  bike_id = t2.bike_id
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire t2
WHERE
  t1.rental_id = t2.rental_id

### The DELETE statement

Standard SQL requires `DELETE` statements to have a `WHERE` clause. In
Teradata, `WHERE` clauses are
[optional in `DELETE` statements](https://docs.teradata.com/reader/huc7AEHyHSROUkrYABqNIg/z8eO9bdxtjFRveHdDwwYPQ)
if you're deleting all the rows in a table. (If specific rows are being deleted,
the Teradata DML also requires a `WHERE` clause.) During translation, any
missing `WHERE` clauses must be added to scripts. This change is necessary only
when all the rows in a table will be deleted.

For instance, the following statement in Teradata SQL deletes all the rows from
a table. The `ALL` clause is optional:

```sql
DELETE t1 ALL;
```

The translation into standard SQL is as follows:

```sql
DELETE FROM t1 WHERE TRUE;
```

__#TODO(you):__ Rewrite this UPDATE statement so that it executes in BigQuery

In [None]:
%%bigquery
--Query will fail

DELETE my_london_bicycles_dataset.cycle_hire ALL;

In [None]:
%%bigquery
--Query will succeed

DELETE my_london_bicycles_dataset.cycle_hire WHERE TRUE;

## Data Definition Language (DDL)

The
[Data Definition Language](https://wikipedia.org/wiki/Data_definition_language)
(DDL) is used to define your database schema. It includes a subset of SQL
statements such as `CREATE`, `ALTER`, and `DROP`.

For the most part, these statements are equivalent between Teradata SQL and
standard SQL. Here is a non-exhaustive list of notable exceptions:

-   Index manipulation options are not supported in
    BigQuery, such as `CREATE INDEX` and `PRIMARY INDEX`.
    BigQuery does not use indexes when querying your data. It
    produces fast results thanks to its underlying model using
    [Dremel](https://ai.google/research/pubs/pub36632),
    its storage techniques using
    [Capacitor](https://cloud.google.com/blog/products/gcp/inside-capacitor-bigquerys-next-generation-columnar-storage-format),
    and its massively parallel architecture.
-   [Constraints](https://docs.teradata.com/reader/rgAb27O_xRmMVc_aQq2VGw/_X6axAFdllKMCoVKT9~hHg),
    which are checks applied to individual columns or an entire table.
    BigQuery supports only `NOT NULL` constraints.
-   [`MULTISET`](https://docs.teradata.com/reader/VrFCOAaniAIfrJsA51oQJA/3vKnwH1vZNoJpZZmuKCsGg),
    which is used to allow duplicate rows in Teradata.
-   [`CASESPECIFIC`](https://docs.teradata.com/reader/S0Fw2AVH8ff3MDA0wDOHlQ/CrmHZxipG~s_PP3s~5Wg4w),
    which specifies case for character data comparisons and collations.

### Indexing for consistency (UNIQUE, PRIMARY INDEX)


In Teradata, a unique index can be used to prevent rows with non-unique keys in a table. If a process tries to insert or update data that has a value that's already in the index, the operation either fails with an index violation (`MULTISET` tables) or silently ignores it (`SET` tables).

Because BigQuery doesn't provide explicit indexes, other strategies can be employed to achieve the same effect. A `MERGE` statement can be used instead to insert only unique records into a target table from a staging table while discarding duplicate records. However, there is no way to prevent a user with edit permissions from inserting a duplicate record, because BigQuery never locks during `INSERT` operations. To generate an error for duplicate records in BigQuery, you can use a `MERGE` statement from a staging table, as shown in the following example.

In [None]:
%%bash
# Re-insert cycle_hire data that was deleted

bq cp -f bigquery-public-data:london_bicycles.cycle_hire my_london_bicycles_dataset.cycle_hire

In [None]:
%%bigquery
-- Create a loading table with some duplicate rows and a new unique row. `rental_id` will be the unique key.

CREATE OR REPLACE TABLE
  my_london_bicycles_dataset.temp_loading_table AS (

  --Grab 5 duplicate rows
  SELECT
    *
  FROM
    my_london_bicycles_dataset.cycle_hire
  LIMIT
    5)
UNION ALL (
  
  --Add a new unique row
  SELECT
    111147469109,
    3180,
    7054,
    '2015-09-03 12:45:00 UTC',
    111,
    'Park Lane, Hyde Park',
    '2015-09-03 11:52:00 UTC',
    300,
    'Serpentine Car Park, Hyde Park',
    NULL,
    NULL,
    NULL)

In [None]:
%%bigquery --verbose
--Number of Rows in base table

SELECT COUNT(*) FROM my_london_bicycles_dataset.cycle_hire;

In [None]:
%%bigquery --verbose
--Number of Rows in loading table

SELECT COUNT(*) FROM my_london_bicycles_dataset.temp_loading_table;

We will now use a `MERGE` statement to insert and dedupe rows to the main table:

In [None]:
%%bigquery

MERGE
  my_london_bicycles_dataset.cycle_hire rentals
USING
  my_london_bicycles_dataset.temp_loading_table temp
ON
  temp.rental_id = rentals.rental_id
  WHEN NOT MATCHED
  THEN
    INSERT ROW

We can now see that the 1 unique row has been inserted:

In [None]:
%%bigquery --verbose
--Number of rows now in base table

SELECT COUNT(*) FROM my_london_bicycles_dataset.cycle_hire;

More often, users prefer to remove duplicates independently in order to find errors in downstream systems.
BigQuery does not support `DEFAULT` and `IDENTITY` (sequences) columns.

Here you will insert the 5 redundant values into the base table and use a `ROW_NUMBER()` function and `SELECT * EXCEPT()` to create a unique set of the data. `DISTINCT rental_id, * EXCEPT(rental_id)` is another option but is often not as fast.

In [None]:
%%bigquery

INSERT INTO
  my_london_bicycles_dataset.cycle_hire
SELECT
  *
FROM
  my_london_bicycles_dataset.temp_loading_table

In [None]:
%%bigquery --verbose
--Number of rows now in base table

SELECT COUNT(*) FROM my_london_bicycles_dataset.cycle_hire;

In [None]:
%%bigquery

-- Number of rows in a unique 'view' of the data:
SELECT
  COUNT(*)
FROM (
  SELECT
    * EXCEPT(row_number)
  FROM (
    SELECT
      *,
      ROW_NUMBER() OVER (PARTITION BY rental_id) row_number
    FROM
      `my_london_bicycles_dataset.cycle_hire`)
  WHERE
    row_number = 1 )

### True Row Uniqueness in BigQuery

If you need true uniqueness and don't have a unique key, you can use `SELECT DISTINCT *`. This is not ideal for performance reasons.

Here you will create a de-duped view and examine peformance impact of calling `SELECT DISTINCT *` each time. Consider regularly re-materializing your data if you have this use case.

In [None]:
%%bash
# Re-insert cycle_hire data that was deleted

bq cp -f bigquery-public-data:london_bicycles.cycle_hire my_london_bicycles_dataset.cycle_hire

Create a View using `SELECT DISTINCT *`.

In [None]:
%%bigquery
CREATE OR REPLACE VIEW
  my_london_bicycles_dataset.cycle_hire_dedupe AS
SELECT
  DISTINCT *
FROM
  my_london_bicycles_dataset.cycle_hire

In [None]:
%%bigquery --verbose
--Number of Rows in base table

SELECT COUNT(*) FROM my_london_bicycles_dataset.cycle_hire;

In [None]:
%%bigquery --verbose
--Number of Rows in de-duped View

SELECT COUNT(*) FROM my_london_bicycles_dataset.cycle_hire_dedupe;

Next, you'll create duplicates of 5 rows in the table and examine the underlying number of rows:

In [None]:
%%bigquery

INSERT INTO
  my_london_bicycles_dataset.cycle_hire
SELECT
  *
FROM
  my_london_bicycles_dataset.cycle_hire
LIMIT 5

In [None]:
%%bigquery --verbose
--Number of Rows in base table

SELECT COUNT(*) FROM my_london_bicycles_dataset.cycle_hire;

In [None]:
%%bigquery --verbose
--Number of Rows in de-duped View

SELECT COUNT(*) FROM my_london_bicycles_dataset.cycle_hire_dedupe;

`SELECT DISTINCT *` is essentially performing a `GROUP BY` on every field in the table. BigQuery can perform this scalably, but when using a view like this, you are asking it to perform the de-duplication upon every query call. Consider periodically rematerializing deduped views or building some de-duplication into your ETL pipelines.

You can see this in the query algebra:
<img src="img/select_distinct_query_algebra.png">


### IPython Magic Hints

As a helpful hint, you can paramterize your bigquery cells as such:

In [3]:
%%bigquery --params {"bike_id": 5}

SELECT
  MAX(duration) AS max_duration,
  bike_id
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire
WHERE
  bike_id=@bike_id
GROUP BY
  bike_id

Unnamed: 0,max_duration,bike_id
0,158340,5


## Stored Procedures

[Stored procedures](https://docs.teradata.com/reader/zzfV8dn~lAaKSORpulwFMg/qGy9u~3hCZ7HjA6Q51CVtA)
in Teradata are a combination of SQL and control statements. Stored procedures
can take parameters that let you build a customized interface to the Teradata
Database.

Stored procedures are supported as part of BigQuery
[Scripting](https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting).

However, there are some cases where other features might be more appropriate.
These alternatives depend on how your stored procedures are being used.
For example:

-   Replace triggers that are used to run periodic queries with
    [scheduled queries](https://cloud.google.com/bigquery/docs/scheduling-queries).
-   Replace stored procedures that control the complex execution of queries
    and their interdependencies with workflows defined in
    [Cloud Composer](https://cloud.google.com/composer) (manged Apached Airflow).
-   Refactor stored procedures that are used as an API into your data
    warehouse with
    [parameterized queries](https://cloud.google.com/bigquery/docs/parameterized-queries)
    and using the
    [{{bigquery_api}}](https://cloud.google.com/bigquery/docs/reference).
    This change implies that you must rebuild the logic from the stored
    procedure in a different programming language such as Java or Go, and that
    you then call SQL queries with parameters from the code.



### BigQuery scripting

BigQuery scripting enables you to send multiple statements to
BigQuery in one request, to use variables, and to use control flow
statements such as [`IF`](#if) and [`WHILE`](#while). For example, you can
declare a variable, assign a value to it, and then reference it in a third
statement.

In BigQuery, a script is a SQL statement list to be executed in
sequence. A SQL statement list is a list of any valid BigQuery
statements that are separated by semicolons.

For example:

In [None]:
%%bigquery

-- Declare a variable to hold names as an array.
DECLARE top_names ARRAY<STRING>;
-- Build an array of the top 100 names from the year 2017.
SET top_names = (
  SELECT ARRAY_AGG(name ORDER BY number DESC LIMIT 100)
  FROM `bigquery-public-data`.usa_names.usa_1910_current
  WHERE year = 2017
);
-- Which names appear as words in Shakespeare's plays?
SELECT
  name AS shakespeare_name
FROM UNNEST(top_names) AS name
WHERE name IN (
  SELECT word
  FROM `bigquery-public-data`.samples.shakespeare
);

<p>Scripts are executed in BigQuery using
<a href="/bigquery/docs/reference/rest/v2/jobs/insert"><code>jobs.insert</code></a>,
similar to any other query, with the multi-statement script specified as the
query text. When a script executes, additional jobs, known as child jobs,
are created for each statement in the script.  You can enumerate the child jobs
of a script by calling
<a href="/bigquery/docs/reference/rest/v2/jobs/list"><code>jobs.list</code></a>,
passing in the script’s job ID as the <code>parentJobId</code> parameter.</p>
<p>When
<a href="/bigquery/docs/reference/rest/v2/jobs/getQueryResults"><code>jobs.getQueryResults</code></a>
is invoked on a script, it will return the query results for the last SELECT,
DML, or DDL statement to execute in the script, with no query results if none of
the above statements have executed.  To obtain the results of all statements in
the script, enumerate the child jobs and call <code>jobs.getQueryResults</code>
on each of them.</p>

BigQuery interprets any request with multiple statements as a script,
unless the statements consist of `CREATE TEMP FUNCTION` statement(s), with a
single final query statement. For example, the following would not be considered
a script:

In [None]:
%%bigquery

CREATE TEMP FUNCTION Add(x INT64, y INT64) AS (x + y);

SELECT Add(3, 4);

### Stored Procedures

Unlike temporary functions which persist only for the length of the query statement, stored procedures can be created and used over time. They are associated with a dataset, just like tables and views:

In [None]:
%%bigquery

CREATE PROCEDURE my_london_bicycles_dataset.AddDelta(INOUT x INT64, delta INT64)
BEGIN
  SET x = x + delta;
END;

In [None]:
%%bigquery

DECLARE accumulator INT64 DEFAULT 0;
CALL my_london_bicycles_dataset.AddDelta(accumulator, 5);
CALL my_london_bicycles_dataset.AddDelta(accumulator, 3);
SELECT accumulator;

## Next Steps

This was a sample of the most common SQL translations required when moving from Teradata to BigQuery. For an exhuastive list, consult the [SQL Translation Reference Page](https://cloud.google.com/solutions/migration/dw2bq/td2bq/td-bq-sql-translation-reference-tables).

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.