In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# [TODO] Timeseries Insights API Demonstration

---

{TODO: Update the links below.}

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/notebook_template.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

**_NOTE_**: This notebook has been tested using Python version 3.9.


## Overview

The [Timeseries Insights](https://cloud.google.com/timeseries-insights) is an
API designed for gathering insights in real time from your time series
datasets.

This tutorial provides a detailed, step-by-step guide on how to use the API to set up a real-world system for continuously detecting spiking n-grams from the live news data.

### Objective

You will learn how to use the [Timeseries Insights](https://cloud.google.com/timeseries-insights) API through the utilization of publicly-available real-time data from the [GDELT](https://www.gdeltproject.org/data.html) project. The process involves creating an initial dataset from historical data and establishing a pipeline for the continuous ingestion of real-time information. Furthermore, you'll configure a query to detect anomalies from the news transcript timeseries.

This tutorial uses the following Google Cloud services and resources:

* GCP Dataflow Pipeline - Used for recurring data ingestion and anomaly
  detection requests.
* GCP Storage - The final detected anomalies are stored in the GCP
  storage as a text file.

The steps involved in this tutorial:

1. Create a dataset.
2. Setup a recurring append request.
3. Set up a recurring query.

### Dataset

This demo uses public real-time data from the
[GDELT](https://www.gdeltproject.org/data.html) project to feed the system.

GDELT data is available in the Bigquery project `gdelt-bq`, and we use unigrams and bigrams from the [Television News N-grams](https://blog.gdeltproject.org/announcing-the-television-news-ngram-2-0-dataset) data stored in `gdeltv2.iatv_1gramsv2` and `gdeltv2.iatv_2gramsv2` tables.

### Setup Colab Instance
If you are going to run the Colab instance on GCP, then see [GCP Deployment Manager](https://pantheon.corp.google.com/dm/deployments) to create Colab instance.

### Costs

This tutorial uses billable components of Google Cloud:

* Dataflow
* Bigquery
* Cloud Storage
* Timeseries Insights API

Learn about [Dataflow pricing](https://cloud.google.com/dataflow/pricing),
[Bigquery pricing](https://cloud.google.com/bigquery#pricing-module)
and [Storage pricing](https://cloud.google.com/storage/pricing),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook.

In [None]:
# Install the packages

! pip3 install --upgrade apache-beam[gcp] \
                         apache-beam[interactive] \
                         google-cloud-bigquery[pandas]

### Colab only: Uncomment the following cell to restart the kernel.

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get $300 free credit towards your compute and storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. Enable the necessary APIs for your project.

4. If you are running this notebook locally, you must install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

If you don't know your project ID, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page, [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = '[your-project-id]'  # @param {type:"string"}

# Set the project id
! gcloud config set project $PROJECT_ID

#### Region

You can also change the `REGION` variable used by Timeseries Insights. Learn more about [Timeseries Insights API Regions](https://cloud.google.com/timeseries-insights/docs/locations).

In [2]:
REGION = 'us-central1'  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you might have to authenticate differently. Follow the instructions below.

Note: If you don't have an existing Jupyter environment, visit the [GCP Deployment Manager](https://pantheon.corp.google.com/dm/deployments) to create a Colab instance on GCP.

**1. Dataflow Workbench**
* Because your account is authenticated, there is nothing for you to do.

**2. If you are using local JupyterLab instance, uncomment and run.**

In [None]:
# ! gcloud auth login

**3. If you are using a Colab instance, uncomment and run.**

In [3]:
# from google.colab import auth
# auth.authenticate_user(project_id=PROJECT_ID)

**4. Enable the following APIs for your project:**
   * [Dataflow
     API](https://pantheon.corp.google.com/apis/api/dataflow.googleapis.com)
   * [Compute Engine
     API](https://pantheon.corp.google.com/apis/api/compute.googleapis.com)
   * [BigQuery
     API](https://pantheon.corp.google.com/apis/api/bigquery.googleapis.com)
   * [Timeseries Insights
     API](https://pantheon.corp.google.com/apis/api/timeseriesinsights.googleapis.com)
   * [Data pipelines
     API](https://pantheon.corp.google.com/apis/api/datapipelines.googleapis.com)
   * [Cloud Logging
     API](https://pantheon.corp.google.com/apis/api/logging.googleapis.com)
   * [Cloud Notebook
     API](https://pantheon.corp.google.com/apis/api/notebooks.googleapis.com)
   * [BigQuery Storage
     API](https://pantheon.corp.google.com/apis/api/bigquerystorage.googleapis.com)
   * [Cloud Scheduler
     API](https://pantheon.corp.google.com/apis/api/cloudscheduler.googleapis.com).

In [4]:
! gcloud services enable bigquery.googleapis.com
! gcloud services enable bigquerystorage.googleapis.com
! gcloud services enable cloudscheduler.googleapis.com
! gcloud services enable dataflow.googleapis.com
! gcloud services enable datapipelines.googleapis.com
! gcloud services enable logging.googleapis.com
! gcloud services enable notebooks.googleapis.com
! gcloud services enable timeseriesinsights.googleapis.com

**5. Allow permissions to the Service account**

Set the service account to be used by this demo. If you are using [GCP Compute Engine](https://cloud.google.com/compute), you can use the [Compute Engine Default Service Account](https://cloud.google.com/iam/docs/service-account-types#default). Choose an existing account service from [Credentials](https://pantheon.corp.google.com/apis/credentials), or create a new one.

In [58]:
SERVICE_ACCOUNT = '[your-service-account]'  # @param {type:"string"}

In [None]:
import sys

IS_COLAB = 'google.colab' in sys.modules

if (
    SERVICE_ACCOUNT == ''
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == '[your-service-account]'
):
    # Get your service account from gcloud
    if not IS_COLAB:
        shell_output = !gcloud auth list 2>/dev/null
        SERVICE_ACCOUNT = shell_output[2].replace('*', '').strip()

    if IS_COLAB:
        shell_output = ! gcloud projects describe  $PROJECT_ID
        project_number = shell_output[-1].split(':')[1].strip().replace("'", '')
        SERVICE_ACCOUNT = f'{project_number}-compute@developer.gserviceaccount.com'

    print('Service Account:', SERVICE_ACCOUNT)

Grant the following role permissions to your service account.

* BigQuery Data Editor
* BigQuery Job User
* BigQuery Read Session User
* Dataflow Developer
* Dataflow Worker
* GCS Storage Bucket Owner
* Storage Object Creator
* Storage Object Viewer
* Storage Object User
* Timeseries Insights DataSet Editor
* Data pipelines Invoker

Simply run the following cell to grant roles or go to the GCP console, open the [IAM & Admin](https://pantheon.corp.google.com/iam-admin) menu, and select the IAM roles from the list.

See the role details from the following documents:

* https://cloud.google.com/storage/docs/access-control/iam-roles
* https://cloud.google.com/bigquery/docs/access-control
* https://cloud.google.com/dataflow/docs/concepts/access-control
* https://cloud.google.com/iam/docs/understanding-roles#timeseriesinsights.datasetsEditor
* https://cloud.google.com/iam/docs/understanding-roles#datapipelines.invoker

See https://cloud.google.com/sdk/gcloud/reference/projects/add-iam-policy-binding for command-line usage.

In [None]:
! gcloud projects add-iam-policy-binding {PROJECT_ID} \
  --member=serviceAccount:{SERVICE_ACCOUNT} --role=roles/bigquery.dataEditor

! gcloud projects add-iam-policy-binding {PROJECT_ID} \
  --member=serviceAccount:{SERVICE_ACCOUNT} --role=roles/bigquery.jobUser

! gcloud projects add-iam-policy-binding {PROJECT_ID} \
  --member=serviceAccount:{SERVICE_ACCOUNT} \
  --role=roles/bigquery.readSessionUser

! gcloud projects add-iam-policy-binding {PROJECT_ID} \
  --member=serviceAccount:{SERVICE_ACCOUNT} --role=roles/dataflow.developer

! gcloud projects add-iam-policy-binding {PROJECT_ID} \
  --member=serviceAccount:{SERVICE_ACCOUNT} --role=roles/dataflow.worker

! gcloud projects add-iam-policy-binding {PROJECT_ID} \
  --member=serviceAccount:{SERVICE_ACCOUNT} --role=roles/storage.objectCreator

! gcloud projects add-iam-policy-binding {PROJECT_ID} \
  --member=serviceAccount:{SERVICE_ACCOUNT} --role=roles/storage.objectUser

! gcloud projects add-iam-policy-binding {PROJECT_ID} \
  --member=serviceAccount:{SERVICE_ACCOUNT} --role=roles/storage.objectViewer

! gcloud projects add-iam-policy-binding {PROJECT_ID} \
  --member=serviceAccount:{SERVICE_ACCOUNT} \
  --role=roles/timeseriesinsights.datasetsEditor

! gcloud projects add-iam-policy-binding {PROJECT_ID} \
  --member=serviceAccount:{SERVICE_ACCOUNT} --role=roles/datapipelines.invoker

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

- *{Note to the notebook author: For any user-provided strings that must be unique (like bucket names or model ID's), append "-unique" to the end of the strings for proper testing.}*

In [None]:
BUCKET_URI = 'gs://your-bucket-name-{PROJECT_ID}-unique'  # @param {type:"string"}

print(BUCKET_URI)

**If you don't have existing bucket, run the following cell to create your Cloud Storage bucket.

In [12]:
! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

### Import libraries

In [36]:
import argparse
import json
import logging
import os
import re
import time
from datetime import datetime
from typing import List

import apache_beam as beam
import google.auth
import matplotlib.pyplot as plt
import pandas
from apache_beam.options import pipeline_options
from apache_beam.options.pipeline_options import GoogleCloudOptions
from google.cloud import storage

This is a library to build the [Event](https://cloud.google.com/timeseries-insights/docs/reference/rest/v1/projects.locations.datasets/appendEvents#event) from GDELT BigQuery dataset.

In [37]:
class RowToPropertiesFn(beam.DoFn):
    """Converts BigQuery row to the properties."""

    def process(self, row):
        """Converts BigQuery row to properties tuple"""
        group_id = hash((row['TIMESTAMP'], row['STATION'], row['SHOW']))

        timestamp = row['TIMESTAMP']
        properties = (
            row['STATION'],
            row['HOUR'],
            row['SHOW'],
        )
        ngram = (row['NGRAM'].replace("'", '').replace('"', ''), row['COUNT'])

        return [(group_id, timestamp, properties, ngram)]


class ConvertPropertyToEventFn(beam.DoFn):
    """Converts single property to event."""

    def process(self, element):
        """Converts property to event"""
        _, timestamp, properties, ngram = element
        station, hour, show = properties
        word, count = ngram

        if show is None:
            show = 'None'

        event = {
            'eventTime': timestamp.isoformat(),
            'dimensions': [
                {'name': 'station', 'stringVal': station},
                {'name': 'hour', 'stringVal': str(hour)},
                {'name': 'show', 'stringVal': show},
                {'name': 'ngram', 'stringVal': word},
                {'name': 'count', 'doubleVal': int(count)},
            ],
        }

        return [event]


class CombinePropertiesByGroupFn(beam.CombineFn):
    """Combines properties to single event by group id."""

    def create_accumulator(self):
        """Create a empty accumulator to track the event."""

        # Timestamp, event id, properties, ngrams.
        return {'t': None, 'e': 0, 'p': (), 'n': []}

    def add_input(self, mutable_accumulator, element):
        """Process the incoming value."""

        _, timestamp, properties, ngram = element
        mutable_accumulator['t'] = timestamp
        mutable_accumulator['p'] = properties
        mutable_accumulator['n'].append(ngram)

        return mutable_accumulator

    def merge_accumulators(self, accumulators):
        """Merge several accumulators into a single one."""
        mutable_accumulator = self.create_accumulator()

        for accumulator in accumulators:
            if mutable_accumulator['t'] is None:
                mutable_accumulator = accumulator
            else:
                mutable_accumulator['n'].extend(accumulator['n'])

        return mutable_accumulator

    def extract_output(self, accumulator):
        """Exports the json event."""

        timestamp = accumulator['t']
        properties = accumulator['p']
        ngrams = accumulator['n']
        station, hour, show = properties

        event = {
            'eventTime': timestamp.isoformat(),
            'dimensions': [
                {'name': 'station', 'stringVal': station},
                {'name': 'hour', 'stringVal': str(hour)},
                {'name': 'show', 'stringVal': show},
            ],
        }

        for ngram in ngrams:
            word, _ = ngram
            event['dimensions'].append({'name': 'ngram', 'stringVal': word})

        return event


class ConvertToJsonFn(beam.DoFn):
    """Converts events dictionary to json string."""

    def process(self, event):
        """Converts events dictionary to json string."""
        event_json = str(event).replace("'", '"')

        return [event_json]


class GdeltClient:
    """Gdelt client to build events from gdelt data in BigQuery."""

    def to_bq_request(read_config):
        """Converts read config to BQ query."""

        import apache_beam as beam

        timestamp, duration, limit, table = read_config

        # iatv_1gramsv2
        query_tmpl = """
            SELECT
                *
            FROM
                `gdelt-bq.gdeltv2.{table}`
            WHERE
                TIMESTAMP_SECONDS({timestamp} - {duration}) < TIMESTAMP AND
                TIMESTAMP <= TIMESTAMP_SECONDS({timestamp})
            """

        query = query_tmpl.format(timestamp=timestamp, duration=duration,
                                  table=table)
        if limit > 0:
            query = '{query} LIMIT {limit}'.format(query=query, limit=limit)

        return beam.io.ReadFromBigQueryRequest(query=query)

    def read_ngrams_from_configs(
        self, read_config: beam.PCollection
    ) -> beam.PCollection:
        """Reads BigQuery using the given read config."""

        ngrams = (
            read_config
            | 'readConfigToBq' >> beam.Map(GdeltClient.to_bq_request)
            | 'readFromBq' >> beam.io.ReadAllFromBigQuery()
        )

        return ngrams

    def build_combined_events(
        self, ngrams: beam.PCollection
    ) -> beam.PCollection:
        """Builds combined events json from the given ngrams.

        Args:
          ngrams: Apache beam pcollection of ngrams.

        Returns:
          Returns Apache beam pcollection of events.
        """

        events = (
            ngrams
            | 'convertToProperties' >> beam.ParDo(RowToPropertiesFn())
            | 'groupByEventId' >> beam.Map(lambda x: (x[0], x))
            | 'combineToEvent' >> beam.CombinePerKey(
                CombinePropertiesByGroupFn())
            | 'getValue' >> beam.MapTuple(lambda k, v: v)
            | 'convertToJson' >> beam.ParDo(ConvertToJsonFn())
        )

        return events

    def build_events(self, ngrams: beam.PCollection) -> beam.PCollection:
        """Builds events json from the given ngrams.

        Args:
          ngrams: Apache beam pcollection of ngrams.

        Returns:
          Returns Apache beam pcollection of events.
        """

        events = (
            ngrams
            | 'convertToProperties' >> beam.ParDo(RowToPropertiesFn())
            | 'propertyToEvent' >> beam.ParDo(ConvertPropertyToEventFn())
            | 'convertToJson' >> beam.ParDo(ConvertToJsonFn())
        )

        return events

    def write_events(
        self, events: beam.PCollection, file_path_prefix: str,
        num_shards: int
    ):
        """Writes the events to the storage.

        Args:
          events: Apache beam pcollectino of events.
          file_path_prefix: GCP storage file path prefix.
          num_shards: Number of file shards.
        """

        events | 'write' >> beam.io.WriteToText(
            file_path_prefix=file_path_prefix, num_shards=num_shards
        )

        return

This is a library for managing the timestamp for a recurring pipeline.

In [38]:
class BuildReadConfigFn(beam.DoFn):
    """Builds read config from the timestamp file."""

    def process(
        self,
        element,
        timestamp_filepath: str,
        min_duration: int,
        max_duration: int,
        delay: int,
        timestamp: int,
        write: bool
    ):
        import re
        from datetime import datetime

        from google.cloud import storage

        match = re.search(r'^gs://([a-zA-Z0-9-_]+)/(.*)', timestamp_filepath)
        if not match:
            return None

        bucket_name = match.group(1)
        blob_name = match.group(2)

        storage_client = storage.Client()
        bucket = storage_client.bucket(bucket_name)
        blob = bucket.blob(blob_name)

        with blob.open('r') as f:
            timestamp = int(f.read())

        now = int(datetime.now().timestamp())
        now = now - now % min_duration - delay
        timestamp = timestamp - timestamp % min_duration
        now = min(now, timestamp + max_duration)
        duration = now - timestamp
        if duration > 0:
            duration = max(duration, min_duration)

        if write:
            with blob.open('w') as f:
                f.write(str(now))

        configs = [(now, duration, -1, 'iatv_1gramsv2'),
                   (now, duration, -1, 'iatv_2gramsv2')]

        return configs


class TimestampOptions(pipeline_options.PipelineOptions):

    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_argument(
            '--min_duration',
            default='3600',
            help='Duration in second to build append events.',
        )
        parser.add_argument(
            '--max_duration',
            default='86400',
            help='Maximum duration in seconds to build append events.',
        )
        parser.add_argument(
            '--delay',
            default='43200',
            help='Duration seconds to wait until data arrives.',
        )
        parser.add_argument(
            '--timestamp_filepath',
            default='gs://_/',
            help='Timestamp metadata filepath.',
        )
        parser.add_argument(
            '--timestamp',
            default='0',
            help='Timestamp used for timeseries requests.',
        )


class TimestampManager:
    """Utility class to handle timestamp file."""

    def __init__(self, options: TimestampOptions):
        self.timestamp_filepath = options.timestamp_filepath
        self.min_duration = int(options.min_duration)
        self.max_duration = int(options.max_duration)
        self.delay = int(options.delay)
        self.timestamp = int(options.timestamp)

        match = re.search(r'^gs://([a-zA-Z0-9-_]+)/(.*)',
                          self.timestamp_filepath)
        if not match:
            raise Exception(
                'Bad timestamp filepath format: {}'.format(
                    options.timestamp_filepath
                )
            )

        self.bucket_name = match.group(1)
        self.blob_name = match.group(2)

    def write_timestamp(self, timestamp: int):
        """Writes the timestmap to the file."""

        storage_client = storage.Client()
        bucket = storage_client.bucket(self.bucket_name)
        blob = bucket.blob(self.blob_name)

        with blob.open('w') as f:
            f.write(str(timestamp))

    def read_timestamp(self) -> int:
        """Returns the latest timestamp from the file."""

        if self.timestamp > 0:
            return self.timestamp

        try:
            storage_client = storage.Client()
            bucket = storage_client.bucket(self.bucket_name)
            blob = bucket.blob(self.blob_name)

            with blob.open('r') as f:
                return int(f.read())
        except:
            return 0

    def get_now_timestamp(self) -> int:
        """Returns timestamp for most recent events."""

        now = int(datetime.now().timestamp())
        return now - now % self.min_duration - self.delay

    def read_timerange_config(
        self, init: beam.PCollection, write: bool
    ) -> beam.PCollection:
        """Builds the tuple (timestamp, duration, limit, table) for beam."""

        timerange = init | 'buildTimerange' >> beam.ParDo(
            BuildReadConfigFn(),
            timestamp_filepath=self.timestamp_filepath,
            min_duration=self.min_duration,
            max_duration=self.max_duration,
            delay=self.delay,
            timestamp=self.timestamp,
            write=write
        )

        return timerange

This is a library for printing and plotting the Timeseries API results.

In [39]:
class Plotter:
    """Utility class to print and plot the detected anomalies."""

    def __init__(self, dataset: str):
        self.dataset = dataset

    def read_query_from_blob(self, blob):
        """Reads query content from the blob."""

        match = re.search(
            r'.*/([0-9]{4})/([0-9]+)/([0-9]+)/([0-9]+).json', blob.name
        )
        if match:
            content = blob.download_as_string().decode('utf8')
            content = content.replace("'", '"')
            query = json.loads(content)

            if query and 'name' in query and query['name'].endswith(self.dataset):
                return query

    def print_slices(self, slices):
        """Builds datafram from the given slices."""
        dates = []
        names = []
        scores = []
        values = []
        forecasts = []
        status = []

        sorted_slices = sorted(
            slices.items(), key=lambda x: x[1]['forecast']['point'][0]['time']
        )

        for ngram, slice in sorted_slices:
            date_str = slice['forecast']['point'][0]['time']
            date = datetime.strptime(date_str, '%Y-%m-%dT%H:%M:%SZ')
            dates.append(date)
            names.append(ngram)
            scores.append(slice['anomalyScore'])
            values.append(slice['detectionPointActual'])
            forecasts.append(slice['forecast']['point'][0]['value'])
            status.append(slice['status'])

        df = pandas.DataFrame({
            'Date (UTC)': dates,
            'Name': names,
            'Score': scores,
            'Value': values,
            'Forecast': forecasts,
            'Status': status,
        })

        display(df)

    def plot_slices(self, slices, ngrams):
        """Plots the timeseries of slices."""

        if not ngrams:
            return

        plots = list()
        for ngram in filter(lambda x: x in slices.keys(), ngrams):
            s = slices[ngram]
            for r in ['history', 'forecast']:
                p = dict()
                for t in s[r]['point']:
                    timestamp = int(
                        datetime.strptime(t['time'],
                                          '%Y-%m-%dT%H:%M:%SZ').timestamp()
                    )
                    p[timestamp] = t['value']
                plots.append(p)

        start = min(min(p.keys()) for p in plots)
        end = max(max(p.keys()) for p in plots)
        width_list = []
        for p in plots:
            tl = list(p.keys())
            width_list.append(
                min(
                    tl[i + 1] - tl[i] for i in range(len(tl) - 1)
                )
            )
        width = min(width_list)
        size = int((end - start) / width) + 1

        y = [[None] * len(plots) for i in range(size)]
        for i in range(len(plots)):
            for t in plots[i].keys():
                y[int((t - start) / width)][i] = plots[i][t]

        plt.figure(figsize=(30, 5))
        plt.plot(y, linestyle='-')
        plt.title('Anomalies')

    def read_queries(self, output_prefix: str, num: int, min_score: float):
        """Reads query outputs."""

        match = re.search(r'^gs://([a-zA-Z0-9-_]+)/(.*)', output_prefix)

        bucket_name = match.group(1)
        dir = match.group(2)

        storage_client = storage.Client()
        bucket = storage_client.bucket(bucket_name)
        blobs = bucket.list_blobs(prefix=dir)
        blob_list = list(blobs)

        slices = dict()
        for blob in blob_list[-num:]:
            query = self.read_query_from_blob(blob)
            if query:
                for slice in query['slices']:
                    if 'anomalyScore' in slice:
                        ngram = slice['dimensions'][0]['stringVal']
                        score = slice['anomalyScore']
                        value = slice['detectionPointActual']
                        forecast = slice['forecast']['point'][0]['value']
                        if (
                            ngram not in slices.keys()
                            and score >= min_score
                            and value > forecast
                        ):
                            slices[ngram] = slice

        return slices

This is a library to call [Timeseries Insights API](https://cloud.google.com/timeseries-insights) for create, delete, list, get, and append requests.

In [40]:
class BeamFn(beam.DoFn):
    """Base beam class for timeseries insights."""

    def __init__(self, domain: str, region: str, dataset: str):
        self.domain = domain
        self.region = region
        self.dataset = dataset

    def setup(self):
        import google.auth

        credentials, self.project_id = google.auth.default(
            'https://www.googleapis.com/auth/cloud-platform'
        )

        self.url = '{d}/projects/{p}/locations/{r}/datasets/{s}'.format(
            d=self.domain, p=self.project_id, r=self.region, s=self.dataset
        )

        self.authed_session = google.auth.transport.requests.AuthorizedSession(
            credentials
        )


class AppendEventsFn(BeamFn):
    """Append events to timeseries insights dataset."""

    def process(self, event_json):
        """Process the event"""
        import json

        url = self.url + ':appendEvents'

        event = json.loads(event_json)
        events = {'events': [event]}
        response = self.authed_session.post(url=url, json=events)

        return [str(response.json())]


class QueryFn(BeamFn):
    """Anomaly detection query."""

    def process(self, element, bucket, post):
        """Sends the query."""
        from datetime import datetime

        timestamp, _, _, _ = element
        url = self.url + ':query'

        t = datetime.fromtimestamp(timestamp - bucket).strftime(
            '%Y-%m-%dT%H:%M:%SZ'
        )

        post['detectionTime'] = t
        response = self.authed_session.post(url=url, json=post)

        return [(timestamp, str(response.json()))]


class WriteResponseFn(beam.DoFn):
    """Write the response to file."""

    def __init__(self, output_prefix: str):
        self.output_prefix = output_prefix

    def setup(self):
        import re

        match = re.search(r'^gs://([a-zA-Z0-9-_]+)/(.*)', self.output_prefix)
        if not match:
            raise Exception(
                'Bad output filepath format: {}'.format(self.output_prefix)
            )

        self.bucket_name = match.group(1)
        self.dir = match.group(2)

    def process(self, element):
        from datetime import datetime

        from google.cloud import storage

        timestamp, response = element
        dt = datetime.utcfromtimestamp(timestamp)
        storage_client = storage.Client()
        bucket = storage_client.bucket(self.bucket_name)
        filepath = '{b}/{y}/{m}/{d}/{h}.json'.format(
            b=self.dir, y=dt.year, m=dt.month, d=dt.day, h=dt.hour
        )
        blob = bucket.blob(filepath)

        with blob.open('w') as f:
            f.write(response)

        return ['gs://' + self.bucket_name + '/' + filepath]


class TimeseriesClient:
    """Timeseries insights client."""

    def __init__(self, region: str, dataset: str):
        self.domain = 'https://timeseriesinsights.googleapis.com/v1'
        self.region = region
        self.dataset = dataset

        credentials, self.project_id = google.auth.default(
            'https://www.googleapis.com/auth/cloud-platform'
        )

        self.authed_session = google.auth.transport.requests.AuthorizedSession(
            credentials
        )

    def list_datasets(self):
        """Lists timeseries for the given region."""

        url = '{d}/projects/{p}/locations/{r}/datasets'.format(
            d=self.domain, p=self.project_id, r=self.region
        )
        response = self.authed_session.get(url)
        return response

    def append_events(self, events):
        """Appends events using beam."""

        result = events | 'append' >> beam.ParDo(
            AppendEventsFn(
                domain=self.domain, region=self.region, dataset=self.dataset
            )
        )

        return result

    def create_dataset(self, events_filepath: str):
        """Creates timeseries insights dataset."""

        post = {
            'name': self.dataset,
            'dataSources': [{'uri': events_filepath}],
            'ttl': '2592000s',
        }
        url = '{d}/projects/{p}/locations/{r}/datasets'.format(
            d=self.domain, p=self.project_id, r=self.region
        )

        print(post)

        return self.authed_session.post(url=url, json=post)

    def get_dataset(self) -> str:
        """Gets the state of the given dataset."""

        datasets = self.list_datasets().json()

        if 'datasets' in datasets:
            name = '/{}'.format(self.dataset)
            for dataset in datasets['datasets']:
                if dataset['name'].endswith(name):
                    return dataset

        return {}

    def delete_dataset(self) -> str:
        """Deletes the dataset."""

        url = '{d}/projects/{p}/locations/{r}/datasets/{s}'.format(
            d=self.domain, p=self.project_id, r=self.region, s=self.dataset
        )

        response = self.authed_session.delete(url)
        return response

    def build_query(
        self, metric: str, bucket: int, history: int = 999, forecast: int = 24
    ):
        """Builds query from the given parameters."""

        post = {
            'detectionTime': '',
            'slicingParams': {
                'dimensionNames': ['ngram'],
            },
            'timeseriesParams': {
                'forecastHistory': '{}s'.format(bucket * history),
                'granularity': '{}s'.format(bucket),
                'metric': metric,
            },
            'forecastParams': {
                'horizonDuration': '{}s'.format(bucket * forecast),
                'noiseThreshold': 10,
            },
            'returnTimeseries': True,
            'numReturnedSlices': 100,
        }

        return post

    def query(self,
              metric: str,
              timestamp: int,
              bucket: int,
              history: int = 999,
              forecast: int = 24):
        """Queries anomaly request."""

        url = '{d}/projects/{p}/locations/{r}/datasets/{s}:query'.format(
            d=self.domain, p=self.project_id, r=self.region, s=self.dataset
        )

        t = datetime.fromtimestamp(timestamp - bucket).strftime(
            '%Y-%m-%dT%H:%M:%SZ'
        )

        post = self.build_query(metric, bucket, history, forecast)
        post['detectionTime'] = t

        response = self.authed_session.post(url=url, json=post)
        return response

    def beam_query(self,
                   timerange,
                   metric: str,
                   bucket: int,
                   history: int = 999,
                   forecast: int = 24):
        """Queries anomaly request using beam."""

        post = self.build_query(metric, bucket, history, forecast)

        response = timerange | 'query' >> beam.ParDo(
            QueryFn(domain=self.domain,
                    region=self.region,
                    dataset=self.dataset),
            bucket,
            post,
        )

        return response

    def evaluate(self,
                 slice: str,
                 timestamp: int,
                 bucket: int,
                 metric: str,
                 history: int = 180,
                 forecast: int = 10):
        """Sends evaluate request."""

        dim = dict(slice.split('=') for x in str.split(','))
        dim = list(map(lambda x: {'name': x, 'stringVal': dim[x]}, dim))

        url = (
            '{d}/projects/{p}/locations/{r}/datasets/{s}' + ':evaluateSlice'
        ).format(d=self.domain,
                 p=self.project_id,
                 r=self.region,
                 s=self.dataset)

        t = datetime.fromtimestamp(timestamp - bucket).strftime(
            '%Y-%m-%dT%H:%M:%SZ'
        )

        post = {
            'detectionTime': t,
            'pinnedDimensions': dim,
            'timeseriesParams': {
                'forecastHistory': '{}s'.format(bucket * history),
                'granularity': '{}s'.format(bucket),
                'metric': metric,
            },
            'forecastParams': {'horizonDuration': '{}s'.format(
                bucket * forecast)},
            'omitTimeseries': False,
        }

        print(post)

        response = self.authed_session.post(url=url, json=post)
        return response

    def beam_write_response(self, response: beam.PCollection,
                            output_prefix: str):
        """Writes the response to dated file."""

        filepath = response | 'write' >> beam.ParDo(
            WriteResponseFn(output_prefix))

        return filepath

This is a library to write template metadata.

In [41]:
def write_metadata(filename: str):
    """Writes pipeline template metadata."""

    storage_client = storage.Client()
    bucket = storage_client.bucket(BUCKET_URI[5:])
    blob = bucket.blob('templates/' + filename)

    with blob.open('w') as f:
        f.write('{}\n')

    print(
        (
            'Visit https://pantheon.corp.google.com/storage/browser/{b}' +
            '/templates?&project={p} for checking the pipeline ' +
            'template.'
        ).format(b=BUCKET_URI[5:], p=PROJECT_ID)
    )

This is a library to create or delete a data pipeline.

In [42]:
def create_pipeline_auth_session():
    scope = 'https://www.googleapis.com/auth/cloud-platform'
    domain = 'https://datapipelines.googleapis.com/v1'

    path = 'projects/{p}/locations/{r}/pipelines'.format(p=PROJECT_ID, r=REGION)
    url = '{d}/{p}'.format(d=domain, p=path)

    credentials, _ = google.auth.default(scope)
    authed_session = google.auth.transport.requests.AuthorizedSession(
        credentials
    )

    return path, url, authed_session


def create_pipeline(name: str, template_name: str, schedule: str):
    path, url, authed_session = create_pipeline_auth_session()

    pipeline = {
        'name': path + '/' + name,
        'displayName': name,
        'type': 'PIPELINE_TYPE_BATCH',
        'state': 'STATE_ACTIVE',
        'workload': {
            'dataflowLaunchTemplateRequest': {
                'projectId': PROJECT_ID,
                'gcsPath': BUCKET_URI + '/templates/' + template_name,
                'launchParameters': {
                    'jobName': name,
                    'environment': {
                        'tempLocation': BUCKET_URI + '/temp/',
                    }
                },
                'location': REGION
            }
        },
        'scheduleInfo': {
            'schedule': schedule,
            'timeZone': 'America/Los_Angeles',
        },
    }
    response = authed_session.post(url=url, json=pipeline)
    return response


def delete_pipeline(name: str):
    _, url, authed_session = create_pipeline_auth_session()
    response = authed_session.delete(url=(url + '/' + name))
    return response


def list_pipelines():
    _, url, authed_session = create_pipeline_auth_session()
    response = authed_session.get(url=url)
    return response

This is a library to build an events JSON file.

In [43]:
class BuildEventsOptions(pipeline_options.PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_argument(
            '--duration',
            default='0',
            help='Duration seconds of the events.'
        )
        parser.add_argument(
            '--output',
            default='gs://',
            help='Output filepath for events json files. Use @N for num shards'
        )
        parser.add_argument(
            '--combine',
            default='false',
            help='Set true to combine properties into single event.'
        )


def build_event_run(options: pipeline_options.PipelineOptions):
    """Pipeline to build timeseries dataset."""

    events_options = options.view_as(BuildEventsOptions)
    timestamp_options = options.view_as(TimestampOptions)
    timestamp_manager = TimestampManager(timestamp_options)

    output = events_options.output
    duration = int(events_options.duration)
    combine = False
    if events_options.combine == 'true':
        combine = True
    timestamp = timestamp_manager.get_now_timestamp()

    pcs = output.split('@')
    num_shards = 1
    events_filepath = output
    if len(pcs) > 1:
        events_filepath = pcs[0]
        num_shards = int(pcs[1])

    pipeline = beam.Pipeline(options=options)

    gdelt_client = GdeltClient()

    now = int(datetime.now().timestamp())
    configs = pipeline | 'init' >> beam.Create(
        [(now, duration, -1, 'iatv_1gramsv2'),
         (now, duration, -1, 'iatv_2gramsv2')])

    ngrams = gdelt_client.read_ngrams_from_configs(configs)
    if combine:
        events = gdelt_client.build_combined_events(ngrams)
    else:
        events = gdelt_client.build_events(ngrams)
    gdelt_client.write_events(events, events_filepath, num_shards)

    result = pipeline.run()
    result.wait_until_finish()

    # Writes timestamp metadata.
    timestamp_manager.write_timestamp(timestamp)


def build_event_main(args: List[str]):
    """Main function to parse the arg and run the build event pipeline."""
    logging.getLogger().setLevel(logging.INFO)

    print('\n'.join(args))

    parser = argparse.ArgumentParser()
    _, beam_args = parser.parse_known_args(args)
    options = pipeline_options.PipelineOptions(beam_args)

    standard_options = options.view_as(pipeline_options.StandardOptions)
    standard_options.runner = 'DataflowRunner'

    build_event_run(options=options)

This is a library to make Timeseries Insights API calls

In [44]:
class TimeseriesOptions(pipeline_options.PipelineOptions):

    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_argument(
            '--input',
            default='gs://',
            help='Input gcp storage path for events input.',
        )
        parser.add_argument(
            '--dataset',
            default='',
            help='Dataset name used for timeseries insights api.',
        )
        parser.add_argument(
            '--command',
            default='',
            help='Dataset commands. get, list, create and delete.',
        )
        parser.add_argument(
            '--slice', default='ngram=today', help='Slice name to evaluate.'
        )
        parser.add_argument(
            '--metric', default='count', help='Metric to aggregate timeseries.'
        )
        parser.add_argument(
            '--bucket', default='3600', help='Bucket size to create timeseries.'
        )
        parser.add_argument(
            '--history', default='999', help='Number of history points to use.'
        )


def dataset_run(options: TimeseriesOptions):
    """Command line utility class to send timeseries requests."""

    cloud_options = options.view_as(GoogleCloudOptions)
    region = cloud_options.region

    timeseries_options = options.view_as(TimeseriesOptions)
    timestamp_options = options.view_as(TimestampOptions)

    input = timeseries_options.input
    dataset = timeseries_options.dataset
    command = timeseries_options.command
    slice = timeseries_options.slice
    metric = timeseries_options.metric
    bucket = int(timeseries_options.bucket)
    history = int(timeseries_options.history)

    timestamp = TimestampManager(timestamp_options).read_timestamp()

    timeseries_client = TimeseriesClient(region, dataset)

    print('Command:', command, ' timestamp:', timestamp)

    if command == 'create':
        response = timeseries_client.create_dataset(input)
        print(json.dumps(response.json(), indent=2))
    elif command == 'delete':
        response = timeseries_client.delete_dataset()
        print(json.dumps(response.json(), indent=2))
    elif command == 'list':
        response = timeseries_client.list_datasets()
        print(json.dumps(response.json(), indent=2))
    elif command == 'query':
        response = timeseries_client.query(
            metric=metric, timestamp=timestamp, bucket=bucket, history=history
        )
        plotter = Plotter(DATASET)
        slices = dict()
        for slice in response.json()['slices']:
            if 'anomalyScore' in slice:
                ngram = slice['dimensions'][0]['stringVal']
                slices[ngram] = slice
        plotter.print_slices(slices)
    elif command == 'evaluate':
        response = timeseries_client.evaluate(
            slice=slice,
            metric=metric,
            timestamp=timestamp,
            bucket=bucket,
            history=history,
            forecast=10,
        )
        slice = response.json()
        name = slice['dimensions'][0]['stringVal']
        slices = {name: slice}

        plotter = Plotter(dataset)
        plotter.print_slices(slices)
        plotter.plot_slices(slices, slices.keys())
    elif command == 'wait':
        finsihed = False
        while not finsihed:
            response = timeseries_client.get_dataset()
            if 'state' not in response:
                print('Dataset doesn\'t exist.')
                print(json.dumps(response, indent=2))
                sys.exit(1)
            state = response['state']
            if state != 'LOADING' and state != 'PENDING':
                finsihed = True
                if state == 'LOADED':
                    print('Dataset loaded.')
                else:
                    print(json.dumps(response, indent=2))
                    sys.exit(1)
            else:
                time.sleep(10)
    else:
        response = timeseries_client.get_dataset()
        print(json.dumps(response, indent=2))


def dataset_main(args: List[str]):
    """Main function to parse the arg and send timeseries requests."""
    logging.getLogger().setLevel(logging.INFO)

    print('\n'.join(args))

    parser = argparse.ArgumentParser()
    _, beam_args = parser.parse_known_args(args)
    options = pipeline_options.PipelineOptions(beam_args)

    dataset_run(options=options)

This is a library to run append events pipeline.

In [45]:
class AppendEventsOptions(pipeline_options.PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_argument(
            '--write', default='true', help='Updates the timestamp information.'
        )


def append_event_run(options: pipeline_options.PipelineOptions):
    """Pipeline to build timeseries dataset and append events."""

    append_options = options.view_as(AppendEventsOptions)
    events_options = options.view_as(BuildEventsOptions)
    timestamp_options = options.view_as(TimestampOptions)
    timeseries_options = options.view_as(TimeseriesOptions)
    timestamp_manager = TimestampManager(timestamp_options)

    output = events_options.output
    write = False
    if append_options.write == 'true':
        write = True
    combine = False
    if events_options.combine == 'true':
        combine = True

    dataset = timeseries_options.dataset

    cloud_options = options.view_as(GoogleCloudOptions)
    region = cloud_options.region

    gdelt_client = GdeltClient()

    p = beam.Pipeline(options=options)
    init = p | 'init' >> beam.Create(['init'])
    read_configs = timestamp_manager.read_timerange_config(init, write=write)
    ngrams = gdelt_client.read_ngrams_from_configs(read_configs)
    if combine:
        events = gdelt_client.build_combined_events(ngrams)
    else:
        events = gdelt_client.build_events(ngrams)

    timeseries_client = TimeseriesClient(region, dataset)
    appends = timeseries_client.append_events(events)
    appends | 'write' >> beam.io.WriteToText(
        file_path_prefix=output, num_shards=10
    )

    result = p.run()
    result.wait_until_finish()


def append_event_main(args: List[str]):
    """Main function to parse the arg and run the append events pipeline."""
    logging.getLogger().setLevel(logging.INFO)

    print('\n'.join(args))

    parser = argparse.ArgumentParser()
    _, beam_args = parser.parse_known_args(args)
    options = pipeline_options.PipelineOptions(beam_args)

    standard_options = options.view_as(pipeline_options.StandardOptions)
    standard_options.runner = 'DataflowRunner'

    append_event_run(options=options)

This is a library to run a Timeseries Insights query pipeline.

In [46]:
class QueryOptions(pipeline_options.PipelineOptions):

    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_argument(
            '--deleteme',
            default='count',
            help=(
                'Use this metric to build timeseries. '
                + 'Set empty to use 1 per group.'
            ),
        )


def query_run(options: pipeline_options.PipelineOptions):
    """Pipeline to run anomaly detection query."""

    cloud_options = options.view_as(GoogleCloudOptions)
    region = cloud_options.region

    events_options = options.view_as(BuildEventsOptions)
    timeseries_options = options.view_as(TimeseriesOptions)
    timestamp_options = options.view_as(TimestampOptions)
    timestamp_manager = TimestampManager(timestamp_options)

    dataset = timeseries_options.dataset
    output = events_options.output
    metric = timeseries_options.metric
    bucket = int(timeseries_options.bucket)
    history = int(timeseries_options.history)

    timeseries_client = TimeseriesClient(region, dataset)

    p = beam.Pipeline(options=options)
    init = p | 'init' >> beam.Create(['init'])
    read_configs = timestamp_manager.read_timerange_config(init, write=False)
    response = timeseries_client.beam_query(read_configs,
                                            metric,
                                            bucket,
                                            history)
    timeseries_client.beam_write_response(response, output)

    result = p.run()
    result.wait_until_finish()


def query_main(args: List[str]):
    """Main function to parse the arg and run the pipeline."""
    logging.getLogger().setLevel(logging.INFO)

    print('\n'.join(args))

    parser = argparse.ArgumentParser()
    _, beam_args = parser.parse_known_args(args)
    options = pipeline_options.PipelineOptions(beam_args)

    standard_options = options.view_as(pipeline_options.StandardOptions)
    standard_options.runner = 'DataflowRunner'

    query_run(options=options)

###Set up dataflow package

Copy the `setup.py` file to the local directory for packaging the libraries for a dataflow pipeline setup. See [Specify dependencies in Python](https://cloud.google.com/functions/docs/writing/specifying-dependencies-python)

In [None]:
! gsutil cp gs://timeseries-insights-samples/tsi-demo/v1/setup.py ./
! cat setup.py

##Create a dataset
Builds event json files from the historical data and create Timeseries Insights Dataset from the built events file.

###Test Dataset
Print out the unigram and bigram sample data from the GDELT dataset.

In [None]:
import apache_beam.runners.interactive.interactive_beam as ib

options = pipeline_options.PipelineOptions()
standard_options = options.view_as(pipeline_options.StandardOptions)
standard_options.runner = 'InteractiveRunner'

cloud_options = options.view_as(pipeline_options.GoogleCloudOptions)
cloud_options.project = PROJECT_ID

cloud_options.region = REGION
cloud_options.staging_location = '{}/staging'.format(BUCKET_URI)
cloud_options.temp_location = '{}/temp'.format(BUCKET_URI)

p = beam.Pipeline(options=options)

gdelt_client = GdeltClient()

now = int(datetime.now().timestamp())
read_configs = p | beam.Create([
    (int(now / 3600 - 12) * 3600, 3600, 10, 'iatv_1gramsv2'),
    (int(now / 3600 - 12) * 3600, 3600, 10, 'iatv_2gramsv2')
])

ngrams = gdelt_client.read_ngrams_from_configs(read_configs)
events = gdelt_client.build_events(ngrams)

result = p.run()
result.wait_until_finish()

ib.show(events)

###Build events

Run the cell to create the events json files from six months of historical data. Then, check the [Dataflow Jobs](https://pantheon.corp.google.com/dataflow/jobs) to see the status of the jobs. The successful job flow looks like this.
<table align="left">
  <td>
    <img src="https://storage.googleapis.com/timeseries-insights-samples/tsi-demo/v1/images/build-events.png" alt="Build events" width="505px" height="832px">
  </td>
</table>

In [None]:
build_event_main(args=[
    '--project=' + PROJECT_ID,
    '--setup_file=' + os.getcwd() + '/setup.py',
    '--region=' + REGION,
    '--job_name=' + DATASET + '-build-events',
    '--staging_location=' + BUCKET_URI + '/staging',
    '--temp_location=' + BUCKET_URI + '/temp',
    '--delay=86400',
    '--duration=15552000',
    '--output=' + BUCKET_URI + '/output/' + DATASET + '-events@10',
    ('--timestamp_filepath=' + BUCKET_URI + '/output/' + DATASET +
     '-timestamp.txt')
])

In [None]:
# Prints the link to the output storage.

print(('Visit the bucket to check the output: ' +
       'https://pantheon.corp.google.com/storage/browser/{b}/' +
       'output?project={p}').format(
          b=BUCKET_URI[5:],
          p=PROJECT_ID))

Check the generated files in the GCP bucket. You can find the events json files and timestamp text file similar to the image. The file names could be different than the image.
<table align="left">
  <td>
    <img src="https://storage.googleapis.com/timeseries-insights-samples/tsi-demo/v1/images/events.png" alt="Event json files" width="561px" height="439px">
  </td>
</table>

###Set dataset name

Set the dataset name to be used by the Timeseries Insights API.

In [None]:
# List the existing datasets to check.

dataset_main(args=[
    '--region=' + REGION,
    '--command=list'
])

In [50]:
DATASET = '[dataset_name]'  # @param {type:"string"}

In [14]:
from datetime import datetime

if (
    DATASET == ''
    or DATASET is None
    or DATASET == '[dataset_name]'
):
    DATASET = 'gdelt-' + datetime.strftime(datetime.now(), '%Y-%m-%d-%H%M%S')

print('Using dataset name: ' + DATASET)

###Create a dataset
Create a dataset from the generated events file in the GCP storage.

In [None]:
dataset_main(args=[
    '--input=' + BUCKET_URI + '/output/' + DATASET + '-events-*',
    '--region=' + REGION,
    (
        '--timestamp_filepath=' + BUCKET_URI + '/output/' + DATASET +
        '-timestamp.txt'
    ),
    '--dataset=' + DATASET,
    '--command=create'
])

Check the status of the created dataset. Wait until the state changes from 'LOADING' to 'LOADED'.

In [None]:
dataset_main(args=[
    '--region=' + REGION,
    '--dataset=' + DATASET,
    '--command=wait'
])

###Evaluate loaded timeseries
Send an evaluate timeseries request to the API to check the loaded dataset. You can change the bucket size, history and slice name to plot different timeseries. Also, you can add `--timestamp` flag to evaluate at the specific timestamp.

In [None]:
dataset_main(args=[
    '--region=' + REGION,
    (
        '--timestamp_filepath=' + BUCKET_URI + '/output/' + DATASET +
        '-timestamp.txt'
    ),
    '--dataset=' + DATASET,
    '--command=evaluate',
    '--history=180',
    '--bucket=' + str(3600 * 24),
    '--slice=ngram=today',
])

##Append real-time data
Set up a pipeline to fetch data from the GDELT Bigquery, and stream to the dataset. This pipeline runs every hour.

###Build the pipeline template
Build the pipeline template to fetch the data, and stream to the dataset. The flag `--delay=43200` controls the data delay. We are using 12-hour delay considering the possible delay of the GDELT dataset.

In [None]:
import os

append_event_main(args=[
    '--project=' + PROJECT_ID,
    '--setup_file=' + os.getcwd() + '/setup.py',
    '--region=' + REGION,
    '--job_name=' + DATASET + '-append-events',
    '--staging_location=' + BUCKET_URI + '/staging',
    (
        '--template_location=' + BUCKET_URI + '/templates/' + DATASET +
        '-append-events'
    ),
    '--temp_location=' + BUCKET_URI + '/temp',
    '--output=' + BUCKET_URI + '/appends/responses',
    (
        '--timestamp_filepath=' + BUCKET_URI + '/output/' + DATASET +
        '-timestamp.txt'
    ),
    '--min_duration=3600',
    '--delay=43200',
    '--dataset=' + DATASET,
    '--autoscaling_algorithm=NONE',
    '--num_workers=4',
    '--number_of_worker_harness_threads=50',
    '--write=true',
])

###Set up a recurring append pipeline

Set up a pipeline to stream a new events data every hour.

In [None]:
# Build the pipeline metadata file.

write_metadata(DATASET + '-append-events_metadata')

Visit https://pantheon.corp.google.com/storage/browser/tsi-demo-timeseries-insights-api-demo-unique/templates?&project=timeseries-insights-api-demo for checking the pipeline template.


After running the cell, go to the Google Cloud Platform (GCP) bucket console to confirm whether the template file has been generated. Inside the folder `${BUCKET_URI}/templates`, you should see the template file.

You can establish and set up the hourly append pipeline by either running the provided code cell or by following the step-by-step instructions in the gcp console.

To use the console, go to GCP [Dataflow->Pipelines](https://pantheon.corp.google.com/dataflow/pipelines) menu and create a new pipeline with the following configurations:

* Pipeline name: `append-event`
* Dataflow template: Custom Template
* Template path: `{BUCKET_URI}/templates/append-events`
* Pipeline type: Batch
* Temporary location: `{BUCKET_URI}/temp`
* Repeat: Hourly (run every 0 minute)

The pipeline is created and scheduled to run hourly.

Note: The append-events pipeline reads the data from BigQuery. The timestamp of latest streamed data is tracked by the file `{BUCKET_URI}/output/{DATASET}-timestamp.txt`.

In [None]:
create_pipeline(name=DATASET + '-append-events',
                template_name=DATASET + '-append-events',
                schedule='0 * * * *')
response = list_pipelines()
print(json.dumps(response.json(), indent=2))

###Evaluate the real-time timeseries
Send an evaluate timeseries request to check the real-time data. You can change the n-gram value of the `--slice` flag to evaluate a different timeseries.

In [None]:
dataset_main(args=[
    '--region=' + REGION,
    (
        '--timestamp_filepath=' + BUCKET_URI + '/output/' + DATASET +
        '-timestamp.txt'
    ),
    '--dataset=' + DATASET,
    '--bucket=3600',
    '--history=1000',
    '--metric=count',
    '--command=evaluate',
    '--slice=ngram=today',
])

##Query dataset
Set up multiple query pipelines to detect anomalies with different bucket sizes. The anomaly results are stored in the GCP bucket.

Test the query API.

In [None]:
dataset_main(args=[
    '--region=' + REGION,
    (
        '--timestamp_filepath=' + BUCKET_URI + '/output/' + DATASET +
        '-timestamp.txt'
    ),
    '--dataset=' + DATASET,
    '--bucket=3600',
    '--history=1000',
    '--metric=count',
    '--command=query'
])

###Build a query with 1-hour time bucket
Set up an hourly pipeline to send an anomaly detection request with 1-hour time bucket. The detection time is read from the `timestamp.txt` file which is updated while the new events append.
The query results are stored in the GCP Bucket path, set by the flag `--output`.

In [None]:
write_metadata(DATASET + '-query01_metadata')

In [None]:
query_main(args=[
    '--project=' + PROJECT_ID,
    '--setup_file=' + os.getcwd() + '/setup.py',
    '--region=' + REGION,
    '--job_name=' + DATASET + '-query01',
    '--staging_location=' + BUCKET_URI + '/staging',
    '--template_location=' + BUCKET_URI + '/templates/' + DATASET + '-query01',
    '--temp_location=' + BUCKET_URI + '/temp',
    '--output=' + BUCKET_URI + '/' + DATASET + '-query01',
    (
        '--timestamp_filepath=' + BUCKET_URI + '/output/' + DATASET +
        '-timestamp.txt'
    ),
    '--dataset=' + DATASET,
    '--bucket=3600',
    '--history=1000',
    '--metric=count',
])

In [None]:
create_pipeline(name=DATASET + '-query01',
                template_name=DATASET + '-query01',
                schedule='10 * * * *')
response = list_pipelines()
print(json.dumps(response.json(), indent=2))

###Build a query with a 24-hour time bucket
Set up same pipeline except the 24-hour time bucket size.

In [None]:
write_metadata(DATASET + '-query24_metadata')

In [None]:
query_main(args=[
    '--project=' + PROJECT_ID,
    '--setup_file=' + os.getcwd() + '/setup.py',
    '--region=' + REGION,
    '--job_name=' + DATASET + '-query24',
    '--staging_location=' + BUCKET_URI + '/staging',
    '--template_location=' + BUCKET_URI + '/templates/' + DATASET + '-query24',
    '--temp_location=' + BUCKET_URI + '/temp',
    '--output=' + BUCKET_URI + '/' + DATASET + '-query24',
    (
        '--timestamp_filepath=' + BUCKET_URI + '/output/' + DATASET +
        '-timestamp.txt'
    ),
    '--dataset=' + DATASET,
    '--bucket=86400',
    '--history=180',
    '--metric=count',
])

In [None]:
create_pipeline(name=DATASET + '-query24',
                template_name=DATASET + '-query24',
                schedule='15 * * * *')
response = list_pipelines()
print(json.dumps(response.json(), indent=2))

##Display a query result
Display the detected anomalies from the most recent 24 queries results. You might see the results after multiple successful runs of the query pipelines. You can check the pipeline runs from [Dataflow Jobs](https://pantheon.corp.google.com/dataflow/jobs).

In [55]:
plotter = Plotter(DATASET)

In [None]:
slices01 = plotter.read_queries(
    output_prefix=BUCKET_URI + '/' + DATASET + '-query01', num=24,
    min_score=1.0)

plotter.print_slices(slices01)
plotter.plot_slices(slices01, slices01.keys())

In [None]:
slices24 = plotter.read_queries(
    output_prefix=BUCKET_URI + '/' + DATASET + '-query24', num=24,
    min_score=1.0)

plotter.print_slices(slices24)
plotter.plot_slices(slices24, slices24.keys())

## Clean up

To clean up all of the Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) that you used for the tutorial.

Otherwise, you can delete the individual resources you created.


###Stop all running pipelines

From the pipeline console (https://pantheon.corp.google.com/dataflow/pipelines) check the existing pipelines, and delete the created pipelines using the following commands:

In [None]:
# Delete the Cloud Pipelines that were created
delete_pipeline(DATASET + '-query01')
delete_pipeline(DATASET + '-query24')
delete_pipeline(DATASET + '-append-events')

###Delete the GCS bucket

In [None]:
# Delete Cloud Storage objects that were created
! gsutil -m rm -r $BUCKET_URI

###Delete the dataset

In [None]:
# Delete the Timeseries Insights dataset that were created
dataset_main(args=[
    '--region=' + REGION,
    '--dataset=' + DATASET,
    '--command=delete'
])