

In this lab, you own a fleet of New York City taxi cabs and are looking to 
monitor how well your business is doing in real-time. You will build a 
streaming data pipeline to capture taxi revenue, passenger count, ride 
status, and much more and visualize the results in a management 
dashboard

![structure](streamingdata.png)

### Creating database, partitioning for performance and define schema using  cloud shell
``` shell
# create the taxirides dataset.
bq mk taxirides
# create the taxirides.realtime table
bq mk \
--time_partitioning_field timestamp \
--schema ride_id:string,point_idx:integer,latitude:float,longitude:float,\
timestamp:timestamp,meter_reading:float,meter_increment:float,ride_status:string,\
passenger_count:integer -t taxirides.realtime
```

### create cloud storage bucket
Create a Cloud Storage bucket   <br>
Name, paste in your GCP Project ID <br>
Location type, click Multi-region <br>


### Set up a Dataflow Pipeline

1. Enter streaming-taxi-pipeline as the Job name for your Dataflow job.
2. Under Dataflow template, select the Pub/Sub Topic to BigQuery template.
3. Under Input Pub/Sub topic, enter projects/pubsub-public-data/topics/taxirides-realtime
4. Under BigQuery output table, enter <myprojectid>:taxirides.realtime
5. Under Temporary location, enter gs://<mybucket>/tmp/.
6. Click Show Optional Parameters and input the following values as listed below:
Max workers: 2
Number of workers: 2
7. Click the RUN JOB button.

In [None]:
--  explore data 
WITH streaming_data AS (
SELECT
timestamp,
TIMESTAMP_TRUNC(timestamp, HOUR, 'UTC') AS hour,
TIMESTAMP_TRUNC(timestamp, MINUTE, 'UTC') AS 
minute,
TIMESTAMP_TRUNC(timestamp, SECOND, 'UTC') AS 
second,
ride_id,
latitude,
longitude,
meter_reading,
ride_status,
passenger_count
FROM
taxirides.realtime
WHERE ride_status = 'dropoff'
ORDER BY timestamp DESC
LIMIT 1000)
-- calculate aggregations on stream for reporting:
SELECT
ROW_NUMBER() OVER() AS dashboard_sort,
minute,
COUNT(DISTINCT ride_id) AS total_rides,
SUM(meter_reading) AS total_revenue,
SUM(passenger_count) AS total_passengers
FROM streaming_data
GROUP BY minute, timestamp


### Stop work

1.Navigate back to Dataflow.
2.Click the streaming-taxi-pipeline or the new job name.
3.Click STOP and select Cancel > STOP JOB

### Google studio 
Specify the below settings: <br>
•Chart type: Combo chart <br> 
•Date range Dimension: dashboard_sort <br>
•Dimension: dashboard_sort <br>
•Drill Down: dashboard_sort (Make sure that Drill down option is turned ON) <br>
•Metric: SUM() total_rides, SUM() total_passengers, SUM() total_revenue <br>
•Sort: dashboard_sort, Ascending (latest rides first) <br>

![dashboard_1](dashboard_1.png)

### data source (custom query)
1. select data source and go more options 
2.Under CUSTOM QUERY, click qwiklabs-gcp-xxxxxxx > Enter Custom Query, add the following query. 
``` sql
SELECT
*
FROM
taxirides.realtime
WHERE
ride_status='dropoff’
```
3. Add timeseries chart Change the field timestamp type to Date & Time > Date Hour Minute (YYYYMMDDhhmm).
4.in the Data panel on the right, change the following:
•Dimension: timestamp
•Metric: meter_reading(SUM)

![dashboard_2](dashboard_2.png)