Streaming is data processing for unbounded data sets

Bounded data (Batch) is data that has a defined start and end
Unbounded data (Streaming) is data that is continuous and has no defined start or end

Stream analytics has many applications 
**Data integration**(10 sec - 10 min )
* data warehouses become real time
* take load off source databases with change data capture (CDC)
* microservices require databases and caches

**Online decisions** (10 ms - 10 sec)
* fraud detection
* real time recommendations
* gaming event 
* financial trading

Google Cloud Platform (GCP) has many services for stream analytics
* Pub/Sub : messaging service (changing and variable volumes of data)
* Dataflow : stream processing engine (processing data)
* bigquery : data warehouse (storing data)

![streamAnalytics](Media/streamAnalytics.png)


Dataflow offers the following that makes it easy to create resilient streaming pipelines when working with unbounded data:
Ability to flexibly reason about time
Controls to ensure correctness

Pub/Sub : Global message queue
Dataflow : Controls to handle late-arriving and out-of-order data
BigQuery : Query data as it arrives from streaming pipelines
Bigtable  :Latency in the order of milliseconds when querying against overwhelming volume


![pubsub](Media/pub_sub_1.png)

Filter the pub/sub events on the message attributes 
Configure via the cloud console, gcloud commond line tools or pub/sub api


![pub_sub_patterns](Media/pub_sub_patterns.png)

![pushAndPullDelivery](Media/pushAndPullDelivery.png)

![subscribers_work_method](Media/subscribers_work_method.png)

In [None]:
# create a topic and publish a message
gcloud pubsub topics create sandiego
gcloud pubsub topics publish sandiego --message "hello"

In [None]:
import os 
from google.cloud import pubsub_v1
publisher = pubsub_v1.PublisherClient()
topic_name = 'projects/{project_id}/topics/{topic}'.format(
    project_id=os.getenv('GOOGLE_CLOUD_PROJECT'),
    topic='my-new-topic',  # Set this to something appropriate.
)
publisher.create_topic(topic_name)
publisher.publish(topic_name, b'My first message!',author='bakro')# author is send attribute

In [None]:
import os 
from google.cloud import pubsub_v1
subscriber = pubsub_v1.SubscriberClient()
subscription_name = 'projects/{project_id}/subscriptions/{sub}'.format(
    project_id=os.getenv('GOOGLE_CLOUD_PROJECT'),
    sub='my-new-subscription',  # Set this to something appropriate.
)
subscriber.create_subscription(name =subscription_name,topic =topic_name)

# pull method callbac function
def callback(message):
    print(message.data)
    message.ack()
subscriber.subscribe(subscription_name, callback=callback)

```shell
gcloud pubsub subscriptions create mySubscription --topic sandiego mysub1
gcloud pubsub subscriptions pull --auto-ack mySubscription
```

In [None]:
import time 
from google.cloud import pubsub_v1
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(project_id, subscription_name)

NUM_MESSAGES = 2
ACK_DEADLINE = 30
SlEEP_TIME = 10
response = subscriber.pull(subscription_path, max_messages=NUM_MESSAGES)

In [None]:
from google.cloud import pubsub
from google.cloud.pubsub import types
# change batch settings
client = pubsub.PublisherClient(batch_settings=types.BatchSettings(max_messages=500))

duplication will happen
If messages have the same ordering key and are in the same region, you can enable message ordering. <br>
• To receive the messages in order, set the message ordering property on the subscription you receive messages from using the Cloud Console, the gcloud command-line tool, or the Pub/Sub API. <br>
• Receiving messages in order might increase latency. <br>

![pubsub2](Media/pubsub_2.png)

Pub/Sub lets you configure an exponential backoff policy for better flow control. <br>
• The idea behind exponential backoff is to add progressively longer delays between retry attempts. <br>
• To create a new subscription with an exponential backoff retry policy, run the gcloud pubsub create command or use the Cloud console. <br>

summary <br>
latency,out_of_order,duplication will happen  <br>
pub/sub with dataflow:EXactly once,ordered processing <br>

Pub/Sub  simplifies systems by removing the need for every component to speak to every component <br>
Pub/Sub connects applications and services through a messaging infrastructure <br>
Pub/Sub doesn’t guarantees that messages delivered are in the order they were received <br>

Which of the following about Pub/Sub topics and subscriptions are true?
1 or more publisher(s) can write to the same topic <br>
1 or more subscriber(s) can request from the same subscription <br>
vWhich of the following delivery methods is ideal for subscribers needing close to real time performance? <br>
Push Delivery  <br>


there are challenges with processing unbounded data
* scalability : streaming data is continuous and can be very large
* fault tolerance : maintain fault tolerance despite increasing data volumes
* Model : is it streaming or batch data?
* Timing : what if data arrives late or out of order?

how do you aggregate data in a stream?
windowing : divide the stream into finite sets of data

![pubsub_dataflow](Media/pubsub_dataflow.png)

![dataflow2](Media/dataflow2.png)

code to modify date-timestamp
```python
unix_timestamp = extract_timestamp_from_log_entry(element)
yield beam.window.TimestampedValue(element, unix_timestamp)
```
```java
c.outputWithTimestamp(element, timestamp );
```

Dupplciation will happen : Exactly once processing with pub/sub and dataflow

```shell
# pub/sub publisher code
msg.publish(event, myid="23tfkdjg")
```
```java
// dataflow pipeline code
p.apply(PubsubIO.readStrings().fromTopic(t).idlLabel("myid"))
```

![windows](Media/windows.png)

Setting time windows
```python
# fixed time windows
from apache_beam import window
fixed_windowed_items =( item | 'window'>> beam.WindowInto(window.FixedWindows(60)))
```

```python
# fixed time windows
from apache_beam import window
sliding_windowed_items =( item | 'window'>> beam.WindowInto(window.SlidingWindows(60,30)))
```

```python
from apache_beam import window
session_windowed_items =( item | 'window'>> beam.WindowInto(window.Sessions(10*60)))
```

Remember : you can apply windows to batch data, although you may need to generate metadata date_timestamp on which windows operate

![window1](Media/window1.png)

![window2](Media/window2.png)

![window3](Media/window3.png)

![watermark](Media/watermark.png)

![window_customtrigger](Media/window_customtrigger.png)

some example triggers
```python
pcollection | windowInto(sliding_window(60,5), # 60 second window, 5 second slide
            trigger = AfterWatermark( # relative to the watermark
                early=AfterProcessingTime(30), # fire 30 secs after pipeline commences
                late=AfterProcessingTime(1) # and for every second thereafter
            )
            accumulationMode = AccumulationMode.ACCUMULATING)
            allowed_lateness = Duration(seconds=2*24*60*60) # 2 days
       
```
```python
pcollection| windowInto(
    FixedWindows(60), # 60 seconds
    trigger = Repeatedly(
        AfterAny(
            AfterCount(10), # every 10 elements
            AfterProcessingTime(30) # every 30 seconds
        )),
    accumulation_mode = AccumulationMode.DISCARDING) #the trigger should be with only new records
```
**you can allow late data past the watermark**
```java
Pcollection<string> items = ...;
PCollection<string> windowed_items = items.apply(
    Window.<string>into(FixedWindows.of(Duration.standardMinutes(1)))
    .withAllowedLateness(Duration.standardMinutes(1))
```

```python
pc = [Initail PCollection]
pc | beam.WindowInto(window.FixedWindows(60),
    trigger = trigger_fn,
    accumulation_mode = accumulation_mode,
    timestamp_combiner = timestamp_combiner,
    allowed_lateness = duration(seconds=2*24*60*60)) # 2 days
```

![Accumulation_modes](Media/Accumulation_modes.png)