In [149]:
from random import normalvariate, random
from itertools import count, groupby, islice
from datetime import date
import math

# 1.  `read_fake_data`
Let's explore how the `read_fake_data` function works.
* It returns a generator.
* This generator can be called via `__next__()`
* Another way to call it successively is via `islice`. This is essentially a `for` loop, calling `__next__()` on the generator.

In [396]:
def read_fake_data(filename, debug=False):
    """
    This function returns a generator.
    Example usage:
       gen = read_fake_date("test_file_name")
       a = gen.__next__()
       => a = (1, 0.98234)

    Note: docs on count -> https://docs.python.org/3/library/itertools.html#itertools.count
    Essentially a wrapper around a basic while True loop
    """
    for i in count():
        if debug:
            print(f"Yielding data point #{i} from read_fake_data generator.")
        sigma = random() * 10
        day = date.fromtimestamp(i)
        value = normalvariate(0, sigma)
        yield (day, value)

In [129]:
data = read_fake_data("test")

In [130]:
data_point_1 = data.__next__()
print(data_point_1)

(datetime.date(1969, 12, 31), 0.15449132434539767)


In [131]:
data_point_2 = data.__next__()
print(data_point_2)

(datetime.date(1969, 12, 31), 2.7335094042676014)


In [136]:
data_point_3_through_13 = islice(data, 5)
for point in list(data_point_3_through_13):
    print(point)

(datetime.date(1969, 12, 31), 0.013996346667972032)
(datetime.date(1969, 12, 31), -0.8741304471887233)
(datetime.date(1969, 12, 31), -1.0111260338016816)
(datetime.date(1969, 12, 31), -0.19077098115058688)
(datetime.date(1969, 12, 31), 2.6459114426509207)


In [138]:
for i in range(0, 5):
    print(data.__next__())

(datetime.date(1969, 12, 31), 1.7800160686913784)
(datetime.date(1969, 12, 31), 11.044291525815439)
(datetime.date(1969, 12, 31), 1.148786215637162)
(datetime.date(1969, 12, 31), -21.859771492126264)
(datetime.date(1969, 12, 31), 3.51119802571488)


# 2. `day_grouper`

In [198]:
def day_grouper(iterable):
    """
    lambda takes in a data point of form: time, value
    Returns itertools groupby, which is an iterator with a next method

    Note: date.fromtimestamp() will return (1969, 12, 31) for any value in range [0, 25199]
    Hence, the first 25200 pieces of read fake data will be grouped together for day (1969, 12, 31)
    """
    key = lambda timestamp_value: timestamp_value[0]
    return groupby(iterable, key)

Let's just look at the standard `itertools.groupby` to start (note, contents must be presorted):

In [203]:
iterable = ["a", "a", "a", "a", "b", "b", "b", "b", "c", "c"]
grouped = groupby(iterable)

for group_name, group in grouped:
    print("Group Name:", group_name, "  Group contents:", list(group))

Group Name: a   Group contents: ['a', 'a', 'a', 'a']
Group Name: b   Group contents: ['b', 'b', 'b', 'b']
Group Name: c   Group contents: ['c', 'c']


Above, `groupby` takes in an iterable that is simple a precomputed list. Could it take in a generator? Yes! Let's look at a generator that creates the above list:

In [337]:
def simple_generator(debug=True):
    i = 0
    while True:
        if debug:
            print("Generator hit and yielding a value")
        if i < 4:
            yield "a"
        elif i < 8:
            yield "b"
        elif i < 11:
            yield "c"
        else: 
            break
        i += 1

In [338]:
iterable = simple_generator()

In [339]:
first_iterable_value = iterable.__next__()
print(first_iterable_value)

Generator hit and yielding a value
a


In [340]:
iterable = simple_generator()
print(
    list(islice(iterable, 10))
)

Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value
['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c']


Can `groupby` handle a generator as an input (as compared to a simple precomputed list?). The answer is yes! We can see this in a variety of ways. To start, let's see that `groupby` returns two items: the group `key` and an `iterator`. This means that `groupy` by does _not_ precompute the groups and group contents! It must be accessed via `__next__()`. This is great since it gives us more control and prevents loading all contents of generator into memory unless we want to. Let's take a look at how it can be accessed:

In [359]:
iterable = simple_generator(debug=True)
grouped = groupby(iterable)

In [360]:
for group_name, group in grouped:
    print("Group Name:", group_name)
    while True:
        try:
            print(f"Group {group_name} value: ", group.__next__())
        except StopIteration:
            print("\n")
            break

Generator hit and yielding a value
Group Name: a
Group a value:  a
Generator hit and yielding a value
Group a value:  a
Generator hit and yielding a value
Group a value:  a
Generator hit and yielding a value
Group a value:  a
Generator hit and yielding a value


Group Name: b
Group b value:  b
Generator hit and yielding a value
Group b value:  b
Generator hit and yielding a value
Group b value:  b
Generator hit and yielding a value
Group b value:  b
Generator hit and yielding a value


Group Name: c
Group c value:  c
Generator hit and yielding a value
Group c value:  c
Generator hit and yielding a value
Group c value:  c
Generator hit and yielding a value




We can also grab _all_ of the contents by advancing the iterator entirely to the end via `list`. This looks like:

In [361]:
iterable = simple_generator(debug=True)
grouped = groupby(iterable)

for group_name, group in grouped:
    print("Group Name:", group_name, "  Group contents:", list(group))

Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value
Group Name: a   Group contents: ['a', 'a', 'a', 'a']
Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value
Group Name: b   Group contents: ['b', 'b', 'b', 'b']
Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value
Group Name: c   Group contents: ['c', 'c', 'c']


How does this work in the context of our example? Let's take a look via the `day_grouper` function and our `read_fake_data` function:

In [369]:
data = read_fake_data("test")
data_day = day_grouper(data)

In [370]:
iter(data_day)

<itertools.groupby at 0x102d50e08>

Remember, `data` is a generator and `data_day` is an `itertools.groupby` that is taking `data` as it's input iterable. In other words, `data_day` is a groupby that has taken a generator as an input. groupby returns an `itertools.groupby`, which is also an iterable:

In [371]:
type(iter(data_day)) == type(data_day)

True

So, the `groupby` will be successively calling `__next__()` on the generator (an iterable), which will be yielding day/value tuple pairs:

In [372]:
group_name, group_iterable = data_day.__next__()

In [373]:
print(group_name)
print(group_iterable)

1969-12-31
<itertools._grouper object at 0x102cec400>


In [374]:
print(group_iterable.__next__())

(datetime.date(1969, 12, 31), -1.2140982362813253)


In [375]:
print(group_iterable.__next__())

(datetime.date(1969, 12, 31), -8.435549216466036)


In [376]:
print(group_iterable.__next__())

(datetime.date(1969, 12, 31), 2.5887059614629875)


In [377]:
print(group_iterable.__next__())

(datetime.date(1969, 12, 31), 8.36185095178034)


In [379]:
for pair in list(islice(group_iterable, 10)): 
    print(pair)

(datetime.date(1969, 12, 31), 12.858924260744535)
(datetime.date(1969, 12, 31), -8.102339986813885)
(datetime.date(1969, 12, 31), 1.1281637632800638)
(datetime.date(1969, 12, 31), 2.616516481219527)
(datetime.date(1969, 12, 31), 0.5825416034689577)
(datetime.date(1969, 12, 31), -5.475462044963013)
(datetime.date(1969, 12, 31), 1.734426388223523)
(datetime.date(1969, 12, 31), 2.8339230931639454)
(datetime.date(1969, 12, 31), -0.4247739211417971)
(datetime.date(1969, 12, 31), 9.731967617917219)


Above, `group_iterable` contains all of the day/value tuple's for day (`group_name`) `datetime.date(1969, 12, 31)`. We can see that when `group_iterable.__next__()` is called, our `read_fake_data` yields a value by enabling `debug`:

In [397]:
data = read_fake_data("test", debug=True)
data_day = day_grouper(data)

In [398]:
group_name, group_iterable = data_day.__next__()

Yielding data point #0 from read_fake_data generator.


In [400]:
group_iterable.__next__()

Yielding data point #1 from read_fake_data generator.


(datetime.date(1969, 12, 31), -3.0773053343360584)

In [401]:
for pair in list(islice(group_iterable, 10)): 
    print(pair)

Yielding data point #2 from read_fake_data generator.
Yielding data point #3 from read_fake_data generator.
Yielding data point #4 from read_fake_data generator.
Yielding data point #5 from read_fake_data generator.
Yielding data point #6 from read_fake_data generator.
Yielding data point #7 from read_fake_data generator.
Yielding data point #8 from read_fake_data generator.
Yielding data point #9 from read_fake_data generator.
Yielding data point #10 from read_fake_data generator.
Yielding data point #11 from read_fake_data generator.
(datetime.date(1969, 12, 31), -2.438625884470149)
(datetime.date(1969, 12, 31), 1.4012717712493605)
(datetime.date(1969, 12, 31), 4.99099942687332)
(datetime.date(1969, 12, 31), -10.853344799696204)
(datetime.date(1969, 12, 31), -4.363478914024846)
(datetime.date(1969, 12, 31), 1.157289121329716)
(datetime.date(1969, 12, 31), 0.8902768894172846)
(datetime.date(1969, 12, 31), -2.7388773546196497)
(datetime.date(1969, 12, 31), -1.4049422890990317)
(datetim

One of the most important things to keep in mind with the `groupby` is that there are technically 2 iterables that we are dealing with. 
1. `data_day` is an iterable. Every time we call `data_day.__next__()` we are going to have the generator yield values, until the `key` no longer matches for the group. Again, this looks like:

In [445]:
iterable = simple_generator(debug=True)
grouped = groupby(iterable)

In [446]:
a_group = grouped.__next__()
print(a_group)

Generator hit and yielding a value
('a', <itertools._grouper object at 0x102b5f1d0>)


We are now in the `a` group. `groupby` essentially allows `__next__()` to be called on `grouped`, and it will keep requesting values from the generator until the group no longer matches. At that point, it will return the new group, in this case `b`:

In [447]:
grouped.__next__()

Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value


('b', <itertools._grouper at 0x102d5f128>)

As expected, `b` is returned. If we call it again, `groupby` will continue requesting values from the generator until the `b` group is no longer matched (we expect to see 4 values), in this case returning the `c` group:

In [448]:
grouped.__next__()

Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value


('c', <itertools._grouper at 0x102b5f7f0>)

And calling `__next__()` a final time will yield the values from the generator where the group is `c`. This should be 3, and we should expect to stop at this point and have `None` returned, since the generator has been exhausted (meaning a `StopIteration` will be thrown):

In [449]:
grouped.__next__()

Generator hit and yielding a value
Generator hit and yielding a value
Generator hit and yielding a value


StopIteration: 

Note, to get a better understanding of generator _state_ let's look at the `a_group` tuple that was returned earlier: `('a', <itertools._grouper object at 0x102b5f1d0>)`. Can we call `__next__()` on the iterator that was returned, `a[0]`? 

In [452]:
a_group[1].__next__()

StopIteration: 

No, we cannot! The generator (`simple_generator`) has been exhausted and has no more values to yield. The way that `groupby` works is that it will grab a value from the generator/iterable and start a new group. This looks like:

```
grouped = groupby(iterable)
a_group = grouped.__next__()
```

This group that is returned, `a_group` above, will have a group name and iterable returned (that will gather group values). Via `__next__()`, `a_group[1]` will then continue to grab values from the generator/iterable where the group is present, and finally throw a `StopIteration` error when the group does not match. This causes a break to occur. Note, that **this is exhausting values from the generator**. In other words, the generator state is being updated and we cannot gather these values again at a later point in time. This can be seen in detail below:

In [453]:
iterable = simple_generator(debug=True)
grouped = groupby(iterable)

In [454]:
a_group = grouped.__next__()

Generator hit and yielding a value


In [456]:
a_group[1].__next__()

'a'

In [457]:
a_group[1].__next__()

Generator hit and yielding a value


'a'

In [458]:
a_group[1].__next__()

Generator hit and yielding a value


'a'

In [459]:
a_group[1].__next__()

Generator hit and yielding a value


'a'

In [460]:
a_group[1].__next__()

Generator hit and yielding a value


StopIteration: 

We see that four times `Generator hit and yielding a value` is printed, meaning out generator is hit and yields a value. The final time, it is hit, but the value is `b` and our group is `a`, so a `StopIteration` error is thrown. 

# 3. `check_anomaly`
How does `check_anomaly` fit into all of this? Well, let's start by passing it a single `day_data_tuple` as input:

In [461]:
def check_anomaly(day_data_tuple):
    """
    Find mean, std, and maximum values for the day. Using a single pass (online)
    mean/std algorithm allows us to only read through the day's data once.

    Note: M2 = 2nd moment, variance
          day_data is an iterable, returned from groupby, and we request values via for loop
    """
    (day, day_data) = day_data_tuple

    n = 0
    mean = 0
    M2 = 0
    max_value = 0
    for timestamp, value in day_data:
        n += 1
        delta = value - mean
        mean = mean + (delta / n)
        M2 += delta * (value - mean)
        max_value = max(max_value, value)
    variance = M2 / (n - 1)
    standard_deviation = math.sqrt(variance)

    # Check if day's data is anomalous, if True return day
    if max_value > mean + 6 * standard_deviation:
        return day
    return False

What does `day_data_tuple` look like? Well, it is essentially a single group with the key being the day and all of it's associated values. Specifically it a tuple, shown below as `day_1_tuple`. It is returned from `day_grouper`. It looks like:

In [485]:
data = read_fake_data("test", debug=True)
data_day = day_grouper(data)

In [486]:
day_1_tuple = data_day.__next__()

Yielding data point #0 from read_fake_data generator.


In [489]:
day_1_tuple

(datetime.date(1969, 12, 31), <itertools._grouper at 0x102b5fb70>)

At index 1 we see that `day_1_tuple` has an iterator. This iterator will request values from our fake data generator, continually yielding them until the day no longer matches the group date (`datetime.date(1969, 12, 31)` above). These values can be accessed via a `for` loop, shown below. Note, `i` is only used to avoid printing 25000+ values:

In [488]:
i = 0
for timestamp, value in day_1_tuple[1]:
    if i > 10:
        break
    print("Day:", timestamp, "Value:", value, "\n")
    i += 1

Day: 1969-12-31 Value: -8.928451135335273 

Yielding data point #1 from read_fake_data generator.
Day: 1969-12-31 Value: -2.6842722720032413 

Yielding data point #2 from read_fake_data generator.
Day: 1969-12-31 Value: 1.988017533134936 

Yielding data point #3 from read_fake_data generator.
Day: 1969-12-31 Value: -4.318303288053731 

Yielding data point #4 from read_fake_data generator.
Day: 1969-12-31 Value: 1.2756273714792294 

Yielding data point #5 from read_fake_data generator.
Day: 1969-12-31 Value: -3.6213015089109053 

Yielding data point #6 from read_fake_data generator.
Day: 1969-12-31 Value: -5.295121946930241 

Yielding data point #7 from read_fake_data generator.
Day: 1969-12-31 Value: 4.571744935215041 

Yielding data point #8 from read_fake_data generator.
Day: 1969-12-31 Value: -0.9216798216394321 

Yielding data point #9 from read_fake_data generator.
Day: 1969-12-31 Value: -5.16559247678084 

Yielding data point #10 from read_fake_data generator.
Day: 1969-12-31 Val

A minor implementation detail: see that the first `timestamp` and `value` printed, `Day: 1969-12-31 Value: -8.928451135335273`, do not also print `Yielding data point #0 from read_fake_data generator.`; this is because this value was already yielded from the generator and stored temporarily by `groupby` in the background. Because this is a generator, we cannot request it again, so there is no call the the fake data generator here. 

So, `check_anomaly` will be using a `for` loop to essentially call `__next__()` on the iterator contained in `day_data_tuple` (`day_data_tuple[1]`). We can see this in action below:

In [503]:
data = read_fake_data("test", debug=False)
data_day = day_grouper(data)

day_1_tuple = data_day.__next__()

check_anomaly(day_1_tuple)

False

In [506]:
data = read_fake_data("test", debug=False)
data_day = day_grouper(data)

day_tuple = data_day.__next__()

check_anomaly(day_tuple)

datetime.date(1969, 12, 31)

Above we have examples of where we found anomalous dates, and where we did not. In both cases, we are passing in the return tuple from `groupby`, which again returns a tuple of the shape: `(<group key>, <iterable which we can call __next__() on to get group contents>)`.

# 4. `filter` and `map` application