<a href="https://colab.research.google.com/github/BBVA/mercury-reels/blob/master/notebooks/reels_walkthrough_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Walk through most of mercury-dynamics REELS functionality

<img style="float: right;" src="https://raw.githubusercontent.com/BBVA/mercury-reels/master/notebooks/images/reels_logo.jpg">

**Reels** is a **mercury** library to analyze sequences of events extracted from transactional data. These events can be automatically discovered or manually defined. Reels identifies events by assigning them **event codes** and creates **clips**, which are sequences of codes and times for each client. Using these clips, a model can be generated to predict the time at which similar events may occur in the future, based on new data.

This notebook provides a comprehensive overview of the majority of the functionalities and also provides an explanation of how the prediction process works.


## Imports

We will use some standard packages in this notebook that we include here.

In [1]:
import random, time

import pandas as pd

from sklearn import metrics

Now, we import reels.

In [2]:
try:
    import reels

except ModuleNotFoundError:
    !pip install mercury-reels

    import reels

Additionally, we can verify that we are using the right version. (This notebook requires at least a version 1.3.1.)

In [3]:
reels.__version__

'1.4.1'

## The dataset

The dataset is a synthetic dataset. The following code generates the dataset for a given number of clients with many configurable parameters.

### The simulation

The dataset simulates for each client a random number of events. The client either drinks a placebo (which does nothing), a poison or an antidote at random and at random time intervals.
If the client has taken the poison a fatal timing count starts. If an event is applied after the fatal time is exceeded, the client dies, generating a target event (the client's death)
at a point in time random but longer than a given parameter. If in the time after the poison and before the death he drinks the antidote, the effect of the poison is cancelled and the 
client survives unless he is poisoned again. If he quits the study poisoned, but before the fatal time expired, we will not observe his death. This is a common situation in survival analysis and is why we treat the data as censored, i.e., not accessible to the observer.

In [4]:
#@title
def new_sequence(max_events, time_ori, min_delay, max_delay, dead_in, poison, antidote, n_placebos):
    
    N = random.randrange(max_events) + 1
    
    poisoned = False
    target   = None
    
    codes = []
    times = []
    
    ct = time.mktime(time.strptime(time_ori, '%Y-%m-%d %H:%M:%S')) + random.randrange(min_delay, max_delay)
    
    for i in range(N):
        ct = ct + random.randrange(min_delay, max_delay)
    
        if (poisoned and ct - poison_t > dead_in):
            target = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(ct))
            break
            
        codes.append(random.randrange(n_placebos + 1) + 1)
        times.append(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(ct)))
        
        if codes[-1] > n_placebos:
            if poisoned:
                codes[-1] = antidote
                poisoned = False
            else:
                codes[-1] = poison
                poisoned = True
                poison_t = ct
                
    return codes, times, target


def create_dataframes(num_clients = 1000, max_events = 15, time_ori = '2022-06-01 00:00:01', min_delay = 60, max_delay = 72*3600, dead_in = 96*3600, poison = 200, antidote = 201, n_placebos = 5):
    
    data    = None
    targets = None

    for cli in range(1, num_clients + 1):
        codes, times, target = new_sequence(max_events, time_ori, min_delay, max_delay, dead_in, poison, antidote, n_placebos)
    
        emi    = ['click' for _ in range(len(codes))]
        weight = [1 for _ in range(len(codes))]
        client = [cli for _ in range(len(codes))]
    
        chunk = pd.DataFrame(list(zip(emi, codes, weight, client, times)), columns = ['emitter', 'description', 'weight', 'client', 'time'])

        data = chunk if data is None else pd.concat([data, chunk])
    
        if target is not None:
            chunk   = pd.DataFrame([[cli, target]], columns = ['client', 'time'])
            targets = chunk if targets is None else pd.concat([targets, chunk])
            
    return data, targets

In [5]:
data, targs = create_dataframes()

In general, all the reels datasets will have transactional nature. They represent events that happen in time which is explicitly given for each row. We have a client identified by some string that will be necessary to cross it with the targets dataset. And we have an event. An event is defined by the union of `(emitter, description, weight)`. This is handy to represent many situations: E.g., in a navigation dataset, the emitter could be the source webpage, the description the destination webpage. In e-commerce, the emitter could be the seller, the description the product and the numerical value the price to help distinguishing different products. 

In this case, we only need one to identify what the client drinks by a code that is 1 to 5 (placebo), 200 (poison) or 201 (antidote). The other two fields are not used and contain constant values.

In [6]:
data

Unnamed: 0,emitter,description,weight,client,time
0,click,2,1,1,2022-06-03 05:57:22
1,click,2,1,1,2022-06-04 12:43:19
2,click,3,1,1,2022-06-06 17:57:49
3,click,4,1,1,2022-06-08 03:26:43
0,click,4,1,2,2022-06-06 01:26:22
...,...,...,...,...,...
5,click,4,1,1000,2022-06-10 15:16:34
6,click,2,1,1000,2022-06-10 19:16:57
7,click,2,1,1000,2022-06-11 09:06:40
8,click,4,1,1000,2022-06-12 05:34:24


The other dataset is much simpler and only for the clients who died. It contains the client id in the same format as in the previous dataset and the date of the target event.

In [7]:
targs

Unnamed: 0,client,time
0,4,2022-06-11 05:45:56
0,13,2022-06-17 20:23:22
0,18,2022-06-09 17:13:00
0,23,2022-06-09 01:32:17
0,26,2022-06-10 06:25:23
...,...,...
0,981,2022-06-14 21:34:36
0,984,2022-06-17 05:32:41
0,985,2022-06-12 18:15:26
0,986,2022-06-10 13:30:19


## The Intake class

The Intake class is a data management utility. It applies the loading of data at a dataframe level to any reels object. E.g., to apply the data of a complete dataframe to a reels `Events.insert_row() method`, the intake will have an `.insert_rows()` (in plural) method. And the same for any reels object that requires data loading.

The dataframe can be either a pandas or a pyspark dataframe. An Intake object can be used as many times as required. 

Now, we create two Intake objects for both dataframes to be used each time we need to populate an object.

In [8]:
intake_data  = reels.Intake(data)
intake_targs = reels.Intake(targs)

In [9]:
intake_data

reels.Intake object using Pandas

Column names: emitter, description, weight, client, time

Shape: 5954 x 5

In [10]:
intake_targs

reels.Intake object using Pandas

Column names: client, time

Shape: 381 x 2

## The Events Class

<img style="right;" src="https://raw.githubusercontent.com/BBVA/mercury-reels/master/notebooks/images/events.png">

An Events object is a container that keeps the definitions of every event we are interested in. Each event is stored as a mapping of `(emitter, description, weight)` into a unique 
integer code. The transactions in a file that do not represent events stored in this object will be ignored. There are many reasons to ignore transactions, including we are only interested in some manually specified events or there are millions of events and we only want to focus on, say, the most frequent ten thousand.

We create an empty Events object.

In [11]:
events = reels.Events()

<div class="alert alert-block alert-success">
<b>Note that:</b> This object will only store up to 1000 events, which is the default value for `max_num_events`. Otherwise, we should give that argument to the constructor.
</div>

In [12]:
events.num_events()

0

Now, we apply the whole dataset to it using the Intake object.

In [13]:
intake_data.insert_rows(events)

<div class="alert alert-block alert-success">
<b>Note that:</b> The column names <b>`emitter`, `description`, `weight`</b> in the dataframe are the default names and, therefore, it is not necessary to give them. 
<br>In general, this call would have been: <b>intake_data.insert_rows(events, columns = ['x', 'y', 'z'])</b>.
</div>

We have explored 6K+ rows to find the 7 unique events and assign codes to them in the order in which they were found for the first time.

In [14]:
events

reels.Events object with 7 events

In [15]:
for ev in events.describe_events():
    print(ev)

('click', '2', 1.0, 1)
('click', '201', 1.0, 7)
('click', '3', 1.0, 2)
('click', '200', 1.0, 5)
('click', '1', 1.0, 6)
('click', '5', 1.0, 4)
('click', '4', 1.0, 3)


<div class="alert alert-block alert-info">
<b>Note 1:</b> We have done event discovery by scanning the whole dataset and were not limited in size. There were only 7 different events and we had capacity for 1000. In general, we may have to use the <b>max_num_events</b> argument of the constructor to decide how many events we want to learn.
</div>

<div class="alert alert-block alert-info">
<b>Note 2:</b> Besides letting the Events object learn by scanning the whole dataset, we could have defined the events manually by either pushing them one by one using the <b>.define_event()</b> method of the events object or using the Intake's <b>.define_events()</b> method.
</div>

## The Clients class

<img style="right;" src="https://raw.githubusercontent.com/BBVA/mercury-reels/master/notebooks/images/clients.png">

The Clients class is a filter that can be used in the following classes (Clips and Targets) to limit processing to a set of clients. This may be necessary for memory reasons since the whole processing is in-memory. We can use this class to contain a subset of all the clients and compute everything over each subset.

In [16]:
clients = reels.Clients()

In [17]:
clients

Empty reels.Clients object (Empty objects select ALL clients.)

<div class="alert alert-block alert-success">
<b>Note that:</b> An empty Clients object means "for every client". We are not going to populate this object. This could be done using the object's <b>add_client_id()</b> method.
</div>

## The Clips class

<img style="right;" src="https://raw.githubusercontent.com/BBVA/mercury-reels/master/notebooks/images/clips.png">

A **clip** is the sequence of events of one client. In a clip, the time at which the event happens and the integer code representing the event is stored.

The **clips** object is a collection of clips for all the clients (or possibly a subset of them filtered using the Clients object).

We create an empty object.

In [18]:
clips = reels.Clips(clients, events)

<div class="alert alert-block alert-success">
<b>Note that:</b> We will give the time of the events to this object. If the time format is not the default <b>'%Y-%m-%d %H:%M:%S'</b>, we have to use the <b>time_format</b> argument.
</div>

In [19]:
clips.num_clips()

0

And we populate it just as before using the Intake object.

In [20]:
intake_data.scan_events(clips)

<div class="alert alert-block alert-success">
<b>Note that:</b> Again, the column names, in this case, <b>`emitter`, `description`, `weight`, 'client', 'time'</b> in the dataframe are the default names and we are not using the argument <b>columns</b>.
</div>

And we can see the clips by client id.

In [21]:
clips

reels.Clips object with 1000 clips totalling 5954 events

In [22]:
clips.num_clips()

1000

In [23]:
clips.describe_clip('1')

[1, 1, 2, 3]

In [24]:
clips.describe_clip('1000')

[2, 6, 4, 3, 1, 3, 1, 1, 3, 4]

In [25]:
clips.describe_clip('1001')

## The Targets class

<img style="right;" src="https://raw.githubusercontent.com/BBVA/mercury-reels/master/notebooks/images/targets.png">

A targets object does mainly two things:

  - Aggregating all the clips in a Clips object as a tree in reverse order. Note that some clips do not lead to a target event and will be stored entirely and others have a target event and all the events after it will be removed.
  - Predicting the expected time to a hypothetical target event based on a clip using that tree learned from all the data in a Clips object.
  
We create a Targets object giving its constructor the populated Clips object we just created.

In [26]:
targets = reels.Targets(clips)

<div class="alert alert-block alert-success">
<b>Note that:</b> We will give the time of the targets to this object. If the time format is not the default <b>'%Y-%m-%d %H:%M:%S'</b>, we have to use the <b>time_format</b> argument.
</div>

In [27]:
targets

reels.Targets object with 1000 clips

Has no targets defined.

Is not fitted.

It still does not contain the targets.

In [28]:
targets.num_targets()

0

And we populate it again, using the Intake. This time the intake of the targets dataframe.

In [29]:
intake_targs.insert_targets(targets)

<div class="alert alert-block alert-success">
<b>Note that:</b> Again, the column names, in this case, <b>'client', 'time'</b> in the dataframe are the default names and we are not using the argument <b>columns</b>.
</div>

In [30]:
targets.num_targets()

381

We have everything, but we have not yet fitted the model. The tree has just one root node with no clip information.

In [31]:
targets

reels.Targets object with 1000 clips

Has 381 targets.

Is not fitted.

In [32]:
targets.describe_tree_node(0)

(0, 0, 0.0, 0)

To build the tree, we call (just once per Targets object) the .fit() method.

In [33]:
targets.fit()

True

<div class="alert alert-block alert-success">
<b>Note that:</b> We are using default hyperparameters. The arguments <b>x_form, agg, p, depth</b> and <b>as_states</b> are described below under "How prediction works".
</div>

Now, we have a root tree node containing information.

The first value, 1000, is the number of visits (which is the number of total clips), the second is the number of targets (the number of rows in the targs dataset), the third is a global time to target aggregation (without any particular sequence since it is the root node). The final 7, is the number of children the node has (one for each event code).

In [34]:
targets

reels.Targets object with 1000 clips

Has 381 targets.

Is fitted with 1000 clips.

In [35]:
targets.describe_tree_node(0)

(1000, 381, 4568.595376, 7)

If we wanted to see the codes corresponding to its children we could:

In [36]:
targets.tree_node_children(0)

[1, 2, 3, 4, 5, 6, 7]

If we wanted to see which code corresponds to the antidote, since the dataset is random and codes are assigned in random order, we call it `code_antidote`:

In [37]:
for ev_tup in events.describe_events():
    _, ev, _, code = ev_tup
    if ev == '201':
        code_antidote = code

We can access it by asking the tree what is the index of the child `code_antidote` of the root node. I.e., `targets.tree_node_idx(0, code_antidote)` and confirm that it has no targets. It is impossible to die immediately after taking the antidote the way the simulation works.

In [38]:
targets.describe_tree_node(targets.tree_node_idx(0, code_antidote))

(31, 0, 0.0, 5)

### Making predictions

To get predictions as the time to the expected target event, you just have to bear in mind that targets.predict_clips(clips) will give you an iterator. 

You can just convert them into a list.

In [39]:
T = list(targets.predict_clips(clips))

And see the first 5:

In [40]:
T[0:5]

[273987.4233416462,
 443072.02851064265,
 25800.3122625789,
 344351.2589951287,
 191015.1586986647]

A more realistic approach would be treating the time as a score (the lower, the more likely) of the event happening. We want to verify that this prediction has a strong signal, despite being built of top of very random sequences, over a not very large dataset, etc.

To verify that, we need to have a ground truth and compare both. Let's do it without cross validation first and then with cross validation.

We create a simple utility function that we will re-use later.

In [41]:
def analyze(targets, clips, targs):
    t_hashes = set([clients.hash_client_id(str(id)) for id in targs.client])    # This is the set of all the clients who are targets
    Y_obs = [int(hh in t_hashes) for hh in clips.clips_client_hashes()]         # This is the observed target/no_target for all the clients
    
    T = [t for t in targets.predict_clips(clips)]                               # These are the predicted times
    
    t_copy = T.copy()
    t_copy.sort()
    t_cut = t_copy[sum(Y_obs)]                                                  # t_cut is a cutting time that generates the same number of targets.
    
    Y_pred = [int(t <= t_cut) for t in T]                                       # This is the predicted target/no_target for all the clients
    
    x_tab = pd.crosstab(pd.array(Y_obs), 
                        pd.array(Y_pred), 
                        rownames = ['Obs'], 
                        colnames = ['Pred'])
    
    acc  = metrics.accuracy_score(Y_obs, Y_pred)                                # We compute basic metrics
    prec = metrics.precision_score(Y_obs, Y_pred)
    f1   = metrics.f1_score(Y_obs, Y_pred)

    print(x_tab)
    print('Accuracy: %.3f, precision: %.3f, f1-score: %.3f' % (acc, prec, f1))

We observe that the model captures a lot of structure from small and random data.

In [42]:
analyze(targets, clips, targs)

Pred    0    1
Obs           
0     530   89
1      78  303
Accuracy: 0.833, precision: 0.773, f1-score: 0.784


### Making predictions over a different dataset (cross-validation)

Now, we build two new smaller testing datasets.

In [43]:
test_data, test_targs = create_dataframes(500)

We need a new Clips object that builds new clips using the test dataset.

In [44]:
test_clips = reels.Clips(clients, events)

<div class="alert alert-block alert-warning">
<b>Important:</b> We need the events dataset to be the same as before because it contains the conversion from events to codes. If we learned a new one, we would learn the events in a different order and the code would change. That would make the model useless.
</div>

We populate it with a one-use-only Intake.

In [45]:
reels.Intake(test_data).scan_events(test_clips)

And we obtain a prediction that is, as expected, worse than the previous one, but still captures structure on a small random dataset.

In [46]:
analyze(targets, test_clips, test_targs)

Pred    0    1
Obs           
0     227   75
1      73  125
Accuracy: 0.704, precision: 0.625, f1-score: 0.628


## How prediction works

Now we will have a closer look at how the prediction works and explain the different parameters that can be passed to `fit()`.

### Aggregating the different possible matches

If we are predicting a sequence that ends with the events `NF, w, x, y, z` where NF means **not found**. The tree just contains (in reverse order, remember) a root node that has a child `z`, that has a child `y` ... `x` ... `w` that does not have a child `NF`. Therefore, we can match the sequences: `..., w, x, y, z`, `..., x, y, z`, `..., y, z` and `..., z`. Each of them has data on how many times it was seen, how many times a target event follows and how long after the event the target event happened.

Each of these sequences can make a prediction. So, how do we combine those predictions into a single value?

This is what the argument `agg` does.

In [47]:
new_targets = reels.Targets(clips)
intake_targs.insert_targets(new_targets)

In two lines, we created a populated new_targets object.

Let's take the prediction from the longest possible sequence (which is the most specific, but also the one with smallest sample size).

In [48]:
new_targets.fit(agg = 'longest')

True

<div class="alert alert-block alert-success">
    <b>Note that:</b> The argument <b>agg</b> can also be set to <b>'mean'</b> or <b>'minimax'</b>. The former averages all the predictions and the latter returns the min() of all. This is intended to be combined with an upper confidence bound based on evidence (sample size). The parameter value is called 'minimax' to remind that it should be combined with an upper bound.
</div>

In [49]:
analyze(targets, clips, targs)

Pred    0    1
Obs           
0     530   89
1      78  303
Accuracy: 0.833, precision: 0.773, f1-score: 0.784


And we can compare the previous result (we repeat it here) with that of the new model. We should observe some improvement because in this dataset the longer clips are more informative since they are more likely to contain the poison/antidote events.

In [50]:
analyze(new_targets, clips, targs)

Pred    0    1
Obs           
0     573   46
1      34  347
Accuracy: 0.920, precision: 0.883, f1-score: 0.897


### Zooming into the prediction details 

There are two fundamental ideas that we have to keep in mind.

#### We only have the "time to target" for the targets

This is obvious, but still important to bear in mind. In survival analysis when we observe failures of lightbulbs, the lightbulbs that do not fail during our study are **not expected to work forever**.

**The Poisson distribution** expresses the probability of a given number of events occurring with a known constant mean rate. Our predictions will assign time estimates to the **non targets** according to this principle, because it is the simplest possible assumption.

This translates in: If we had 40% of targets happening 1 hour after the sequence, the non targets would have a prediction time accordingly longer than one hour, but shorter than if the same targets would have been 4% instead of 40%.

#### "Targets vs. non targets" is a binomial proportion

If we observe 3 targets out of 7, we have much less evidence than if we observe 300 out of 700 even if the maximum likelihood estimate (3/7) is the same in both cases. A better estimate, if we consider that the target is a risk, would be an upper confidence bound for the binomial proportion. Reels implements the Agresti-Coull upper bound for a given probability `p`. In that case, the upper bound would be much higher for 3 of 7 than for 300 of 700.

<center><img src="https://raw.githubusercontent.com/BBVA/mercury-reels/master/notebooks/images/bounds_small.png"/></center>

**The parameter `p`** controls how the upper bound (UB in the picture) is given for an Agresti-Coull confidence interval. The resulting proportion (targets vs. no targets) controls how the prediction is blending the non targets (Poisson distribution) and the targets (observed time). 
<div class="alert alert-block alert-info">
    <b>Note:</b> By setting <b>p = 0</b> we remove the confidence interval and the estimate will be exactly <b>(number of observed targets)/(number of observed)</b>.
</div>

In [51]:
new_targets = reels.Targets(clips)
intake_targs.insert_targets(new_targets)

Let's play with `p`. Again, we populate a new targets object.

In [52]:
new_targets.fit(agg = 'longest', p = 0)

True

And we notice even more improvement. 

(Note that we are overfitting to the training set. This is just a play on the parameters. Removing the confidence intervals is giving us the original times more accurately.)

In [53]:
analyze(new_targets, clips, targs)

Pred    0    1
Obs           
0     592   27
1      26  355
Accuracy: 0.947, precision: 0.929, f1-score: 0.931


### More fit() parameters: x_form

This parameter controls if a `log()` transformation should be applied to the times (default value, `x_form = 'log'`) or not (`x_form = 'linear'`). If applied, the predicted times will also be applied the corresponding `exp()` transformation so that they are expressed in the same units.  

In [54]:
new_targets = reels.Targets(clips)
intake_targs.insert_targets(new_targets)

In [55]:
new_targets.fit(agg = 'longest', p = 0, x_form = 'linear')

True

Again, we could see some improvement by returning the overfitted times with more accuracy.

In [56]:
analyze(new_targets, clips, targs)

Pred    0    1
Obs           
0     596   23
1      20  361
Accuracy: 0.957, precision: 0.940, f1-score: 0.944


### More fit() parameters: depth

This parameter sets the maximum depth of the tree and, therefore, the maximum length of the sequences that are learnt.


In [57]:
new_targets = reels.Targets(clips)
intake_targs.insert_targets(new_targets)

In [58]:
new_targets.fit(agg = 'longest', p = 0, x_form = 'linear', depth = 3)

True

By limiting the sequence length to 3, we see a large drop in performance.

In [59]:
analyze(new_targets, clips, targs)

Pred    0    1
Obs           
0     516  103
1      99  282
Accuracy: 0.798, precision: 0.732, f1-score: 0.736


### More fit() parameters: as_states

In some cases, it may be good to treat the events as states. I.e., **removing consecutive identical events**. Instead of an event in time, we consider the event as a state that does not change until a **different** event changes the state.

Reels supports doing this automatically. Just set `as_states = True` and events will be treated as states.

In [60]:
new_targets = reels.Targets(clips)
intake_targs.insert_targets(new_targets)

In [61]:
new_targets.fit(agg = 'longest', p = 0, x_form = 'linear', as_states = True)

True

Again, we create a new targets and fit it setting `as_states = True`. The previous best performance is not reached this time. 

In [62]:
analyze(new_targets, clips, targs)

Pred    0    1
Obs           
0     536   83
1      66  315
Accuracy: 0.851, precision: 0.791, f1-score: 0.809


<div class="alert alert-block alert-info">
    <b>Note:</b> In all our examples we have been using the <b>predict_clips()</b> method that makes predictions based on clips. Besides that method, there is a 
<b>predict_clients()</b> that makes predictions based on client ids stored in a Clients object. For it to work, the clients must be part of the Clips object passed to the constructor where their clips can be found.
</div>

## What's next?

<img style="float: left;" src="https://raw.githubusercontent.com/BBVA/mercury-reels/master/notebooks/images/reels_small.png"> Now, you can also check the **reels_event_optimization** tutorial to dive deeper into reels!
