# First steps

In [None]:
!git clone https://github.com/retentioneering/retentioneering-tools.git
!python retentioneering-tools/setup.py install
!pip install shap==0.29.1
!pip install -I shap

In [None]:
PROJECT_PATH = 'retentioneering-tools'
from importlib.machinery import SourceFileLoader
somemodule = SourceFileLoader('retentioneering', 'retentioneering-tools/retentioneering/__init__.py').load_module()

In [None]:
from retentioneering import init_config
import pandas as pd

Firstly, we need to initialize our config file

In [None]:
init_config(
    experiments_folder='experiments', # folder for saving experiment results: graph visualization, heatmaps and etc.
    index_col='user_pseudo_id', # column by which we split users / sessions / whatever
    event_col='event_name', # column that describes event
    event_time_col='event_timestamp', # column that describes timestamp of event
    positive_target_event='passed', # name of positive target event
    negative_target_event='lost', # name of negative target event
    pos_target_definition={ # how to define positive event, e.g. empty means that add passed for whom was not 'lost'

    },
    neg_target_definition={ # how to define negative event, e.g. users who were inactive for 600 seconds.
        'time_limit': 600
    },
#     neg_target_definition={ # you also can define target event as list of other events
#         'event_list': ['lost']
#     }
)

We need to create instance of pandas DataFrame with our data.

In [None]:
import os
os.chdir('retentioneering-tools')
data = pd.read_csv('examples/data/train.csv')
data = data.sort_values('event_timestamp')

In [None]:
data = data.retention.prepare()

In [None]:
data.head()

In [None]:
edgelist = data.retention.get_edgelist()
edgelist.head()

You can use any columns as edge source and target using `cols` param, by default it is equal to list of `event_col` and automatically created `next_event` (shift of it) is used.

Also you can use any column and any aggregation e.g.
one can calculate number of unique users, who passed through edge via next chunk

In [None]:
data.head()

In [None]:
edgelist = data.retention.get_edgelist(edge_col='user_pseudo_id', edge_attributes='users_nunique', norm=False)
edgelist.sort_values('users_nunique', ascending=False).head()

or adjacency matrix

In [None]:
data.retention.get_adjacency()

All similar parameters work for adjacency matrix calculation.

In [None]:
data.retention.get_adjacency(edge_col='user_pseudo_id', edge_attributes='users_nunique', norm=False)

or we can simply visualize graph.

By default weight in visualization is equal to rate of unique users, who pass the edge,
you can change it to rate of all event occasions by turning `user_based` equal `False`.

In [None]:
data.retention.plot_graph(thresh=0.05, width=800, height=800)

If you change node positions and want to save resulting layout, you can click on donwload button and load it to graph visualizer as follows.

In [None]:
data.retention.plot_graph(layout_dump='node_params.json', width=800, height=800)

Also you can use other data columns and aggregation functions from `retention.get_edgelist()` method (be sure that in this case `user_based=False`).

For example, we can visualize mean time between events.

Firtly, we should add column with time difference between event timestamps.

In [None]:
data['seconds_between_events'] = (data.next_timestamp - data.event_timestamp) * 1e-9
# use show_percent=False to visualize absolute value
data.retention.plot_graph(user_based=False, edge_col='seconds_between_events', edge_attributes='time_mean', thresh=0.01, width=800, height=800, show_percent=False)

# Temporal funnel

Let's plot the temporal funnel. Rows correspond to different events and columns correspond to step number in the user trajectory, value corresponds to fraction of all users who had corresponding event at corresponding step. For example, you can see that all users in the analysis start from "welcome_screen" (step 1) and end ended up passed (~0.6) or lost (~0.4) after 24 steps

In [None]:
desc_table = data.retention.get_step_matrix(max_steps=30)

And we can calculate temporal funnel difference between two groups

In [None]:
# create group filter based on target events
diff_filter = data.retention.create_filter()

# calculate difference table between two groups
diff_table = data.retention.get_step_matrix_difference(diff_filter, max_steps=30)

# Clustering

We can use clustering with different visualizations

Clutermap allows to see how important different events are for clustering.
For example we can see that `onboarding_welcome_screen` is always equal, so it does not affect clustering, but `onboarding_chooseLoginType` varies accross users and creates some clusters.

In [None]:
data.retention.get_clusters(plot_type='cluster_heatmap');

Then it will be useful to visualize projection of user trajectories to understand how many clusters we have.

In [None]:
data.retention.learn_tsne(plot_type='targets');

We can see that projection is poor, so it will be good to tune it. To update TSNE weights we need to set `refit` parameter to `True`.

Any parameter from `sklearn.manifold.TSNE` can be used, e.g. `perplexity` can help to obtain better visualization.

In [None]:
data.retention.learn_tsne(perplexity=10, plot_type='targets', refit=True);

Now we can see two dense cirle clusters.

Any parameters from `sklearn.cluster.KMeans` can be used.

In [None]:
data.retention.get_clusters(n_clusters=8, plot_type='cluster_tsne', refit_cluster=True);

We do not use target events in clustering, so we can compare different groups in terms of what target event is likely to occur in them.

In [None]:
data.retention.get_clusters(plot_type='cluster_pie');

We can see that clusters `0` and `1` are pretty interesting, so we can visualize graph for them.

In [None]:
(data
 .retention
 .filter_cluster(0)
 .retention
 .plot_graph(width=800, height=800))

In [None]:
(data
 .retention
 .filter_cluster(1)
 .retention
 .plot_graph(width=800, height=800))

In [None]:
(data
 .retention
 .filter_cluster(4)
 .retention
 .plot_graph(width=800, height=800))

In [None]:
(data
 .retention
 .filter_cluster(5)
 .retention
 .plot_graph(width=800, height=800))

In [None]:
(data
 .retention
 .filter_cluster(7)
 .retention
 .plot_graph(width=800, height=800))

# Supervised classifier

Supervised learning is usually better then clustering.

In [None]:
model = data.retention.create_model()

To understand what features are meaningful, we can visualize graph of weights.

Larger the node or edge, larger its effect on probability of target event.
Green nodes mean positive effect, red nodes -- negative.

In [None]:
features = data.retention.extract_features(ngram_range=(1,2))
target = features.index.isin(data.retention.get_positive_users())

In [None]:
model.permutation_importance(features, target, thresh=0.)

You can use any different model with sklearn-api (ont only sklearn package has it e.g. `lightgm` can be used too).

And pass params to it.

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = data.retention.create_model(RandomForestClassifier, n_estimators=25)

In [None]:
features = data.retention.extract_features(ngram_range=(1,2))
target = features.index.isin(data.retention.get_positive_users())

In [None]:
model.permutation_importance(features, target, thresh=0.)