Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update anonymous-explainer.md #281

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 74 additions & 29 deletions docs/guides/anonymous-explainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,62 +4,90 @@ sidebar_position: 5

# Analyzing anonymous user experiments

In the field of A/B testing, it is common to assign unauthenticated users to experiments using anonymous identifiers such as Cookie IDs. However, it is necessary to link these assignments to user-level events in order to calculate metrics for these unauthenticated users. In some cases, certain events may have an associated User ID, while others may not, due to users transitioning between authenticated and unauthenticated states.
Some websites only allow logged-in traffic: Facebook, for instance. However, most website welcome any traffic to their homepage. When running A/B tests on those pages, some of the visitors will be logged-in but not most of them. In that case, it is common to assign unauthenticated users to experiments using anonymous identifiers such as Session ID, Cookie ID, Browser finger-printing, IP addresses, etc. However, those visitors can create an account or log in, and transition between an authenticated and unauthenticated states. If an experiment is looking at actions that only a user can make, like purchases or renewing a subscrption, the assignments on `anonymous_id` need to be matched to user-level event.

## Anonymous assignement means anonymous analysis

Using a User ID to analyse the data is tempting. However, if the change might discourage users from logging in, you would want to measure that. It’s preferable to look at conversion rate by visitors, than to assume the number of log-ins are the same for all variants.


## Track `anonymous_id` throughout the session

The easiest way to do that is to sent the anonymous identifier with the User ID with all your logs. This is commonly done if you use a session ID to assign home-page changes; its less common with other identifer, but useful when running A/B tests.

However if a user sees an ad on social media, comes in on your service through an embeded mobile browser, logs in there and later switches to a desktop browser to type in their credit card number, you might not be able to track them with the session alone. In that case, you might have to track UTM links accross platforms, match IPs, or their User ID.

--xx
we need to calculate metrics for these unauthenticated users.

In some cases, certain events may have an associated User ID, while others may not, due to users transitioning between authenticated and unauthenticated states.

For instance, consider a typical E-commerce company that maintains a table of events tracking user clicks and their progress through the purchase funnel (e.g., add-to-cart, checkout, and purchase events). Prior to making their first purchase and authenticating, any events generated by a user will not be associated with a User ID. Once the user authenticates, their events can be linked to a User ID as long as they remain logged in. However, if the user logs out, subsequent events will not be associated with a User ID until they authenticate again. This situation can result in time periods where important events are not linked to a User ID.
-- xx

To address this issue, attribution models can be developed within the transformation layer of the data warehouse. Although these models may vary, a common approach is to assume that the association between a cookie and a specific user remains valid until proven otherwise. Specifically, once an association between an Anonymous ID and a User ID is established using data collected post-authentication, all events prior to that authentication moment can be assumed to have been performed by the same user. This relationship remains valid until the Anonymous ID is associated with a new User ID in the data. At that point, the observed relationship between the Anonymous ID and User ID can be assumed until another relationship is identified.
## Match `anonymous_id` to later log-in events

To address this issue, attribution models can be developed within the transformation layer of the data warehouse. These models may vary but a common approach is to assume that the association between a browser, a cookie, or an IP and a specific user remains valid until proven otherwise. Once an association between an Anonymous ID and a User ID is established using data collected post-authentication, all events prior to that authentication moment can be assumed to have been performed by the same user.

This relationship remains valid until the Anonymous ID is associated with a new User ID in the data. This can happen if a computer is shared, for instance. At that point, the observed relationship between the Anonymous ID and the later User ID can be assumed until another relationship is identified.

# Anonymous User Attribution

By performing a small amount of data transformation within the data warehouse and a few minutes of setup within Eppo, you can analyze these types of anonymized user experiments effectively in Eppo.

## Warehouse Setup

To build an anonymous visitor-to-user attribution model, begin with a table similar to the one described above. From this table, build a model that identifies the minimum and maximum time in which an Anonymous ID was associated with a specific User ID. A template query to do this can be viewed below.
To build an anonymous visitor-to-user attribution model, begin with a table similar to the one described above. From this table, build a model that identifies the minimum and maximum time in which an Anonymous ID was associated with a specific User ID.

```sql
with
A template query to do this can be viewed below.

users_lag as (
SELECT
```sql
WITH
users_lag as (
SELECT
user_id
, anonymous_id
, lag(user_id) OVER (PARTITION BY anonymous_id ORDER BY ts) as last_user_id
, lag(ts) OVER (PARTITION BY anonymous_id ORDER BY ts) as last_ts
, lead(ts) OVER (PARTITION BY anonymous_id ORDER BY ts) as next_ts
, anonymous_id
, LAG(user_id) OVER (PARTITION BY anonymous_id ORDER BY ts) as last_user_id
, LAG(ts) OVER (PARTITION BY anonymous_id ORDER BY ts) as last_ts
, LEAD(ts) OVER (PARTITION BY anonymous_id ORDER BY ts) as next_ts
FROM event_table
)

, user_switch as (
SELECT
*
, SUM(IF(last_user_id != user_id, 1, 0)) OVER (PARTITION BY anonymous_id ORDER BY ts) as cumulative_switch
FROM users_lag
SELECT *
, SUM(IF(last_user_id != user_id, 1, 0))
OVER (PARTITION BY anonymous_id ORDER BY ts) as cumulative_switch
FROM users_lag
)

, user_login_windows_collapsed as (
select
anonymous_id
, user_id
, cumulative_switch
, LOGICAL_OR(last_ts IS NULL) as is_first
, LOGICAL_OR(next_ts IS NULL) as is_last
, min(ts) as ts_min
, max(next_ts) as ts_max
from user_switch
group by 1,2,3
SELECT
anonymous_id
, user_id
, cumulative_switch
, LOGICAL_OR(last_ts IS NULL) as is_first
, LOGICAL_OR(next_ts IS NULL) as is_last
, MIN(ts) as ts_min
, MAX(next_ts) as ts_max
FROM user_switch
-- WHERE cumulative_switch = 0 -- Uncomment to exclude users sharing a device
GROUP BY 1,2,3
)

SELECT
anonymous_id
, IF(is_first, TIMESTAMP("0001-01-01 00:00:00"), ts_min) as ts_start_window
, IF(is_last, TIMESTAMP("9999-12-31 23:59:59"), ts_max) as ts_end_window
, IF(is_first, TIMESTAMP("0001-01-01 00:00:00"), ts_min) as ts_start_window
, IF(is_last, TIMESTAMP("9999-12-31 23:59:59"), ts_max) as ts_end_window
FROM user_login_windows_collapsed
order by anonymous_id, ts_min;

ORDER BY anonymous_id, ts_min;
```

-- Multiple inconsistent anynomous exposures

For the first identified relationship between an Anonymous ID and a User ID, a timestamp infinitely far into the past is used for the `ts_start_window` in order to provide an inferred User ID for events prior to a user’s first moment of authentication. Similarly, for the last identified relationship between an Anonymous ID and a User ID, a timestamp far into the future is used for the `ts_end_window` column to ensure that any events created in an unauthenticated state will have an inferred User ID. This association will be used for all unauthenticated events until a new relationship for any given Anonymous ID and User ID is identified. At this point, a new `ts_start_window` is defined for the given Anonymous ID.

-- Drawing?

Once this model is built, it can be joined to any fact table within the data warehouse. It should be joined onto these fact tables by User ID wherever a fact event’s timestamp is between a given user’s `ts_start_window` and `ts_end_window`. By doing this, all fact tables at the user level can now have an inferred Anonymous ID. This inferred Anonymous ID can then be used by Eppo to link Assignment SQL definitions at the Anonymous ID level to these fact tables.

```sql
Expand Down Expand Up @@ -90,3 +118,20 @@ Finally, create a Fact SQL definition that utilizes the anonymous visitor-to-use

With this setup, all metrics derived from this fact will successfully link back to assignments with Anonymous IDs.
Follow the same pattern described above for other facts associated with metrics that need to be added to the metric repository.


# Alternative approaches

Matching visitors and users after the fact isn’t the only way. There are patterns that we can recommend, but go beyond Eppo’s remit.

## Double exposure

One possible issue when letting visitors see a meaningful change prior to being logged-in is that they might use different devices, browser, connection and could be exposed to multiple variants. If the change isn’t manifest, this could be overlooked; however, if it‘s a large promotion, users might remember and seek to redeem it. In that case, you are better off excluding those users if you can match .

## Soft login

Another practice that is common among more mature e-commerce platforms is to _assume_ who users are based on anonymous identifier: they let users see targetted recommendations, register favourites, fill in their basket, but to confirm a sale, see their previous orders or edit their delivery address, the users need to log in. This gradient approach to security allows them to match most returning visitors to user accounts without forcing users into log-in flows before they are ready.

## Same user, multiple accounts (SUMA)

In some cases, either in a professional context where assistance can act in the name of their manager, or related to disingenous behaviour like evading detection or abusing promotion for a first order, a user might use multiple accounts. It’s up to you, depending on circumstances, to decide if those actions deserve to be excluded from your testing results, or if you would get better insights by defining a User entity, distinct from an Account, that pools multiple accounts used by the same person.