# Importing the databases

In [1]:
import pandas as pd

df = pd.read_csv("../data/df_final_demo.txt")
final_demo_df = df.copy()

final_demo_df

Unnamed: 0,client_id,clnt_tenure_yr,clnt_tenure_mnth,clnt_age,gendr,num_accts,bal,calls_6_mnth,logons_6_mnth
0,836976,6.0,73.0,60.5,U,2.0,45105.30,6.0,9.0
1,2304905,7.0,94.0,58.0,U,2.0,110860.30,6.0,9.0
2,1439522,5.0,64.0,32.0,U,2.0,52467.79,6.0,9.0
3,1562045,16.0,198.0,49.0,M,2.0,67454.65,3.0,6.0
4,5126305,12.0,145.0,33.0,F,2.0,103671.75,0.0,3.0
...,...,...,...,...,...,...,...,...,...
70604,7993686,4.0,56.0,38.5,U,3.0,1411062.68,5.0,5.0
70605,8981690,12.0,148.0,31.0,M,2.0,101867.07,6.0,6.0
70606,333913,16.0,198.0,61.5,F,2.0,40745.00,3.0,3.0
70607,1573142,21.0,255.0,68.0,M,3.0,475114.69,4.0,4.0


In [6]:
import pandas as pd

df = pd.read_csv("../data/df_final_web_data_pt_1.txt")
final_web_data_1_df = df.copy()

final_web_data_1_df

Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time
0,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:27:07
1,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:26:51
2,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:19:22
3,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:19:13
4,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:18:04
...,...,...,...,...,...
343136,2443347,465784886_73090545671,136329900_10529659391_316129,confirm,2017-03-31 15:15:46
343137,2443347,465784886_73090545671,136329900_10529659391_316129,step_3,2017-03-31 15:14:53
343138,2443347,465784886_73090545671,136329900_10529659391_316129,step_2,2017-03-31 15:12:08
343139,2443347,465784886_73090545671,136329900_10529659391_316129,step_1,2017-03-31 15:11:37


In [7]:
import pandas as pd

df = pd.read_csv("../data/df_final_web_data_pt_2.txt")
final_web_data_2_df = df.copy()

final_web_data_2_df

Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time
0,763412,601952081_10457207388,397475557_40440946728_419634,confirm,2017-06-06 08:56:00
1,6019349,442094451_91531546617,154620534_35331068705_522317,confirm,2017-06-01 11:59:27
2,6019349,442094451_91531546617,154620534_35331068705_522317,step_3,2017-06-01 11:58:48
3,6019349,442094451_91531546617,154620534_35331068705_522317,step_2,2017-06-01 11:58:08
4,6019349,442094451_91531546617,154620534_35331068705_522317,step_1,2017-06-01 11:57:58
...,...,...,...,...,...
412259,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:46:10
412260,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:45:29
412261,9668240,388766751_9038881013,922267647_3096648104_968866,step_1,2017-05-24 18:44:51
412262,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:44:34


In [8]:
import pandas as pd

df = pd.read_csv("../data/df_final_experiment_clients.txt")
experiment_clients_df = df.copy()

experiment_clients_df

Unnamed: 0,client_id,Variation
0,9988021,Test
1,8320017,Test
2,4033851,Control
3,1982004,Test
4,9294070,Control
...,...,...
70604,2443347,
70605,8788427,
70606,266828,
70607,1266421,


# Sanity check of the databases

### Important notes

Things to pay attention to while merging:
- Make sure that reach client_id is either in the control group, or the test group, but not both.
- One client_id can have multiple visitor_id's, but not the other way around. Visitor_id should have no duplicates.
- We merge dataframes ON client_id, which is the common denominator column of the three tables.

How to define the time spent per step:
- Each row provide the timestamp of the client initiating a step
- therefore, we will need to group per client_id and sort the values of time in an natural order
- Then we need to create two more columns, one for "duration", one for "success" - a boolean column that specifies whether the client proceeded or did a step back (error) at each step.
- Lastly, we need to create a column for our most important KPI, which is "conversion". That means, that a customer has proceeded in all steps and finalized the confirmation.

However, here are some important biases we need to account for:
- Session fragmentation -> Create a column for "SESSIONS" per client

A single client producing several visitor_id values within the same experiment window may represent broken sessions rather than distinct attempts. The simple check is: count visitor_id per client_id. If most clients have one and a few have many, inspect their time ordering. If multiple visitor_id values overlap in time, they likely represent a single attempt. A simple rule:
If two visitor_ids within the same client_id occur less than ~5 minutes apart and both start at step 1, you can treat them as the same attempt. If not, leave them separate.

- Inconsistent step ordering.

For each client_id, sort by date_time and verify that process_step never jumps backward by more than one. Small backward moves usually indicate page refreshes. Large jumps indicate noise.
We should flag sequences where process_step is not monotonically increasing and either exclude them or report them as noisy.

- Temporal truncation.

Our experiment has a fixed end date. Any visit_id whose final event occurs near that boundary might not have had time to finish. Compute the time difference between the last observed step and the experiment end. If the gap is very small, treat the session as incomplete by truncation rather than failure. You can either exclude them or keep them but acknowledge the ambiguity.

- Arm misclassification.

Each client_id appears exactly once and in exactly one group in the experiment file. If duplicates appear or if a client_id in the web logs is missing from the experiment file, flag and exclude.




### Step 1: Experiment data - Load and sanity-check each table.

Verify row counts, missing client_id, duplicate client_id in the experiment roster, and duplicate visitor_id in the web logs. This establishes whether the dataset is even suitable for merging.

### Step 2: Experiment data - Validate absence of arm misclassification.

Each client_id appears exactly once and in exactly one group in the experiment file. If duplicates appear or if a client_id in the web logs is missing from the experiment file, flag and exclude.


### Step 3: Web logs data - Inspect session multiplicity

Group web logs by client_id and count distinct visitor_id. If most clients have one and a minority have many, keep visitor_id as the unit of analysis. Only collapse visitor_id when two IDs begin a step-1 sequence within minutes of each other.

In [19]:
# Group by client_id and count unique visitor_id
visits_per_client = final_web_data_1_df.groupby('client_id')['visitor_id'].nunique()

# Convert to DataFrame
visits_per_client_df.columns = ['client_id', 'num_visits']

print(f"\nTotal number of clients: {len(visits_per_client_df)}")


Total number of clients: 58391


In [20]:
# Count: how many clients have 1 visitor, 2 visitors,...
distribution = visitors_per_client_df['num_visitors'].value_counts().sort_index()

print("Distribution of visitors per client:")
print(distribution)



Distribution of visitors per client:
num_visitors
1     54079
2      3742
3       423
4        92
5        22
6        17
7        13
8         2
12        1
Name: count, dtype: int64

Percentages
1 visitor(s): 54079 clients (92.62%)
2 visitor(s): 3742 clients (6.41%)
3 visitor(s): 423 clients (0.72%)
4 visitor(s): 92 clients (0.16%)
5 visitor(s): 22 clients (0.04%)
6 visitor(s): 17 clients (0.03%)
7 visitor(s): 13 clients (0.02%)
8 visitor(s): 2 clients (0.00%)
12 visitor(s): 1 clients (0.00%)


In [21]:
# Calculate percentages
total_clients = len(visitors_per_client_df)

for num_visitors, count in distribution.items():
    percentage = (count / total_clients) * 100
    print(f"{num_visitors} visitor(s): {count} clients ({percentage:.2f}%)")


Percentages
1 visitor(s): 54079 clients (92.62%)
2 visitor(s): 3742 clients (6.41%)
3 visitor(s): 423 clients (0.72%)
4 visitor(s): 92 clients (0.16%)
5 visitor(s): 22 clients (0.04%)
6 visitor(s): 17 clients (0.03%)
7 visitor(s): 13 clients (0.02%)
8 visitor(s): 2 clients (0.00%)
12 visitor(s): 1 clients (0.00%)


In [12]:
# Count clients == 1 visitor
clients_with_one = (visitors_per_client_df['num_visitors'] == 1).sum()
percentage_one = (clients_with_one / total_clients) * 100

# Count clients with multiple visitors
clients_with_many = (visitors_per_client_df['num_visitors'] > 1).sum()
percentage_many = (clients_with_many / total_clients) * 100

print(f"Clients with exactly 1 visitor: {clients_with_one} ({percentage_one:.2f}%)")
print(f"Clients with multiple visitors: {clients_with_many} ({percentage_many:.2f}%)")


Clients with exactly 1 visitor: 54079 (92.62%)
Clients with multiple visitors: 4312 (7.38%)


In [13]:
# Get clients with more than 1 visitor
clients_multiple_visits = visitors_per_client_df[visitors_per_client_df['num_visitors'] > 1]

print(f"Total clients with multiple visits: {len(clients_multiple)}")
print("\nTop 20 clients with most visits:")
print(clients_multiple.sort_values('num_visitors', ascending=False).head(20))

Total clients with multiple visitors: 4312

Top 20 clients with most visitors:
       client_id  num_visitors
52685    9008485            12
55586    9511606             8
45772    7818040             8
27214    4668884             7
26425    4538983             7
44294    7569607             7
10129    1758062             7
24065    4125715             7
34476    5907556             7
12303    2128341             7
41120    7033293             7
48336    8256491             7
42184    7216759             7
36820    6305830             7
32652    5601303             7
55500    9497505             7
40219    6879419             6
43767    7475602             6
43897    7499381             6
33523    5756542             6


### Step 4: Web logs data - Account for Session Fragmentation - Build "session" timelines

For each (client_id, visitor_id), sort by date_time. Check monotonicity of process_step. Minor regressions can be tolerated; major reversals get flagged and excluded

### Step 5: Web logs data - Compute step durations and success-level outcomes.

Within each sorted sequence, compute time between steps, derive a success flag per step, and classify the visitor_id as converted or not.

### Step 6: Temporal truncation

Any attempt whose last timestamp sits very close to the experiment’s end can be marked ambiguous. We may exclude them or keep them with a clear note that their status is censored.

# Merging the databases

Merge on client_id, using the experiment client_id as the base, since these are the only customers that matter.

We prefer "left" merge instead of "inner" because:
how="left" keeps all experiment clients, including:
- those who never visited (web columns NaN)
- those missing demographics (demo columns NaN)

An inner merge would silently drop:
- assigned clients with no web activity
- assigned clients missing demographics

For an A/B test, dropping assigned-but-inactive clients biases completion rates, so left is preferable. We can always later filter to “clients with web activity and complete demographics” explicitly, instead of letting the join hide them.

In [7]:
 # exp_demo = experiment.merge(demo, on="client_id", how="left")

# then

# web = pd.concat([web_pt1, web_pt2], ignore_index=True)

# and lastly

# full = exp_demo.merge(web_data, on="client_id", how="left")
