# Importing the databases

In [2]:
import pandas as pd

df = pd.read_csv("../data/df_final_demo.txt")
final_demo_df = df.copy()

final_demo_df

Unnamed: 0,client_id,clnt_tenure_yr,clnt_tenure_mnth,clnt_age,gendr,num_accts,bal,calls_6_mnth,logons_6_mnth
0,836976,6.0,73.0,60.5,U,2.0,45105.30,6.0,9.0
1,2304905,7.0,94.0,58.0,U,2.0,110860.30,6.0,9.0
2,1439522,5.0,64.0,32.0,U,2.0,52467.79,6.0,9.0
3,1562045,16.0,198.0,49.0,M,2.0,67454.65,3.0,6.0
4,5126305,12.0,145.0,33.0,F,2.0,103671.75,0.0,3.0
...,...,...,...,...,...,...,...,...,...
70604,7993686,4.0,56.0,38.5,U,3.0,1411062.68,5.0,5.0
70605,8981690,12.0,148.0,31.0,M,2.0,101867.07,6.0,6.0
70606,333913,16.0,198.0,61.5,F,2.0,40745.00,3.0,3.0
70607,1573142,21.0,255.0,68.0,M,3.0,475114.69,4.0,4.0


In [3]:
import pandas as pd

df = pd.read_csv("../data/df_final_web_data_pt_1.txt")
final_web_data_1_df = df.copy()

final_web_data_1_df

Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time
0,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:27:07
1,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:26:51
2,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:19:22
3,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:19:13
4,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:18:04
...,...,...,...,...,...
343136,2443347,465784886_73090545671,136329900_10529659391_316129,confirm,2017-03-31 15:15:46
343137,2443347,465784886_73090545671,136329900_10529659391_316129,step_3,2017-03-31 15:14:53
343138,2443347,465784886_73090545671,136329900_10529659391_316129,step_2,2017-03-31 15:12:08
343139,2443347,465784886_73090545671,136329900_10529659391_316129,step_1,2017-03-31 15:11:37


In [4]:
import pandas as pd

df = pd.read_csv("../data/df_final_web_data_pt_2.txt")
final_web_data_2_df = df.copy()

final_web_data_2_df

Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time
0,763412,601952081_10457207388,397475557_40440946728_419634,confirm,2017-06-06 08:56:00
1,6019349,442094451_91531546617,154620534_35331068705_522317,confirm,2017-06-01 11:59:27
2,6019349,442094451_91531546617,154620534_35331068705_522317,step_3,2017-06-01 11:58:48
3,6019349,442094451_91531546617,154620534_35331068705_522317,step_2,2017-06-01 11:58:08
4,6019349,442094451_91531546617,154620534_35331068705_522317,step_1,2017-06-01 11:57:58
...,...,...,...,...,...
412259,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:46:10
412260,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:45:29
412261,9668240,388766751_9038881013,922267647_3096648104_968866,step_1,2017-05-24 18:44:51
412262,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:44:34


In [5]:
import pandas as pd

df = pd.read_csv("../data/df_final_experiment_clients.txt")
experiment_clients_df = df.copy()

experiment_clients_df

Unnamed: 0,client_id,Variation
0,9988021,Test
1,8320017,Test
2,4033851,Control
3,1982004,Test
4,9294070,Control
...,...,...
70604,2443347,
70605,8788427,
70606,266828,
70607,1266421,


# Sanity check of the databases

### Important notes

Things to pay attention to while merging:
- Make sure that reach client_id is either in the control group, or the test group, but not both.
- One client_id can have multiple visitor_id's, but not the other way around. Visitor_id should have no duplicates.
- We merge dataframes ON client_id, which is the common denominator column of the three tables.

How to define the time spent per step:
- Each row provide the timestamp of the client initiating a step
- therefore, we will need to group per client_id and sort the values of time in an natural order
- Then we need to create two more columns, one for "duration", one for "success" - a boolean column that specifies whether the client proceeded or did a step back (error) at each step.
- Lastly, we need to create a column for our most important KPI, which is "conversion". That means, that a customer has proceeded in all steps and finalized the confirmation.

However, here are some important biases we need to account for:
- Session fragmentation -> Create a column for "SESSIONS" per client

A single client producing several visitor_id values within the same experiment window may represent broken sessions rather than distinct attempts. The simple check is: count visitor_id per client_id. If most clients have one and a few have many, inspect their time ordering. If multiple visitor_id values overlap in time, they likely represent a single attempt. A simple rule:
If two visitor_ids within the same client_id occur less than ~5 minutes apart and both start at step 1, you can treat them as the same attempt. If not, leave them separate.

- Inconsistent step ordering.

For each client_id, sort by date_time and verify that process_step never jumps backward by more than one. Small backward moves usually indicate page refreshes. Large jumps indicate noise.
We should flag sequences where process_step is not monotonically increasing and either exclude them or report them as noisy.

- Temporal truncation.

Our experiment has a fixed end date. Any visit_id whose final event occurs near that boundary might not have had time to finish. Compute the time difference between the last observed step and the experiment end. If the gap is very small, treat the session as incomplete by truncation rather than failure. You can either exclude them or keep them but acknowledge the ambiguity.

- Arm misclassification.

Each client_id appears exactly once and in exactly one group in the experiment file. If duplicates appear or if a client_id in the web logs is missing from the experiment file, flag and exclude.




### Step 1: Experiment data - Load and sanity-check each table.

Verify row counts, missing client_id, duplicate client_id in the experiment roster. This establishes whether the dataset is even suitable for merging.

In [6]:
#Creating copies
final_demo_v1 = final_demo_df.copy(deep=True)
final_web1_v1 = final_web_data_1_df.copy(deep=True)
final_web2_v1 = final_web_data_2_df.copy(deep=True)
experiment_clients_v1  = experiment_clients_df.copy(deep=True)

In [27]:
#final_demo_v1 Table
print(final_demo_v1.shape)
print(final_demo_v1.columns)

(70609, 9)
Index(['client_id', 'clnt_tenure_yr', 'clnt_tenure_mnth', 'clnt_age', 'gendr',
       'num_accts', 'bal', 'calls_6_mnth', 'logons_6_mnth'],
      dtype='object')


In [29]:
final_demo_v1.isna().sum()

client_id            0
clnt_tenure_yr      14
clnt_tenure_mnth    14
clnt_age            15
gendr               14
num_accts           14
bal                 14
calls_6_mnth        14
logons_6_mnth       14
dtype: int64

In [30]:
final_demo_v1['client_id'].duplicated().sum()

np.int64(0)

In [31]:
final_demo_v1['gendr'].value_counts(dropna=False)

gendr
U      24122
M      23724
F      22746
NaN       14
X          3
Name: count, dtype: int64

In [32]:
final_demo_v1.describe()

Unnamed: 0,client_id,clnt_tenure_yr,clnt_tenure_mnth,clnt_age,num_accts,bal,calls_6_mnth,logons_6_mnth
count,70609.0,70595.0,70595.0,70594.0,70595.0,70595.0,70595.0,70595.0
mean,5004992.0,12.05295,150.659367,46.44224,2.255528,147445.2,3.382478,5.56674
std,2877278.0,6.871819,82.089854,15.591273,0.534997,301508.7,2.23658,2.353286
min,169.0,2.0,33.0,13.5,1.0,13789.42,0.0,1.0
25%,2519329.0,6.0,82.0,32.5,2.0,37346.83,1.0,4.0
50%,5016978.0,11.0,136.0,47.0,2.0,63332.9,3.0,5.0
75%,7483085.0,16.0,192.0,59.0,2.0,137544.9,6.0,7.0
max,9999839.0,62.0,749.0,96.0,8.0,16320040.0,7.0,9.0


In [33]:
final_demo_v1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70609 entries, 0 to 70608
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   client_id         70609 non-null  int64  
 1   clnt_tenure_yr    70595 non-null  float64
 2   clnt_tenure_mnth  70595 non-null  float64
 3   clnt_age          70594 non-null  float64
 4   gendr             70595 non-null  object 
 5   num_accts         70595 non-null  float64
 6   bal               70595 non-null  float64
 7   calls_6_mnth      70595 non-null  float64
 8   logons_6_mnth     70595 non-null  float64
dtypes: float64(7), int64(1), object(1)
memory usage: 4.8+ MB


In [7]:
#Row counts
print("final_demo_v1:", final_demo_v1.shape)
print("final_web1_v1:", final_web1_v1.shape)
print("final_web2_v1:", final_web2_v1.shape)
print("experiment_clients_v1 :", experiment_clients_v1.shape)

final_demo_v1: (70609, 9)
final_web1_v1: (343141, 5)
final_web2_v1: (412264, 5)
experiment_clients_v1 : (70609, 2)


In [8]:
#Tables' columns
print("demo columns:", final_demo_v1.columns)
print("web1 columns:", final_web1_v1.columns)
print("web2 columns:", final_web2_v1.columns)
print("experiment columns:", experiment_clients_v1.columns)

demo columns: Index(['client_id', 'clnt_tenure_yr', 'clnt_tenure_mnth', 'clnt_age', 'gendr',
       'num_accts', 'bal', 'calls_6_mnth', 'logons_6_mnth'],
      dtype='object')
web1 columns: Index(['client_id', 'visitor_id', 'visit_id', 'process_step', 'date_time'], dtype='object')
web2 columns: Index(['client_id', 'visitor_id', 'visit_id', 'process_step', 'date_time'], dtype='object')
experiment columns: Index(['client_id', 'Variation'], dtype='object')


In [9]:
print("demo data types:", "\n", final_demo_v1.dtypes)

demo data types: 
 client_id             int64
clnt_tenure_yr      float64
clnt_tenure_mnth    float64
clnt_age            float64
gendr                object
num_accts           float64
bal                 float64
calls_6_mnth        float64
logons_6_mnth       float64
dtype: object


In [10]:
print("web1 data types:", "\n", final_web1_v1.dtypes)

web1 data types: 
 client_id        int64
visitor_id      object
visit_id        object
process_step    object
date_time       object
dtype: object


In [11]:
print("web2 data types:", "\n", final_web2_v1.dtypes)

web2 data types: 
 client_id        int64
visitor_id      object
visit_id        object
process_step    object
date_time       object
dtype: object


In [12]:
print("experiment data types:", "\n", experiment_clients_v1.dtypes)

experiment data types: 
 client_id     int64
Variation    object
dtype: object


In [13]:
#They all have client_id column, next:
#Checking for missing values in client_id column
print("demo missing client_id:", final_demo_v1["client_id"].isna().sum())
print("web1 missing client_id:", final_web1_v1["client_id"].isna().sum())
print("web2 missing client_id:", final_web2_v1["client_id"].isna().sum())
print("experiment missing client_id:", experiment_clients_v1["client_id"].isna().sum())

demo missing client_id: 0
web1 missing client_id: 0
web2 missing client_id: 0
experiment missing client_id: 0


In [14]:
#Duplicated values in client_id column
experiment_clients_v1["client_id"].duplicated().sum()

np.int64(0)

In [15]:
#Demo Table
final_demo_v1["client_id"].duplicated().sum()

np.int64(0)

In [26]:
final_demo_v1["client_id"].nunique()

70609

In [16]:
#Web tables (zero is normal)
print(final_web1_v1["client_id"].duplicated().sum())
print(final_web2_v1["client_id"].duplicated().sum())

284750
344834


In [17]:
#Check the experiment groups (important preview)
experiment_clients_v1["Variation"].value_counts(dropna=False)

Variation
Test       26968
Control    23532
NaN        20109
Name: count, dtype: int64

In [18]:
experiment_clients_v1["Variation"].unique()

array(['Test', 'Control', nan], dtype=object)

In [19]:
# We only concat them together not merge
final_web_v2 = pd.concat([final_web1_v1, final_web2_v1], ignore_index=True).copy(deep=True)

print("final_web1_v1:", final_web1_v1.shape)
print("final_web2_v1:", final_web2_v1.shape)
print("final_web_v2 :", final_web_v2.shape)

final_web1_v1: (343141, 5)
final_web2_v1: (412264, 5)
final_web_v2 : (755405, 5)


In [20]:
#column check
print("Columns:", final_web_v2.columns.tolist())

Columns: ['client_id', 'visitor_id', 'visit_id', 'process_step', 'date_time']


### Step 2: Experiment data - Validate absence of arm misclassification.

Each client_id appears exactly once and in exactly one group in the experiment file. If duplicates appear or if a client_id in the web logs is missing from the experiment file, flag and exclude.


### Step 3: Web logs data - Inspect session multiplicity

Group web logs by client_id and count distinct visitor_id. If most clients have one and a minority have many, keep visitor_id as the unit of analysis. Only collapse visitor_id when two IDs begin a step-1 sequence within minutes of each other.

### Step 4: Web logs data - Account for Session Fragmentation - Build "session" timelines

For each (client_id, visitor_id), sort by date_time. Check monotonicity of process_step. Minor regressions can be tolerated; major reversals get flagged and excluded

In [21]:
# starting by concatenating the two web data dataframes
import pandas as pd

final_web_data_df = pd.concat(
    [final_web_data_1_df, final_web_data_2_df],
    ignore_index=True
)

# we will use the drop.duplicates method to drop rows that have the exact same values across all columns.
final_web_data_df = final_web_data_df.drop_duplicates()

final_web_data_df.shape

(744641, 5)

In [22]:
# the drop duplicates dropped roughly 1.4% of the total data, which is an expected when logs are split across files or exported twice. It indicates repeated rows, not behavioral data loss.

In [23]:
final_web_data_df.head()

Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time
0,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:27:07
1,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:26:51
2,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:19:22
3,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:19:13
4,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:18:04


In [24]:
final_web_data_df["process_step"].value_counts()

process_step
start      234999
step_1     162797
step_2     132750
step_3     111589
confirm    102506
Name: count, dtype: int64

### Step 5: Web logs data - Compute step durations and success-level outcomes.

Within each sorted sequence, compute time between steps, derive a success flag per step, and classify the visitor_id as converted or not.

### Step 6: Temporal truncation

Any attempt whose last timestamp sits very close to the experiment’s end can be marked ambiguous. We may exclude them or keep them with a clear note that their status is censored.

# Merging the databases

Merge on client_id, using the experiment client_id as the base, since these are the only customers that matter.

We prefer "left" merge instead of "inner" because:
how="left" keeps all experiment clients, including:
- those who never visited (web columns NaN)
- those missing demographics (demo columns NaN)

An inner merge would silently drop:
- assigned clients with no web activity
- assigned clients missing demographics

For an A/B test, dropping assigned-but-inactive clients biases completion rates, so left is preferable. We can always later filter to “clients with web activity and complete demographics” explicitly, instead of letting the join hide them.

In [25]:
 # exp_demo = experiment.merge(demo, on="client_id", how="left")

# then

# web = pd.concat([web_pt1, web_pt2], ignore_index=True)

# and lastly

# full = exp_demo.merge(web_data, on="client_id", how="left")


# Data Cleaning