Digital Footprints (df_final_web_data): A detailed trace of client interactions online, divided into two parts: pt_1 and pt_2. It’s recommended to merge these two files prior to a comprehensive data analysis.

Metadata
This comprehensive set of fields will guide your analysis, helping you unravel the intricacies of client behavior and preferences.

client_id: Every client’s unique ID.
variation: Indicates if a client was part of the experiment.
visitor_id: A unique ID for each client-device combination.
visit_id: A unique ID for each web visit/session.
process_step: Marks each step in the digital process.
date_time: Timestamp of each web activity.
clnt_tenure_yr: Represents how long the client has been with Vanguard, measured in years.
clnt_tenure_mnth: Further breaks down the client’s tenure with Vanguard in months.
clnt_age: Indicates the age of the client.
gendr: Specifies the client’s gender.
num_accts: Denotes the number of accounts the client holds with Vanguard.
bal: Gives the total balance spread across all accounts for a particular client.
calls_6_mnth: Records the number of times the client reached out over a call in the past six months.
logons_6_mnth: Reflects the frequency with which the client logged onto Vanguard’s platform over the last six months.

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

pd.set_option('display.max_rows', 500)

%matplotlib inline 

In [6]:
pt_1 = "Datasets/df_final_web_data_pt_1.txt"

In [7]:
pt_2 = "Datasets/df_final_web_data_pt_2.txt"

In [8]:
df_1 = pd.read_csv(pt_1)

In [16]:
df_1.head(20)

Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time
0,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:27:07
1,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:26:51
2,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:19:22
3,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:19:13
4,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:18:04
5,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:17:15
6,9988021,580560515_7732621733,781255054_21935453173_531117,step_1,2017-04-17 15:17:01
7,9988021,580560515_7732621733,781255054_21935453173_531117,start,2017-04-17 15:16:22
8,8320017,39393514_33118319366,960651974_70596002104_312201,confirm,2017-04-05 13:10:05
9,8320017,39393514_33118319366,960651974_70596002104_312201,step_3,2017-04-05 13:09:43


In [9]:
df_2 = pd.read_csv(pt_2)

In [13]:
df_2.head(20)

Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time
0,763412,601952081_10457207388,397475557_40440946728_419634,confirm,2017-06-06 08:56:00
1,6019349,442094451_91531546617,154620534_35331068705_522317,confirm,2017-06-01 11:59:27
2,6019349,442094451_91531546617,154620534_35331068705_522317,step_3,2017-06-01 11:58:48
3,6019349,442094451_91531546617,154620534_35331068705_522317,step_2,2017-06-01 11:58:08
4,6019349,442094451_91531546617,154620534_35331068705_522317,step_1,2017-06-01 11:57:58
5,6019349,442094451_91531546617,154620534_35331068705_522317,start,2017-06-01 11:57:54
6,4726500,934350987_45569789638,467318052_88159801968_565608,confirm,2017-06-05 17:38:52
7,4726500,934350987_45569789638,467318052_88159801968_565608,step_3,2017-06-05 17:38:33
8,4726500,934350987_45569789638,467318052_88159801968_565608,step_2,2017-06-05 17:37:31
9,4726500,934350987_45569789638,467318052_88159801968_565608,step_1,2017-06-05 17:37:24


In [18]:
df_merged = pd.concat([df_1, df_2], axis=0)

In [20]:
df_merged.shape

(755405, 5)

In [60]:
df_merged

Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time
0,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:27:07
1,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:26:51
2,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:19:22
3,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:19:13
4,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:18:04
...,...,...,...,...,...
412259,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:46:10
412260,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:45:29
412261,9668240,388766751_9038881013,922267647_3096648104_968866,step_1,2017-05-24 18:44:51
412262,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:44:34


In [34]:
df_merged.isnull().sum()

client_id       0
visitor_id      0
visit_id        0
process_step    0
date_time       0
dtype: int64

In [36]:
# Initially we had 10,764 duplicated rows; Clean duplicates: eliminate them.

df_merged.duplicated().sum()

0

In [37]:
df_merged.drop_duplicates(inplace=True)

In [56]:
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Index: 744641 entries, 0 to 412263
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   client_id     744641 non-null  int64 
 1   visitor_id    744641 non-null  object
 2   visit_id      744641 non-null  object
 3   process_step  744641 non-null  object
 4   date_time     744641 non-null  object
dtypes: int64(1), object(4)
memory usage: 34.1+ MB


In [None]:
# Change date_time like object to datetime type

In [58]:
df_merged["date_time"] = pd.to_datetime(df_merged.date_time)

In [59]:
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Index: 744641 entries, 0 to 412263
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   client_id     744641 non-null  int64         
 1   visitor_id    744641 non-null  object        
 2   visit_id      744641 non-null  object        
 3   process_step  744641 non-null  object        
 4   date_time     744641 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 34.1+ MB


In [133]:
df_merged.nunique()

# 120,157 clients -unique values
# 130,236 visitor_id: algunos clientes usaron más de un device para ingresar a la plataforma
# 158,095 visit_id: some clients visit the platform more than one time.

client_id       120157
visitor_id      130236
visit_id        158095
process_step         5
date_time       629363
year                 1
month                4
day                 31
time             77640
dtype: int64

In [119]:
df_merged.process_step.unique()

array(['step_3', 'step_2', 'step_1', 'start', 'confirm'], dtype=object)

In [120]:
df_merged["year"] = df_merged["date_time"].dt.year

In [121]:
df_merged

Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time,year,month,day,hour,time
0,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:27:07,2017,4,17,15:27:07,15:27:07
1,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:26:51,2017,4,17,15:26:51,15:26:51
2,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:19:22,2017,4,17,15:19:22,15:19:22
3,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:19:13,2017,4,17,15:19:13,15:19:13
4,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:18:04,2017,4,17,15:18:04,15:18:04
...,...,...,...,...,...,...,...,...,...,...
744636,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:46:10,2017,5,24,18:46:10,18:46:10
744637,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:45:29,2017,5,24,18:45:29,18:45:29
744638,9668240,388766751_9038881013,922267647_3096648104_968866,step_1,2017-05-24 18:44:51,2017,5,24,18:44:51,18:44:51
744639,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:44:34,2017,5,24,18:44:34,18:44:34


In [122]:
df_merged.shape

(744641, 10)

In [123]:
df_merged["year"] = df_merged["date_time"].dt.year
df_merged["month"] = df_merged["date_time"].dt.month
df_merged["day"] = df_merged["date_time"].dt.day
df_merged["time"] = df_merged["date_time"].dt.time

In [124]:
df_merged

Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time,year,month,day,hour,time
0,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:27:07,2017,4,17,15:27:07,15:27:07
1,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:26:51,2017,4,17,15:26:51,15:26:51
2,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:19:22,2017,4,17,15:19:22,15:19:22
3,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:19:13,2017,4,17,15:19:13,15:19:13
4,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:18:04,2017,4,17,15:18:04,15:18:04
...,...,...,...,...,...,...,...,...,...,...
744636,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:46:10,2017,5,24,18:46:10,18:46:10
744637,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:45:29,2017,5,24,18:45:29,18:45:29
744638,9668240,388766751_9038881013,922267647_3096648104_968866,step_1,2017-05-24 18:44:51,2017,5,24,18:44:51,18:44:51
744639,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:44:34,2017,5,24,18:44:34,18:44:34


In [129]:
df_merged = df_merged.drop(columns = ["level_0", "index"])

KeyError: "['level_0', 'index'] not found in axis"

In [132]:
df_merged = df_merged.drop(columns = ["hour"])

In [130]:
df_merged

Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time,year,month,day,hour,time
0,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:27:07,2017,4,17,15:27:07,15:27:07
1,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:26:51,2017,4,17,15:26:51,15:26:51
2,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:19:22,2017,4,17,15:19:22,15:19:22
3,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:19:13,2017,4,17,15:19:13,15:19:13
4,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:18:04,2017,4,17,15:18:04,15:18:04
...,...,...,...,...,...,...,...,...,...,...
744636,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:46:10,2017,5,24,18:46:10,18:46:10
744637,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:45:29,2017,5,24,18:45:29,18:45:29
744638,9668240,388766751_9038881013,922267647_3096648104_968866,step_1,2017-05-24 18:44:51,2017,5,24,18:44:51,18:44:51
744639,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:44:34,2017,5,24,18:44:34,18:44:34


In [131]:
df_merged.year.unique()

array([2017], dtype=int32)

In [128]:
df_merged.month.unique()

array([4, 3, 6, 5], dtype=int32)

Next, carry out a client behaviour analysis to answer any additional relevant questions you think are important.