# Project Brief
The Digital Challenge<br>
The digital world is evolving, and so are Vanguard’s clients. Vanguard believed that a more intuitive and modern User Interface (UI), coupled with timely in-context prompts (cues, messages, hints, or instructions provided to users directly within the context of their current task or action), could make the online process smoother for clients. The critical question was: Would these changes encourage more clients to complete the process?
An A/B test was set into motion from 3/15/2017 to 6/20/2017 by the team.

Control Group: Clients interacted with Vanguard’s traditional online process.

Test Group: Clients experienced the new, spruced-up digital interface.

Both groups navigated through an identical process sequence: an initial page, three subsequent steps, and finally, a confirmation page signaling process completion.<br>
The goal is to see if the new design leads to a better user experience and higher process completion rates.


Answer the following questions about demographics:
Who are the primary clients using this online process?
Are the primary clients younger or older, new or long-standing?
Next, carry out a client behaviour analysis to answer any additional relevant questions you think are important.

### Columns
<br>client_id: Every client’s unique ID.
<br>variation: Indicates if a client was part of the experiment.
<br>visitor_id: A unique ID for each client-device combination.
<br>visit_id: A unique ID for each web visit/session.
<br>process_step: Marks each step in the digital process.
<br>date_time: Timestamp of each web activity.
<br>clnt_tenure_yr: Represents how long the client has been with Vanguard, measured in years.
<br>clnt_tenure_mnth: Further breaks down the client’s tenure with Vanguard in months.
<br>clnt_age: Indicates the age of the client.
<br>gendr: Specifies the client’s gender.
<br>num_accts: Denotes the number of accounts the client holds with Vanguard.
<br>bal: Gives the total balance spread across all accounts for a particular client.
<br>calls_6_mnth: Records the number of times the client reached out over a call in the past six months.
<br>logons_6_mnth: Reflects the frequency with which the client logged onto Vanguard’s platform over the last six months.


## Setup

In [165]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline
import os

In [166]:
print(os.getcwd())

/Users/leilajavanmardi/Desktop/Leila/Coding_IronHack/Data_Analytics_Bootcamp/week5/Project/notebooks


In [167]:
# relative pass should be 
path1 = "df_final_demo.txt"
path2 = "df_final_experiment_clients.txt"
path3 = "df_final_web_data_pt_1.txt"
path4 = "df_final_web_data_pt_2.txt"

df_demo = pd.read_csv(path1)
df_exp = pd.read_csv(path2)
df_web_1 = pd.read_csv(path3)
df_web_2 = pd.read_csv(path4)

In [168]:
print(df_web_1.shape)
df_web_1.head(5)

(343141, 5)


Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time
0,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:27:07
1,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:26:51
2,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:19:22
3,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:19:13
4,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:18:04


In [169]:
print(df_web_2.shape)
df_web_2.tail(5)

(412264, 5)


Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time
412259,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:46:10
412260,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:45:29
412261,9668240,388766751_9038881013,922267647_3096648104_968866,step_1,2017-05-24 18:44:51
412262,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:44:34
412263,674799,947159805_81558194550,86152093_47511127657_716022,start,2017-06-03 12:17:09


#### Merging the Web datasets

In [170]:
# merging the 2 datasets
df_web = pd.concat([df_web_1,df_web_2 ])

Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time
285512,169,201385055_71273495308,749567106_99161211863_557568,step_3,2017-04-12 20:22:05
285511,169,201385055_71273495308,749567106_99161211863_557568,confirm,2017-04-12 20:23:09
285513,169,201385055_71273495308,749567106_99161211863_557568,step_2,2017-04-12 20:20:31
285514,169,201385055_71273495308,749567106_99161211863_557568,step_1,2017-04-12 20:19:45
285515,169,201385055_71273495308,749567106_99161211863_557568,start,2017-04-12 20:19:36
...,...,...,...,...,...
305392,9999875,738878760_1556639849,931268933_219402947_599432,step_1,2017-06-01 22:40:08
305388,9999875,738878760_1556639849,931268933_219402947_599432,confirm,2017-06-01 22:48:39
305389,9999875,738878760_1556639849,931268933_219402947_599432,step_3,2017-06-01 22:44:58
305391,9999875,738878760_1556639849,931268933_219402947_599432,step_1,2017-06-01 22:41:28


# Initial Exploration

### Demo dataset

In [171]:
print(f'The data set has {df_demo.shape[0]} rows and {df_demo.shape[1]} columns with the following types:')
print(df_demo.dtypes)
df_demo.sort_values(by = 'client_id', inplace = True)
df_demo.reset_index( drop=True, inplace= True)
df_demo.sample(5)

The data set has 70609 rows and 9 columns with the following types:
client_id             int64
clnt_tenure_yr      float64
clnt_tenure_mnth    float64
clnt_age            float64
gendr                object
num_accts           float64
bal                 float64
calls_6_mnth        float64
logons_6_mnth       float64
dtype: object


Unnamed: 0,client_id,clnt_tenure_yr,clnt_tenure_mnth,clnt_age,gendr,num_accts,bal,calls_6_mnth,logons_6_mnth
66766,9445989,9.0,109.0,25.0,M,2.0,33954.78,7.0,7.0
6503,927516,10.0,128.0,30.5,M,2.0,132528.76,2.0,5.0
45042,6375571,10.0,126.0,68.0,M,3.0,144354.16,4.0,7.0
33466,4754533,14.0,175.0,73.0,M,2.0,172699.49,5.0,8.0
12170,1746014,31.0,376.0,71.0,M,2.0,56384.17,3.0,6.0


In [173]:
print(f'The number of unique values in Demo dataset:')

for column in df_demo.columns:
    print(f'column {column} has {df_demo[column].nunique()}')

print(f'\nUnique values in Expriment dataset:\n')

for column in df_demo.columns:
    unique_values_demo = df_demo[column].unique()
    print(f'column {column}: {unique_values_demo}')

The number of unique values in Demo dataset:
column client_id has 70609
column clnt_tenure_yr has 54
column clnt_tenure_mnth has 482
column clnt_age has 165
column gendr has 4
column num_accts has 8
column bal has 70328
column calls_6_mnth has 8
column logons_6_mnth has 9

Unique values in Expriment dataset:

column client_id: [    169     555     647 ... 9999729 9999832 9999839]
column clnt_tenure_yr: [21.  3. 12. 11.  9.  5.  8.  7. 48. 14. 19. 23. 13.  4. 15.  6. 16. 30.
 27. 18. 20. 22. 17. 10. 24. 26. 25. 28. 29. 43. 32. 31. 34. 36. 55. 33.
 35.  2. 51. 37. 38. nan 62. 40. 45. 39. 50. 52. 47. 44. 42. 41. 46. 54.
 49.]
column clnt_tenure_mnth: [262.  46. 151. 143. 109. 145.  66.  99.  85. 576. 177.  60. 150. 139.
 229. 280. 260. 110. 157.  98.  63. 179. 231. 142.  58. 117. 116. 189.
 190.  75.  94. 172. 199. 154.  86.  77. 253. 361.  57. 173.  92. 329.
  72.  81. 149. 155.  89. 195. 141. 106. 252. 218. 364. 140. 241. 257.
 226. 269. 170. 164.  78. 108.  56. 105.  88. 211. 205.  73.

In [179]:
df_demo.isna().sum()

client_id            0
clnt_tenure_yr      14
clnt_tenure_mnth    14
clnt_age            15
gendr               14
num_accts           14
bal                 14
calls_6_mnth        14
logons_6_mnth       14
dtype: int64

In [190]:
demo_num_col = df_demo.select_dtypes(include=['number']).drop(columns = ['client_id'])
demo_num_col.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70609 entries, 0 to 70608
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   clnt_tenure_yr    70595 non-null  float64
 1   clnt_tenure_mnth  70595 non-null  float64
 2   clnt_age          70594 non-null  float64
 3   num_accts         70595 non-null  float64
 4   bal               70595 non-null  float64
 5   calls_6_mnth      70595 non-null  float64
 6   logons_6_mnth     70595 non-null  float64
dtypes: float64(7)
memory usage: 3.8 MB


In [192]:
df_demo.select_dtypes(include=['number']).drop(columns = 'client_id').describe().round(2)

Unnamed: 0,clnt_tenure_yr,clnt_tenure_mnth,clnt_age,num_accts,bal,calls_6_mnth,logons_6_mnth
count,70595.0,70595.0,70594.0,70595.0,70595.0,70595.0,70595.0
mean,12.05,150.66,46.44,2.26,147445.24,3.38,5.57
std,6.87,82.09,15.59,0.53,301508.71,2.24,2.35
min,2.0,33.0,13.5,1.0,13789.42,0.0,1.0
25%,6.0,82.0,32.5,2.0,37346.84,1.0,4.0
50%,11.0,136.0,47.0,2.0,63332.9,3.0,5.0
75%,16.0,192.0,59.0,2.0,137544.9,6.0,7.0
max,62.0,749.0,96.0,8.0,16320040.15,7.0,9.0


### Experiment dataset

In [174]:
print(f'The data set has {df_exp.shape[0]} rows and {df_exp.shape[1]} columns with the following types:')
print(df_exp.dtypes)
df_exp.sort_values(by = 'client_id', inplace = True)
df_exp.reset_index(drop = True, inplace = True)
df_exp.sample(5)

The data set has 70609 rows and 2 columns with the following types:
client_id     int64
Variation    object
dtype: object


Unnamed: 0,client_id,Variation
10494,1505133,Control
4837,682590,
59179,8362528,
10474,1502999,
12548,1805030,


In [175]:
print(f'The number of unique values in Expriment dataset:')

for column in df_exp.columns:
    print(f'column {column} has {df_exp[column].nunique()}')

print(f'\nUnique values in Expriment dataset:\n')

for column in df_exp.columns:
    unique_values_exp = df_exp[column].unique()
    print(f'column {column}: {unique_values_exp}')

The number of unique values in Expriment dataset:
column client_id has 70609
column Variation has 2

Unique values in Expriment dataset:

column client_id: [    169     555     647 ... 9999729 9999832 9999839]
column Variation: [nan 'Test' 'Control']


In [181]:
df_exp.isna().sum()

client_id        0
Variation    20109
dtype: int64

### Merging Demo and Expriment 

In [198]:
df=pd.merge(df_demo,df_exp, on='client_id')
df

Unnamed: 0,client_id,clnt_tenure_yr,clnt_tenure_mnth,clnt_age,gendr,num_accts,bal,calls_6_mnth,logons_6_mnth,Variation
0,169,21.0,262.0,47.5,M,2.0,501570.72,4.0,4.0,
1,555,3.0,46.0,29.5,U,2.0,25454.66,2.0,6.0,Test
2,647,12.0,151.0,57.5,M,2.0,30525.80,0.0,4.0,Test
3,722,11.0,143.0,59.5,F,2.0,22466.17,1.0,1.0,
4,934,9.0,109.0,51.0,F,2.0,32522.88,0.0,3.0,Test
...,...,...,...,...,...,...,...,...,...,...
70604,9999400,7.0,86.0,28.5,U,2.0,51787.04,0.0,3.0,Test
70605,9999626,9.0,113.0,35.0,M,2.0,36642.88,6.0,9.0,Test
70606,9999729,10.0,124.0,31.0,F,3.0,107059.74,6.0,9.0,Test
70607,9999832,23.0,281.0,49.0,F,2.0,431887.61,1.0,4.0,Test


### Web dataset

In [176]:
df_web.sort_values(by='client_id', inplace = True)
df_web.reset_index(drop= True, inplace= True)
df_web.head(5)

In [177]:
print(f'The data set has {df_web.shape[0]} rows and {df_web.shape[1]} columns with the following types:')
print(df_web.dtypes)

The data set has 755405 rows and 5 columns with the following types:
client_id        int64
visitor_id      object
visit_id        object
process_step    object
date_time       object
dtype: object


Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time
0,169,201385055_71273495308,749567106_99161211863_557568,step_3,2017-04-12 20:22:05
1,169,201385055_71273495308,749567106_99161211863_557568,confirm,2017-04-12 20:23:09
2,169,201385055_71273495308,749567106_99161211863_557568,step_2,2017-04-12 20:20:31
3,169,201385055_71273495308,749567106_99161211863_557568,step_1,2017-04-12 20:19:45
4,169,201385055_71273495308,749567106_99161211863_557568,start,2017-04-12 20:19:36


In [210]:

print(f'The number of unique values in Web dataset:')

for column in df_web.columns:
    print(f'column {column} has {df_web[column].nunique()}')

unique_values_web = df_web.process_step.unique()
print(f'\ncolumn process_step has : {unique_values_web} as Unique values')

The number of unique values in Web dataset:
column client_id has 120157
column visitor_id has 130236
column visit_id has 158095
column process_step has 5
column date_time has 629363

column process_step has : ['step_3' 'confirm' 'step_2' 'step_1' 'start'] as Unique values


In [180]:
df_d.isna().sum()

client_id       0
visitor_id      0
visit_id        0
process_step    0
date_time       0
dtype: int64

## Data cleaning

In [213]:
print(f'The data set has {df.shape[0]} rows and {df.shape[1]} columns with the following types:')
print(df.dtypes)

print(f'The number of null values in Demo_Expriment dataset:')
df.isna().sum()

The data set has 70609 rows and 10 columns with the following types:
client_id             int64
clnt_tenure_yr      float64
clnt_tenure_mnth    float64
clnt_age            float64
gendr                object
num_accts           float64
bal                 float64
calls_6_mnth        float64
logons_6_mnth       float64
Variation            object
dtype: object
The number of null values in Demo_Expriment dataset:


client_id               0
clnt_tenure_yr         14
clnt_tenure_mnth       14
clnt_age               15
gendr                  14
num_accts              14
bal                    14
calls_6_mnth           14
logons_6_mnth          14
Variation           20109
dtype: int64

In [212]:
print(f'The number of unique values in Demo_Expriment dataset:')

for column in df.columns:
    print(f'column {column} has {df[column].nunique()}')

df_col_unique = ['gendr',
       'num_accts', 'bal', 'calls_6_mnth', 'logons_6_mnth', 'Variation']
print(f'\nUnique values in Demo_Expriment:')

for column in df_col_unique:
    unique_values = df[column].unique()
    print(f'column {column}: {unique_values}')

The number of unique values in Demo_Expriment dataset:
column client_id has 70609
column clnt_tenure_yr has 54
column clnt_tenure_mnth has 482
column clnt_age has 165
column gendr has 4
column num_accts has 8
column bal has 70328
column calls_6_mnth has 8
column logons_6_mnth has 9
column Variation has 2

Unique values in Demo_Expriment:
column gendr: ['M' 'U' 'F' nan 'X']
column num_accts: [ 2.  3.  5.  4.  6.  8. nan  7.  1.]
column bal: [501570.72  25454.66  30525.8  ... 107059.74 431887.61  67425.35]
column calls_6_mnth: [ 4.  2.  0.  1.  6.  5.  3.  7. nan]
column logons_6_mnth: [ 4.  6.  1.  3.  9.  5.  8.  7.  2. nan]
column Variation: [nan 'Test' 'Control']


### Client Behavior Analysis
Who are the primary clients using this online process?<br>
Are the primary clients younger or older, new or long-standing?<br>
Next, carry out a client behaviour analysis to answer any additional relevant questions you think are important.
