# VANGUARD AB TEST


## METADATA HELP

This comprehensive set of fields will guide your analysis, helping you unravel the intricacies of client behavior and preferences.

- **client_id**: Every client’s unique ID.
- **variation**: Indicates if a client was part of the experiment.
- **visitor_id**: A unique ID for each client-device combination.
- **visit_id**: A unique ID for each web visit/session.
- **process_step**: Marks each step in the digital process.
- **date_time**: Timestamp of each web activity.
- **clnt_tenure_yr**: Represents how long the client has been with Vanguard, measured in years.
- **clnt_tenure_mnth**: Further breaks down the client’s tenure with Vanguard in months.
- **clnt_age**: Indicates the age of the client.
- **gendr**: Specifies the client’s gender.
- **num_accts**: Denotes the number of accounts the client holds with Vanguard.
- **bal**: Gives the total balance spread across all accounts for a particular client.
- **calls_6_mnth**: Records the number of times the client reached out over a call in the past six months.
- **logons_6_mnth**: Reflects the frequency with which the client logged onto Vanguard’s platform over the last six months.


In [244]:
%load_ext autoreload
%autoreload 2 

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [245]:
import pandas as pd
from cleaning import *
from mining import *
from dotenv import load_dotenv
import os
import yaml

In [246]:
# Load environment variables
load_dotenv()

True

In [247]:
config = parse_config()

{'tables': {'clients': {'paths': ['data/df_final_demo.txt'], 'separator': ',', 'columns': [{'client_id': {'original_name': 'client_id', 'data_type': 'INTEGER', 'primary_key': True, 'pandas_dtype': 'int64'}}, {'clnt_tenure_yr': {'original_name': 'clnt_tenure_yr', 'data_type': 'FLOAT', 'pandas_dtype': 'float64'}}, {'clnt_tenure_mnth': {'original_name': 'clnt_tenure_mnth', 'data_type': 'FLOAT', 'pandas_dtype': 'float64'}}, {'clnt_age': {'original_name': 'clnt_age', 'data_type': 'FLOAT', 'pandas_dtype': 'float64'}}, {'gendr': {'original_name': 'gendr', 'data_type': 'CHAR(1)', 'pandas_dtype': 'category', 'valid_categories': ['U', 'M', 'F']}}, {'num_accts': {'original_name': 'num_accts', 'data_type': 'FLOAT', 'pandas_dtype': 'float64'}}, {'bal': {'original_name': 'bal', 'data_type': 'FLOAT', 'pandas_dtype': 'float64'}}, {'calls_6_mnth': {'original_name': 'calls_6_mnth', 'data_type': 'FLOAT', 'pandas_dtype': 'float64'}}, {'logons_6_mnth': {'original_name': 'logons_6_mnth', 'data_type': 'FLOAT

## Data Mining

In [248]:
#TODO: find a way to import and merge web data 2
#TODO: adapt function to remote url + save to sql to prevent large repo


In [249]:
# Import data
demo_df = import_data_from_config(config, 'clients')
experiment_df = import_data_from_config(config, 'experiment')
web_data_1_df = import_data_from_config(config, 'visits')

{'paths': ['data/df_final_demo.txt'], 'separator': ',', 'columns': [{'client_id': {'original_name': 'client_id', 'data_type': 'INTEGER', 'primary_key': True, 'pandas_dtype': 'int64'}}, {'clnt_tenure_yr': {'original_name': 'clnt_tenure_yr', 'data_type': 'FLOAT', 'pandas_dtype': 'float64'}}, {'clnt_tenure_mnth': {'original_name': 'clnt_tenure_mnth', 'data_type': 'FLOAT', 'pandas_dtype': 'float64'}}, {'clnt_age': {'original_name': 'clnt_age', 'data_type': 'FLOAT', 'pandas_dtype': 'float64'}}, {'gendr': {'original_name': 'gendr', 'data_type': 'CHAR(1)', 'pandas_dtype': 'category', 'valid_categories': ['U', 'M', 'F']}}, {'num_accts': {'original_name': 'num_accts', 'data_type': 'FLOAT', 'pandas_dtype': 'float64'}}, {'bal': {'original_name': 'bal', 'data_type': 'FLOAT', 'pandas_dtype': 'float64'}}, {'calls_6_mnth': {'original_name': 'calls_6_mnth', 'data_type': 'FLOAT', 'pandas_dtype': 'float64'}}, {'logons_6_mnth': {'original_name': 'logons_6_mnth', 'data_type': 'FLOAT', 'pandas_dtype': 'flo

In [250]:
# Display data
display('demo', demo_df.shape, demo_df.head())
display('experiment', experiment_df.shape, experiment_df.head())
display('web data', web_data_1_df.shape, web_data_1_df.head())

'demo'

(70609, 9)

Unnamed: 0,client_id,clnt_tenure_yr,clnt_tenure_mnth,clnt_age,gendr,num_accts,bal,calls_6_mnth,logons_6_mnth
0,836976,6.0,73.0,60.5,U,2.0,45105.3,6.0,9.0
1,2304905,7.0,94.0,58.0,U,2.0,110860.3,6.0,9.0
2,1439522,5.0,64.0,32.0,U,2.0,52467.79,6.0,9.0
3,1562045,16.0,198.0,49.0,M,2.0,67454.65,3.0,6.0
4,5126305,12.0,145.0,33.0,F,2.0,103671.75,0.0,3.0


'experiment'

(70609, 2)

Unnamed: 0,client_id,Variation
0,9988021,Test
1,8320017,Test
2,4033851,Control
3,1982004,Test
4,9294070,Control


'web data'

(412264, 10)

Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time,client_id.1,visitor_id.1,visit_id.1,process_step.1,date_time.1
0,9988021.0,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:27:07,763412,601952081_10457207388,397475557_40440946728_419634,confirm,2017-06-06 08:56:00
1,9988021.0,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:26:51,6019349,442094451_91531546617,154620534_35331068705_522317,confirm,2017-06-01 11:59:27
2,9988021.0,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:19:22,6019349,442094451_91531546617,154620534_35331068705_522317,step_3,2017-06-01 11:58:48
3,9988021.0,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:19:13,6019349,442094451_91531546617,154620534_35331068705_522317,step_2,2017-06-01 11:58:08
4,9988021.0,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:18:04,6019349,442094451_91531546617,154620534_35331068705_522317,step_1,2017-06-01 11:57:58


## Data Cleaning

In [251]:

#TODO: don't impose categories?
#TODO: consider binning / pd.cut / qcut for numerical data

In [252]:
# Rename columns
def rename_columns(df, column_names_dict):
    df = df.rename(columns=column_names_dict)
    
demo_df_2 = rename_columns(demo_df, {'client_id':'id'})
display(demo_df_2)

None

In [253]:
# Data formatting

In [254]:
# Data Typing

In [255]:
# Handle duplicates

In [256]:
# Handle missing values

## Data Exploration

In [257]:
# Handle outliers

In [258]:
#frequency tables

## Analysis

## Visualizations

## Conclusions

In [259]:
example_dict = {"key":"Value"}
example_list = ['value','other value']

example_yaml = '''
                key:value               

'''

def read_yaml_from_string(yaml_str):
    try:
        data = yaml.safe_load(yaml_str)
        return data
    except Exception as e:
        print(f"Error loading YAML string: {e}")
        return None

# Example usage
yaml_data = read_yaml_from_string(example_yaml)

if yaml_data:
    print(example_yaml)


                key:value               


