# VANGUARD AB TEST


## METADATA HELP

This comprehensive set of fields will guide your analysis, helping you unravel the intricacies of client behavior and preferences.

- **client_id**: Every client’s unique ID.
- **variation**: Indicates if a client was part of the experiment.
- **visitor_id**: A unique ID for each client-device combination.
- **visit_id**: A unique ID for each web visit/session.
- **process_step**: Marks each step in the digital process.
- **date_time**: Timestamp of each web activity.
- **clnt_tenure_yr**: Represents how long the client has been with Vanguard, measured in years.
- **clnt_tenure_mnth**: Further breaks down the client’s tenure with Vanguard in months.
- **clnt_age**: Indicates the age of the client.
- **gendr**: Specifies the client’s gender.
- **num_accts**: Denotes the number of accounts the client holds with Vanguard.
- **bal**: Gives the total balance spread across all accounts for a particular client.
- **calls_6_mnth**: Records the number of times the client reached out over a call in the past six months.
- **logons_6_mnth**: Reflects the frequency with which the client logged onto Vanguard’s platform over the last six months.


In [None]:
%load_ext autoreload
%autoreload 2 

In [1078]:
from cleaning import *
from mining import *
from db_handling import *
import pandas as pd
from dotenv import load_dotenv
import os


In [None]:
# Load environment variables
load_dotenv()

### Load Configuration

In [None]:
# Load config.yaml
config = parse_config()

## Data Mining

In [1081]:
# Creates a dictionary of all imported dataframes
dataframes = { name:import_data_from_config(config, name) for name in config['tables']}

## Data Cleaning

In [1082]:
#TODO: don't impose categories?

In [1083]:
# Rename columns
dataframes = rename_columns(dataframes, config)

In [1084]:
# Select columns
dataframes = select_columns(dataframes, config)

In [None]:
display_dataFrames(dataframes,'head')

## Separation

In [None]:
client_df = dataframes['clients']
experiment_df = dataframes['experiment']
visits_df = dataframes['visits']
display(experiment_df['variation'].isna().sum())
display (client_df, experiment_df, visits_df)

In [None]:
variation_visits = visits_df.merge(experiment_df, on='client_id')
display(variation_visits['variation'].value_counts())
display(variation_visits)
confirmed_steps = variation_visits[variation_visits['process_step'] == 'confirm']
display(confirmed_steps)
unique_visit_ids = confirmed_steps.drop_duplicates(subset='visit_id')
display(unique_visit_ids)
""" visits = variation_visits.groupby(['variation','process_step']).agg({'process_step':'count'})
visits """


In [None]:

display(unique_visit_ids['variation'].value_counts())

In [None]:
# drop the nulls from clients, but keep the list of the drops

nulls_client_id = client_df[client_df.isna().any(axis=1)]['client_id']
nulls_client_id

In [None]:
client_df = client_df.dropna(axis=0)
client_df

In [None]:
display(client_df['gender'].value_counts(dropna = False))
# x->u, keep 'U's for everything except the gender statistics

In [1092]:
#client_df['gender'] = client_df['gender'].replace(to_replace=r'.*X.*', value ="U", regex=True)

In [None]:
display(experiment_df['variation'].value_counts(dropna = False))
# keep NaN for general analysis of clients, but drop them from everywhere for test analysis

In [None]:
# client_df, experiment_df, visit_df -> for general analysis
# new_client_df, new_experiment_df, new_visit_dfn -> for test/control analysis   experiment_df_null = 
nulls_in_experiment = experiment_df[experiment_df.isna().any(axis=1)]['client_id']
nulls_in_experiment

In [None]:
# new df removing client ID that are null in experiment

new_experiment_df = experiment_df[~experiment_df['client_id'].isin(nulls_in_experiment)]
display(new_experiment_df.count())

new_experiment_df = new_experiment_df[~new_experiment_df['client_id'].isin(nulls_client_id)]
display(new_experiment_df.count())



In [None]:
new_visits_df = visits_df[~visits_df['client_id'].isin(nulls_in_experiment)]
new_visits_df

In [None]:
new_client_df = client_df[~client_df['client_id'].isin(nulls_in_experiment)]
new_client_df

## End separation

In [1098]:
dataframes['clients'] = new_client_df.copy()
dataframes['experiment'] = new_experiment_df.copy()
dataframes['visits'] = new_visits_df.copy()

In [1099]:
# Data Categorizing
dataframes = clean_categorical_data(dataframes, config)

In [1100]:
#Convert types
dataframes = convert_types(dataframes, config)

In [None]:
display_dataFrames(dataframes, 'head', 'dtypes', 'cat_count')

In [1102]:
client_df = dataframes['clients']
experiment_df = dataframes['experiment']
visits_df = dataframes['visits']


### SQL EXPORT

In [1103]:
if config['refresh_db']:

    db_password = os.getenv('SQL_PASSWORD')

    # Create database if it doesn't exist
    engine = create_db(db_password, config)

    # Export tables to database if refresh is set to true
    export_dataframes_to_sql(engine, dataframes)

    # Import data from database
    dataframes = import_all_tables_from_sql(engine)

### Local Caching

In [None]:
""" # Save files locally in an untracked folder
export_dataframes_to_csv(dataframes) """

In [None]:
#TODO CAREFUL DATA WONT BE PROPERLY CATEGORIZED / TYPED run after : convert_types(dataframes, config)
""" clients_df = pd.read_csv('data/cleaned/clients.csv')
experiment_df = pd.read_csv('data/cleaned/experiment.csv')
visits_df = pd.read_csv('data/cleaned/visits.csv') """

## CLEAN FRAMES

In [None]:
display('clients :',client_df, 'experiment :',experiment_df, 'visits :',visits_df)

experiment_df['variation'].value_counts()


In [1107]:
# client_since_year : redundant : drop
# client_since_month: hypothesis : the longer they are client, the more valuable to us
# client_since_month: hypothesis : the older the client is, the more valuable to us
# gender: hypothesis : the men have more balance
# number_of_accounts: hypothesis : the clients with more accounts have more balance
# calls + logons : hypothesis : active clients are more valuable to us

# process steps + time : 
    # - SUCCESS : All the steps, in order, in a reasonable amount of time for each step
    
    # - ERROR : path do not start with start : drop
    # - ERROR : path do not complete : analyse
    # - ERROR : path do not complete in order: analyse
    # - ERROR : All the steps in order but took very long
    # - ERROR : Unusual amount of time between steps

## Data Exploration

In [1108]:
# Handle outliers

In [1109]:
#frequency tables

## Analysis

In [1110]:
#TODO: consider binning / pd.cut / qcut for numerical data
#TODO: correlation matrix
#TODO: tukeys_test_outliers

In [1111]:
# check back and forth between steps, lost?
# 

## Visualizations

## Conclusions