# VANGUARD AB TEST


## METADATA HELP

This comprehensive set of fields will guide your analysis, helping you unravel the intricacies of client behavior and preferences.

- **client_id**: Every client’s unique ID.
- **variation**: Indicates if a client was part of the experiment.
- **visitor_id**: A unique ID for each client-device combination.
- **visit_id**: A unique ID for each web visit/session.
- **process_step**: Marks each step in the digital process.
- **date_time**: Timestamp of each web activity.
- **clnt_tenure_yr**: Represents how long the client has been with Vanguard, measured in years.
- **clnt_tenure_mnth**: Further breaks down the client’s tenure with Vanguard in months.
- **clnt_age**: Indicates the age of the client.
- **gendr**: Specifies the client’s gender.
- **num_accts**: Denotes the number of accounts the client holds with Vanguard.
- **bal**: Gives the total balance spread across all accounts for a particular client.
- **calls_6_mnth**: Records the number of times the client reached out over a call in the past six months.
- **logons_6_mnth**: Reflects the frequency with which the client logged onto Vanguard’s platform over the last six months.


In [None]:
%load_ext autoreload
%autoreload 2 

In [121]:
from cleaning import *
from mining import *
from db_handling import *
import pandas as pd
from dotenv import load_dotenv
import os


### Load Configuration

In [None]:
# Load config.yaml
config = parse_config()

## Data Mining

In [123]:
#TODO: adapt function for remote urls
#TODO: local caching
#TODO: looks for source files, if not found, fetch from source, clean, and save

In [124]:
# Creates a dictionary of all imported dataframes
dataframes = { name:import_data_from_config(config, name) for name in config['tables']}

## Data Cleaning

In [125]:
#TODO: don't impose categories?

In [126]:
# Rename columns
dataframes = rename_columns(dataframes, config)

In [127]:
# Select columns
dataframes = select_columns(dataframes, config)

In [128]:
# Data Categorizing
dataframes = clean_categorical_data(dataframes, config)

In [129]:
#Convert types
dataframes = convert_types(dataframes, config)

In [None]:
display_dataFrames(dataframes, 'head', 'dtypes', 'cat_count')

In [131]:
# Handle duplicates

In [132]:
# Handle missing values

### SQL EXPORT

In [133]:
# Load environment variables
load_dotenv()
db_password = os.getenv('SQL_PASSWORD')

In [134]:
# Create database if it doesn't exist
engine = create_db(db_password, config)

In [None]:
# Export tables to database if refresh is set to true
export_dataframes_to_sql(engine, dataframes, config)

## Data Re-import

In [136]:
# Import data from database
cleaned_dfs = import_all_tables_from_sql(engine)

### Local Caching

In [137]:
# Save files locally in an untracked folder
export_dataframes_to_csv(cleaned_dfs)

In [None]:
clients_df = import_data(['data/cleaned/clients.csv'])
experiment_df = import_data(['data/cleaned/experiment.csv'])
visits_df = import_data(['data/cleaned/visits.csv'])

In [None]:
display('clients :',clients_df, 'experiment :',experiment_df, 'visits :',visits_df)

## Data Exploration

In [22]:
# Handle outliers

In [23]:
#frequency tables

## Analysis

In [24]:
#TODO: consider binning / pd.cut / qcut for numerical data
#TODO: correlation matrix
#TODO: tukeys_test_outliers

In [None]:
# check back and forth between steps, lost?
# 

## Visualizations

## Conclusions