# VANGUARD AB TEST


## METADATA HELP

This comprehensive set of fields will guide your analysis, helping you unravel the intricacies of client behavior and preferences.

- **client_id**: Every client’s unique ID.
- **variation**: Indicates if a client was part of the experiment.
- **visitor_id**: A unique ID for each client-device combination.
- **visit_id**: A unique ID for each web visit/session.
- **process_step**: Marks each step in the digital process.
- **date_time**: Timestamp of each web activity.
- **clnt_tenure_yr**: Represents how long the client has been with Vanguard, measured in years.
- **clnt_tenure_mnth**: Further breaks down the client’s tenure with Vanguard in months.
- **clnt_age**: Indicates the age of the client.
- **gendr**: Specifies the client’s gender.
- **num_accts**: Denotes the number of accounts the client holds with Vanguard.
- **bal**: Gives the total balance spread across all accounts for a particular client.
- **calls_6_mnth**: Records the number of times the client reached out over a call in the past six months.
- **logons_6_mnth**: Reflects the frequency with which the client logged onto Vanguard’s platform over the last six months.


In [None]:
%load_ext autoreload
%autoreload 2 

In [224]:
import pandas as pd
from cleaning import *
from mining import *
from dotenv import load_dotenv
import os
import yaml

In [None]:
# Load environment variables
load_dotenv()

In [None]:
# Load config
config = parse_config()

## Data Mining

In [227]:
#TODO: adapt function to remote url + save to sql to prevent large repo

In [228]:
dataFrames = { name:import_data_from_config(config, name) for name in config['tables']}

In [None]:
display_dataFrames(dataFrames)

## Data Cleaning

In [230]:
#TODO: don't impose categories?
#TODO: consider binning / pd.cut / qcut for numerical data

In [231]:
# Rename columns
dataFrames = rename_columns(dataFrames, config)

In [None]:
# select columns
dataFrames = select_columns(dataFrames, config)
display_dataFrames(dataFrames)

In [None]:
# Data Categorizing
def clean_categorical_data(dataFrames, config):
    #TODO : default values for categories if no valid_categories
    for table in config['tables']:
        for column in config['tables'][table]['columns']:

            column_config = config['tables'][table]['columns'][column]
            valid_categories = column_config.get('valid_categories')

            if valid_categories:
                dataFrames[table][column] = dataFrames[table][column].astype('category')
                dataFrames[table][column] = dataFrames[table][column].cat.set_categories(valid_categories)

                fallback = column_config.get('fallback_category')
                if fallback:
                    dataFrames[table][column] = dataFrames[table][column].fillna(fallback)
                else:
                    dataFrames[table][column] = dataFrames[table][column].cat.add_categories(['unknown'])
                    dataFrames[table][column] = dataFrames[table][column].fillna('unknown')
    return dataFrames



dataFrames = clean_categorical_data(dataFrames, config)

display_categorical_value_counts(dataFrames)
display_dataFrames(dataFrames)


In [None]:
#convert types
def convert_types(dataFrames, config):
    for table in config['tables']:
        for column in config['tables'][table]['columns']:
            column_config = config['tables'][table]['columns'][column]
            if column_config['pandas_dtype'] == 'int64':
                dataFrames[table][column] = dataFrames[table][column].fillna(0)  # or use dropna()
            dataFrames[table][column] = dataFrames[table][column].astype(column_config['pandas_dtype'])
    return dataFrames

dataFrames = convert_types(dataFrames, config)
display_dataFrames(dataFrames)

In [235]:
# Data formatting

In [236]:
# Data Typing

In [237]:
# Handle duplicates

In [238]:
# Handle missing values

## Data Exploration

In [239]:
# Handle outliers

In [240]:
#frequency tables

## Analysis

## Visualizations

## Conclusions