## Summary 

Thank you for your interest in Vira Health's data science team!  

This jupyter notebook contains instructions for a short task which will give you an insight into some of what Vira Health is working on.   

The task is structured around building a simple 1 page dashboard that summarises what you think are the key characteristics of the datasets provided.  

Data provided is courtesy of the Study of Women's Health Across the Nation (SWAN) and is publically available from their [website](https://www.swanstudy.org/).      

Additional documentation can be found in the ICPSR data repository [here](https://www.icpsr.umich.edu/web/ICPSR/series/00253) and may be helpful to support completion of the task.  

## Step 1: Data exploration 

The /data folder includes data from a questionnaire collected at baseline ("swan_1996_97_baseline.csv") and two annual follow-up visits ("swan_1997_99_visit1.csv", "swan_1998_00_visit2.csv").  

Documentation for the baseline visit with details of the questionnaire and variables referenced is also included in the /data folder as "baseline-visit-codebook-PI.pdf".  

As a first step, please load in the data in this python notebook and conduct whatever exploration you need to decide - **"What are the key characteristics of these datasets?"**.   

To help focus, remember that the overall aim of the task is to **build a simple 1 page dashboard that summarises the key characteristics of the datasets**.    

Example exploration could include answering sub-questions such as, what is the size of each sample? how many participants have data in all follow-up visits?  

Please include inline code comments or markdown to explain your approach.  

Note, this exploration is not expected to be comprehensive, but if there are further analyses you would conduct to help you understand these datasets please include them in your commentary and explain what you would do.    

# Summary of response


Overall, my strategy is:
1. Clean the data up into a state suitable for interrogation by data consumers (e.g. nulls cleaned, datetimes converted), who can then theoretically come up with detailed requirements for what to do with this dataset
2. Produce some a data quality dashboard to for data consumers to use in conjunction with the data itself

The idea being that data consumers would then be in a position to iteratively make data engineering requirements for this dataset.

## 0. Install dependencies

In [1]:
!pip install -r requirements.txt

Collecting plotly
  Using cached plotly-5.13.1-py2.py3-none-any.whl (15.2 MB)
Collecting jupyter-dash
  Using cached jupyter_dash-0.4.2-py3-none-any.whl (23 kB)
Collecting tenacity>=6.2.0
  Using cached tenacity-8.2.1-py3-none-any.whl (24 kB)
Collecting flask
  Using cached Flask-2.2.3-py3-none-any.whl (101 kB)
Collecting dash
  Using cached dash-2.8.1-py3-none-any.whl (9.9 MB)
Collecting ansi2html
  Using cached ansi2html-1.8.0-py3-none-any.whl (16 kB)
Collecting retrying
  Using cached retrying-1.3.4-py3-none-any.whl (11 kB)
Collecting dash-core-components==2.0.0
  Using cached dash_core_components-2.0.0-py3-none-any.whl (3.8 kB)
Collecting dash-table==5.0.0
  Using cached dash_table-5.0.0-py3-none-any.whl (3.9 kB)
Collecting dash-html-components==2.0.0
  Using cached dash_html_components-2.0.0-py3-none-any.whl (4.1 kB)
Collecting itsdangerous>=2.0
  Using cached itsdangerous-2.1.2-py3-none-any.whl (15 kB)
Collecting click>=8.0
  Using cached click-8.1.3-py3-none-any.whl (96 kB)
Coll

# 1. Import packages

In [2]:
import pandas as pd
import re
import numpy as np

# 2. Load data


In [3]:
baseline_df = pd.read_csv("data/swan_1996_97_baseline.csv")
visit1_df = pd.read_csv("data/swan_1997_99_visit1.csv")
visit2_df = pd.read_csv("data/swan_1998_00_visit2.csv")
dfs = {'baseline_df': baseline_df, 'visit1_df': visit1_df,'visit2_df': visit2_df}

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


# 3. Get idea of scale of data

In [4]:
for name, df in dfs.items():
    print(f'{name} dims: {df.shape}')

baseline_df dims: (3302, 741)
visit1_df dims: (2881, 576)
visit2_df dims: (2748, 551)


Couple things are apparent here:
   1. The data is _very_ wide. Due to the width of the data, its easiest to get a taste for what's going on outside of the notebook - Excel will suit fine in this case of csvs on disk. If the location differed, we could choose a different solution (e.g. AWS Athena for S3 data, RDBMS for data in a database, etc.)
   2. The number of rows decreases from baseline -> 2nd visit, whether this is due to inherent dropoff in the dataset or a potential issue is unknown at this stage
   3. The number of columns decreases similarly.
    
Action points at this stage after getting eyes on data in excel:
1. Each table has column suffixes which we may or may not want to remove
2. We have a lot of text data that needs transforming into binary/category variables
3. We have a lot of 'NA' strings that need cleaning

### Handling text data

Given my current lack of understanding of the data, I will leave multicardinal (?) survey results as is until a requirement is given to change them. However, we can convert binary text columns to boolean for sure and save some memory

In [5]:
def _convert_bools(df):
    out_df = df.copy()
    for col in df.columns:
        if list(df[col][df[col].notna()].unique()) == ['(1) No', '(2) Yes' ]:
             out_df[col] = out_df[col].map({'(1) No': 0, '(2) Yes' :1}).astype("boolean")
    return out_df

def convert_bools(dict_of_dfs):
    out = {}
    for name,df in dfs.items():
        out[name] =_convert_bools(df)  
    return out

### Handling missing values

Luckily pandas assumes 'NA' = NaN. But we still have other examples of missing data:

In [6]:
def strip_whitespace(dict_of_dfs):
    out = {}
    for name,df in dfs.items():
        out[name] = _strip_whitespace(df)
    return out

def _strip_whitespace(df):
    out_df = df.copy()
    for col in out_df.loc[:, out_df.dtypes == object].columns:
        out_df[col] = out_df[col].str.strip(' ')
    return out_df

def clean_nans(dict_of_dfs):
    out = {}
    for name,df in dfs.items():
        out[name] = _clean_nans(df)
    return out

def _clean_nans(df):
    out_df = df.copy()
    for col in out_df.loc[:, out_df.dtypes == object].columns:
        if out_df[col].isin(['.','', '-1']).any():
            out_df[col] = out_df[col].replace({'.' : np.nan, '':np.nan, '-1':np.nan})
    return out_df

NB _clean_nans()_ should be safe for cols like LMPDAY0 which have negative integer values, and are not read as object.

### Converting datetimes

Any datetimes with /?

In [7]:
for name,df in dfs.items():
    for col in df.columns:
        if df[col].dtype == 'object':
            if df[col].str.contains('/').any():
                print(df[col])


0       (1) Yes, as per protocol
1       (1) Yes, as per protocol
2       (1) Yes, as per protocol
3       (1) Yes, as per protocol
4       (1) Yes, as per protocol
                  ...           
3297    (1) Yes, as per protocol
3298    (1) Yes, as per protocol
3299       (3) Yes, Last attempt
3300    (1) Yes, as per protocol
3301    (1) Yes, as per protocol
Name: BLDRWAT0, Length: 3302, dtype: object
0       (3) Occasionally/Mod Amt Of The Time (3-4 Days)
1              (2) Some/A Little Of The Time (1-2 Days)
2              (2) Some/A Little Of The Time (1-2 Days)
3              (2) Some/A Little Of The Time (1-2 Days)
4              (2) Some/A Little Of The Time (1-2 Days)
                             ...                       
3297           (2) Some/A Little Of The Time (1-2 Days)
3298              (1) Rarely/None Of The Time (< 1 Day)
3299              (1) Rarely/None Of The Time (< 1 Day)
3300              (1) Rarely/None Of The Time (< 1 Day)
3301              (1) Rarely/None

0       (1) None Or Less Than 1 Hr/Wk
1       (1) None Or Less Than 1 Hr/Wk
2       (1) None Or Less Than 1 Hr/Wk
3       (1) None Or Less Than 1 Hr/Wk
4               (3) 20 Hrs Or More/Wk
                    ...              
3297            (3) 20 Hrs Or More/Wk
3298    (1) None Or Less Than 1 Hr/Wk
3299    (1) None Or Less Than 1 Hr/Wk
3300    (1) None Or Less Than 1 Hr/Wk
3301    (1) None Or Less Than 1 Hr/Wk
Name: CHLDCAR0, Length: 3302, dtype: object
0       (2) Between 1 And 2 Hrs/Day
1           (3) More Than 2 Hrs/Day
2           (3) More Than 2 Hrs/Day
3        (1) 1 Hour Or Less Per Day
4       (2) Between 1 And 2 Hrs/Day
                   ...             
3297        (3) More Than 2 Hrs/Day
3298    (2) Between 1 And 2 Hrs/Day
3299     (1) 1 Hour Or Less Per Day
3300    (2) Between 1 And 2 Hrs/Day
3301        (3) More Than 2 Hrs/Day
Name: PREPMEA0, Length: 3302, dtype: object
0                       (3) Daily Or More
1                       (3) Daily Or More
2             

0                (458) Hairdressers & cosmetologists
1                        (379) General office clerks
2                                                NaN
3           (033) Purchasing agents & buyers, n.e.c.
4                            (095) Registered nurses
                            ...                     
3297                                             NaN
3298    (037) Management related occupations, n.e.c.
3299               (156) Teachers, elementary school
3300         (022) Managers & administrators, n.e.c.
3301                                             NaN
Name: OCCUP0, Length: 3302, dtype: object
0                           (5) Hispanic
1           (2) Chinese/Chinese American
2       (4) Caucasian/White Non-Hispanic
3       (4) Caucasian/White Non-Hispanic
4             (1) Black/African American
                      ...               
3297          (1) Black/African American
3298      (3) Japanese/Japanese American
3299    (4) Caucasian/White Non-Hispanic
3300  

0                                     NaN
1                                     NaN
2                                     NaN
3                                     NaN
4                                     NaN
                      ...                
2876                                  NaN
2877                                  NaN
2878                                  NaN
2879    (3) Infrequently (several x/year)
2880             (5) Rarely (< once/year)
Name: CONTCT11, Length: 2881, dtype: object
0             (7) Have not moved
1                      (6) Never
2       (5) Rarely (< once/year)
3                      (6) Never
4                     (4) Yearly
                  ...           
2876          (7) Have not moved
2877                  (1) Weekly
2878          (7) Have not moved
2879                         NaN
2880                         NaN
Name: CONTCT21, Length: 2881, dtype: object
0       (4) Infrequently(several x/year)
1                              (1) Daily
2    

0           (3) About once/week
1       (4) More than once/week
2       (4) More than once/week
3       (2) Once or twice/month
4                (1) Not at all
                 ...           
2876    (2) Once or twice/month
2877    (4) More than once/week
2878    (2) Once or twice/month
2879        (3) About once/week
2880    (2) Once or twice/month
Name: DESIRSE1, Length: 2881, dtype: object
0                     (5) Daily
1                           NaN
2                     (5) Daily
3           (3) About once/week
4                           NaN
                 ...           
2876                        NaN
2877                  (5) Daily
2878                        NaN
2879    (4) More than once/week
2880                        NaN
Name: KISSING1, Length: 2881, dtype: object
0                     (5) Daily
1                           NaN
2       (4) More than once/week
3       (4) More than once/week
4                           NaN
                 ...           
2876            

0              (2) Some/a little of the time (1-2 days)
1                 (1) Rarely/none of the time (< 1 day)
2       (3) Occasionally/mod amt of the time (3-4 days)
3                 (1) Rarely/none of the time (< 1 day)
4                 (1) Rarely/none of the time (< 1 day)
                             ...                       
2743           (2) Some/a little of the time (1-2 days)
2744              (1) Rarely/none of the time (< 1 day)
2745           (2) Some/a little of the time (1-2 days)
2746              (1) Rarely/none of the time (< 1 day)
2747              (1) Rarely/none of the time (< 1 day)
Name: TALKLES2, Length: 2748, dtype: object
0                 (1) Rarely/none of the time (< 1 day)
1                 (1) Rarely/none of the time (< 1 day)
2       (3) Occasionally/mod amt of the time (3-4 days)
3                 (1) Rarely/none of the time (< 1 day)
4                 (1) Rarely/none of the time (< 1 day)
                             ...                       
2743

0               (4) Yes, 3-4 times/wk
1               (3) Yes, 1-2 times/wk
2       (1) No, not in the past 2 wks
3       (1) No, not in the past 2 wks
4       (1) No, not in the past 2 wks
                    ...              
2743    (1) No, not in the past 2 wks
2744    (1) No, not in the past 2 wks
2745             (2) Yes, < once a wk
2746    (1) No, not in the past 2 wks
2747             (2) Yes, < once a wk
Name: TRBLSLE2, Length: 2748, dtype: object
0         (5) Yes, 5 or more times/wk
1               (3) Yes, 1-2 times/wk
2               (4) Yes, 3-4 times/wk
3               (3) Yes, 1-2 times/wk
4       (1) No, not in the past 2 wks
                    ...              
2743    (1) No, not in the past 2 wks
2744            (3) Yes, 1-2 times/wk
2745            (3) Yes, 1-2 times/wk
2746    (1) No, not in the past 2 wks
2747            (3) Yes, 1-2 times/wk
Name: WAKEUP2, Length: 2748, dtype: object
0         (5) Yes, 5 or more times/wk
1                (2) Yes, < once a wk
2

Any times wih ':' - yes! Let's convert them for downstream consumption

In [8]:
for name,df in dfs.items():
    for col in df.columns:
        if df[col].dtype == 'object':
            if df[col].str.contains(':').any():
                print(f'{name},{col}, {df[col].unique()[:10]}')

baseline_df,SPSCTIM0, ['           .' '     0:09:03' '     0:15:07' '     0:18:29'
 '     0:13:51' '     0:09:08' '     0:10:31' '     0:09:31'
 '     0:11:10' '     0:10:02']
baseline_df,HPSCTIM0, ['           .' '     0:08:53' '     0:15:17' '     0:18:13'
 '     0:14:01' '     0:09:17' '     0:10:41' '     0:09:35'
 '     0:10:56' '     0:10:08']
baseline_df,HPSCMOD0, [nan '(05) 5: 2000 machine' '(11) 11: 4500 machine']
visit1_df,STRTIM11, ['           .' '     9:00:00' '     8:00:00' '     8:30:00'
 '     7:00:00' '    17:00:00' '     7:30:00' '     6:00:00'
 '     9:30:00' '     8:15:00']
visit1_df,STPTIM11, ['           .' '    17:00:00' '    15:00:00' '    15:15:00'
 '    14:00:00' '    11:30:00' '     1:30:00' '    17:30:00'
 '    16:00:00' '    14:30:00']
visit1_df,SPSCTIM1, ['     0:10:08' '     0:12:55' '     0:18:01' '           .'
 '     0:10:47' '     0:11:36' '     0:10:03' '     0:12:44'
 '     0:12:59' '     0:16:12']
visit1_df,HPSCTIM1, ['     0:09:52' '     0:13:03' 

In [9]:
def _convert_dt(df_name,col,dict_of_dfs,_format):
    dict_of_dfs[df_name][col] = pd.to_datetime(dict_of_dfs[df_name][col],format = _format)
    
def mutate_dfs_convert_dts(dict_of_dfs):
    format_1_cols = [
    ('visit1_df','STRTIM11'),
    ('visit1_df','STPTIM11'),
    ('visit2_df','STRTIM12'),
    ('visit2_df','STPTIM12'),
    ('visit2_df','STRTIM22'),
    ('visit2_df','STPTIM22'),
    # ('visit2_df','STRTIM32'),
#     ('visit2_df','STPTIM32'),
    ]
    format_2_cols = [
    ('baseline_df','SPSCTIM0'),
    ('baseline_df','HPSCTIM0'),
    ('visit1_df','SPSCTIM1'),
    ('visit1_df','HPSCTIM1'),
    ('visit2_df','SPSCTIM2'),
    ]

    for coltuple in format_1_cols:
        _convert_dt(*coltuple,dict_of_dfs,'%H:%M:%S')
    for coltuple in format_2_cols:
        _convert_dt(*coltuple,dict_of_dfs,'0:%H:%M')



# 4. Thinking about data quality

## 4a. Looking at data richness by column

Lets assume that a column over 95% null is less useful for analytics due to a smaller sample size. Quantifying them:

In [10]:
def get_nulls_ratios(dict_of_dfs): 
    out = {}
    for name, df in dfs.items():
        nulls = (df.isnull().mean() *100).round(2)
        out[name + 'nulls_stats'] = (
            pd.DataFrame(
                nulls[nulls>=95]
            )
            .reset_index()
            .rename(columns = {'index':'colname',0:'percent_null'})
        )
    return out

We may want to remove any columns over 97.5 null to reduce the dimensions of our data.

In [11]:
def filter_out_nans_mutates(dict_of_dfs):
    names = ['baseline_df','visit1_df','visit2_df']
    for name in names:
        cols_to_drop = dict_of_dfs[name + 'nulls_stats']
        cols_to_drop = list(cols_to_drop[cols_to_drop.percent_null > 97.5].colname)
        dict_of_dfs[name] = dict_of_dfs[name].drop(columns=cols_to_drop)


## 4. Quantify how columns are shared between tables

In [12]:
def get_clean_column_names(dfname, df,df_suffix_dict):
    
    suffix = df_suffix_dict[dfname]
    return [re.sub(f'{suffix}$','',col) for col in df.columns]

def get_columns_in_a_not_in_b(dict_of_dfs,dfname_a,dfname_b):
    df_suffix_dict = {'baseline_df' : '0', 'visit1_df' : '1','visit2_df': '2'}
    cleaned_columns_a = get_clean_column_names(dfname_a, dict_of_dfs[dfname_a], df_suffix_dict)
    cleaned_columns_b = get_clean_column_names(dfname_b, dict_of_dfs[dfname_b], df_suffix_dict)
    set_difference = set(cleaned_columns_a).difference(set(cleaned_columns_b))
    out = pd.DataFrame({'columns':[col+df_suffix_dict[dfname_a] for col in set_difference]})
    return out


This is a much smaller intersections in columns than expected- after interrogating the column names there are clearly suffixes relating to the stage of the study. Let's get rid of them, as we have that information at the df level.

In [13]:
get_columns_in_a_not_in_b(dfs,'visit1_df','visit2_df')

Unnamed: 0,columns
0,CVRDAY1
1,GOWRONG1
2,OLDMOVE1
3,LESSPRE1
4,TPARESU1
...,...
61,IMAGNOL1
62,FAILUREA1
63,GOODOLD1
64,APOARES1


This looks more reasonable (in conunction with the documentation provided, I am happy that no info was lost at this stage). Let's see how that compares to baseline:

In [14]:
get_columns_in_a_not_in_b(dfs,'visit1_df','baseline_df')

Unnamed: 0,columns
0,SEX81
1,RELAT41
2,MOMBORN1
3,BCORIEN1
4,SIDEEFF1
...,...
361,BCPTWI21
362,FARTHER1
363,CHILDRE1
364,REGPERI1


Accounting for the suffixes, the baseline table contains a lot of data not in the visit tables, which is not unexpected

## 5. Check state of SWANID

First make sure that no SWANID in visits are not present in baseline

In [15]:
set(visit1_df['SWANID']).difference(set(baseline_df['SWANID']))

set()

In [16]:
set(visit2_df['SWANID']).difference(set(baseline_df['SWANID']))

set()

Good. Lets also make sure they are always unique per table:


In [17]:
for name, df in dfs.items():
    print(f'{name}: {df.SWANID.nunique()/df.SWANID.shape[0]}')

baseline_df: 1.0
visit1_df: 1.0
visit2_df: 1.0


See how SWANIDs are conserved throughout the study - how many drop off at each stage:

In [18]:
def get_cohort_funnel_stats(dfs):
    total = len(dfs['baseline_df']['SWANID'])
    vis1 = len(set(dfs['baseline_df']['SWANID']).intersection(set(dfs['visit1_df']['SWANID'])))
    vis2=len(set(dfs['baseline_df']['SWANID']).intersection(set(dfs['visit1_df']['SWANID'])).intersection(set(dfs['visit2_df']['SWANID'])))
    cohort = pd.DataFrame({'stage':['baseline','visit1','visit2'], 'subjects':[total,vis1,vis2]})
    return cohort

## Joining datasets

In [19]:
def set_swanid_index_mutates(dfs):
    for name in ['baseline_df','visit1_df','visit2_df']:
        dfs[name].set_index('SWANID',inplace=True)
        
def drop_redundant_cols_mutates(dfs):
    drop_dict = {'baseline_df': ['VISIT'],'visit1_df': ['VISIT','RACE'], 'visit2_df': ['VISIT','RACE']}
    for dfname, cols in drop_dict.items():
        dfs[dfname] = dfs[dfname].drop(columns=cols)


In [20]:
def join_visits_mutates_(dfs):
    full_data = dfs['baseline_df'].join(dfs['visit1_df'],how = 'left',).join(dfs['visit2_df'], how='left')
    dfs['full_data'] = full_data


NB - running the above left joins, even after filtering out columns over 97.5% null, exceeds data limits in postgres. Some kind of data lake (e.g. s3) or data warehouse (e.g. Redshift) might be preferable to store this joined data in reality. However, sticking to the limits my Postgres container, I will load the tables seperately into the database.

## Testing

Now we have some data processing functions, we ideally want 100% unit test coverage of them. Given the confines of this task, and the fact that we are developing in Jupyter, let's sketch out what some of our unit tests might look like (when included in a pytest testing suite).

In [21]:
def test_strip_whitespace():
    _in = pd.DataFrame({
        'col1':[' one', '   . ', 'two', ' '],
        'col2':[1,2,3]
    })
    target = pd.DataFrame({
        'col1':['one', '.', 'two', ''],
        'col2':[1,2,3]
    })
    out = _test_strip_whitespace(df)
    pd.testing.assert_frame_equal(out,target)

def test_clean_nans(df):
    _in = pd.DataFrame({
        'col1':['one','-1','two', ''],
        'col2':['three','-1','four', '.'],
    })
    target = pd.DataFrame({
        'col1':['one',np.nan,'two', np.nan],
        'col2':['three',np.nan,'four', np.nan],
    })
    out = _test_strip_whitespace(df)
    pd.testing.assert_frame_equal(out,target)

test_strip_whitespace()
test_clean_nans()

ValueError: arrays must all be same length

## Step 2: Data aggregation and presentation

Now you have a basic understanding of the data, the next step is to **build a simple 1 page dashboard that summarises the key characteristics of the datasets**.  

For this task, you will need to consider the following areas: 
- How to aggregate the 3 datasets into one data structure (e.g. a database) that can be queried (with your chosen programming language) 
- What exhibits to display
- How to transform the data for the selected exhibits
- How to host the dashboard (note: the dashboard does not require a public URL, you can demo on your local machine)

##### To help simplify this task, please use the following guidelines:
- Stick to a 1 page layout and do not try and prepare more than 6 exhibits 
- Don't worry about 'perfect' styling - we understand this can take a lot of time and we're most interested in your overall approach and the core components of your implementation 

Your implementation will likely require writing code outside of the Jupyter notebook environment.  
So, in this notebook please just provide a summary of your approach and add written detail for what else you would do that you haven't included.    
As in step 1, in your written code, please include inline code comments or markdown to explain your approach.  

# Summary

Through Section 1, we have some metadata related info that we can assume is of some use to our data consumers, as well as some simple data processing that can be the first step towards an ETL pipeline!

So, my data output will be a metadata/QA related dashboard, and some slightly more production-ready code to transform the data.

For the dashboard, I will be using JupyterDash, which allows you to develop Dash apps within a jupyter environment (seemed like the ideal tool for this problem)

## Extract

In [None]:
baseline_df = pd.read_csv("data/swan_1996_97_baseline.csv")
visit1_df = pd.read_csv("data/swan_1997_99_visit1.csv")
visit2_df = pd.read_csv("data/swan_1998_00_visit2.csv")
dfs = {'baseline_df': baseline_df, 'visit1_df': visit1_df,'visit2_df': visit2_df}

## Transform

In [None]:
dfs = convert_bools(dfs)
dfs = strip_whitespace(dfs)
dfs = clean_nans(dfs)
dfs = {**dfs,**get_nulls_ratios(dfs)}
filter_out_nans_mutates(dfs)
mutate_dfs_convert_dts(dfs)
drop_redundant_cols_mutates(dfs)
dfs = {**dfs, **{'cols_in_visit1_not_in_baseline':get_columns_in_a_not_in_b(dfs,'visit1_df','baseline_df')}}
dfs = {**dfs, **{'cols_in_visit2_not_in_visit1':get_columns_in_a_not_in_b(dfs,'visit2_df','visit1_df')}}
dfs = {**dfs, **{'cohort_funnel_stats': get_cohort_funnel_stats(dfs)}}
# set_swanid_index_mutates(dfs)
# join_visits_mutates_(dfs)
#Create insights


## Load

In [None]:
from sqlalchemy import create_engine
engine = create_engine('postgresql://postgres:postgres@db:5432/postgres')
for name, df in dfs.items():
    df.to_sql(name,engine,if_exists ='replace',index=False)