**Note**: The concat_dfs and create_diff functions are not-optimal for MGRA series 14, as they are designed for the anticipated MGRA series 15+. Therefore, we have created temp versions of these functions to use for MGRA series 14. When the series 15 gets released, these functions will need to be manually switched in the `GUI Implementation` section.

# Imports

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import yaml
import matplotlib.pyplot as plt
import seaborn as sns
#Libraries 
import os
import pyodbc
import glob
import copy
import PySimpleGUI as sg
import traceback

# Comparison Functions

## MGRA Level Data

### Concatenate both DS dataframes

In [2]:
def concat_dfs(comparison_first_ID_processed_data, comparison_second_ID_processed_data):
    """
    Merges two mgra-level dataframes (generated by download_DS_data function) horizontally.
    Returns a comparison table grouped by mgra and year.
    """
    # Added geozone to merge keys to account for mgra's in multiple jurisdictions (or other geographical levels)
    first_second_ID_comparison = comparison_first_ID_processed_data.merge(
        comparison_second_ID_processed_data,
        how='outer',
        left_on=[f'mgra_{first_ID}',
                 f'year_{first_ID}',
                 f'geozone_{first_ID}'],
        right_on=[f'mgra_{second_ID}',
                 f'year_{second_ID}',
                 f'geozone_{second_ID}'])
    
    # Clean green combined
    first_second_ID_comparison = first_second_ID_comparison.drop([f'mgra_{second_ID}', f'year_{second_ID}', f'geozone_{second_ID}'], axis=1)
    first_second_ID_comparison = first_second_ID_comparison.rename(columns={f'mgra_{first_ID}': 'mgra', f'year_{first_ID}': 'year', f'geozone_{first_ID}': 'geozone'})
    
    # Because we're summing, if using series 14 data, mgra's in multiple jurisdictions will be counted multiple times
    first_second_ID_comparison = first_second_ID_comparison.groupby(['mgra', 'year']).sum()
        
    return first_second_ID_comparison

## CPA level Data

In [3]:
def cpa_aggregation(first_ID_df, second_ID_df, cpa_level):
    """
    Joins Community Planning Area (CPA) information onto MGRA-level dataframes (generated by download_DS_data function).
    Drops MGRA values that aren't in a CPA.
    Returns a comparison table grouped by CPA and year.
    """
    # Adding SQl Data (CPA) to first_id_df
    comparison_first_ID_processed_data_cpa = first_ID_df.merge(cpa_level, how='left', on='mgra')
    comparison_first_ID_processed_data_cpa = comparison_first_ID_processed_data_cpa[comparison_first_ID_processed_data_cpa['geozone'] != '*Not in a CPA*']

    # Adding SQl Data (CPA) to second_id_df
    comparison_second_ID_processed_data_cpa = second_ID_df.merge(cpa_level, how='left', on='mgra')
    comparison_second_ID_processed_data_cpa = comparison_second_ID_processed_data_cpa[comparison_second_ID_processed_data_cpa['geozone'] != '*Not in a CPA*']

    # Merge first_id_df and second_id_df together on mgra, year, and geozone
    comparison_processed_data_cpa = comparison_first_ID_processed_data_cpa.merge(comparison_second_ID_processed_data_cpa, how='outer', on=['mgra', 'year', 'geozone'], suffixes=[f'_{first_ID}', f'_{second_ID}'])

    # Drop the MGRA column because it isn't really a quantitative value
    comparison_processed_data_cpa = comparison_processed_data_cpa.drop('mgra', axis=1)

    # Aggregate the sum of features by geozone and year
    comparison_processed_data_cpa = comparison_processed_data_cpa.groupby(['geozone', 'year']).sum()

    # Rename index (geozone -> cpa)
    comparison_processed_data_cpa.index.names = ['cpa', 'year']
    
    return comparison_processed_data_cpa

## Jurisdiction level Data

In [4]:
def jur_aggregation(first_ID_df, second_ID_df, jur_level):
    """
    Joins Jurisdiction information onto MGRA-level dataframes (generated by download_DS_data function).
    Returns a comparison table grouped by Jurisdiction and year.
    """
    # Adding SQl Data (Jurisdiction) to first_id_df
    comparison_first_ID_processed_data_jur = first_ID_df.merge(jur_level, how='left', on='mgra')
    
    # Adding SQl Data (Jurisdiction) to second_id_df
    comparison_second_ID_processed_data_jur = second_ID_df.merge(jur_level, how='left', on='mgra')
    
    # Merge first_id_df and second_id_df together on mgra, year, and geozone
    comparison_processed_data_jur = comparison_first_ID_processed_data_jur.merge(comparison_second_ID_processed_data_jur, how='outer', on=['mgra', 'year', 'geozone'], suffixes=[f'_{first_ID}', f'_{second_ID}'])
    
    # Drop the MGRA column because it isn't really a quantitative value
    comparison_processed_data_jur = comparison_processed_data_jur.drop('mgra', axis=1)
    
    # Aggregate the sum of features by geozone and year
    comparison_processed_data_jur = comparison_processed_data_jur.groupby(['geozone', 'year']).sum()
    
    # Rename index (geozone -> jurisdiction)
    comparison_processed_data_jur.index.names = ['jurisdiction', 'year']
        
    return comparison_processed_data_jur

## Creating Diff File for all Geo Levels

In [5]:
def non_shared_features(features_first_ID, features_second_ID):
    """
    (Comparison only)
    Identifies non-shared features between two different DS_ID's.
    """
    # Display non-shared features
    return list(set(features_first_ID) ^ set(features_second_ID))

In [6]:
def non_shared_years(first_ID_df, second_ID_df):
    """
    (Comparison only)
    Identifies non-shared years between two different DS_ID's.
    """
    # Display non-shared years
    return set(list(first_ID_df['year'].unique())) ^ set(list(second_ID_df['year'].unique()))

In [7]:
def create_diff(features_first_ID, features_second_ID, first_second_ID_comparison):
    """
    (Comparison only)
    Returns a comparison table where the second_ID values are subtracted from the first_ID values.
    """
    # Finding features common to both DSID data frames
    first_ID_unique = set(features_first_ID)
    intersection = first_ID_unique.intersection(features_second_ID)
    shared_features = list(intersection)
    
    # Calculate diff values between the two DS_ID's
    diff_df = pd.DataFrame()

    # NOTE: Subtracts second DS ID from first DS ID. If negative, then second DS ID was greater than first DS ID.
    for column in [col for col in features_first_ID if col in features_second_ID]:
        diff_df[f'{column}_diff'] = first_second_ID_comparison[f'{column}_{first_ID}'] - first_second_ID_comparison[f'{column}_{second_ID}']
        
    return diff_df

In [8]:
def concat_dfs_temp(comparison_first_ID_processed_data, comparison_second_ID_processed_data):
    """
    Merges two mgra-level dataframes (generated by download_DS_data function) horizontally.
    Returns a comparison table grouped by mgra and year.
    """
    # Added geozone to merge keys to account for mgra's in multiple jurisdictions (or other geographical levels)
    first_second_ID_comparison = comparison_first_ID_processed_data.merge(
        comparison_second_ID_processed_data,
        how='outer',
        on=[f'mgra', f'year'],
        suffixes=[f'_{first_ID}', f'_{second_ID}'])
    
    #print(first_second_ID_comparison)
    
    # Clean green combined
    #first_second_ID_comparison = first_second_ID_comparison.drop([f'mgra_DS41', f'year_DS41'], axis=1)
    #first_second_ID_comparison = first_second_ID_comparison.rename(columns={f'mgra_DS35': 'mgra', f'year_DS35': 'year'})
    
    # Because we're summing, if using series 14 data, mgra's in multiple jurisdictions will be counted multiple times
    first_second_ID_comparison = first_second_ID_comparison.groupby(['mgra', 'year']).sum()
        
    return first_second_ID_comparison

In [9]:
def create_diff_temp(features_first_ID, features_second_ID, first_second_ID_comparison):
    """
    (Comparison only)
    Returns a comparison table where the second_ID values are subtracted from the first_ID values.
    """
    # Finding features common to both DSID data frames
    #first_ID_unique = set(features_first_ID)
    #intersection = first_ID_unique.intersection(features_second_ID)
    
    shared_feats = [col for col in features_first_ID if col in features_second_ID]
    
    #shared_features = list(intersection)
    
    # Calculate diff values between the two DS_ID's
    diff_df = pd.DataFrame()

    # NOTE: Subtracts second DS ID from first DS ID. If negative, then second DS ID was greater than first DS ID.
    for column in shared_feats:
        diff_df[f'{column}_diff'] = first_second_ID_comparison[f'{column}_{first_ID}'] - first_second_ID_comparison[f'{column}_{second_ID}']
        
    return diff_df

In [10]:
#a, b, c = download_DS_data('DS35', jur_level)

In [11]:
#d, e, f = download_DS_data('DS41', jur_level)

In [12]:
#g = concat_dfs_temp(b, e)

In [13]:
#h = create_diff_temp(c, f, g)

In [14]:
#h

## Region level Data

In [15]:
def region_aggregation(first_ID_df, second_ID_df):
    """
    Sums the entire MGRA-level dataframes (generated by download_DS_data function) by column to get region values.
    Returns a comparison table grouped by year.
    """
    # Merge first_id_df and second_id_df together on mgra and year
    comparison_processed_data_reg = first_ID_df.merge(second_ID_df, how='outer', on=['mgra', 'year'], suffixes=[f'_{first_ID}', f'_{second_ID}'])
    
    # Aggregate the sum of features by year
    comparison_processed_data_reg = comparison_processed_data_reg.groupby('year').sum()
    
    # Drop the MGRA column because it isn't really a quantitative value
    comparison_processed_data_reg = comparison_processed_data_reg.drop('mgra', axis=1)
        
    return comparison_processed_data_reg

# Individual Functions

In [16]:
# maybe config argument?
def download_DS_data(ds_ID, jur_level):
    """
    Downloads DS_ID csv data from SANDAG's T drive, formatted for non-MGRA series 14 data.
    Returns processed data (merged with jurisdiction data and DS labeled), unprocessed data, and the features in the
    dataset.
    """
    datafiles = config[ds_ID].items()
    
    comparison_no_geozone_df = pd.DataFrame()
    for year, file_name in datafiles:
        working_df = pd.read_csv(file_name)
        working_df['year'] = year[-4:]
        comparison_no_geozone_df = comparison_no_geozone_df.append(working_df)
    
    # rename housing columns from sql data
    comparison_no_geozone_df = comparison_no_geozone_df.rename(columns=housing_key)
    
    # Save the features_first_ID for future use (Used when creating the diff file)
    features = comparison_no_geozone_df.drop(['mgra', 'year'], axis=1).columns
    
    comparison_no_geozone = copy.deepcopy(comparison_no_geozone_df)
    
    # Adding SQl Data to first_id_df
    comparison_processed_data = comparison_no_geozone.merge(jur_level, how='left', on='mgra')
    
    # making it original
    comparison_processed_data.columns = [x + f'_{ds_ID}' for x in comparison_processed_data.columns]
        
    return comparison_processed_data, comparison_no_geozone_df, features

In [17]:
def download_series14_data(ds_ID, jur_level_14):
    """
    Downloads DS_ID csv data from SANDAG's T drive, formatted for MGRA series 14 data (using mgra_id).
    Returns processed data (merged with jurisdiction data and DS labeled), unprocessed data, and the features in the
    dataset.
    """
    datafiles = config[ds_ID].values()
    
    comparison_no_geozone_df = pd.DataFrame()
    for file_name in datafiles:
        working_df = pd.read_csv(file_name)
        working_df['year'] = f"{file_name[-11:-7]}"
        comparison_no_geozone_df = comparison_no_geozone_df.append(working_df)

    # Save the features_first_ID for future use (Used when creating the diff file)
    features = comparison_no_geozone_df.drop(['mgra_id', 'year'], axis=1).columns

    comparison_no_geozone = copy.deepcopy(comparison_no_geozone_df)

    # Adding SQl Data to first_id_df
    comparison_processed_data = comparison_no_geozone.merge(jur_level_14, how='left', on='mgra_id')
    comparison_processed_data = comparison_processed_data.rename({'mgra_id': 'mgra'})

    # making it original
    comparison_processed_data.columns = [x + f'_{ds_ID}' for x in comparison_processed_data.columns]

    return comparison_processed_data, comparison_no_geozone_df, features

## CPA Aggregation

In [18]:
def cpa_aggregation_ind(first_ID_df, cpa_level):
    """
    Joins Community Planning Area (CPA) information onto an MGRA-level dataframe (generated by download_DS_data function).
    Drops MGRA values that aren't in a CPA.
    Returns a table containing aggregated CPA values grouped by CPA and year.
    """
    # Adding SQl Data (CPA) to first_id_df
    comparison_first_ID_processed_data_cpa = first_ID_df.merge(cpa_level, how='left', on='mgra')
    comparison_first_ID_processed_data_cpa = comparison_first_ID_processed_data_cpa[comparison_first_ID_processed_data_cpa['geozone'] != '*Not in a CPA*']

    # Drop the MGRA column because it isn't really a quantitative value
    comparison_processed_data_cpa = comparison_first_ID_processed_data_cpa.drop('mgra', axis=1)

    # Aggregate the sum of features by geozone and year
    comparison_processed_data_cpa = comparison_processed_data_cpa.groupby(['geozone', 'year']).sum()

    # Rename index (geozone -> cpa)
    comparison_processed_data_cpa.index.names = ['cpa', 'year']
    
    return comparison_processed_data_cpa

## Jurisdiction level Data

In [19]:
def jur_aggregation_ind(first_ID_df, jur_level):
    """
    Joins Jurisdiction information onto an MGRA-level dataframe (generated by download_DS_data function).
    Returns a table containing aggregated jurisdiction values grouped by jurisdiction and year.
    """
    # Adding SQl Data (Jurisdiction) to first_id_df
    comparison_first_ID_processed_data_jur = first_ID_df.merge(jur_level, how='left', on='mgra')
    
    # Drop the MGRA column because it isn't really a quantitative value
    comparison_processed_data_jur = comparison_first_ID_processed_data_jur.drop('mgra', axis=1)
    
    # Aggregate the sum of features by geozone and year
    comparison_processed_data_jur = comparison_processed_data_jur.groupby(['geozone', 'year']).sum()
    
    # Rename index (geozone -> jurisdiction)
    comparison_processed_data_jur.index.names = ['jurisdiction', 'year']
        
    return comparison_processed_data_jur

## Region level Data

In [20]:
def region_aggregation_ind(first_ID_df):
    """
    Sums the entire MGRA-level dataframe (generated by download_DS_data function) by column to get region values.
    Returns a table containing aggregated mgra values grouped by year.
    """
    # Aggregate the sum of features by year
    comparison_processed_data_reg = first_ID_df.groupby('year').sum()
    
    # Drop the MGRA column because it isn't really a quantitative value
    comparison_processed_data_reg = comparison_processed_data_reg.drop('mgra', axis=1)
        
    return comparison_processed_data_reg

# Environment Setup

## Pulling Info From YML File

In [21]:
# Localise with . files 
# config_filename = 'C:/Users/cra/OneDrive - San Diego Association of Governments/DS41_42/ds41_42_config.yml'
config_filename = './ds_config.yml'

In [22]:
with open(config_filename, "r") as yml_file:
    config = yaml.safe_load(yml_file)

## Downloading SQL Data

In [23]:
# TODO: Decide when/how we read mgra_series value (we can solely identify series 14 using range of DS values i think)
# Future mgra_series should not need mgra_id implementation

In [24]:
#mgra_series = value 
# maybe from yml config?

In [25]:
conn = pyodbc.connect('Driver={ODBC Driver 17 for SQL Server};'
                      'Server=DDAMWSQL16.sandag.org;'
                      'Database=demographic_warehouse;'
                      'Trusted_Connection=yes;')

In [26]:
# if mgra_series == 14:
#     query_all = "SELECT mgra_id, geotype, geozone FROM demographic_warehouse.dim.mgra WHERE series = 14 AND (geotype='cpa' OR geotype='jurisdiction')" #Remove the last and part when I do this for real
# else:
#     # replace with bottom code once mgra_series gets implemented
#     query_all = "SELECT mgra, geotype, geozone FROM demographic_warehouse.dim.mgra WHERE series = 14 AND (geotype='cpa' OR geotype='jurisdiction')" #Remove the last and part when I do this for real
#     #query_all = f"SELECT mgra, geotype, geozone FROM demographic_warehouse.dim.mgra WHERE series = {mgra_series} AND (geotype='cpa' OR geotype='jurisdiction')" #Remove the last and part when I do this for real

In [27]:
query_all = "SELECT mgra, geotype, geozone FROM demographic_warehouse.dim.mgra WHERE series = 14 AND (geotype='cpa' OR geotype='jurisdiction')" #Remove the last and part when I do this for real

In [28]:
sql_query = pd.read_sql_query(query_all,conn)
sql_df_all = pd.DataFrame(sql_query)

In [29]:
# SQl Data at different levels
jur_level = sql_df_all[sql_df_all['geotype']=='jurisdiction'].drop('geotype', axis=1).drop_duplicates()
cpa_level = sql_df_all[sql_df_all['geotype']=='cpa'].drop('geotype', axis=1).drop_duplicates()

In [30]:
# hh, gq_mil, gq_college, and gq_other sql query
housing_query = "SELECT short_name, long_name FROM demographic_warehouse.dim.housing_type"

In [31]:
housing_info = pd.read_sql_query(housing_query,conn)
sql_housing_info = pd.DataFrame(housing_info)

In [32]:
housing_key = sql_housing_info.set_index('short_name').to_dict()['long_name']

In [33]:
for key, value in housing_key.items():
    housing_key[key] = f'{value} ({key})'

In [34]:
housing_key

{'hh': 'Household Population (hh)',
 'gq_mil': 'Group Quarters - Military (gq_mil)',
 'gq_college': 'Group Quarters - College (gq_college)',
 'gq_other': 'Group Quarters - Other (gq_other)'}

# GUI Implementation

In [35]:
# Declare desired output options
comparison_selection_list = ['mgra_both', 'cpa_both', 'jur_both', 'region_both', 'mgra_diff', 'cpa_diff', 'jur_diff', 'region_diff']
individual_selection_list = ['mgra_ind', 'cpa_ind', 'jur_ind', 'region_ind']

## Base window

In [36]:
def base_window():
    """
    Creates SimplePyGUI window that enables user to select output path and output option (comparison or individual).
    Returns click event as well as selected values (click event will indicate output option and values will indicate 
    output path).
    """
    layout_first = [ 
        [sg.Text('Please Designate An Output Path (or leave blank to use local outputs folder)')],
        [sg.Text('Output Path', size =(15, 1)), sg.FolderBrowse(key='output-path')],
        [sg.Text('Select An Output Option')],
        [sg.Button(button_text='Comparison', key='comparison-select'),
         sg.Button(button_text='Individual', key='individual-select'),
         sg.Cancel()]
    ]
    
    window = sg.Window('Base window', layout_first, element_justification='c')
    event, values = window.read()
    window.close()

    return event, values

## Comparison window

In [37]:
def assert_inputs(event, values, output_path, output_notes):
    """Assert that inputs are compatible and formatted correctly!"""
    
    if output_path == '':
        if not os.path.exists('outputs'):
            os.makedirs('outputs')
        output_path = './outputs'
    globals()['output_path'] = output_path
    
    output_notes.append(f'Output files are located in: {output_path}')
    input_list = values['input_list']
    
    # Check to make sure there's at least one desired output
    assert len(values['input_list']) >= 1, 'Please select at least one output.'

    if event == 'comparison':
        # check that there are exactly 2 ds_ids selected
        assert len(values['DS_IDs']) == 2, 'Incorrect number of DS_IDs selected.'

        ds_selection = values['DS_IDs']
        ds_selection.sort(reverse=True)
        
        first_ID, second_ID = ds_selection[0], ds_selection[1]
        globals()['first_ID'] = first_ID
        globals()['second_ID'] = second_ID
        return

    return

In [38]:
ab = 'hello'

In [39]:
'el' in ab

True

In [40]:
def generate_outputs(event, created_dfs):
    """
    Function that converts created dataframes into csv files.
    """ 
    # comparison output
    if event == 'comparison':
        for df_name, df in created_dfs.items():
            if 'diff' in df_name:
                df.to_csv(f"{output_path}/{df_name}_{first_ID}_minus_{second_ID}.csv")
            else:
                df.to_csv(f"{output_path}/{df_name}_{first_ID}_{second_ID}.csv")
        return
    
    # individual output
    for df_name, df in created_dfs.items():
        df.to_csv(f"{output_path}/{df_name}_{individual_ID}.csv")
    return

In [41]:
def create_comparison_dfs(event, first_ID, second_ID, input_list, output_notes):
    """
    (Comparison Only)
    Function that runs through desired outputs and creates dataframes based on selected desired outputs. This function
    also saves any notes that need to be displayed to the user.
    """
    # download data for each ds_id
    first_ID_processed, first_ID_unprocessed, first_ID_features = download_DS_data(first_ID, jur_level)
    second_ID_processed, second_ID_unprocessed, second_ID_features = download_DS_data(second_ID, jur_level)
    
    unshared_features = non_shared_features(first_ID_features, second_ID_features)
    if len(unshared_features) > 0:
        output_notes.append(f'Unshared features: {", ".join(unshared_features)}')
    else:
        output_notes.append('All features are shared.')
              
    unshared_years = non_shared_years(first_ID_unprocessed, second_ID_unprocessed)
    if len(unshared_years) > 0:
        output_notes.append(f'Unshared years: {", ".join(unshared_years)}')
    else:
        output_notes.append('All years are shared.')
                            
    if any(df[-4:] == 'diff' for df in input_list):
        output_notes.append(f'Differences in diff files were generated by calculating: {first_ID} values - {second_ID} values.')
    
    output_notes.append(f'Base year for {first_ID} is {[item[-4:] for item in config[first_ID].keys()][0]}.')
    output_notes.append(f'Base year for {second_ID} is {[item[-4:] for item in config[second_ID].keys()][0]}.')

    created = {}
    if 'mgra_both' in input_list:
        mgra_both = concat_dfs_temp(first_ID_unprocessed, second_ID_unprocessed)
        created['mgra_both'] = mgra_both
    if 'cpa_both' in input_list:
        cpa_both = cpa_aggregation(first_ID_unprocessed, second_ID_unprocessed, cpa_level)
        created['cpa_both'] = cpa_both
    if 'jur_both' in input_list: 
        jur_both = jur_aggregation(first_ID_unprocessed, second_ID_unprocessed, jur_level)
        created['jur_both'] = jur_both
    if 'region_both' in input_list:
        region_both = region_aggregation(first_ID_unprocessed, second_ID_unprocessed)
        created['region_both'] = region_both
    if 'mgra_diff' in input_list:
        if 'mgra_both' not in input_list:
            mgra_both = concat_dfs_temp(first_ID_unprocessed, second_ID_unprocessed)
        mgra_diff = create_diff_temp(first_ID_features, second_ID_features, mgra_both)
        created['mgra_diff'] = mgra_diff
    if 'cpa_diff' in input_list:
        if 'cpa_both' not in input_list:
            cpa_both = cpa_aggregation(first_ID_unprocessed, second_ID_unprocessed, cpa_level)
        cpa_diff = create_diff(first_ID_features, second_ID_features, cpa_both)
        created['cpa_diff'] = cpa_diff
    if 'jur_diff' in input_list:
        if 'jur_both' not in input_list:
            jur_both = jur_aggregation(first_ID_unprocessed, second_ID_unprocessed, jur_level)
        jur_diff = create_diff(first_ID_features, second_ID_features, jur_both)
        created['jur_diff'] = jur_diff
    if 'region_diff' in input_list:
        if 'region_both' not in input_list:
            region_both = region_aggregation(first_ID_unprocessed, second_ID_unprocessed)
        region_diff = create_diff(first_ID_features, second_ID_features, region_both)
        created['region_diff'] = region_diff

    generate_outputs(event, created)
                            
    print(f"{first_ID} & {second_ID} {', '.join(created.keys())} outputs generated successfully!")
                            
    return

In [42]:
def comparison_window(output_path):
    """
    Creates SimplePyGUI window that enables user to select multiple DS_ID's along with desired outputs. The window will
    also have a console section where any output notes or errors will be displayed.
    Returns click event as well as selected values (might remove return values since no purpose as of now).
    """
    lb = sg.Listbox(values=comparison_selection_list, select_mode='multiple', size=(30, len(comparison_selection_list)+1), key='input_list')
    
    def select_all():
        lb.set_value(comparison_selection_list)
        return
    def deselect_all():
        lb.set_value([])
        return
    
    layout_comparison = [
        [sg.Button('Back', key='Back')],
        [sg.Text('Please Select 2 DS_IDs')],
        [sg.Listbox(values=(list(config.keys())[:-1]), select_mode='multiple', size=(30, len(config.keys())), key='DS_IDs')],
        [sg.Text('Please Select Desired Outputs')],
        [[sg.Button('Select All', target='input_list', key='select_all'), sg.Button('Clear All', target='input_list', key='clear_all')], lb],
        [sg.Submit(key='comparison'), sg.Button('Cancel/Close', key='Cancel')],
        [sg.Output(size=(100,20), key='output')]
    ]
    
    window = sg.Window('Comparison window', layout_comparison, element_justification='c')
    
    while True: # Event Loop
        event, values = window.Read()
        if event in (None, 'Cancel', 'Back'):
            break
        if event == 'select_all':
            select_all()
        if event == 'clear_all':
            deselect_all()
        if event == 'comparison':
            try:
                output_notes = []
                assert_inputs(event, values, output_path, output_notes)
                print('Creating dataframes...')
                create_comparison_dfs(event, first_ID, second_ID, values['input_list'], output_notes)
                print()
                print('\n'.join(output_notes))
                print()
            except FileNotFoundError as f:
                print('Please connect to the VPN. If connected, please check YML file datapaths.')
            except Exception as e:
                print(traceback.format_exc())
            
    window.Close()
    window['output'].__del__()
    
    if event == 'Back':
        initiate_window()
    
    return event, values

## Individual window

In [43]:
def create_individual_dfs(event, individual_IDs, input_list, output_notes):
    """
    (Individual Only)
    Function that runs through desired outputs and creates dataframes based on selected desired outputs. This function
    also saves any notes that need to be displayed to the user.
    """
    
    for individual_ID in individual_IDs:
        
        globals()['individual_ID'] = individual_ID
        
        # download data for the ds_id
        individual_ID_processed, individual_ID_unprocessed, individual_ID_features = download_DS_data(individual_ID, jur_level)

        output_notes.append(f'Base year for {individual_ID} is {[item[-4:] for item in config[individual_ID].keys()][0]}.')

        created = {}
        if 'mgra_ind' in input_list:
            mgra_ind = individual_ID_unprocessed.set_index('mgra')
            created['mgra_ind'] = mgra_ind
        if 'cpa_ind' in input_list:
            cpa_ind = cpa_aggregation_ind(individual_ID_unprocessed, cpa_level)
            created['cpa_ind'] = cpa_ind
        if 'jur_ind' in input_list:
            jur_ind = jur_aggregation_ind(individual_ID_unprocessed, jur_level)
            created['jur_ind'] = jur_ind
        if 'region_ind' in input_list:
            region_ind = region_aggregation_ind(individual_ID_unprocessed)
            created['region_ind'] = region_ind

        generate_outputs(event, created)
        print(f"{individual_ID} {', '.join(created.keys())} outputs generated successfully!")
    return

In [44]:
def individual_window(output_path):
    """
    Creates SimplePyGUI window that enables user to select a single DS_ID along with desired outputs. The window will
    also have a console section where any output notes or errors will be displayed.
    Returns click event as well as selected values (might remove return values since no purpose as of now).
    """
    lb_options = sg.Listbox(values=individual_selection_list, select_mode='multiple', size=(30, len(individual_selection_list)+1), key='input_list')
    lb_ds = sg.Listbox(values=(list(config.keys())[:-1]), select_mode='multiple', size=(30, len(config.keys())), key='individual_ID')
    
    def select_all_options():
        lb_options.set_value(individual_selection_list)
        return
    def deselect_all_options():
        lb_options.set_value([])
        return
    
    def select_all_ds():
        lb_ds.set_value(list(config.keys())[:-1])
        return
    def deselect_all_ds():
        lb_ds.set_value([])
        return
        
    layout_individual = [
        [sg.Button('Back', key='Back')],
        [sg.Text('Please Select DS_ID(s)')],
        [[sg.Button('Select All', target='individual_ID', key='select_all_ds'), sg.Button('Clear All', target='individual_ID', key='clear_all_ds')], lb_ds],
        [sg.Text('Please Select Desired Outputs')],
        [[sg.Button('Select All', target='input_list', key='select_all'), sg.Button('Clear All', target='input_list', key='clear_all')], lb_options],
        [sg.Submit(key='individual'), sg.Button('Cancel/Close', key='Cancel')],
        [sg.Output(size=(100,20), key='output')]
    ]
    
    window = sg.Window('Individual window', layout_individual, element_justification='c')
    
    while True: # Event Loop
        event, values = window.Read()
        if event in (None, 'Cancel', 'Back'):
            break
        if event == 'select_all':
            select_all_options()
        if event == 'clear_all':
            deselect_all_options()
        if event == 'select_all_ds':
            select_all_ds()
        if event == 'clear_all_ds':
            deselect_all_ds()
        if event == 'individual':
            try:
                output_notes = []
                assert_inputs(event, values, output_path, output_notes)
                print('Creating dataframes...')
                create_individual_dfs(event, values['individual_ID'], values['input_list'], output_notes)
                print()
                print('\n'.join(output_notes))
                print()
            except FileNotFoundError as f:
                print('Please connect to the VPN. If connected, please check YML file datapaths.')
            except Exception as e:
                print(traceback.format_exc())

    window.Close()
    window['output'].__del__()
    
    if event == 'Back':
        initiate_window()
    
    return event, values

## Initialize GUI

In [45]:
def initiate_window():
    """
    Function that initiates the window flow process starting with the base window. Helps coordinate transfer from base
    window to either comparison or individual window based on click event returned from base window.
    """
    sg.theme('SandyBeach')
    
    event, values = base_window()
    output_path = values['output-path']
    while True:
        if event in [None, 'Cancel']:
            return
        if event == 'comparison-select':
            event, values = comparison_window(output_path)
            return
        if event == 'individual-select':
            event, values = individual_window(output_path)
            return

In [47]:
initiate_window()

TODO List:
- Figure out what other output notes we need (especially for individual comparisons)
- consider csv outputs as inputs to power bi


- add mgra_id grouping for series 14 ds_ids (do we have csv files for series 14 we can test out? I think the ones we have been using are series 13.)
- use outer join for the comparisons (**done but need to check if it works as intended**)
- adjust sql query to be any series (currently queries only series 14 i think)

series 14 mgra to jurisdiction: even if mgra falls into multiple juris, there are scenarios that one juridiction would report 0 to account for duplication

IDEAS to consider:

- should we rename columns based on sql tables (for clarity on column meanings)? **Already done for housing cols but maybe there's more we can do**
- generate outputs to different folders for comparison or individual?
- should we order the DS_ID's in the selection list?
- should we always make diff tables newer - older? or older - newer? and we pick that by the DS_ID number right?
- how are mgra csv files released? If there's a convention for folder file paths, maybe we could automate selection of new filepaths instead of using the yml file