# Data Preperation User Notebook
This notebook is to be used to create Forecast outputs from the outputted CSV files from the Estimates and Forecast team. You must be connnected to SANDAG servers in order to run this code

The outputs for these functions (if you choose to push the outputs to J Drive) can be found:   
J:\DataScience\DataQuality\QAQC\Forecast QC Automation

The code below is all for MGRA series 15 data. 

# Important Note
Prior to running any of the functions within this notebook the ds_config_2.yml file must be updated. For formatting see prior year inputs. 

In [1]:
import pandas as pd
import numpy as np
import pyodbc
from data_prep_functions import *
import warnings
warnings.filterwarnings('ignore')

# Individual File Creation
Prior to any other functions being ran, the MGRA level for the given DSID must be ran. All other outputs generally pull from the MGRA output, so it needs to be ran and pushed to J drive. 

## MGRA Level

In [2]:
mgra_ind_df = mgra_output(dsid='99', to_jdrive=True)
mgra_ind_df

outputting


Unnamed: 0,mgra,year,taz,LUZ,pop,hhp,hs,hs_sf,hs_mf,hs_mh,...,hotelroomtotal,parkactive,openspaceparkpreserve,beachactive,district27,milestocoast,acre,landacre,effective_acres,truckregiontype
0,1,2022,3010,10,440,440,176,84,92,0,...,0.0,0.0,0.000000,0.0,9,4.35,18.837621,18.837621,18.837621,1
1,2,2022,1797,28,130,68,56,0,56,0,...,0.0,0.0,0.000000,0.0,15,0.64,2.872330,2.872330,2.872330,1
2,3,2022,4361,239,549,549,200,23,177,0,...,0.0,0.0,0.000000,0.0,13,12.22,25.713898,25.713898,25.713898,1
3,4,2022,340,151,5,5,3,3,0,0,...,0.0,0.0,0.000000,0.0,2,0.17,2.678374,2.678374,2.678374,1
4,5,2022,388,151,90,90,43,43,0,0,...,0.0,0.0,0.000000,0.0,2,0.47,4.057765,4.057765,4.057765,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24316,24317,2029,3691,111,2,2,2,2,0,0,...,,0.0,0.000000,0.0,9,7.76,0.648685,0.648685,0.648685,1
24317,24318,2029,3683,212,130,130,46,46,0,0,...,,0.0,0.000000,0.0,3,14.73,50.854434,50.854434,50.514407,1
24318,24319,2029,4943,225,0,0,0,0,0,0,...,,0.0,39188.407437,0.0,14,55.40,39632.965922,39632.965922,1892.868273,1
24319,24320,2029,4940,227,0,0,0,0,0,0,...,,0.0,47858.175303,0.0,14,68.04,47919.200993,47919.200993,0.000000,1


## Rollup the data

In [6]:
for geography in ['cpa', 'luz', 'census_tract', 'sra','jurisdiction','region']:
    rollup_data(dsid='99', geo_level=geography, to_jdrive=True)
    print(f"{geography} is complete")


cpa is complete
luz is complete
census_tract is complete
sra is complete
jurisdiction is complete
region is complete


In [3]:
rolled = rollup_data(dsid='99', geo_level='jurisdiction', to_jdrive=False)
rolled

Unnamed: 0,jurisdiction,year,pop,hhp,hs,hs_sf,hs_mf,hs_mh,hh,hh_sf,...,hotelroomtotal,parkactive,openspaceparkpreserve,beachactive,district27,milestocoast,acre,landacre,effective_acres,truckregiontype
0,Carlsbad,2022,115585,114619,48104,32197,14590,1317,44991,30437,...,5082.0,192.365439,6193.801,3.888166,859,1701.03,25052.69,24213.66,18811.637627,758
1,Carlsbad,2029,120733,119790,52974,34020,17637,1317,48918,31858,...,0.0,192.365439,6286.655,3.888166,859,1701.03,25052.34,25052.34,17251.629005,758
2,Chula Vista,2022,276785,275188,88143,53106,31099,3938,85412,51709,...,1920.0,585.71285,7250.759,0.0,12401,4579.59,33565.0,31907.62,29174.937433,1075
3,Chula Vista,2029,280365,278811,93700,56689,33073,3938,90478,54957,...,0.0,676.298002,7263.771,0.0,12401,4579.59,33565.11,33565.11,27896.944189,1075
4,Coronado,2022,22277,17325,9665,5526,4139,0,7455,4497,...,1717.0,87.934961,103.6729,148.088965,2243,71.61,10090.06,5425.038,8916.190879,223
5,Coronado,2029,21686,16892,9934,5371,4563,0,7518,4259,...,0.0,91.166858,103.6729,148.088964,2243,71.61,10096.22,10096.22,8709.476457,223
6,Del Mar,2022,3929,3929,2629,1903,726,0,1974,1370,...,350.0,4.065899,82.56417,0.0,148,17.74,1136.536,1080.632,1024.284528,66
7,Del Mar,2029,3800,3800,2737,1940,797,0,1965,1326,...,0.0,4.065899,82.56417,0.0,148,17.74,1136.928,1136.928,933.798145,66
8,El Cajon,2022,105638,103018,36590,15521,18873,2196,35396,15077,...,1044.0,76.908958,364.3771,0.0,7644,7427.47,9277.101,9277.101,9099.665456,588
9,El Cajon,2029,106425,103878,38705,15532,21061,2112,37257,15002,...,0.0,76.995044,364.3771,0.0,7644,7427.47,9277.138,9277.138,8977.614552,588


### Region Level
- Requires MGRA output to have already been created

In [3]:
region_ind = region_foldup(dsid='38', to_jdrive=True)
region_ind

Unnamed: 0_level_0,Unnamed: 1_level_0,hs,hs_sf,hs_mf,hs_mh,hh,hh_sf,hh_mf,hh_mh,gq_civ,gq_mil,...,hotelroomtotal,truckregiontype,district27,milestocoast,acres,effective_acres,land_acres,MicroAccessTime,remoteAVParking,refueling_stations
region,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
San Diego,2016,1190554,717626,430716,42212,1134848,687509,407158,40181,63014,43285,...,56646,23002,221100,245307.3086,2727204.0,1251249.0,2698589.0,2333475,5,699
San Diego,2018,1205852,721197,442548,42107,1147635,690686,416960,39989,69372,46797,...,61917,23002,221100,245307.3086,2727204.0,1251249.0,2698589.0,2333475,5,699
San Diego,2020,1226462,722838,461517,42107,1166240,692867,433349,40024,72056,48307,...,61917,23002,221100,245307.3086,2727204.0,1251249.0,2698589.0,2333475,5,699
San Diego,2023,1262957,726724,494126,42107,1197072,696698,460237,40137,73875,50572,...,63707,23002,221100,245307.3086,2727204.0,1251249.0,2698589.0,2049556,5,699
San Diego,2025,1288217,728371,517739,42107,1219745,698590,480988,40167,74447,51327,...,64889,23002,221100,245307.3086,2727204.0,1251249.0,2698589.0,1688220,5,699
San Diego,2026,1300847,729354,529386,42107,1231007,699444,491395,40168,74733,51327,...,65379,23002,221100,245307.3086,2727204.0,1251249.0,2698589.0,1688220,5,699
San Diego,2029,1338737,730943,565687,42107,1264151,701165,522809,40177,75591,51327,...,66938,23002,221100,245307.3086,2727204.0,1251249.0,2698589.0,1688220,5,699
San Diego,2030,1351367,731412,577848,42107,1274948,701619,533151,40178,75877,51327,...,67428,23002,221100,245307.3086,2727204.0,1251249.0,2698589.0,1688220,5,699
San Diego,2032,1376162,732534,601521,42107,1296193,702532,553482,40179,76043,51327,...,68233,23002,221100,245307.3086,2727204.0,1251249.0,2698589.0,1688220,5,699
San Diego,2035,1409867,734934,632826,42107,1327588,704236,583173,40179,76292,51327,...,69454,23002,221100,245307.3086,2727204.0,1251249.0,2698589.0,1688220,5,699


# Both File Creation
- Last run took: 2m 3.1s

In [8]:
both_df = create_both_df(dsid_1='99', dsid_2='36R', geo_level='mgra', to_jdrive=True)
both_df

Unnamed: 0_level_0,Unnamed: 1_level_0,taz_36,hs_36,hs_sf_36,hs_mf_36,hs_mh_36,hh_36,hh_sf_36,hh_mf_36,hh_mh_36,gq_civ_36,...,luz_id_36R,truckregiontype_36R,district27_36R,milestocoast_36R,acres_36R,effective_acres_36R,land_acres_36R,MicroAccessTime_36R,remoteAVParking_36R,refueling_stations_36R
year,mgra,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2016,1,3331,19,19,0,0,18,18,0,0,0,...,95,1,27,3.7997,16.615444,12.961482,16.615444,10,0,0
2016,2,3331,35,35,0,0,34,34,0,0,0,...,95,1,27,3.9761,19.519185,19.519185,19.519185,10,0,0
2016,3,3358,52,52,0,0,52,52,0,0,0,...,95,1,27,4.1939,27.845124,26.867938,27.845124,10,0,0
2016,4,3358,30,30,0,0,30,30,0,0,0,...,95,1,27,4.2782,7.976178,7.976178,7.976178,10,0,0
2016,5,3358,28,28,0,0,28,28,0,0,0,...,95,1,27,4.0062,7.072502,7.063693,7.072502,10,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2050,22998,1290,90,90,0,0,87,87,0,0,0,...,14,1,1,2.6193,41.241522,26.863578,41.241522,120,0,0
2050,22999,1290,0,0,0,0,0,0,0,0,0,...,14,1,1,2.3703,35.842780,30.635095,35.842780,120,0,0
2050,23000,1290,131,131,0,0,126,126,0,0,0,...,14,1,1,2.1721,28.735275,16.109816,28.735275,120,0,0
2050,23001,1254,83,83,0,0,81,81,0,0,0,...,14,1,1,2.7063,41.006144,28.126056,41.006144,120,0,0


# Diff File Creation
- This is going to do dsid_1 minus dsid_2
- Last run took 4m 21.8s

In [4]:
diff_df = create_diff_df(dsid_1='36', dsid_2='38', geo_level='mgra', to_jdrive=True)
diff_df

Unnamed: 0_level_0,Unnamed: 1_level_0,hs,hs_sf,hs_mf,hs_mh,hh,hh_sf,hh_mf,hh_mh,gq_civ,gq_mil,...,hotelroomtotal,truckregiontype,district27,milestocoast,acres,effective_acres,land_acres,MicroAccessTime,remoteAVParking,refueling_stations
mgra,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1,2016,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0.0,0.0,0.0,0.0,0,0,0
2,2016,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0.0,0.0,0.0,0.0,0,0,0
3,2016,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0.0,0.0,0.0,0.0,0,0,0
4,2016,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0.0,0.0,0.0,0.0,0,0,0
5,2016,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0.0,0.0,0.0,0.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22998,2050,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0.0,0.0,0.0,0.0,0,0,0
22999,2050,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0.0,0.0,0.0,0.0,0,0,0
23000,2050,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0.0,0.0,0.0,0.0,0,0,0
23001,2050,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0.0,0.0,0.0,0.0,0,0,0


# Population comparison (between household csv and input files)
- Output name  is: mgra_households_dataset_population_comparison_{output_type}_DS{dsid}_QA.csv
- Last run took: 3m 16.3s

In [2]:
def population_comparison_households_and_input_files(dsid, gq_only, no_gq, to_jdrive):
    '''Compare MGRA population data to household dataset population data based on gq preference.'''
    # Input Files
    mgra_data = pd.read_csv(rf'J:\DataScience\DataQuality\QAQC\forecast_automation\mgra_series_13_outputs_CSV_data\aggregated_data\mgra_DS{dsid}_ind_QA.csv', usecols=[
                            'year', 'mgra', 'pop', 'hhp'])
    mgra_data['gq_pop_input_files'] = mgra_data['pop'] - mgra_data['hhp']

    # gq_only = True , no_gq = False
    if (gq_only == no_gq) & (gq_only == False):
        mgra_data = mgra_data[['mgra', 'year', 'pop']]
        output_type = 'all'
    elif gq_only:
        mgra_data = mgra_data[['mgra', 'year', 'gq_pop_input_files']]
        output_type = 'GQ_only'
    else:
        mgra_data = mgra_data[['mgra', 'year', 'hhp']]
        output_type = 'no_GQ'
    
    return mgra_data

In [3]:
mgra_data = population_comparison_households_and_input_files('36R', gq_only=False, no_gq=True, to_jdrive=False)
mgra_data

Unnamed: 0,mgra,year,hhp
0,1,2016,41
1,2,2016,81
2,3,2016,111
3,4,2016,73
4,5,2016,63
...,...,...,...
299021,22998,2050,242
299022,22999,2050,0
299023,23000,2050,323
299024,23001,2050,204


# Household number comparison (between households csv file and input files)
- Output file looks like: mgra_households_dataset_hh_count_comparison_no_GQ_DS{dsid}_QA.csv   
- Compare number of households between input files and households dataset. Only at the 'No GQ' level.
- Last run took: 3m 33.8s

In [2]:
hs_comp_households_input_files = household_number_comparison_houseolds_and_input_files(dsid='36R', to_jdrive=True)
hs_comp_households_input_files

Unnamed: 0,mgra,year,house_count_household_file,hh_count_input_files,Diff
0,1,2016,18,18,0
1,2,2016,34,34,0
2,3,2016,52,52,0
3,4,2016,30,30,0
4,5,2016,28,28,0
...,...,...,...,...,...
237675,22995,2050,37,37,0
237676,22996,2050,103,103,0
237677,22998,2050,87,87,0
237678,23000,2050,126,126,0


# Population comparison (Between households csv and persons csv)
- Output looks like: DS{dsid}_persons_household_population_comparison_{gq_status}_QA.csv
- Last run took: 9m 48.7s
- Data completed around: 8m 15s

In [7]:
pop_comp_households_persons = aggregate_persons_households_population_comparison(dsid='36R', gq_only=True, to_jdrive=True)
pop_comp_households_persons

2016 is complete
2018 is complete
2020 is complete
2023 is complete
2025 is complete
2026 is complete
2029 is complete
2030 is complete
2032 is complete
2035 is complete
2040 is complete
2045 is complete
2050 is complete


Unnamed: 0,hhid,Persons_Dataset_Pop,Households_Dataset_Pop,Diff_P_minus_H,year
0,886,1,,,2016
1,2000,1,,,2016
2,2554,1,,,2016
3,2608,1,,,2016
4,2852,1,,,2016
...,...,...,...,...,...
84340,1453334,1,1.0,0.0,2050
84341,1453335,1,1.0,0.0,2050
84342,1453336,1,1.0,0.0,2050
84343,1453337,1,1.0,0.0,2050


# MGRA Series 14 Outputs
Outputs from Series 14 forecast used MGRA Series 13 values. MGRA series 
