## PART I. Preparation:

1. Create a working directory with the following folder structure:
        <working_dir_name>/
                          input/
                          output/
        
2. Install [doppelganger](https://github.com/sidewalklabs/doppelganger.git):
    
        `pip install doppelganger`
        
3. Download household and population level PUMS data for your state form [this link](https://www.census.gov/programs-surveys/acs/data/pums.html) and save to your `input` folder. You should get two files in `csv` format: ss15pxx.csv and ss15hxx.csv where xx stands for your state's abbreviation and p and h represent the population and household-level PUMS, respectively.

4. Download the PUMA 2010 shapefile for your state from [this link](https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2010&layergroup=Public+Use+Microdata+Areas) and save to your `input` folder. 
    - Note that you may use this shapefile in a GIS suite to filter the data creation step to PUMAs within a certain area of interest (such as a metropolitan region).
    
5. Now `cd` to your working dir, and we will begin preprocessing.


## PART II. Preprocessing:

### Imports

In [1]:

from __future__ import (
    absolute_import, division, print_function, unicode_literals
)
import builtins
import pandas as pd
import dask.dataframe as dd
from doppelganger import (allocation,
                          inputs,
                          Configuration,
                          HouseholdAllocator,
                          PumsData,
                          SegmentedData,
                          BayesianNetworkModel,
                          Population,
                          Preprocessor,
                          Marginals)
from download_allocate_generate import *
import logging


logging.basicConfig(filename='logs', filemode='a', level=logging.INFO)

%load_ext autoreload
%autoreload 2

### AOI Loading
This step loads the PUMAs in your AOI. This should have been completed using a GIS software with the shapefile downloaded in Part I.

In [7]:
# Load pumas (these will be based on the pumas you actually want to generate data for.)
puma_df = pd.read_csv('input/sfbay_puma_from_intersection.csv',dtype=str)
# Select 2010 PUMA column for filtering
puma_df_clean = puma_df.PUMACE10

### Filter PUMS data

__NOTE__: If you've previously completed this task, uncomment appropriate parts below to load intermediate data outputs for further processing

#### If filtering PUMS data for the first time
Run the following to filter PUMS data (only if it's your first time running this task):

In [14]:
# Read in population level PUMS (may take a while...)
person_pums_df = pd.read_csv('input/ss14pca.csv', na_values=['N.A'],na_filter=True)
# Read in household level PUMS (may take a while...)
household_pums_df = pd.read_csv('input/ss14hca.csv', na_values=['N.A'],na_filter=True)

  interactivity=interactivity, compiler=compiler, result=result)


In [44]:
# filter household data and population data to AOI
person_df_in_aoi=person_pums_df[person_pums_df['PUMA10'].isin(puma_df_clean.values)]
person_df_in_aoi.loc[:,'puma']=person_df_in_aoi['PUMA10']
household_df_in_aoi=household_pums_df[household_pums_df['PUMA10'].isin(puma_df_clean.values)]
household_df_in_aoi.loc[:,'puma']=person_df_in_aoi['PUMA10']


In [54]:
# Save for later use
person_df_in_aoi.to_csv('input/sfbay_person_pums_data.csv',index_label='index')
household_df_in_aoi.to_csv('input/sfbay_household_pums_data.csv',index_label='index')

#### Otherwise, load Previously Filtered PUMS Data:

Uncomment below if starting from outputs of previous step:

In [2]:
# houshold_df_in_aoi=pd.read_csv('input/sfbay_household_pums_data.csv')
# person_df_in_aoi=pd.read_csv('input/sfbay_person_pums_data.csv')

### Define Constants for Analysis  

In [9]:
STATE = '06'  # change to your state as appropriate
PUMA = puma_df_clean.iloc[0]
TABLENAME='acs_2015_5yr_pums'
output_dir = 'output'
census_api_key = '4d1ff8f7278171c404244dbe3055addfb97757c7'

### Load the Doppelganger example configuration file
This file does the following three things:
1. Defines person-specific variables in `person_fields`. In the example, you'll see `age`, `sex`, and `individual_income`. These variables are mapped to the PUMS variables in `inputs.py`. For example, `age` in Doppelganger is mapped to the PUMS variable `agep`. To use other variables from the PUMS with Doppelganger, you'll need to map their relationships in `inputs.py` and specify them here. 
2. Defines household-specific variables in `household_fields`. In the example, you'll see `household_income` and `num_vehicles`. As with the person-specific variables, you'll need to modify `inputs.py` to use other variables in Doppelganger.
3. Defines procedures to process input variables into bins in `preprocessing`.
4. Defines the structure of the household and person Bayesian Networks in `network_config_files`.  

In [10]:
configuration = Configuration.from_file('input/config.json')
household_fields = tuple(set(
    field.name for field in allocation.DEFAULT_HOUSEHOLD_FIELDS).union(
        set(configuration.household_fields)
))
persons_fields = tuple(set(
    field.name for field in allocation.DEFAULT_PERSON_FIELDS).union(
        set(configuration.person_fields)
))

In [11]:
gen_pumas = ['state_06_puma_{}_generated.csv'.format(puma) for puma in puma_df_clean]

In [13]:
pumas_to_go = set(puma_df_clean.values.tolist())

for puma in puma_df_clean:
    gen_puma ='state_06_puma_{}_generated.csv'.format(puma)
    if gen_puma in os.listdir('.'):
        print (puma)
        pumas_to_go.remove(puma)
        

puma_tract_mappings='input/2010_puma_tract_mapping.txt'
configuration = Configuration.from_file('input/config.json')
preprocessor = Preprocessor.from_config(configuration.preprocessing_config)

for puma_id in pumas_to_go:
    households_data = PumsData.from_csv('input/sfbay_household_pums_data.csv').clean(household_fields, preprocessor, state=STATE, puma=puma_id)
    persons_data = PumsData.from_csv('input/sfbay_person_pums_data.csv').clean(persons_fields, preprocessor,state=STATE, puma=puma_id)
    population_segmenter = lambda x: None
    household_segmenter = lambda x: None
    print("loaded")
    
    household_model, person_model = create_bayes_net(
            STATE, puma_id, output_dir,
            households_data, persons_data, configuration,
            person_segmenter, household_segmenter
        )

    marginals, allocator = download_tract_data(
                STATE, puma_id, output_dir, census_api_key, puma_tract_mappings,
                households_data, persons_data
            )
    
    print('Allocated {}'.format(puma_id))
    population = generate_synthetic_people_and_households(
                STATE, puma_id, output_dir, allocator,
                person_model, household_model
            )
    
    print('Generated {}'.format(puma_id))
    accuracy = Accuracy.from_doppelganger(
            cleaned_data_persons=persons_data,
            cleaned_data_households=households_data,
            marginal_data=marginals,
            population=population
        )
    
    logging.info('Absolute Percent Error for state {}, and puma {}: {}'.format(STATE, puma_id,
                     accuracy.absolute_pct_error().mean()))


loaded


UnboundLocalError: local variable 'allocator' referenced before assignment

In [None]:
df_comb=dd.read_csv('state_06_puma_*_generated.csv')
df_next=df_comb[['tract','num_people','num_vehicles','household_id_x','serial_number','repeat_index']]
df_next.compute().to_csv('combined_pop.csv')
df=pd.read_csv('combined_pop.csv')
df.num_people=df.num_people.replace('4+',4).astype(int)
tract_sd=pd.concat([df['num_people']['std'],df['num_vehicles']['std']],axis=1)
tract_sd.columns=['hh_sd','car_sd']

tract_sd_df=pd.DataFrame(tract_sd)
tract_sd_df['tract']=tract_sd_df.index
merged_gdf=merged_gdf.merge(tract_sd_df,on='tract')
df=df.drop(['Unnamed: 0','serial_number','repeat_index','person_id'],axis=1)
df.drop_duplicates(inplace=True)

def compute_row(i):
    row = df.iloc[i]
    tract_no = row.tract
    pt = get_random_point_in_polygon(tract_gdf[(tract_no == tract_gdf.tract)].geometry.values[0])
    return np.array([str(row.household_id_x),int(row.num_people),int(row.num_vehicles),float(pt.x),float(pt.y)])

res = []
for i in tnrange(df.shape[0], desc='1st loop'):
    res.append(compute_row(i))
out_df=dd.from_array(res).compute()
out_df.to_csv("output/hhOut.csv",header=False,index=False)