# Assign PUMS a ZIP via GeoCorr


## Purpose

I hope to accomplish the following:
<p></p>

* assign a zip code to each PUMS (wrangled) record

Each wrangled PUMS record, representing an individual house, contains a PUMA.
In order to assign a realistic street & address to each record, we will need to have an associated zip code.

We can link PUMA to ZIP via a Geocorr (see instructions below).
We hope to assign zip codes to PUMAs based on this mapping.
Because some zip codes contain drastically more houses than others, we use the number of houses within a zip code & PUMA combination to assign probabalistic weight to each zip code contained within a PUMA.

-------------------------

<b>Ensure that you have already accomplished the following before running the script.</b>
<p></p>

1. Navigate to the [Geocorr website](https://mcdc.missouri.edu/applications/geocorr2022.html)
2. Select State you wish to link to (ex: Washington) 
3. Select at least 1 source geography 
   * Other Geographies > PUMA (2012)
4. Select at least 1 target geography.  CTRL+click for multiple 
   * 2020 Geographies > ZIP/ZCTA  
5. Select weighting variable of <b>"Housing units (2020 Census)"</b>
5. Generate output 
6. Click on .csv link to download
7. Save/move output as: 
   * "{root_directory}/SupportingDocs/Housing/01_Raw/GeoCorr_mapping<b>_hus</b>.csv"
   
### ALSO DO THIS, note the difference bolded

1. Navigate to the [Geocorr website](https://mcdc.missouri.edu/applications/geocorr2022.html)
2. Select State you wish to link to (ex: Washington) 
3. Select at least 1 source geography 
   * Other Geographies > PUMA (2012)
4. Select at least 1 target geography.  CTRL+click for multiple 
   * 2020 Geographies > ZIP/ZCTA  
5. Select weighting variable of <b>"Population (2020 Census)"</b>
5. Generate output 
6. Click on .csv link to download
7. Save/move output as: 
   * "{root_directory}/SupportingDocs/Housing/01_Raw/GeoCorr_mapping<b>_pop</b>.csv"

----------------------

<p>Author: PJ Gibson</p>
<p>Create Date: 2022-07-03</p>
<p>Contact: peter.gibson@doh.wa.gov</p>

%md
## 1.  Load in variables, libraries

In [0]:
import pandas as pd
import numpy as np

# Set random seed
rng = np.random.default_rng( 42 )

## 2.  Read & Clean

### 2.1 PUMS

This data has been wrangled so that each row represents and individual house.
Each record (row) contains information on the house type, number bedrooms, and many other relevant fields.

Most importantly for this script, each record has an associated PUMA.
A PUMA represents a geographical area.
Depending on the type of census survey you use to pull the PUMS data, PUMAs can represent different population sizes.
For the 2019 ACS 5-year survey for Washington State, each PUMA contains anywhere between 40205 - 46959 housing units.
<p></p>
* More information on PUMS / the PUMA field [linked here](https://www.census.gov/content/dam/Census/library/publications/2021/acs/acs_pums_handbook_2021.pdf)

In [0]:
# Read in our data
df_PUMS_housing = pd.read_csv(f'../../SupportingDocs/Housing/02_Wrangled/PUMS_housing_1to1.csv')

# Convert PUMA to string
df_PUMS_housing['PUMA'] = df_PUMS_housing['PUMA'].astype(str).str.zfill(5)

# Sort by PUMA
df_PUMS_housing = df_PUMS_housing.sort_values(['PUMA', 'house_id'], ascending=True).reset_index(drop=True).reset_index()

### 2.2 Geocorr

This data has been wrangled so that each row represents and individual house.
Each record (row) contains information on the house type, number bedrooms, and many other relevant fields.

Most importantly for this script, each record has an associated PUMA.
A PUMA represents a geographical area.
Depending on the type of census survey you use to pull the PUMS data, PUMAs can represent different population sizes.
For the 2019 ACS 5-year survey for Washington State, each PUMA contains anywhere between 40205 - 46959 housing units.

In [0]:
# Read in our dataset
df_Geocorr = pd.read_csv(f'../../SupportingDocs/Housing/01_Raw/GeoCorr_mapping_hus.csv')\
               .query('zcta != " "')

# Koalas has no "skiprows" parameter so we need to do this manually
df_Geocorr = df_Geocorr[1:].reset_index()

# Convert afact to float
df_Geocorr['afact'] = df_Geocorr['afact'].astype('float64')

# Make number of houses in zip an integer
df_Geocorr['hus20'] = df_Geocorr['hus20'].astype(int)

# Remove records with no houses
df_Geocorr = df_Geocorr.query('hus20 > 0')

## 3. Wrangle Data

### 3.1 Assign weighted probabilities

We group by each PUMA to see a list of all available ZIPs within a PUMA.
Using the housing counts from each ZIP code and the sum of these counts for all ZIP codes within the PUMA, we can get weights summing to 1.

We should be aware that since some ZIPs overlap with multiple PUMAs, that this approach is not perfect.
It may result in overestimates for highly populated areas & underestimates for sparsly populated areas.

In [0]:
def normalize_afact(x):
  '''
  Returns a column with probabilities that sum to 1
  
  This function is necissary because we utilize a np.choice() function that requires a "p" parameter of probabilities that MUST sum to 1.
  The current afact values mostly sum to 1, but occasionally are just above or below 1.
  We need them perfectly normalized.
  
  Args:
    x (koalas Series): afact values of float type
    
  Returns:
    koalas Series: of length equal to input "x" containing values that do sum to 1
  '''
  
  return x / np.sum(x)

# enact our normalization within 
df_Geocorr['proba_zip_within_puma'] = df_Geocorr['hus20'].groupby(df_Geocorr['puma12']).apply(normalize_afact).to_list()

## 4. Assign ZIP codes to PUMS

### 4.1 Compile ZIPs

For each PUMS record row, we can use the PUMA-ZIP mapping & corresponding probabalistic weights to randomly select a zip code.
Initially I wanted to approach this problem on a record-by record case, but that proved to be inefficient and problematic.
I was using numpy.random.choice and kept experiencing an error where my probabilities did not sum to *exactly* 1.  
Sumns were off by 0.0000000001 or less.
The solution was using the [numpy.random.multinomial](https://numpy.org/doc/stable/reference/random/generated/numpy.random.multinomial.html) function.  
It has a built in method of making probabilities sum to 1 as long as they're very close.
By aggregating to PUMA and using this function, I could randomly assign a zip code to each row containing the specified PUMA.
This approach is vectorized and therefore reasonably efficient in my eyes.

See code comments below for more detailed view of the methodology.

In [0]:
# We'll loop through each PUMA in order of ascending values
list_pumas_sorted_asc = pd.Series(df_Geocorr.puma12.unique()).sort_values().to_numpy()

# Filter exclusively to PUMAs found in our 1:1 file
list_pumas_sorted_asc = np.array(list(set(list(list_pumas_sorted_asc)) & set(list(df_PUMS_housing.PUMA.unique()))))

# Get empty list to append to
list_assigned_zips = []

# For each puma....
for puma in list_pumas_sorted_asc:
  
  # See how many houses we will need to assign a zip code to
  num_applied_houses = len( df_PUMS_housing.query(f'PUMA == "{puma}"') )
  
  # Find out supporting geography, house count by zip information for our puma 
  df_puma_specific_geographies = df_Geocorr.query(f'puma12 == "{puma}"')
  df_puma_specific_geographies = df_puma_specific_geographies.sort_values('zcta', ascending=True)
  
  # Get list of zip codes & their associated probabilities for our puma
  list_available_zips = df_puma_specific_geographies['zcta'].to_numpy()
  list_available_probs = df_puma_specific_geographies['proba_zip_within_puma'].to_numpy()
  
  # For each house (within PUMS) with our specified puma, use our probabilities to see which choice our random number generator makes.  Convert to bool
  df_random_choices = rng.multinomial( 1 , pvals = list_available_probs , size = num_applied_houses)
  df_random_choices = df_random_choices.astype(bool)
  
  # For each house (row in our random output)...
  for i in np.arange(0,len(df_random_choices)):
    
    # ...append the related zip code that was chosen to our formerly initialized list
    list_assigned_zips.append( [puma, list_available_zips[df_random_choices[i]][0]] )
    
# # Format into pandas dataframe with proper columns
df_assigned_zips = pd.DataFrame(list_assigned_zips, columns=['PUMA','assigned_zip'])

### 4.2 Join ZIP codes to PUMS

Our output is exactly the same size as our PUMS dataset.
Both are ordered by ascending PUMA.
We probably could just assign the new ZIP assignments as a new column, but I decided to join them via a merge statement.

In [0]:
# Compile a "link_ID" that we will use to join the datasets - group by PUMA and rank by zip/house_id
df_assigned_zips['link_id'] = df_assigned_zips.groupby('PUMA')['assigned_zip'].rank('first')
df_PUMS_housing['link_id'] = df_PUMS_housing.groupby('PUMA')['house_id'].rank('first')


# Join the data together
df_PUMS_with_ZIP = df_PUMS_housing.merge(df_assigned_zips, how = 'inner', on = ['PUMA','link_id'])

## 5. Save

In [0]:
df_PUMS_with_ZIP.to_csv(f'../../SupportingDocs/Housing/03_Complete/PUMS_with_zip.csv', index=False)