# Using the Random Library to Generate Synthetic Data
### Randomly generating values for synthetic InfoUSA data
The InfoUSA dataset provided to us was purchased by Duke University, and we signed privacy agreements to not share the data publicly. For this reason, we must generate our own synthetic data mimicking that in the original InfoUSA data. The synthetic InfoUSA data created in those notebook and used throughout this repository is located in ```/data/source_files/infousa_files```.

### Import statements

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import random
import geopandas as gpd
import os
import geopandas as gpd
import numpy as np

from shapely.geometry import Polygon
from shapely.geometry import box
from shapely.geometry import Point

### Setting ```DATA_DIR```
In order to read in files from this repository, we must set ```DATA_DIR``` to be the data folder within this repository. This requires ```os.getcwd()``` to return the path to the processing notebook of this repository, so ```xxx/codeplus-celine-dcc-package/processing```, where ```xxx``` is the path to where you cloned this repository. If it is not, use ```os.chdir(path)``` to change the current working directory to ```xxx/codeplus-celine-dcc-package/processing``` before getting the current working directory in ```DATA_DIR = os.getcwd()```, where ```path``` is ```xxx/codeplus-celine-dcc-package/processing```.

In [3]:
DATA_DIR = os.getcwd()
DATA_DIR = DATA_DIR.replace('processing', 'data')
DATA_DIR

'/hpc/home/at341/ondemand/codeplus-celine-dcc-package/data'

### Generating synthetic InfoUSA data

The original InfoUSA data has a column with the state to which each household it has information on belongs. To make this synthetic data as realistic as possible, we created a list of all the state abbreviations, ```states```, as used in the original InfoUSA data, and will use the random library to randomly select from this list. We do the same with the ```age_codes``` list, but this time with the age codes as stipulated by the original data.

In [4]:
states = [ 'AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
           'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
           'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
           'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
           'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY']
age_codes = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M']

Next, to create a dataframe, we can use a dictionary. The dictionary will contain each column name as a key and the list of values for that column as the value. Hence, we must create lists for each column containing the values for that column. To do so, we coded an inner for loop that will append 10,000 random values to each list, based on what those values should range from, as observed from the original InfoUSA data. The outer for loop ensures that we create 5 synthetic InfoUSA files, as the original InfoUSA data is given in 38,000 different files. We do this so that this repository uses fake data mimicking the real data as closely as possible. 

In these for loops, we use the function ```generate_random_location_within_us```. This function generates ```num_pt``` latitude and longitude coordinates, ensuring that each of these coordinates lies within the US. This is done to ensure we mimick the InfoUSA data as closely as possible and only use 'valid' coordinates. This function first uses the random library's ```.uniform``` method to generate random ```lat_point``` and ```lon_point```, which are latitude and longitude coordinates within the general latitude and longitude boundaries of the US. Then, it creates a GeoDataFrame from these two coordinates, transforming them into a Point geometry. This is so that we can then use GeoPandas' ```.sjoin()``` function to determine whether the Point lies within the geometry of the US. If it does, ```lat_point``` and ```lon_point``` are added to ```lat``` and ```lon``` lists, which are used in the dictionary to create the synthetic dataframe as described above.

After all this is done, the last line saves each synthetic dataframe to ```/data/source_files/infousa_files``` and mimicks the naming system of the InfoUSA files.

To get the geometry of the US, we import a shapefile with the geometries of all countries in the world, taken from ArcGis, available [here](https://hub.arcgis.com/datasets/2b93b06dc0dc4e809d3c8db5cb96ba69_0/explore?location=-0.591520%2C0.000000%2C2.17).

In [5]:
df_world = gpd.read_file(DATA_DIR + '/source_files/world_shapefiles/world.shp')
df_us = df_world[(df_world['COUNTRYAFF'] == 'United States') & (df_world['COUNTRY'] == 'United States')]
df_us

Unnamed: 0,FID,COUNTRY,ISO,COUNTRYAFF,AFF_ISO,SHAPE_Leng,SHAPE_Area,geometry
154,155,United States,US,United States,US,726.106056,1116.670604,"MULTIPOLYGON (((-76.39501 39.22999, -76.38695 ..."


In [6]:
def generate_random_location_within_us(num_pt, polygon):
    """
    Generate num_pt random location coordinates .
    :param num_pt INT number of random location coordinates
    :param polygon geopandas.geoseries.GeoSeries the polygon of the region
    :return x, y lists of location coordinates, longetude and latitude
    """
    i = 0
    lat = []
    lon = []
    
    while i < num_pt:
    ## generate random location coordinates
        lat_point = random.uniform(25, 50) ## these are approximate latitude boundaries of the US
        lon_point = random.uniform(-125, -65) ## these are approximate longitude boundaries of the US
                
        # print(lat_point)
        # print(lon_point)
        
        ## create a GeoDataFrame with the lat/lon coordinates as Point geometry
        d = {'point': ['point1'], 'geometry': [Point(lon_point, lat_point)]}
        gdf = gpd.GeoDataFrame(d, crs="EPSG:4326")
        
        ## append to list only if Point is within polygon geometry    
        if len(gdf.sjoin(polygon, predicate = 'within')) == 1:
            lat.append(lat_point)
            lon.append(lon_point)
            i += 1
    
    return lat, lon

In [10]:
%%time

random.seed(1)

for i in range(5):  
    zipcode = random.randint(10000, 99999)
    ZIP = []
    census_county_2010 = []
    census_state_2010 = []
    ChildrenHHCount = []
    length_of_residence = []
    children_ind = []
    STATE = []
    head_hh_age_code = []
    GE_LATITUDE_2010, GE_LONGITUDE_2010 = generate_random_location_within_us(15000, df_us) # generate 10,000 coordinates within the US
    
    for i in range (0, 15000):
        ZIP.append(random.randint(10000, 99999))
        census_county_2010.append(str(random.randint(0, 5)) + str(random.randint(0, 9)) + str(random.randint(0, 9)))
        census_state_2010.append(str(random.randint(0, 5)) + str(random.randint(0, 9)))                                          
        STATE.append(states[random.randint(0, 50)])
        ChildrenHHCount.append(random.randint(0, 15))
        length_of_residence.append(random.randint(0, 70))
        children_ind.append(random.randint(0,1))
        head_hh_age_code.append(age_codes[random.randint(0, 12)])

    # print(len(ZIP))
    # print(len(census_county_2010))
    # print(len(ChildrenHHCount))
    # print(len(length_of_residence))
    # print(len(children_ind))
    # print(len(GE_LONGITUDE_2010))
    # print(len(GE_LATITUDE_2010))
    # print(len(STATE))
    # print(len(head_hh_age_code))
   
    d = {'ZIP': ZIP, 'census_county_2010': census_county_2010, 'census_state_2010': census_state_2010, 'STATE': STATE, 
         'ChildrenHHCount': ChildrenHHCount, 'length_of_residence': length_of_residence, 
         'children_ind': children_ind, 'head_hh_age_code': head_hh_age_code, 'GE_LATITUDE_2010': GE_LATITUDE_2010, 
         'GE_LONGITUDE_2010': GE_LONGITUDE_2010}
    df_synthetic = pd.DataFrame(d)
    df_synthetic
    
    df_synthetic.to_csv(DATA_DIR + '/source_files/infousa_files/Household_Ethnicity_zip_' + str(zipcode) + '_year_2020.txt', sep = '\t', index = False)

CPU times: user 28min 45s, sys: 9.42 s, total: 28min 55s
Wall time: 29min 5s


### Generating synthetic InfoUSA case study data
The original InfoUSA data has real values for county and state for each household, but this synthetic data is created by randomly selecting state abbreviations and randomly generating county fips codes. Therefore, when we filter for only households in Charleston County for our visualizations, we look for county fips 45019 (the same is done for Harris County). However, it is extremely unlikely that this exact number for county fips was generated randomly when creating the synthetic data, and we need enough observations with that county fips to make a meaningful visualization. Thus, we created two more synthetic InfoUSA datasets specifically for Charleston and Harris County, as will be needed in our visualizations in this repository.

#### Charleston County
We follow the same steps as above, but for the state column, we always have 'SC', as Charleston County is in South Carolina, and the state fips is set to 45 and the county fips is set to 045, all the correct values for Charleston County. We also restricted the latitude and longitude values for values within Charleston County, using the same structure as the```generate_random_locations_within_us``` function as in the section above. However, instead of passing in a GeoDataFrame with the geometry of the US into the ```polygon``` parameter, we pass in ```df_charleston```- a GeoDataFrame with the geometry of Charleston County. In addition, this ```generate_random_location_within_charleston``` function generates random coordinates within general latitude and longitude boundaries of Charleston County so that it is more efficient. ```generate_random_locations_within_us``` would work, but the probability a randomly generated point lies within Charleston County would be significantly lower.

We get this GeoDataFrame by filtering through a shapefile with the geometries for all counties in the US, taken from the United States Census Bureau's Cartographic Boundary Files (available [here](https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html)).

In [14]:
df_counties = gpd.read_file(DATA_DIR + '/source_files/county_shapefiles/counties.shp')
df_charleston = df_counties[(df_counties['NAME'] == 'Charleston')]
df_charleston

Unnamed: 0,STATEFP,COUNTYFP,COUNTYNS,AFFGEOID,GEOID,NAME,NAMELSAD,STUSPS,STATE_NAME,LSAD,ALAND,AWATER,geometry
918,45,19,1252740,0500000US45019,45019,Charleston,Charleston County,SC,South Carolina,6,2377554561,1139619648,"MULTIPOLYGON (((-79.50795 33.02008, -79.50713 ..."


In [16]:
def generate_random_location_within_charleston(num_pt, polygon):
    """
    Generate num_pt random location coordinates .
    :param num_pt INT number of random location coordinates
    :param polygon geopandas.geoseries.GeoSeries the polygon of the region
    :return x, y lists of location coordinates, longetude and latitude
    """
    i = 0
    lat = []
    lon = []
    
    while i < num_pt:
        ## generate random location coordinates
        lat_point = random.uniform(32.650, 33.080) ## these are approximate latitude boundaries of charleston
        lon_point = random.uniform(-80.370, -79.460) ## these are approximate longitude boundaries of charleston
                
        # print(lat_point)
        # print(lon_point)
        
        ## create a GeoDataFrame with the lat/lon coordinates as Point geometry
        d = {'point': ['point1'], 'geometry': [Point(lon_point, lat_point)]}
        gdf = gpd.GeoDataFrame(d, crs="EPSG:4326")
        gdf
    
        ## append to list only if Point is within polygon geometry
        if len(gdf.sjoin(polygon, predicate = 'within')) == 1:
            lat.append(lat_point)
            lon.append(lon_point)
            i += 1
    
    return lat, lon

In [17]:
%%time

age_codes = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M']

random.seed(1)

ZIP = []
census_county_2010 = []
census_state_2010 = []
ChildrenHHCount = []
length_of_residence = []
children_ind = []
STATE = []
head_hh_age_code = []
GE_LATITUDE_2010, GE_LONGITUDE_2010 = generate_random_location_within_charleston(5000, df_charleston) # generate 5,000 coordinates within charleston

    
for i in range (0, 5000):
    ZIP.append(random.randint(29400, 29500)) ## restrict zipcode values for zipcodes within Harris County
    census_county_2010.append('019') ## use charleston county fips 019
    census_state_2010.append('45') ## use south carolina state fips 45                                         
    STATE.append('SC') ## use south carolina state abbreviation SC
    ChildrenHHCount.append(random.randint(0, 15))
    length_of_residence.append(random.randint(0, 70))
    children_ind.append(random.randint(0,1))
    head_hh_age_code.append(age_codes[random.randint(0, 12)])
# print(len(ZIP))
# print(len(census_county_2010))
# print(len(ChildrenHHCount))
# print(len(length_of_residence))
# print(len(children_ind))
# print(len(GE_LONGITUDE_2010))
# print(len(GE_LATITUDE_2010))
# print(len(STATE))
# print(len(head_hh_age_code))
   
d = {'ZIP': ZIP, 'census_county_2010': census_county_2010, 'census_state_2010': census_state_2010, 'STATE': STATE, 
     'ChildrenHHCount': ChildrenHHCount, 'length_of_residence': length_of_residence, 
     'children_ind': children_ind, 'head_hh_age_code': head_hh_age_code, 'GE_LATITUDE_2010': GE_LATITUDE_2010, 
     'GE_LONGITUDE_2010': GE_LONGITUDE_2010}
df_synthetic_charleston = pd.DataFrame(d)
df_synthetic_charleston

CPU times: user 2min 32s, sys: 948 ms, total: 2min 33s
Wall time: 2min 34s


Unnamed: 0,ZIP,census_county_2010,census_state_2010,STATE,ChildrenHHCount,length_of_residence,children_ind,head_hh_age_code,GE_LATITUDE_2010,GE_LONGITUDE_2010
0,29445,019,45,SC,0,45,0,J,32.978423,-80.137887
1,29440,019,45,SC,10,51,0,K,32.863037,-79.960963
2,29428,019,45,SC,9,44,0,A,32.930185,-79.652262
3,29476,019,45,SC,5,61,0,K,32.690360,-80.344204
4,29444,019,45,SC,14,68,0,F,32.841516,-79.713399
...,...,...,...,...,...,...,...,...,...,...
4995,29412,019,45,SC,8,70,1,E,32.845994,-80.030114
4996,29427,019,45,SC,8,9,0,G,32.719291,-79.941533
4997,29408,019,45,SC,3,41,0,I,32.668219,-80.197994
4998,29443,019,45,SC,7,70,0,B,32.675563,-80.187002


##### Processing
Since this data will be used directly in visualization notebook **04_charleston_dist**, we must then then process this data the same way we would process the synthetic InfoUSA data from above in processing notebook **01_merging_files**. 

In [18]:
%%time
df_synthetic_charleston['county_fips'] = df_synthetic_charleston['census_state_2010'] + df_synthetic_charleston['census_county_2010']

df_synthetic_charleston = df_synthetic_charleston[['ZIP', 'county_fips', 'STATE', 'ChildrenHHCount', 'children_ind', 'head_hh_age_code', 
                                           'GE_LATITUDE_2010', 'GE_LONGITUDE_2010']]
df_synthetic_charleston

CPU times: user 6.42 ms, sys: 4 µs, total: 6.42 ms
Wall time: 5.75 ms


Unnamed: 0,ZIP,county_fips,STATE,ChildrenHHCount,children_ind,head_hh_age_code,GE_LATITUDE_2010,GE_LONGITUDE_2010
0,29445,45019,SC,0,0,J,32.978423,-80.137887
1,29440,45019,SC,10,0,K,32.863037,-79.960963
2,29428,45019,SC,9,0,A,32.930185,-79.652262
3,29476,45019,SC,5,0,K,32.690360,-80.344204
4,29444,45019,SC,14,0,F,32.841516,-79.713399
...,...,...,...,...,...,...,...,...
4995,29412,45019,SC,8,1,E,32.845994,-80.030114
4996,29427,45019,SC,8,0,G,32.719291,-79.941533
4997,29408,45019,SC,3,0,I,32.668219,-80.197994
4998,29443,45019,SC,7,0,B,32.675563,-80.187002


**Renaming columns**: 
We rename the columns in our dataset for standardization purposes.

In [19]:
df_synthetic_charleston.rename(columns = {'ZIP': 'zip', 'STATE': 'state', 'ChildrenHHCount': 'child_num', 
                           'children_ind': 'has_child', 'head_hh_age_code': 'age_code', 'GE_LATITUDE_2010': 'lat_h_4326', 
                            'GE_LONGITUDE_2010': 'lon_h_4326'}, inplace = True)
df_synthetic_charleston

Unnamed: 0,zip,county_fips,state,child_num,has_child,age_code,lat_h_4326,lon_h_4326
0,29445,45019,SC,0,0,J,32.978423,-80.137887
1,29440,45019,SC,10,0,K,32.863037,-79.960963
2,29428,45019,SC,9,0,A,32.930185,-79.652262
3,29476,45019,SC,5,0,K,32.690360,-80.344204
4,29444,45019,SC,14,0,F,32.841516,-79.713399
...,...,...,...,...,...,...,...,...
4995,29412,45019,SC,8,1,E,32.845994,-80.030114
4996,29427,45019,SC,8,0,G,32.719291,-79.941533
4997,29408,45019,SC,3,0,I,32.668219,-80.197994
4998,29443,45019,SC,7,0,B,32.675563,-80.187002


**Transforming household latitude and longitude coordinates from EPSG 4326 to EPSG 3857**.
A lot of our visualizations need coordinates in EPSG 3857, however these coordinates are in EPSG 4326. Therefore, we use the pyproj interface, which allows us to use the PROJ coordinate transformation software to transform our EPSG 4326 coordinates to EPSG 3857. This creates two new columns in our original dataset with the transformed coordinates.

In [20]:
from pyproj import Proj, Transformer

In [21]:
# Apply transformation
transform_4326_to_3857 = Transformer.from_crs('epsg:4326', 'epsg:3857')
df_synthetic_charleston['lat_h_3857'], df_synthetic_charleston['lon_h_3857'] = transform_4326_to_3857.transform(
                                                df_synthetic_charleston['lat_h_4326'], df_synthetic_charleston['lon_h_4326'])

df_synthetic_charleston

Unnamed: 0,zip,county_fips,state,child_num,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857
0,29445,45019,SC,0,0,J,32.978423,-80.137887,-8.920909e+06,3.892440e+06
1,29440,45019,SC,10,0,K,32.863037,-79.960963,-8.901214e+06,3.877139e+06
2,29428,45019,SC,9,0,A,32.930185,-79.652262,-8.866849e+06,3.886041e+06
3,29476,45019,SC,5,0,K,32.690360,-80.344204,-8.943876e+06,3.854276e+06
4,29444,45019,SC,14,0,F,32.841516,-79.713399,-8.873655e+06,3.874287e+06
...,...,...,...,...,...,...,...,...,...,...
4995,29412,45019,SC,8,1,E,32.845994,-80.030114,-8.908912e+06,3.874880e+06
4996,29427,45019,SC,8,0,G,32.719291,-79.941533,-8.899051e+06,3.858104e+06
4997,29408,45019,SC,3,0,I,32.668219,-80.197994,-8.927600e+06,3.851348e+06
4998,29443,45019,SC,7,0,B,32.675563,-80.187002,-8.926376e+06,3.852319e+06


**Exporting final dataframe**. Finally, we export this dataframe to ```/data/source_files/infousa_files``` for use in our visualizations.

In [22]:
%%time
df_synthetic_charleston.to_parquet(DATA_DIR + '/source_files/infousa_files/charleston_households.parquet')

CPU times: user 32.4 ms, sys: 8.95 ms, total: 41.4 ms
Wall time: 45.7 ms


#### Harris County
We follow the same steps as above, but for the state column, we always have 'TX', as Harris County is in Texas, and the state fips is set to 48 and the county fips is set to 201, all the correct values for Harris County. We also restricted the latitude and longitude values for values within Harris County.

In [24]:
df_harris = df_counties[(df_counties['NAME'] == 'Harris') & (df_counties['STUSPS'] == 'TX')]
df_harris

Unnamed: 0,STATEFP,COUNTYFP,COUNTYNS,AFFGEOID,GEOID,NAME,NAMELSAD,STUSPS,STATE_NAME,LSAD,ALAND,AWATER,geometry
1160,48,201,1383886,0500000US48201,48201,Harris,Harris County,TX,Texas,6,4421068052,182379558,"MULTIPOLYGON (((-94.97839 29.68365, -94.97743 ..."


In [26]:
def generate_random_location_within_harris(num_pt, polygon):
    """
    Generate num_pt random location coordinates .
    :param num_pt INT number of random location coordinates
    :param polygon geopandas.geoseries.GeoSeries the polygon of the region
    :return x, y lists of location coordinates, longetude and latitude
    """
    i = 0
    lat = []
    lon = []
    
    while i < num_pt:
    ## generate random location coordinates
        lat_point = random.uniform(29.530, 30.120) ## these are approximate latitude boundaries of harris
        lon_point = random.uniform(-95.820, -94.960) ## these are approximate longitude boundaries of harris
        
        # print(lat_point)
        # print(lon_point)
        
        ## create a GeoDataFrame with the lat/lon coordinates as Point geometry
        d = {'point': ['point1'], 'geometry': [Point(lon_point, lat_point)]}
        gdf = gpd.GeoDataFrame(d, crs="EPSG:4326")
        
        ## append to list only if Point is within polygon geometry
        if len(gdf.sjoin(polygon, predicate = 'within')) == 1:
            lat.append(lat_point)
            lon.append(lon_point)
            i += 1
    
    return lat, lon

In [28]:
%%time

age_codes = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M']

random.seed(1)

ZIP = []
census_county_2010 = []
census_state_2010 = []
ChildrenHHCount = []
length_of_residence = []
children_ind = []
STATE = []
head_hh_age_code = []
GE_LATITUDE_2010, GE_LONGITUDE_2010 = generate_random_location_within_harris(10000, df_harris) # generate 10,000 coordinates within harris
    
for i in range (0, 10000):
    ZIP.append(random.randint(77000, 77300)) ## restrict zipcode values for zipcodes within Harris County
    census_county_2010.append('201') ## use harris county fips 201
    census_state_2010.append('48') ## use texas state fips 48                                      
    STATE.append('TX') ## use texas abbreviation TX
    ChildrenHHCount.append(random.randint(0, 15))
    length_of_residence.append(random.randint(0, 70))
    children_ind.append(random.randint(0,1))
    head_hh_age_code.append(age_codes[random.randint(0, 12)])
    
# print(len(ZIP))
# print(len(census_county_2010))
# print(len(ChildrenHHCount))
# print(len(length_of_residence))
# print(len(children_ind))
# print(len(GE_LONGITUDE_2010))
# print(len(GE_LATITUDE_2010))
# print(len(STATE))
# print(len(head_hh_age_code))
   
d = {'ZIP': ZIP, 'census_county_2010': census_county_2010, 'census_state_2010': census_state_2010, 'STATE': STATE, 
     'ChildrenHHCount': ChildrenHHCount, 'length_of_residence': length_of_residence, 
     'children_ind': children_ind, 'head_hh_age_code': head_hh_age_code, 'GE_LATITUDE_2010': GE_LATITUDE_2010, 
     'GE_LONGITUDE_2010': GE_LONGITUDE_2010}
df_synthetic_harris = pd.DataFrame(d)
df_synthetic_harris

CPU times: user 2min 48s, sys: 1.11 s, total: 2min 49s
Wall time: 2min 50s


Unnamed: 0,ZIP,census_county_2010,census_state_2010,STATE,ChildrenHHCount,length_of_residence,children_ind,head_hh_age_code,GE_LATITUDE_2010,GE_LONGITUDE_2010
0,77291,201,48,TX,2,51,0,D,29.609275,-95.091207
1,77173,201,48,TX,7,67,1,B,29.980627,-95.600641
2,77235,201,48,TX,15,1,0,D,29.822307,-95.433438
3,77243,201,48,TX,6,48,0,L,29.914440,-95.141698
4,77229,201,48,TX,11,46,1,H,30.023101,-95.447820
...,...,...,...,...,...,...,...,...,...,...
9995,77092,201,48,TX,14,31,1,H,29.914909,-95.722899
9996,77147,201,48,TX,14,17,0,B,30.070496,-95.794543
9997,77019,201,48,TX,0,0,1,F,29.846029,-95.535133
9998,77108,201,48,TX,1,53,1,I,29.759730,-95.619028


##### Processing
Since this data will be used directly in visualization notebook **05_harris_dist**, we must then then process this data the same way we would process the synthetic InfoUSA data from above in processing notebook **01_merging_files**. 

In [29]:
%%time
df_synthetic_harris['county_fips'] = df_synthetic_harris['census_state_2010'] + df_synthetic_harris['census_county_2010']

df_synthetic_harris = df_synthetic_harris[['ZIP', 'county_fips', 'STATE', 'ChildrenHHCount', 'children_ind', 'head_hh_age_code', 
                                           'GE_LATITUDE_2010', 'GE_LONGITUDE_2010']]
df_synthetic_harris

CPU times: user 6.99 ms, sys: 989 µs, total: 7.98 ms
Wall time: 7.26 ms


Unnamed: 0,ZIP,county_fips,STATE,ChildrenHHCount,children_ind,head_hh_age_code,GE_LATITUDE_2010,GE_LONGITUDE_2010
0,77291,48201,TX,2,0,D,29.609275,-95.091207
1,77173,48201,TX,7,1,B,29.980627,-95.600641
2,77235,48201,TX,15,0,D,29.822307,-95.433438
3,77243,48201,TX,6,0,L,29.914440,-95.141698
4,77229,48201,TX,11,1,H,30.023101,-95.447820
...,...,...,...,...,...,...,...,...
9995,77092,48201,TX,14,1,H,29.914909,-95.722899
9996,77147,48201,TX,14,0,B,30.070496,-95.794543
9997,77019,48201,TX,0,1,F,29.846029,-95.535133
9998,77108,48201,TX,1,1,I,29.759730,-95.619028


**Renaming columns**: 
We rename the columns in our dataset for standardization purposes.

In [30]:
df_synthetic_harris.rename(columns = {'ZIP': 'zip', 'STATE': 'state', 'ChildrenHHCount': 'child_num', 
                           'children_ind': 'has_child', 'head_hh_age_code': 'age_code', 'GE_LATITUDE_2010': 'lat_h_4326', 
                            'GE_LONGITUDE_2010': 'lon_h_4326'}, inplace = True)
df_synthetic_harris

Unnamed: 0,zip,county_fips,state,child_num,has_child,age_code,lat_h_4326,lon_h_4326
0,77291,48201,TX,2,0,D,29.609275,-95.091207
1,77173,48201,TX,7,1,B,29.980627,-95.600641
2,77235,48201,TX,15,0,D,29.822307,-95.433438
3,77243,48201,TX,6,0,L,29.914440,-95.141698
4,77229,48201,TX,11,1,H,30.023101,-95.447820
...,...,...,...,...,...,...,...,...
9995,77092,48201,TX,14,1,H,29.914909,-95.722899
9996,77147,48201,TX,14,0,B,30.070496,-95.794543
9997,77019,48201,TX,0,1,F,29.846029,-95.535133
9998,77108,48201,TX,1,1,I,29.759730,-95.619028


**Transforming household latitude and longitude coordinates from EPSG 4326 to EPSG 3857**.
A lot of our visualizations need coordinates in EPSG 3857, however these coordinates are in EPSG 4326. Therefore, we use the pyproj interface, which allows us to use the PROJ coordinate transformation software to transform our EPSG 4326 coordinates to EPSG 3857. This creates two new columns in our original dataset with the transformed coordinates.

In [31]:
from pyproj import Proj, Transformer

In [32]:
# Apply transformation
transform_4326_to_3857 = Transformer.from_crs('epsg:4326', 'epsg:3857')
df_synthetic_harris['lat_h_3857'], df_synthetic_harris['lon_h_3857'] = transform_4326_to_3857.transform(
                                                df_synthetic_harris['lat_h_4326'], df_synthetic_harris['lon_h_4326'])

df_synthetic_harris

Unnamed: 0,zip,county_fips,state,child_num,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857
0,77291,48201,TX,2,0,D,29.609275,-95.091207,-1.058550e+07,3.453424e+06
1,77173,48201,TX,7,1,B,29.980627,-95.600641,-1.064221e+07,3.501060e+06
2,77235,48201,TX,15,0,D,29.822307,-95.433438,-1.062360e+07,3.480729e+06
3,77243,48201,TX,6,0,L,29.914440,-95.141698,-1.059113e+07,3.492557e+06
4,77229,48201,TX,11,1,H,30.023101,-95.447820,-1.062520e+07,3.506520e+06
...,...,...,...,...,...,...,...,...,...,...
9995,77092,48201,TX,14,1,H,29.914909,-95.722899,-1.065582e+07,3.492617e+06
9996,77147,48201,TX,14,0,B,30.070496,-95.794543,-1.066380e+07,3.512615e+06
9997,77019,48201,TX,0,1,F,29.846029,-95.535133,-1.063492e+07,3.483774e+06
9998,77108,48201,TX,1,1,I,29.759730,-95.619028,-1.064426e+07,3.472703e+06


**Exporting final dataframe**. Finally, we export this dataframe to ```/data/source_files/infousa_files``` for use in our visualizations.

In [33]:
%%time
df_synthetic_harris.to_parquet(DATA_DIR + '/source_files/infousa_files/harris_households.parquet')

CPU times: user 13 ms, sys: 13.8 ms, total: 26.8 ms
Wall time: 31.8 ms
