# Using the Random Library to Generate Synthetic Data
### Randomly generating values for synthetic InfoUSA data
The InfoUSA dataset provided to us was purchased by Duke University, and we signed privacy agreements to not share the data publicly. For this reason, we must generate our own synthetic data mimicking that in the original InfoUSA data. The synthetic InfoUSA data created in those notebook and used throughout this repository is located in ```/data/source_files/infousa_files```.

### Import statements

In [1]:
import pandas as pd
import random
import os

### Setting ```DATA_DIR```
In order to read in files from this repository, we must set ```DATA_DIR``` to be the data folder within this repository. This requires ```os.getcwd()``` to return the path to the processing notebook of this repository, so ```xxx/codeplus-celine-dcc-package/processing```, where ```xxx``` is the path to where you cloned this repository. If it is not, use ```os.chdir(path)``` to change the current working directory to ```xxx/codeplus-celine-dcc-package/processing``` before getting the current working directory in ```DATA_DIR = os.getcwd()```, where ```path``` is ```xxx/codeplus-celine-dcc-package/processing```.

In [2]:
DATA_DIR = os.getcwd()
DATA_DIR = DATA_DIR.replace('processing', 'data')
DATA_DIR

'/hpc/home/at341/ondemand/codeplus-celine-dcc-package/data'

### Generating synthetic InfoUSA data

The original InfoUSA data has a column with the state to which each household it has information on belongs. To make this synthetic data as realistic as possible, we created a list of all the state abbreviations, ```states```, as used in the original InfoUSA data, and will use the random library to randomly select from this list. We do the same with the ```age_codes``` list, but this time with the age codes as stipulated by the original data.

In [None]:
states = [ 'AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
           'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
           'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
           'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
           'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY']
age_codes = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M']

Next, to create a dataframe, we can use a dictionary. The dictionary will contain each column name as a key and the list of values for that column as the value. Hence, we must create lists for each column containing the values for that column. To do so, we coded an inner for loop that will append 100,000 random values to each list, based on what those values should range from, as observed from the original InfoUSA data. The outer for loop ensures that we create 10 synthetic InfoUSA files, as the original InfoUSA data is given in 38,000 different files. We do this so that this repository uses fake data mimicking the real data as closely as possible. 

The last line saves each synthetic dataframe to ```/data/source_files/infousa_files```.

In [28]:
%%time

random.seed(1)

for i in range(10):  
    zipcode = random.randint(10000, 99999)
    ZIP = []
    census_county_2010 = []
    census_state_2010 = []
    ChildrenHHCount = []
    length_of_residence = []
    children_ind = []
    GE_LONGITUDE_2010 = []
    GE_LATITUDE_2010 = []
    STATE = []
    head_hh_age_code = []
    
    for i in range (0, 100000):
        ZIP.append(random.randint(10000, 99999))
        census_county_2010.append(str(random.randint(0, 5)) + str(random.randint(0, 9)) + str(random.randint(0, 9)))
        census_state_2010.append(str(random.randint(0, 5)) + str(random.randint(0, 9)))                                          
        STATE.append(states[random.randint(0, 50)])
        ChildrenHHCount.append(random.randint(0, 15))
        length_of_residence.append(random.randint(0, 70))
        children_ind.append(random.randint(0,1))
        head_hh_age_code.append(age_codes[random.randint(0, 12)])
        GE_LONGITUDE_2010.append(random.uniform(-125, -65))
        GE_LATITUDE_2010.append(random.uniform(35, 50))

    # print(len(ZIP))
    # print(len(census_county_2010))
    # print(len(ChildrenHHCount))
    # print(len(length_of_residence))
    # print(len(children_ind))
    # print(len(GE_LONGITUDE_2010))
    # print(len(GE_LATITUDE_2010))
    # print(len(STATE))
    # print(len(head_hh_age_code))
   
    d = {'ZIP': ZIP, 'census_county_2010': census_county_2010, 'census_state_2010': census_state_2010, 'STATE': STATE, 
         'ChildrenHHCount': ChildrenHHCount, 'length_of_residence': length_of_residence, 
         'children_ind': children_ind, 'head_hh_age_code': head_hh_age_code, 'GE_LATITUDE_2010': GE_LATITUDE_2010, 
         'GE_LONGITUDE_2010': GE_LONGITUDE_2010}
    df_synthetic = pd.DataFrame(d)
    df_synthetic
    
    df_synthetic.to_csv(DATA_DIR + '/source_files/infousa_files/Household_Ethnicity_zip_' + str(zipcode) + '_year_2020.txt', sep = '\t', index = False)

27611
26238
43975
44452
46782
63537
87821
37619
48736
17556
CPU times: user 14 s, sys: 132 ms, total: 14.1 s
Wall time: 14.5 s


### Generating synthetic InfoUSA case study data
The original InfoUSA data has real values for county and state for each household, but this synthetic data is created by randomly selecting state abbreviations and randomly generating county fips codes. Therefore, when we filter for only households in Charleston County for our visualizations, we look for county fips 45019 (the same is done for Harris County). However, it is extremely unlikely that this exact number for county fips was generated randomly when creating the synthetic data, and we need enough observations with that county fips to make a meaningful visualization. Thus, we created two more synthetic InfoUSA datasets specifically for Charleston and Harris County, as will be needed in our visualizations in this repository.

#### Charleston County
We follow the same steps as above, but for the state column, we always have 'SC', as Charleston County is in South Carolina, and the state fips is set to 45 and the county fips is set to 045, all the correct values for Charleston County. We also restricted the latitude and longitude values for values roughly within Charleston County.

In [61]:
%%time

age_codes = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M']

random.seed(1)

ZIP = []
census_county_2010 = []
census_state_2010 = []
ChildrenHHCount = []
length_of_residence = []
children_ind = []
GE_LONGITUDE_2010 = []
GE_LATITUDE_2010 = []
STATE = []
head_hh_age_code = []
    
for i in range (0, 100000):
    ZIP.append(random.randint(29400, 29500))
    census_county_2010.append('019')
    census_state_2010.append('45')                                          
    STATE.append('SC')
    ChildrenHHCount.append(random.randint(0, 15))
    length_of_residence.append(random.randint(0, 70))
    children_ind.append(random.randint(0,1))
    head_hh_age_code.append(age_codes[random.randint(0, 12)])
    GE_LONGITUDE_2010.append(random.uniform(-80.370, -79.460))
    GE_LATITUDE_2010.append(random.uniform(32.650, 33.080))
# print(len(ZIP))
# print(len(census_county_2010))
# print(len(ChildrenHHCount))
# print(len(length_of_residence))
# print(len(children_ind))
# print(len(GE_LONGITUDE_2010))
# print(len(GE_LATITUDE_2010))
# print(len(STATE))
# print(len(head_hh_age_code))
   
d = {'ZIP': ZIP, 'census_county_2010': census_county_2010, 'census_state_2010': census_state_2010, 'STATE': STATE, 
     'ChildrenHHCount': ChildrenHHCount, 'length_of_residence': length_of_residence, 
     'children_ind': children_ind, 'head_hh_age_code': head_hh_age_code, 'GE_LATITUDE_2010': GE_LATITUDE_2010, 
     'GE_LONGITUDE_2010': GE_LONGITUDE_2010}
df_synthetic_charleston = pd.DataFrame(d)
df_synthetic_charleston

CPU times: user 577 ms, sys: 16.9 ms, total: 593 ms
Wall time: 594 ms


Unnamed: 0,ZIP,census_county_2010,census_state_2010,STATE,ChildrenHHCount,length_of_residence,children_ind,head_hh_age_code,GE_LATITUDE_2010,GE_LONGITUDE_2010
0,29417,019,45,SC,2,32,0,H,32.853065,-79.677524
1,29448,019,45,SC,6,12,1,A,32.817618,-79.557081
2,29477,019,45,SC,0,57,1,L,32.904196,-79.640338
3,29413,019,45,SC,10,3,0,A,32.653958,-79.778913
4,29448,019,45,SC,6,54,0,I,32.838292,-80.168261
...,...,...,...,...,...,...,...,...,...,...
99995,29451,019,45,SC,4,34,1,G,32.859898,-79.983927
99996,29415,019,45,SC,0,37,1,I,32.976982,-80.108903
99997,29472,019,45,SC,9,23,1,C,33.027838,-79.829499
99998,29431,019,45,SC,2,18,0,M,32.805037,-79.584401


##### Processing
Since this data will be used directly in visualization notebook **04_charleston_dist**, we must then then process this data the same way we would process the synthetic InfoUSA data from above in processing notebook **01_merging_files**. 

In [62]:
%%time
df_synthetic_charleston['county_fips'] = df_synthetic_charleston['census_state_2010'] + df_synthetic_charleston['census_county_2010']

df_synthetic_charleston = df_synthetic_charleston[['ZIP', 'county_fips', 'STATE', 'ChildrenHHCount', 'children_ind', 'head_hh_age_code', 
                                           'GE_LATITUDE_2010', 'GE_LONGITUDE_2010']]
df_synthetic_charleston

CPU times: user 15.4 ms, sys: 1.05 ms, total: 16.4 ms
Wall time: 14.9 ms


Unnamed: 0,ZIP,county_fips,STATE,ChildrenHHCount,children_ind,head_hh_age_code,GE_LATITUDE_2010,GE_LONGITUDE_2010
0,29417,45019,SC,2,0,H,32.853065,-79.677524
1,29448,45019,SC,6,1,A,32.817618,-79.557081
2,29477,45019,SC,0,1,L,32.904196,-79.640338
3,29413,45019,SC,10,0,A,32.653958,-79.778913
4,29448,45019,SC,6,0,I,32.838292,-80.168261
...,...,...,...,...,...,...,...,...
99995,29451,45019,SC,4,1,G,32.859898,-79.983927
99996,29415,45019,SC,0,1,I,32.976982,-80.108903
99997,29472,45019,SC,9,1,C,33.027838,-79.829499
99998,29431,45019,SC,2,0,M,32.805037,-79.584401


**Renaming columns**: 
We rename the columns in our dataset for standardization purposes.

In [63]:
df_synthetic_charleston.rename(columns = {'ZIP': 'zip', 'STATE': 'state', 'ChildrenHHCount': 'child_num', 
                           'children_ind': 'has_child', 'head_hh_age_code': 'age_code', 'GE_LATITUDE_2010': 'lat_h_4326', 
                            'GE_LONGITUDE_2010': 'lon_h_4326'}, inplace = True)
df_synthetic_charleston

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_synthetic_charleston.rename(columns = {'ZIP': 'zip', 'STATE': 'state', 'ChildrenHHCount': 'child_num',


Unnamed: 0,zip,county_fips,state,child_num,has_child,age_code,lat_h_4326,lon_h_4326
0,29417,45019,SC,2,0,H,32.853065,-79.677524
1,29448,45019,SC,6,1,A,32.817618,-79.557081
2,29477,45019,SC,0,1,L,32.904196,-79.640338
3,29413,45019,SC,10,0,A,32.653958,-79.778913
4,29448,45019,SC,6,0,I,32.838292,-80.168261
...,...,...,...,...,...,...,...,...
99995,29451,45019,SC,4,1,G,32.859898,-79.983927
99996,29415,45019,SC,0,1,I,32.976982,-80.108903
99997,29472,45019,SC,9,1,C,33.027838,-79.829499
99998,29431,45019,SC,2,0,M,32.805037,-79.584401


**Transforming household latitude and longitude coordinates from EPSG 4326 to EPSG 3857**.
A lot of our visualizations need coordinates in EPSG 3857, however these coordinates are in EPSG 4326. Therefore, we use the pyproj interface, which allows us to use the PROJ coordinate transformation software to transform our EPSG 4326 coordinates to EPSG 3857. This creates two new columns in our original dataset with the transformed coordinates.

In [64]:
from pyproj import Proj, Transformer

In [65]:
# Apply transformation
transform_4326_to_3857 = Transformer.from_crs('epsg:4326', 'epsg:3857')
df_synthetic_charleston['lat_h_3857'], df_synthetic_charleston['lon_h_3857'] = transform_4326_to_3857.transform(
                                                df_synthetic_charleston['lat_h_4326'], df_synthetic_charleston['lon_h_4326'])

df_synthetic_charleston

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_synthetic_charleston['lat_h_3857'], df_synthetic_charleston['lon_h_3857'] = transform_4326_to_3857.transform(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_synthetic_charleston['lat_h_3857'], df_synthetic_charleston['lon_h_3857'] = transform_4326_to_3857.transform(


Unnamed: 0,zip,county_fips,state,child_num,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857
0,29417,45019,SC,2,0,H,32.853065,-79.677524,-8.869661e+06,3.875817e+06
1,29448,45019,SC,6,1,A,32.817618,-79.557081,-8.856254e+06,3.871121e+06
2,29477,45019,SC,0,1,L,32.904196,-79.640338,-8.865522e+06,3.882594e+06
3,29413,45019,SC,10,0,A,32.653958,-79.778913,-8.880948e+06,3.849462e+06
4,29448,45019,SC,6,0,I,32.838292,-80.168261,-8.924290e+06,3.873860e+06
...,...,...,...,...,...,...,...,...,...,...
99995,29451,45019,SC,4,1,G,32.859898,-79.983927,-8.903770e+06,3.876722e+06
99996,29415,45019,SC,0,1,I,32.976982,-80.108903,-8.917682e+06,3.892249e+06
99997,29472,45019,SC,9,1,C,33.027838,-79.829499,-8.886579e+06,3.899000e+06
99998,29431,45019,SC,2,0,M,32.805037,-79.584401,-8.859295e+06,3.869454e+06


**Exporting final dataframe**. Finally, we export this dataframe to ```/data/source_files/infousa_files``` for use in our visualizations.

In [66]:
%%time
df_synthetic_charleston.to_parquet(DATA_DIR + '/source_files/infousa_files/charleston_households.parquet')

CPU times: user 73 ms, sys: 17.7 ms, total: 90.7 ms
Wall time: 112 ms


#### Harris County
We follow the same steps as above, but for the state column, we always have 'TX', as Harris County is in Texas, and the state fips is set to 48 and the county fips is set to 201, all the correct values for Harris County. We also restricted the latitude and longitude values for values roughly within Harris County.

In [55]:
%%time

age_codes = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M']

random.seed(1)

ZIP = []
census_county_2010 = []
census_state_2010 = []
ChildrenHHCount = []
length_of_residence = []
children_ind = []
GE_LONGITUDE_2010 = []
GE_LATITUDE_2010 = []
STATE = []
head_hh_age_code = []
    
for i in range (0, 500000):
    ZIP.append(random.randint(77000, 77300))
    census_county_2010.append('201')
    census_state_2010.append('48')                                          
    STATE.append('TX')
    ChildrenHHCount.append(random.randint(0, 15))
    length_of_residence.append(random.randint(0, 70))
    children_ind.append(random.randint(0,1))
    head_hh_age_code.append(age_codes[random.randint(0, 12)])
    GE_LONGITUDE_2010.append(random.uniform(-95.820, -94.960))
    GE_LATITUDE_2010.append(random.uniform(29.530, 30.120))
    
# print(len(ZIP))
# print(len(census_county_2010))
# print(len(ChildrenHHCount))
# print(len(length_of_residence))
# print(len(children_ind))
# print(len(GE_LONGITUDE_2010))
# print(len(GE_LATITUDE_2010))
# print(len(STATE))
# print(len(head_hh_age_code))
   
d = {'ZIP': ZIP, 'census_county_2010': census_county_2010, 'census_state_2010': census_state_2010, 'STATE': STATE, 
     'ChildrenHHCount': ChildrenHHCount, 'length_of_residence': length_of_residence, 
     'children_ind': children_ind, 'head_hh_age_code': head_hh_age_code, 'GE_LATITUDE_2010': GE_LATITUDE_2010, 
     'GE_LONGITUDE_2010': GE_LONGITUDE_2010}
df_synthetic_harris = pd.DataFrame(d)
df_synthetic_harris

CPU times: user 2.66 s, sys: 54.5 ms, total: 2.71 s
Wall time: 2.72 s


Unnamed: 0,ZIP,census_county_2010,census_state_2010,STATE,ChildrenHHCount,length_of_residence,children_ind,head_hh_age_code,GE_LATITUDE_2010,GE_LONGITUDE_2010
0,77068,201,48,TX,2,32,0,H,29.808625,-95.165572
1,77194,201,48,TX,6,12,1,A,29.759987,-95.051747
2,77001,201,48,TX,14,34,0,J,30.061842,-95.007067
3,77015,201,48,TX,0,3,0,G,30.101734,-95.229624
4,77014,201,48,TX,7,56,1,I,29.666211,-95.619547
...,...,...,...,...,...,...,...,...,...,...
499995,77191,201,48,TX,5,31,1,E,29.905114,-95.675370
499996,77050,201,48,TX,10,60,1,M,30.028496,-95.755345
499997,77242,201,48,TX,8,46,1,A,29.913911,-95.466206
499998,77238,201,48,TX,10,23,1,F,30.069060,-95.759268


##### Processing
Since this data will be used directly in visualization notebook **05_harris_dist**, we must then then process this data the same way we would process the synthetic InfoUSA data from above in processing notebook **01_merging_files**. 

In [56]:
%%time
df_synthetic_harris['county_fips'] = df_synthetic_harris['census_state_2010'] + df_synthetic_harris['census_county_2010']

df_synthetic_harris = df_synthetic_harris[['ZIP', 'county_fips', 'STATE', 'ChildrenHHCount', 'children_ind', 'head_hh_age_code', 
                                           'GE_LATITUDE_2010', 'GE_LONGITUDE_2010']]
df_synthetic_harris

CPU times: user 95.3 ms, sys: 13.8 ms, total: 109 ms
Wall time: 107 ms


Unnamed: 0,ZIP,county_fips,STATE,ChildrenHHCount,children_ind,head_hh_age_code,GE_LATITUDE_2010,GE_LONGITUDE_2010
0,77068,48201,TX,2,0,H,29.808625,-95.165572
1,77194,48201,TX,6,1,A,29.759987,-95.051747
2,77001,48201,TX,14,0,J,30.061842,-95.007067
3,77015,48201,TX,0,0,G,30.101734,-95.229624
4,77014,48201,TX,7,1,I,29.666211,-95.619547
...,...,...,...,...,...,...,...,...
499995,77191,48201,TX,5,1,E,29.905114,-95.675370
499996,77050,48201,TX,10,1,M,30.028496,-95.755345
499997,77242,48201,TX,8,1,A,29.913911,-95.466206
499998,77238,48201,TX,10,1,F,30.069060,-95.759268


**Renaming columns**: 
We rename the columns in our dataset for standardization purposes.

In [57]:
df_synthetic_harris.rename(columns = {'ZIP': 'zip', 'STATE': 'state', 'ChildrenHHCount': 'child_num', 
                           'children_ind': 'has_child', 'head_hh_age_code': 'age_code', 'GE_LATITUDE_2010': 'lat_h_4326', 
                            'GE_LONGITUDE_2010': 'lon_h_4326'}, inplace = True)
df_synthetic_harris

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_synthetic_harris.rename(columns = {'ZIP': 'zip', 'STATE': 'state', 'ChildrenHHCount': 'child_num',


Unnamed: 0,zip,county_fips,state,child_num,has_child,age_code,lat_h_4326,lon_h_4326
0,77068,48201,TX,2,0,H,29.808625,-95.165572
1,77194,48201,TX,6,1,A,29.759987,-95.051747
2,77001,48201,TX,14,0,J,30.061842,-95.007067
3,77015,48201,TX,0,0,G,30.101734,-95.229624
4,77014,48201,TX,7,1,I,29.666211,-95.619547
...,...,...,...,...,...,...,...,...
499995,77191,48201,TX,5,1,E,29.905114,-95.675370
499996,77050,48201,TX,10,1,M,30.028496,-95.755345
499997,77242,48201,TX,8,1,A,29.913911,-95.466206
499998,77238,48201,TX,10,1,F,30.069060,-95.759268


**Transforming household latitude and longitude coordinates from EPSG 4326 to EPSG 3857**.
A lot of our visualizations need coordinates in EPSG 3857, however these coordinates are in EPSG 4326. Therefore, we use the pyproj interface, which allows us to use the PROJ coordinate transformation software to transform our EPSG 4326 coordinates to EPSG 3857. This creates two new columns in our original dataset with the transformed coordinates.

In [58]:
from pyproj import Proj, Transformer

In [59]:
# Apply transformation
transform_4326_to_3857 = Transformer.from_crs('epsg:4326', 'epsg:3857')
df_synthetic_harris['lat_h_3857'], df_synthetic_harris['lon_h_3857'] = transform_4326_to_3857.transform(
                                                df_synthetic_harris['lat_h_4326'], df_synthetic_harris['lon_h_4326'])

df_synthetic_harris

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_synthetic_harris['lat_h_3857'], df_synthetic_harris['lon_h_3857'] = transform_4326_to_3857.transform(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_synthetic_harris['lat_h_3857'], df_synthetic_harris['lon_h_3857'] = transform_4326_to_3857.transform(


Unnamed: 0,zip,county_fips,state,child_num,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857
0,77068,48201,TX,2,0,H,29.808625,-95.165572,-1.059378e+07,3.478974e+06
1,77194,48201,TX,6,1,A,29.759987,-95.051747,-1.058111e+07,3.472736e+06
2,77001,48201,TX,14,0,J,30.061842,-95.007067,-1.057614e+07,3.511502e+06
3,77015,48201,TX,0,0,G,30.101734,-95.229624,-1.060091e+07,3.516634e+06
4,77014,48201,TX,7,1,I,29.666211,-95.619547,-1.064432e+07,3.460716e+06
...,...,...,...,...,...,...,...,...,...,...
499995,77191,48201,TX,5,1,E,29.905114,-95.675370,-1.065053e+07,3.491359e+06
499996,77050,48201,TX,10,1,M,30.028496,-95.755345,-1.065944e+07,3.507213e+06
499997,77242,48201,TX,8,1,A,29.913911,-95.466206,-1.062725e+07,3.492489e+06
499998,77238,48201,TX,10,1,F,30.069060,-95.759268,-1.065987e+07,3.512430e+06


**Exporting final dataframe**. Finally, we export this dataframe to ```/data/source_files/infousa_files``` for use in our visualizations.

In [60]:
%%time
df_synthetic_harris.to_parquet(DATA_DIR + '/source_files/infousa_files/harris_households.parquet')

CPU times: user 306 ms, sys: 31.8 ms, total: 338 ms
Wall time: 422 ms
