# Census 2020

Retrieves data from the Census Bureau's 2020 Census API for zctas (not available in plrd), county subdivisions, and tracts. A specific list of census variables is passed into the script, which are retrieved from the public redistricting tables or profile table (latter coming at end of 2023). Variables must be retrieved in chunks because only 50 can be passed to the API at a time, and each url varies by geography and retreives them in different combinations. After some processing output is written to a SQLite database. An option to create a metadata table appears at the bottom, but should only be run once for a given extract (socialecon and pophousing) and not for each individual geography.

https://www.census.gov/data/developers/data-sets/decennial-census.2020.html

## Variables

In [1]:
import os, requests, json, sqlite3, random, pandas as pd, numpy as np
from IPython.display import clear_output

In [2]:
keyfile='census_key.txt'

#API variables - UPDATE THE YEAR AND GEO
year='2020'
geo='tract' # 'zip code tabulation area' or 'county subdivision' or 'tract'
state='44'
dsource='dec'
dname='pl' # public law redistricting data

#Variables to read in from spreadsheet - UPDATE WORKSHEET
worksheet='plrd' # 'plrd'
geoexcelsheet={'zip code tabulation area':'zctas', 'county subdivision':'county_subdivs', 'tract':'tracts'}
geotype=geoexcelsheet.get(geo)

#SQL output
tabname='{}_census{}_{}'.format(geotype,year,worksheet)
dbname=os.path.join('outputs','testdb.sqlite')

#Dump files for api data storage
jsonpath=os.path.join('outputs', tabname+'_retrieved_data.json')

## Variable Lists
Get full list of variables from the API, read in our retrieval list, and compare the varianle IDs and names to make sure nothing is missing and that nothing has changed since the last iteration. *Don't move on to the next block until both lists match.* Lastly, read in list of geographies.

In [3]:
datadict={}
vars_url = f'https://api.census.gov/data/{year}/{dsource}/{dname}/variables.json'
response=requests.get(vars_url)
var_data=response.json()
datadict.update(var_data['variables'])
random.sample(list(datadict.items()), 2)

[('P1_003N',
  {'label': ' !!Total:!!Population of one race:!!White alone',
   'concept': 'RACE',
   'predicateType': 'int',
   'group': 'P1',
   'limit': 0,
   'attributes': 'P1_003NA'}),
 ('SUMLEVEL',
  {'label': 'Summary Level code',
   'predicateType': 'string',
   'group': 'N/A',
   'limit': 0})]

In [4]:
dfexcel = pd.read_excel(os.path.join('inputs','dec2020_variables.xlsx'),sheet_name=worksheet)
dfexcel.head()

Unnamed: 0,census_var,census_label,dtype
0,H1_001N,OCCUPANCY STATUS!!Total:,int
1,H1_002N,OCCUPANCY STATUS!!Total:!!Occupied,int
2,H1_003N,OCCUPANCY STATUS!!Total:!!Vacant,int
3,P1_001N,RACE!!Total:,int
4,P1_003N,RACE!!Total:!!Population of one race:!!White a...,int


In [5]:
dfvars = pd.DataFrame.from_dict(datadict,columns=['label'],orient='index')
dfvars_selected=dfvars.loc[dfvars.index.isin(dfexcel['census_var'])]
dfvars_count=len(dfvars_selected)
dfexcel_count=len(dfexcel['census_var'])

if dfvars_count==dfexcel_count:
    print('There are an equal number of variables in both lists:', dfvars_count)
else:
    print('There is a mismatch in the number of variables; the api has,', dfvars_count, 
          'while the original list has',dfexcel_count,'. Missing:')
    nomatch=dfexcel[~dfexcel['census_var'].isin(dfvars_selected.index)]
    print(nomatch)

There are an equal number of variables in both lists: 49


In [6]:
# Geographic indetifiers: zctas to retrieve, pumas to filter by, and counties containing tracts to retrieve
excelgeo = pd.read_excel(os.path.join('inputs','dec2020_variables.xlsx'),sheet_name=geotype, dtype=object)
geoids = excelgeo['GEO'].tolist()
print('Number of geographic indetifiers:',len(geoids))

Number of geographic indetifiers: 5


## Retrieve Data
Given the large number of variables in the ACS and limits of the API, variables must be passed to the url in separate blocks or chunks. The first chunk that's captured is written to an empty datalist; the header row and then one row for each geography. Each subsequent chunk is iterated through by row, so each row is appended to the correct row in datalist. In all cases, the last values, identifiers automatically returned with each API call, are not appended.

In [7]:
def chunks(l, n):
    # For item i in a range that is a length of l,
    for i in range(0, len(l), n):
        # Create an index range for l of n items:
        yield l[i:i+n]

In [8]:
reqvars=list(chunks(dfvars_selected.index.tolist(),46))
reqvars[0].insert(0,'NAME')
reqvars[0].insert(0,'GEO_ID')
print('Number of chunks:',len(reqvars))

Number of chunks: 2


In [9]:
with open(keyfile) as key:
    api_key=key.read().strip()

base_url = f'https://api.census.gov/data/{year}/{dsource}/{dname}'
base_url

'https://api.census.gov/data/2020/dec/pl'

In [10]:
#Function for retrieving data; running this block loads it into memory
#Different geographies have different urls, 
#and a different number of identifiers tacked on to the end of each request

def getdata():
    dlist=[]
    for i, v in enumerate(reqvars):
        batchcols=','.join(v)
        if geotype=='zctas':
            data_url = f'{base_url}?get={batchcols}&for={geo}:{g}&key={api_key}'
            dropvar=-1
        elif geotype=='county_subdivs':
            data_url = f'{base_url}?get={batchcols}&for={geo}:*&in=state:{state}&in=county:{county}&key={api_key}'
            dropvar=-3
        elif geotype=='tracts':
            data_url = f'{base_url}?get={batchcols}&for={geo}:*&in=state:{state}&in=county:{county}&key={api_key}'
            dropvar=-3
        else:
            print('Appropriate geography not specified in variables block')
            break  
        response=requests.get(data_url)
        if response.status_code==200:
            clear_output(wait=True)
            data=response.json()
            for i2, v2 in enumerate(data):
                if i == 0:
                    dlist.append(v2[:dropvar])
                else:
                    for item in v2[:dropvar]:
                        dlist[i2].append(item)
        else:
            print('***Problem with retrieval***, response code',response.status_code)
    return dlist

##### ***THIS BLOCK IS A REQUESTS BLOCK!***
*NOTE: ZCTA retrieval takes a long time - 5 mins for 80 ZCTAs*

In [11]:
#If this block was run successfully for a given table and geography don't rerun - next block pulls from saved json
datalist=[]
if geotype=='zctas':
    for g in geoids:
        georecord=getdata()
        print('Retrieved data for',g)
        if len(datalist)==0:
            datalist.append(georecord[0])
            datalist.append(georecord[1])
        else:
            datalist.append(georecord[1])
else:
    for county in geoids:
        georecord=getdata()
        print('Retrieved data for',county)
        if len(datalist)==0:
            for geog in georecord:
                datalist.append(geog)
        else:
            for geog in georecord[1:]:
                datalist.append(geog)
    
dlrows=len(datalist)
dlitems=sum(len(x) for x in datalist)
dlbyrow=dlitems / dlrows
print('Retrieved', dlrows, 'records and', dlitems,'data points with', dlbyrow, 'points for each record...')
        
with open(jsonpath, 'w') as f:
    json.dump(datalist, f)
print('Done - Data dumped to json file')

Retrieved data for 009
Retrieved 251 records and 12801 data points with 51.0 points for each record...
Done - Data dumped to json file


## Process Data
Replace footnotes with nulls, create a new GEOID2 column, replace census variable names with database variable names.

In [12]:
with open(jsonpath, 'r') as f:
    jsondata=json.load(f)
alldata = pd.DataFrame(jsondata[1:],columns=jsondata[0],dtype=object).rename(columns={
    'GEO_ID':'GEOIDLONG','NAME':'GEOLABEL'}).set_index('GEOIDLONG')
alldata.info()
# Index and column entries should be 1 row and 1 column less than previous count (excludes header row and index column) 

<class 'pandas.core.frame.DataFrame'>
Index: 250 entries, 1400000US44001030100 to 1400000US44009990200
Data columns (total 50 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   GEOLABEL  250 non-null    object
 1   H1_001N   250 non-null    object
 2   H1_002N   250 non-null    object
 3   H1_003N   250 non-null    object
 4   P1_001N   250 non-null    object
 5   P1_003N   250 non-null    object
 6   P1_004N   250 non-null    object
 7   P1_005N   250 non-null    object
 8   P1_006N   250 non-null    object
 9   P1_007N   250 non-null    object
 10  P1_008N   250 non-null    object
 11  P1_009N   250 non-null    object
 12  P2_001N   250 non-null    object
 13  P2_002N   250 non-null    object
 14  P2_003N   250 non-null    object
 15  P2_005N   250 non-null    object
 16  P2_006N   250 non-null    object
 17  P2_007N   250 non-null    object
 18  P2_008N   250 non-null    object
 19  P2_009N   250 non-null    object
 20  P2_010N   250 non-null 

In [13]:
alldata.head(3)

Unnamed: 0_level_0,GEOLABEL,H1_001N,H1_002N,H1_003N,P1_001N,P1_003N,P1_004N,P1_005N,P1_006N,P1_007N,...,P5_001N,P5_002N,P5_003N,P5_004N,P5_005N,P5_006N,P5_007N,P5_008N,P5_009N,P5_010N
GEOIDLONG,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1400000US44001030100,"Census Tract 301, Bristol County, Rhode Island",1859,1775,84,4801,4217,66,0,188,0,...,110,103,0,0,103,0,7,0,0,7
1400000US44001030200,"Census Tract 302, Bristol County, Rhode Island",1322,1263,59,3580,2922,50,5,287,0,...,100,0,0,0,0,0,100,94,0,6
1400000US44001030300,"Census Tract 303, Bristol County, Rhode Island",1764,1684,80,4775,4137,38,0,268,0,...,12,0,0,0,0,0,12,0,0,12


In [14]:
idxgeoid2={'zctas':-5, 'county_subdivs':-10,'tracts':-11}
alldata.insert(loc=0, column='GEOIDSHORT',value=alldata.index.str[idxgeoid2.get(geotype):])
alldata.head(3)

Unnamed: 0_level_0,GEOIDSHORT,GEOLABEL,H1_001N,H1_002N,H1_003N,P1_001N,P1_003N,P1_004N,P1_005N,P1_006N,...,P5_001N,P5_002N,P5_003N,P5_004N,P5_005N,P5_006N,P5_007N,P5_008N,P5_009N,P5_010N
GEOIDLONG,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1400000US44001030100,44001030100,"Census Tract 301, Bristol County, Rhode Island",1859,1775,84,4801,4217,66,0,188,...,110,103,0,0,103,0,7,0,0,7
1400000US44001030200,44001030200,"Census Tract 302, Bristol County, Rhode Island",1322,1263,59,3580,2922,50,5,287,...,100,0,0,0,0,0,100,94,0,6
1400000US44001030300,44001030300,"Census Tract 303, Bristol County, Rhode Island",1764,1684,80,4775,4137,38,0,268,...,12,0,0,0,0,0,12,0,0,12


In [15]:
# For PUMAS filter all the geotype for the state by local areas
if geotype == 'pumas':
    censusdata=alldata.loc[alldata.GEOID2.isin(geoids)].copy().astype(object).sort_index()
else:
    censusdata=alldata.copy().astype(object).sort_index()
censusdata.shape

(250, 51)

In [16]:
censusdata.head(3)

Unnamed: 0_level_0,GEOIDSHORT,GEOLABEL,H1_001N,H1_002N,H1_003N,P1_001N,P1_003N,P1_004N,P1_005N,P1_006N,...,P5_001N,P5_002N,P5_003N,P5_004N,P5_005N,P5_006N,P5_007N,P5_008N,P5_009N,P5_010N
GEOIDLONG,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1400000US44001030100,44001030100,"Census Tract 301, Bristol County, Rhode Island",1859,1775,84,4801,4217,66,0,188,...,110,103,0,0,103,0,7,0,0,7
1400000US44001030200,44001030200,"Census Tract 302, Bristol County, Rhode Island",1322,1263,59,3580,2922,50,5,287,...,100,0,0,0,0,0,100,94,0,6
1400000US44001030300,44001030300,"Census Tract 303, Bristol County, Rhode Island",1764,1684,80,4775,4137,38,0,268,...,12,0,0,0,0,0,12,0,0,12


## Write to Database
Update list of variables and data types, build create table string, create datatable in temporary database.


In [17]:
dfexcel.replace({'dtype': {'int': 'INTEGER', 'float': 'REAL'}},inplace=True)
dfexcel.census_label.replace({'!!': ' - '},inplace=True, regex=True)
dfexcel.head()

Unnamed: 0,census_var,census_label,dtype
0,H1_001N,OCCUPANCY STATUS - Total:,INTEGER
1,H1_002N,OCCUPANCY STATUS - Total: - Occupied,INTEGER
2,H1_003N,OCCUPANCY STATUS - Total: - Vacant,INTEGER
3,P1_001N,RACE - Total:,INTEGER
4,P1_003N,RACE - Total: - Population of one race: - Whit...,INTEGER


In [18]:
vardict=dfexcel.set_index('census_var').T.to_dict('list')
random.sample(list(vardict.items()), 2)

[('P3_007N',
  ['RACE FOR THE POPULATION 18 YEARS AND OVER - Total: - Population of one race: - Native Hawaiian and Other Pacific Islander alone',
   'INTEGER']),
 ('P2_011N',
  ['HISPANIC OR LATINO, AND NOT HISPANIC OR LATINO BY RACE - Total: - Not Hispanic or Latino: - Population of two or more races:',
   'INTEGER'])]

In [19]:
con = sqlite3.connect(dbname) 
cur = con.cursor()

In [20]:
cur.execute('DROP TABLE IF EXISTS {}'.format(tabname))
dbstring="""
CREATE TABLE {} (
GEOIDLONG TEXT NOT NULL PRIMARY KEY,
GEOIDSHORT TEXT,
GEOLABEL TEXT,
""".format(tabname)

for k,v in vardict.items():
    dbstring=dbstring+k+' '+v[1]+', \n'
    
dbstring=dbstring[:-3]
dbstring=dbstring+');'
print(dbstring)


CREATE TABLE tracts_census2020_plrd (
GEOIDLONG TEXT NOT NULL PRIMARY KEY,
GEOIDSHORT TEXT,
GEOLABEL TEXT,
H1_001N INTEGER, 
H1_002N INTEGER, 
H1_003N INTEGER, 
P1_001N INTEGER, 
P1_003N INTEGER, 
P1_004N INTEGER, 
P1_005N INTEGER, 
P1_006N INTEGER, 
P1_007N INTEGER, 
P1_008N INTEGER, 
P1_009N INTEGER, 
P2_001N INTEGER, 
P2_002N INTEGER, 
P2_003N INTEGER, 
P2_005N INTEGER, 
P2_006N INTEGER, 
P2_007N INTEGER, 
P2_008N INTEGER, 
P2_009N INTEGER, 
P2_010N INTEGER, 
P2_011N INTEGER, 
P3_001N INTEGER, 
P3_003N INTEGER, 
P3_004N INTEGER, 
P3_005N INTEGER, 
P3_006N INTEGER, 
P3_007N INTEGER, 
P3_008N INTEGER, 
P3_009N INTEGER, 
P4_001N INTEGER, 
P4_002N INTEGER, 
P4_003N INTEGER, 
P4_005N INTEGER, 
P4_006N INTEGER, 
P4_007N INTEGER, 
P4_008N INTEGER, 
P4_009N INTEGER, 
P4_010N INTEGER, 
P4_011N INTEGER, 
P5_001N INTEGER, 
P5_002N INTEGER, 
P5_003N INTEGER, 
P5_004N INTEGER, 
P5_005N INTEGER, 
P5_006N INTEGER, 
P5_007N INTEGER, 
P5_008N INTEGER, 
P5_009N INTEGER, 
P5_010N INTEGER);


In [21]:
cur.execute(dbstring)

<sqlite3.Cursor at 0x2d72d146490>

In [22]:
censusdata.to_sql(name=tabname, if_exists='append', index=True, con=con)

In [23]:
cur.execute('SELECT COUNT(*) FROM {};'.format(tabname))
rows = cur.fetchone()
print(rows[0], 'records written to', tabname)

250 records written to tracts_census2020_plrd


In [24]:
cur.execute('SELECT * FROM {} LIMIT 1;'.format(tabname))
col_names = [cn[0] for cn in cur.description]
print(len(col_names), 'columns written to', tabname)
#Number should be same as number in df acsdata plus 1, since index not included in df count

52 columns written to tracts_census2020_plrd


In [25]:
con.close()

## Metadata Table
DO NOT RERUN THIS SECTION FOR MULTIPLE GEOGRAPHIES. In the RI Geodatabase there is only one metadata table per census table series (one for plrd, one for profile) for all geographies. For whichever geography is processed first, set action variable to 'create' and run this entire series of blocks for the table. If there was a second table, set the action variable to 'append' and skip the table creation and identifier insertion blocks.

In [None]:
#Change table name and specify an action - you're creating the table for the first time with acs1 variables, 
#or appending the tables with acs2 variables

metatab='census2020_plrd_lookup'
action='create' # 'create' or 'append'

In [None]:
con = sqlite3.connect(dbname) 
cur = con.cursor()

In [None]:
#Only run this block when creating initial table
if action=='create':
    mdstring="""
    CREATE TABLE {} (
    var_id TEXT,
    var_value TEXT);
    """.format(metatab)
    cur.execute(mdstring)
else:
    print('Block not executed because "create" not selected as an action in earlier block')

In [None]:
#Only run this block when creating initial table
if action=='create':
    exstring="""
        INSERT INTO {} VALUES('GEOIDLONG','Id');
        INSERT INTO {} VALUES('GEOIDSHORT','Id2');
        INSERT INTO {} VALUES('GEOLABEL','Geography');
        """.format(metatab,metatab,metatab,metatab)
    cur.executescript(exstring)
    con.commit()
else:
    print('Block not executed because "create" not selected as an action in earlier block')

In [None]:
#Run when creating table or when appending records
#Keys and values - db ids and labels
if action in ('create','append'):
    for mk, mv in vardict.items():
        cur.execute("INSERT INTO {} values(?,?)".format(metatab),(mk,mv[0]))
    con.commit()
else:
    print('Block not executed because action not specified in earlier block')

In [None]:
cur.execute('SELECT COUNT(*) FROM {};'.format(metatab))
rows = cur.fetchone()
print(rows[0], 'records in', metatab)

In [None]:
action=''
con.close()