# Census 2020

Retrieves data from the Census Bureau's 2020 Census API for zctas (not available in plrd), county subdivisions, and tracts. A specific list of census variables is passed into the script, which are retrieved from the public redistricting tables or profile table. Variables must be retrieved in chunks because only 50 can be passed to the API at a time, and each url varies by geography and retreives them in different combinations. After some processing output is written to a SQLite database. An option to create a metadata table appears at the bottom, but should only be run once for a given extract (count and pct) and not for each individual geography.

https://www.census.gov/data/developers/data-sets/decennial-census.2020.html

## Variables

In [1]:
import os, requests, json, sqlite3, random, pandas as pd, numpy as np
from IPython.display import clear_output

In [2]:
keyfile='census_key.txt'

#API variables - UPDATE THE YEAR AND GEO
year='2020'
geo='zip code tabulation area' # 'zip code tabulation area' or 'county subdivision' or 'tract'
state='44'
dsource='dec'
dname='dp' # pl = public law redistricting data, dp = demographic profile

#Variables to read in from spreadsheet - UPDATE WORKSHEET
worksheet='pct' # plrd = redistricting file, count = demographic profile counts, pct = demographic profile % totals
geoexcelsheet={'zip code tabulation area':'zctas', 'county subdivision':'county_subdivs', 'tract':'tracts'}
geotype=geoexcelsheet.get(geo)

#SQL output
tabname='{}_census{}_{}'.format(geotype,year,worksheet)
dbname=os.path.join('outputs','testdb.sqlite')

#Dump files for api data storage
jsonpath=os.path.join('outputs', tabname+'_retrieved_data.json')

## Variable Lists
Get full list of variables from the API, read in our retrieval list, and compare the varianle IDs and names to make sure nothing is missing and that nothing has changed since the last iteration. *Don't move on to the next block until both lists match.* Lastly, read in list of geographies.

In [3]:
datadict={}
vars_url = f'https://api.census.gov/data/{year}/{dsource}/{dname}/variables.json'
response=requests.get(vars_url)
var_data=response.json()
datadict.update(var_data['variables'])
random.sample(list(datadict.items()), 2)

[('DP1_0100P',
  {'label': 'Percent!!HISPANIC OR LATINO BY RACE!!Total population!!Hispanic or Latino!!Asian alone',
   'concept': 'PROFILE OF GENERAL POPULATION AND HOUSING CHARACTERISTICS',
   'predicateType': 'float',
   'group': 'DP1',
   'limit': 0,
   'attributes': 'DP1_0100PA'}),
 ('DP1_0131P',
  {'label': 'Percent!!RELATIONSHIP!!Total population!!In group quarters!!Noninstitutionalized population:!!Female',
   'concept': 'PROFILE OF GENERAL POPULATION AND HOUSING CHARACTERISTICS',
   'predicateType': 'float',
   'group': 'DP1',
   'limit': 0,
   'attributes': 'DP1_0131PA'})]

In [4]:
dfexcel = pd.read_excel(os.path.join('inputs','dec2020_variables.xlsx'),sheet_name=worksheet)
dfexcel.head()

Unnamed: 0,census_var,census_label,dtype
0,DP1_0001P,Percent!!SEX AND AGE!!Total population,float
1,DP1_0002P,Percent!!SEX AND AGE!!Total population!!Under ...,float
2,DP1_0003P,Percent!!SEX AND AGE!!Total population!!5 to 9...,float
3,DP1_0004P,Percent!!SEX AND AGE!!Total population!!10 to ...,float
4,DP1_0005P,Percent!!SEX AND AGE!!Total population!!15 to ...,float


In [5]:
dfvars = pd.DataFrame.from_dict(datadict,columns=['label'],orient='index')
dfvars_selected=dfvars.loc[dfvars.index.isin(dfexcel['census_var'])]
dfvars_count=len(dfvars_selected)
dfexcel_count=len(dfexcel['census_var'])

if dfvars_count==dfexcel_count:
    print('There are an equal number of variables in both lists:', dfvars_count)
else:
    print('There is a mismatch in the number of variables; the api has,', dfvars_count, 
          'while the original list has',dfexcel_count,'. Missing:')
    nomatch=dfexcel[~dfexcel['census_var'].isin(dfvars_selected.index)]
    print(nomatch)

There are an equal number of variables in both lists: 160


In [6]:
# Geographic indetifiers: zctas to retrieve, pumas to filter by, and counties containing tracts to retrieve
excelgeo = pd.read_excel(os.path.join('inputs','dec2020_variables.xlsx'),sheet_name=geotype, dtype=object)
geoids = excelgeo['GEO'].tolist()
print('Number of geographic indetifiers:',len(geoids))

Number of geographic indetifiers: 81


## Retrieve Data
Given the large number of variables and limits of the API, variables must be passed to the url in separate blocks or chunks. The first chunk that's captured is written to an empty datalist; the header row and then one row for each geography. Each subsequent chunk is iterated through by row, so each row is appended to the correct row in datalist. In all cases, the last values, identifiers automatically returned with each API call, are not appended.

In [7]:
def chunks(l, n):
    # For item i in a range that is a length of l,
    for i in range(0, len(l), n):
        # Create an index range for l of n items:
        yield l[i:i+n]

In [8]:
reqvars=list(chunks(dfvars_selected.index.tolist(),46))
reqvars[0].insert(0,'NAME')
reqvars[0].insert(0,'GEO_ID')
print('Number of chunks:',len(reqvars))

Number of chunks: 4


In [9]:
with open(keyfile) as key:
    api_key=key.read().strip()

base_url = f'https://api.census.gov/data/{year}/{dsource}/{dname}'
base_url

'https://api.census.gov/data/2020/dec/dp'

In [10]:
#Function for retrieving data; running this block loads it into memory
#Different geographies have different urls, 
#and a different number of identifiers tacked on to the end of each request

def getdata():
    dlist=[]
    for i, v in enumerate(reqvars):
        batchcols=','.join(v)
        if geotype=='zctas':
            data_url = f'{base_url}?get={batchcols}&for={geo}:{g}&key={api_key}'
            dropvar=-1
        elif geotype=='county_subdivs':
            data_url = f'{base_url}?get={batchcols}&for={geo}:*&in=state:{state}&in=county:{county}&key={api_key}'
            dropvar=-3
        elif geotype=='tracts':
            data_url = f'{base_url}?get={batchcols}&for={geo}:*&in=state:{state}&in=county:{county}&key={api_key}'
            dropvar=-3
        else:
            print('Appropriate geography not specified in variables block')
            break  
        response=requests.get(data_url)
        if response.status_code==200:
            clear_output(wait=True)
            data=response.json()
            for i2, v2 in enumerate(data):
                if i == 0:
                    dlist.append(v2[:dropvar])
                else:
                    for item in v2[:dropvar]:
                        dlist[i2].append(item)
        else:
            print('***Problem with retrieval***, response code',response.status_code)
    return dlist

##### ***THIS BLOCK IS A REQUESTS BLOCK!***
*NOTE: ZCTA retrieval takes a long time - 5 mins for 80 ZCTAs*

In [11]:
#If this block was run successfully for a given table and geography don't rerun - next block pulls from saved json
datalist=[]
if geotype=='zctas':
    for g in geoids:
        georecord=getdata()
        print('Retrieved data for',g)
        if len(datalist)==0:
            datalist.append(georecord[0])
            datalist.append(georecord[1])
        else:
            datalist.append(georecord[1])
else:
    for county in geoids:
        georecord=getdata()
        print('Retrieved data for',county)
        if len(datalist)==0:
            for geog in georecord:
                datalist.append(geog)
        else:
            for geog in georecord[1:]:
                datalist.append(geog)
    
dlrows=len(datalist)
dlitems=sum(len(x) for x in datalist)
dlbyrow=dlitems / dlrows
print('Retrieved', dlrows, 'records and', dlitems,'data points with', dlbyrow, 'points for each record...')
        
with open(jsonpath, 'w') as f:
    json.dump(datalist, f)
print('Done - Data dumped to json file')

Retrieved data for 02921
Retrieved 82 records and 13284 data points with 162.0 points for each record...
Done - Data dumped to json file


## Process Data
Replace footnotes with nulls, create a new GEOID2 column, replace census variable names with database variable names.

In [12]:
with open(jsonpath, 'r') as f:
    jsondata=json.load(f)
alldata = pd.DataFrame(jsondata[1:],columns=jsondata[0],dtype=object).rename(columns={
    'GEO_ID':'GEOIDLONG','NAME':'GEOLABEL'}).set_index('GEOIDLONG')
alldata.info()
# Index and column entries should be 1 row and 1 column less than previous count (excludes header row and index column) 

<class 'pandas.core.frame.DataFrame'>
Index: 81 entries, 860Z200US02802 to 860Z200US02921
Columns: 161 entries, GEOLABEL to DP1_0160P
dtypes: object(161)
memory usage: 102.5+ KB


In [13]:
alldata.head(3)

Unnamed: 0_level_0,GEOLABEL,DP1_0001P,DP1_0002P,DP1_0003P,DP1_0004P,DP1_0005P,DP1_0006P,DP1_0007P,DP1_0008P,DP1_0009P,...,DP1_0151P,DP1_0152P,DP1_0153P,DP1_0154P,DP1_0155P,DP1_0156P,DP1_0157P,DP1_0158P,DP1_0159P,DP1_0160P
GEOIDLONG,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
860Z200US02802,ZCTA5 02802,100.0,2.6,4.8,4.2,3.6,2.8,3.7,2.3,8.8,...,0.0,0.4,2.1,0.8,2.1,-888888888,-888888888,100.0,67.3,32.7
860Z200US02804,ZCTA5 02804,100.0,4.7,6.1,5.6,5.7,5.1,5.6,4.7,6.8,...,0.1,1.7,1.0,2.2,0.7,-888888888,-888888888,100.0,84.8,15.2
860Z200US02806,ZCTA5 02806,100.0,4.8,7.0,9.0,8.2,4.7,2.7,3.4,5.6,...,0.0,0.9,0.5,1.8,1.4,-888888888,-888888888,100.0,87.0,13.0


In [14]:
#This is a lousy solution, come up with something better in the future
footnotes=['-999999999','-999999999.0', '-999999999.00',
           '-888888888','-888888888.0', '-888888888.00',
           '-666666666','-666666666.0', '-666666666.00',
           '-555555555','-555555555.0', '-555555555.00',
           '-333333333','-333333333.0', '-333333333.00',
           '-222222222','-222222222.0', '-222222222.00']
alldata.replace(footnotes,np.nan,inplace=True)

In [15]:
idxgeoid2={'zctas':-5, 'county_subdivs':-10,'tracts':-11}
alldata.insert(loc=0, column='GEOIDSHORT',value=alldata.index.str[idxgeoid2.get(geotype):])
alldata.head(3)

Unnamed: 0_level_0,GEOIDSHORT,GEOLABEL,DP1_0001P,DP1_0002P,DP1_0003P,DP1_0004P,DP1_0005P,DP1_0006P,DP1_0007P,DP1_0008P,...,DP1_0151P,DP1_0152P,DP1_0153P,DP1_0154P,DP1_0155P,DP1_0156P,DP1_0157P,DP1_0158P,DP1_0159P,DP1_0160P
GEOIDLONG,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
860Z200US02802,2802,ZCTA5 02802,100.0,2.6,4.8,4.2,3.6,2.8,3.7,2.3,...,0.0,0.4,2.1,0.8,2.1,,,100.0,67.3,32.7
860Z200US02804,2804,ZCTA5 02804,100.0,4.7,6.1,5.6,5.7,5.1,5.6,4.7,...,0.1,1.7,1.0,2.2,0.7,,,100.0,84.8,15.2
860Z200US02806,2806,ZCTA5 02806,100.0,4.8,7.0,9.0,8.2,4.7,2.7,3.4,...,0.0,0.9,0.5,1.8,1.4,,,100.0,87.0,13.0


In [16]:
# For PUMAS filter all the geotype for the state by local areas
if geotype == 'pumas':
    censusdata=alldata.loc[alldata.GEOID2.isin(geoids)].copy().astype(object).sort_index()
else:
    censusdata=alldata.copy().astype(object).sort_index()
censusdata.shape

(81, 162)

In [17]:
censusdata.head(3)

Unnamed: 0_level_0,GEOIDSHORT,GEOLABEL,DP1_0001P,DP1_0002P,DP1_0003P,DP1_0004P,DP1_0005P,DP1_0006P,DP1_0007P,DP1_0008P,...,DP1_0151P,DP1_0152P,DP1_0153P,DP1_0154P,DP1_0155P,DP1_0156P,DP1_0157P,DP1_0158P,DP1_0159P,DP1_0160P
GEOIDLONG,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
860Z200US02802,2802,ZCTA5 02802,100.0,2.6,4.8,4.2,3.6,2.8,3.7,2.3,...,0.0,0.4,2.1,0.8,2.1,,,100.0,67.3,32.7
860Z200US02804,2804,ZCTA5 02804,100.0,4.7,6.1,5.6,5.7,5.1,5.6,4.7,...,0.1,1.7,1.0,2.2,0.7,,,100.0,84.8,15.2
860Z200US02806,2806,ZCTA5 02806,100.0,4.8,7.0,9.0,8.2,4.7,2.7,3.4,...,0.0,0.9,0.5,1.8,1.4,,,100.0,87.0,13.0


## Write to Database
Update list of variables and data types, build create table string, create datatable in temporary database.


In [18]:
dfexcel.replace({'dtype': {'int': 'INTEGER', 'float': 'REAL'}},inplace=True)
dfexcel.census_label.replace({'!!': ' - '},inplace=True, regex=True)
dfexcel.head()

Unnamed: 0,census_var,census_label,dtype
0,DP1_0001P,Percent - SEX AND AGE - Total population,REAL
1,DP1_0002P,Percent - SEX AND AGE - Total population - Und...,REAL
2,DP1_0003P,Percent - SEX AND AGE - Total population - 5 t...,REAL
3,DP1_0004P,Percent - SEX AND AGE - Total population - 10 ...,REAL
4,DP1_0005P,Percent - SEX AND AGE - Total population - 15 ...,REAL


In [19]:
vardict=dfexcel.set_index('census_var').T.to_dict('list')
random.sample(list(vardict.items()), 2)

[('DP1_0085P',
  ['Percent - TOTAL RACES TALLIED [1] - Total races tallied', 'REAL']),
 ('DP1_0132P', ['Percent - HOUSEHOLDS BY TYPE - Total households', 'REAL'])]

In [20]:
con = sqlite3.connect(dbname) 
cur = con.cursor()

In [21]:
cur.execute('DROP TABLE IF EXISTS {}'.format(tabname))
dbstring="""
CREATE TABLE {} (
GEOIDLONG TEXT NOT NULL PRIMARY KEY,
GEOIDSHORT TEXT,
GEOLABEL TEXT,
""".format(tabname)

for k,v in vardict.items():
    dbstring=dbstring+k+' '+v[1]+', \n'
    
dbstring=dbstring[:-3]
dbstring=dbstring+');'
print(dbstring)


CREATE TABLE zctas_census2020_pct (
GEOIDLONG TEXT NOT NULL PRIMARY KEY,
GEOIDSHORT TEXT,
GEOLABEL TEXT,
DP1_0001P REAL, 
DP1_0002P REAL, 
DP1_0003P REAL, 
DP1_0004P REAL, 
DP1_0005P REAL, 
DP1_0006P REAL, 
DP1_0007P REAL, 
DP1_0008P REAL, 
DP1_0009P REAL, 
DP1_0010P REAL, 
DP1_0011P REAL, 
DP1_0012P REAL, 
DP1_0013P REAL, 
DP1_0014P REAL, 
DP1_0015P REAL, 
DP1_0016P REAL, 
DP1_0017P REAL, 
DP1_0018P REAL, 
DP1_0019P REAL, 
DP1_0020P REAL, 
DP1_0021P REAL, 
DP1_0022P REAL, 
DP1_0023P REAL, 
DP1_0024P REAL, 
DP1_0025P REAL, 
DP1_0026P REAL, 
DP1_0027P REAL, 
DP1_0028P REAL, 
DP1_0029P REAL, 
DP1_0030P REAL, 
DP1_0031P REAL, 
DP1_0032P REAL, 
DP1_0033P REAL, 
DP1_0034P REAL, 
DP1_0035P REAL, 
DP1_0036P REAL, 
DP1_0037P REAL, 
DP1_0038P REAL, 
DP1_0039P REAL, 
DP1_0040P REAL, 
DP1_0041P REAL, 
DP1_0042P REAL, 
DP1_0043P REAL, 
DP1_0044P REAL, 
DP1_0045P REAL, 
DP1_0046P REAL, 
DP1_0047P REAL, 
DP1_0048P REAL, 
DP1_0049P REAL, 
DP1_0050P REAL, 
DP1_0051P REAL, 
DP1_0052P REAL, 
DP1_0053P 

In [22]:
cur.execute(dbstring)

<sqlite3.Cursor at 0x151e4d8bb20>

In [23]:
censusdata.to_sql(name=tabname, if_exists='append', index=True, con=con)

In [24]:
cur.execute('SELECT COUNT(*) FROM {};'.format(tabname))
rows = cur.fetchone()
print(rows[0], 'records written to', tabname)

81 records written to zctas_census2020_pct


In [25]:
cur.execute('SELECT * FROM {} LIMIT 1;'.format(tabname))
col_names = [cn[0] for cn in cur.description]
print(len(col_names), 'columns written to', tabname)
#Number should be same as number in df acsdata plus 1, since index not included in df count

163 columns written to zctas_census2020_pct


In [26]:
con.close()

## Metadata Table
DO NOT RERUN THIS SECTION FOR MULTIPLE GEOGRAPHIES. In the RI Geodatabase there is only one metadata table per census table series (one for plrd, one for profile) for all geographies. For whichever geography is processed first, set action variable to 'create' and run this entire series of blocks for the table. If there was a second table, set the action variable to 'append' and skip the table creation and identifier insertion blocks.

In [None]:
#Change table name and specify an action - you're creating the table for the first time with acs1 variables, 
#or appending the tables with acs2 variables

metatab='census2020_lookup'
action='' # 'create' or 'append'

In [None]:
con = sqlite3.connect(dbname) 
cur = con.cursor()

In [None]:
#Only run this block when creating initial table
if action=='create':
    mdstring="""
    CREATE TABLE {} (
    var_id TEXT,
    var_value TEXT);
    """.format(metatab)
    cur.execute(mdstring)
else:
    print('Block not executed because "create" not selected as an action in earlier block')

In [None]:
#Only run this block when creating initial table
if action=='create':
    exstring="""
        INSERT INTO {} VALUES('GEOIDLONG','Id');
        INSERT INTO {} VALUES('GEOIDSHORT','Id2');
        INSERT INTO {} VALUES('GEOLABEL','Geography');
        """.format(metatab,metatab,metatab,metatab)
    cur.executescript(exstring)
    con.commit()
else:
    print('Block not executed because "create" not selected as an action in earlier block')

In [None]:
#Run when creating table or when appending records
#Keys and values - db ids and labels
if action in ('create','append'):
    for mk, mv in vardict.items():
        cur.execute("INSERT INTO {} values(?,?)".format(metatab),(mk,mv[0]))
    con.commit()
else:
    print('Block not executed because action not specified in earlier block')

In [None]:
cur.execute('SELECT COUNT(*) FROM {};'.format(metatab))
rows = cur.fetchone()
print(rows[0], 'records in', metatab)

In [None]:
action=''
con.close()