# Housing TIGER Scraping

This script is designed to scrape census data from the TIGER/Line Shapefiles provided by the United States Census Bureau, specifically the specified year and state. You can find the data source [here](https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html).

### Runtime

Approximately 3 minutes for Washington State.  Runtime may differ by state.

### High Level Overview
This script downloads, processes, and merges geographic data from the US Census Bureau's Topologically Integrated Geographic Encoding and Referencing (TIGER) system. The script is configurable for a specific US state and will handle data for all counties within that state. The output is a comprehensive dataset containing selected geographic features and their attributes for the specified state.

### Detailed Overview
The script is set to scrape data for a specified state (in this case, "Washington") and all of its counties. For each county, it downloads three different types of data files: 'ARRF', 'ARFNRF', 'FNRF', which respectively create 'addr', 'addrfn', and 'featnames' files.

These files are then combined together. The 'addr' and 'addrfn' files are joined on the 'ARID' field, and then the 'featnames' files are joined on the 'LINEARID' field. After joining, the following fields are extracted: 'FULLNAME','FROMHN','TOHN','SIDE','ZIP','PLUS4','MTFCC'. All of these operations are performed for each county.

At the end of the script, all the data is merged together to provide the required information for all the counties within the state.

A list of counties for the given state is also generated at the start of the script and used for a quality check at the end to verify that data has been extracted for all counties.

The script also performs a check to ensure the required columns exist in the downloaded data files before performing any merging operations. If they do not exist, a warning message is printed.

### Notes
* The TIGER/Line Shapefiles website mentions that users should not download large amounts of data during their peak usage hours of 8 AM to 4 PM Eastern time.
* The columns specified might not be present in all the shapefiles depending on the state and county.
* The script downloads each TIGER file separately for each county. Depending on the amount of data you plan to download, you may want to consider downloading the data in bulk and then extracting the files for each county.
* The relative year's TIGER files might not be available at the given URL, as the URL structure can change based on the year and the version of the dataset. You need to adjust the URL according to the correct path.

--------- 

### About

<p>Author: PJ Gibson</p>
<p>Created Date: 2023-06-30</p>
<p>Contact: peter.gibson@doh.wa.gov</p>
<p>Assistance in the generation of this script was provided by GPT-4, a model developed by OpenAI.</p>

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import os
import requests
import zipfile
import geopandas as gpd
from simpledbf import Dbf5
from tqdm import tqdm
import pickle

# Specify the state you want
state = "Washington"
fips = "53"

year = "2019"

# Create a directory for the downloaded files if it does not exist
folder_name = f'TIGERfiles_{state}'
os.makedirs(folder_name, exist_ok=True)

# Load the key from the pickle file
with open('secrets.pkl', 'rb') as f:
    api_key = pickle.load(f)

print("API key loaded successfully.")


In [None]:
# US Census FTP server
base_url = f'https://www2.census.gov/geo/tiger/TIGER{year}'

params = {"key": api_key}

# List of counties
counties_url = f'{base_url}/COUNTY/tl_{year}_us_county.zip'
counties_file = f'{folder_name}/tl_{year}_us_county.zip'

# Download the file to zip
response = requests.get(counties_url, params=params)
with open(counties_file, 'wb') as output:
    output.write(response.content)

# Unzip the county file
with zipfile.ZipFile(counties_file, 'r') as zip_ref:
    zip_ref.extractall(f'{folder_name}/')

# Read in which counties belong to our state of interest
counties = gpd.read_file(f'{folder_name}/tl_{year}_us_county.shp')
state_counties = counties[counties['STATEFP'] == fips]

# Define the dataframe that will guide our data extraction by county and check for quality completeness
counties_list = state_counties['NAME'].to_frame()

# Begin extracting, combining, and saving data for each county.
for i in tqdm(np.arange(0,len(state_counties))):

    # Define row iteration within our for-loop
    row = state_counties.iloc[i]

    # Define the county of interest. Used primarily for file naming
    county = row['NAME']

    # For each file type below, we will append the output dataframe to this empty list
    files_to_combine = []

    for file_type in ['addr', 'featnames', 'addrfn']:
        file_url = f'{base_url}/{file_type.upper()}/tl_{year}_{row["GEOID"]}_{file_type}.zip'
        file_name = f'{folder_name}/tl_{year}_{row["GEOID"]}_{file_type}.zip'

        # Download the file
        response = requests.get(file_url, params=params)
        with open(file_name, 'wb') as output:
            output.write(response.content)

        # Unzip the file
        with zipfile.ZipFile(file_name, 'r') as zip_ref:
            zip_ref.extractall(f'{folder_name}/{county}_{file_type}/')

        # Read in the .dbf file, one of the unzipped contents
        dbf = Dbf5(f'{folder_name}/{county}_{file_type}/tl_{year}_{row["GEOID"]}_{file_type}.dbf')

        # Convert to pandas dataframe
        df = dbf.to_dataframe()

        # Append to dataframe to combine outside of for loop
        files_to_combine.append(df)

    # Merge the files
    merged = pd.merge(files_to_combine[0].drop('MTFCC',axis=1), files_to_combine[2], on='ARID', how='outer')
    merged = pd.merge(merged, files_to_combine[1], on='LINEARID', how='outer')

    # Extract the required fields and drop duplicates
    merged = merged[['FULLNAME', 'FROMHN', 'TOHN', 'SIDE', 'ZIP', 'PLUS4', 'MTFCC']].drop_duplicates()

    # Define fips value
    merged['COUNTY_FIPS'] = int(row["GEOID"])

    # Define county name
    merged['COUNTY_NAME'] = county

    # Save the result
    merged.to_csv(f'{folder_name}/{county}_combined.csv', index=False)

# Merge all county data
all_counties = pd.concat([pd.read_csv(f'{folder_name}/{county}_combined.csv') for county in state_counties['NAME']])

# We only will allow people to live on roads designed for housing (non-highways, bike paths, 4x4 access only ect...)
###### See https://www2.census.gov/geo/pdfs/maps-data/data/tiger/tgrshp2019/TGRSHP2019_TechDoc.pdf (end of paper) for detail on MTFCC options.
all_counties = all_counties.query('MTFCC == "S1400"')

# Save the result
all_counties.to_csv(f'../../SupportingDocs/Housing/03_Complete/TIGER_state_streets.csv', index=False)

# Quality check: Check if all counties' data have been extracted
extracted_counties = all_counties['COUNTY_NAME'].unique().tolist()
counties_list['Extracted'] = counties_list['NAME'].isin(extracted_counties)
missing_counties = counties_list[~counties_list['Extracted']]['NAME'].tolist()
if missing_counties:
    print(f'WARNING: Data for the following counties were not extracted: {missing_counties}')