# DTSA 5509 Final Project
## Predicting Walkability of U.S. Counties

#### Rahul Cheeniyil
#### 12-11-2024

In this project, I use the Walkability Index dataset ([published by the U.S. Environmental Protection Agency](https://catalog.data.gov/dataset/walkability-index7)) to build and test supervised learning models to predict walkability of U.S. Census Blocks. Predicting walkability is important for allocating infrastructure funding for areas with low pedestrian accessibility. Improving city walkability can promote for public health and safety as well as reduce automobile traffic and improve environmental health due to reduced emissions.

The models I will use are multilinear regression and gradient boosting. In this report, I will walk through the data I use, how I intend to use it, data cleaning, and exploratory data analysis prior to building the models themselves. Model performance will be assessed at the end of the report.

### Data
The data consists of various metrics from Censis 2019 block groups. The data is accessed by a publicly available API. The code cell below loads the data via REST API requests and loads it into a Pandas DataFrame. The resulting DataFrame has 220,134 rows and 182 columns. Each row corresponds to a U.S. Census Block and each column responds to a metric associated with that block. The column "NatWalkInd" is the National Walkability Index score assigned to the block. This is the metric I will be predicting with machine learning models. All other columns reflect various metrics/statistics associated with the block. These 181 columns can be grouped into several categories.
- Geographic Identifiers: Spatial and administrative context for identifying the blocks
- Population and Demographics: Population of the block and various demographic distributions.
- Automobile Ownership Statistics: Car ownership/dependency of residents including percentage of households owning different numbers of vehicles.
- Employment and Wages: Job distributions and wage levels.
- Land Use and Density: Metrics related to the land itself.
- Transportation: Proximity to destinations and transit.
- Environmental Metrics: Assessments of emissions and vehicle usage.
- Geographic Measurements: Geometric characterization of the land.

In [None]:
import requests
import pandas as pd

# Publicly accessibly Esri REST API Endpoint
api_url = "https://geodata.epa.gov/arcgis/rest/services/OA/WalkabilityIndex/MapServer/0/query"

params = {
    "where": "1=1",  # Select all records
    "outFields": "*",  # Retrieve all fields
    "f": "json",  # Response format
    "resultRecordCount": 10000,  # Limit to 10000 records per request to avoid rate limits
    "returnGeometry": True,  # Include geographic data such as area/perimeter of census blocks
}

# Fetch the data in chunks
def fetch_data(api_url, params):
    offset = 0
    all_data = []

    while True:
        params["resultOffset"] = offset
        response = requests.get(api_url, params=params)

        if response.status_code == 200:
            data = response.json()
            features = data.get("features", [])
            
            if not features:
                break  # No more records
            
            all_data.extend([feature["attributes"] for feature in features])
            
            offset += len(features)
        else:
            print(f"Error: {response.status_code}")
            break

    return pd.DataFrame(all_data)

walkability_data = fetch_data(api_url, params) # This took me about 7 minutes to run.

print(walkability_data.info)

# Report memory usage in MB
memory_B = walkability_data.memory_usage(deep=True).sum()
memory_MB = round(memory_B/1024/1024, 3)
print(f"The DataFrame size is {memory_MB:,} MB.")
