## General data rules


- We try to preserve as many FIPS codeas as possible; some exclusions are hard to avoid, for which we apologize, but we're doing our best to use as complete datasets as we can.
- Each processed dataset has the same first two columns, followed by float columns with features.
- Each processed dataset focuses on one feature group.
- Raw datasets are stored in the `data/raw` folder.
- Processed datasets are stored in the `data/processed` folder.
- The `_sdt_` versions of processed data result from standardizing all float columns to have mean 0 and standard deviation 1, and then rescaling to fit between -1 and 1. This is done to make the data more amenable to fair similarity computations and machine learning algorithms.
- Exclusion enforced by NaNs in incoming accepted datasets are aggregated in `exclusions.pkl`, which is used in the dataset pipeline.


## Outcome variables

These are time-series variables identified as those of interest by data advocates we interacted with (or at least as many of those as we were able to incorporate for various reasons).


### GDP


- **Definition** - Chain-type GDP is a method for calculating Gross Domestic Product (GDP) that adjusts for changes in the composition and prices of goods and services over time.

- **Time restrictions** - 2001 to 2021

- **Source** - The dataset was obtained from the [Bureau of Economic Analysis](https://www.bea.gov/), and it can be downloaded via [this link](https://apps.bea.gov/iTable/?reqid=70&step=1&isuri=1&acrdn=5#eyJhcHBpZCI6NzAsInN0ZXBzIjpbMSwyOSwyNSwzMSwyNiwyNywzMF0sImRhdGEiOltbIlRhYmxlSWQiLCI1MzMiXSxbIk1ham9yX0FyZWEiLCI0Il0sWyJTdGF0ZSIsWyIwIl1dLFsiQXJlYSIsWyIwMDAwMCJdXSxbIlN0YXRpc3RpYyIsWy-1Il1dLFsiVW5pdF9vZl9tZWFzdXJlIiwiTGV2ZWxzIl0sWyJZZWFyIiwi-1Il1dLFsiWWVhciJdXSxb-1Il1dLFsiWWVhcmJlZ2luIiwi-1Il1dLFsiWWVhcmJlZ2luIiwi-1Il1dfQ==).

- **Notes** - The dataset is missing values for the year 2012.

- The GDP data contained in `CAGDP1_2001_2021.csv` was downloaded on Oct 4, 2023. The following locations were removed due to NaNs:

|   GeoFIPS | GeoName                                    |
|----------:|--------------------------------------------|
|     02063 | Chugach Census Area, AK*                  |
|     02066 | Copper River Census Area, AK*            |
|     02105 | Hoonah-Angoon Census Area, AK*           |
|     02195 | Petersburg Borough, AK*                  |
|     02198 | Prince of Wales-Hyder Census Area, AK*   |
|     02201 | Prince of Wales-Outer Ketchikan Census Area, AK* |
|     02230 | Skagway Municipality, AK*                |
|     02261 | Valdez-Cordova Census Area, AK*          |
|     02275 | Wrangell City and Borough, AK*           |
|     08014 | Broomfield, CO*                          |




- Furthermore, the processed datasets do not include counties specified in the exclusions.pkl file. This decision was motivated by the necessity of ensuring consistent GeoFips values across all datasets.

## Background variables


These are variables that are not available as time series, but can nevertheless used to evaluate similarity between locations and to build predictive models. 
Since there will be many of them, for ease of use, we grouped them into categories.

### Demographic variables

#### Population

- **Definition** - Demographic variables, including a time series for population, were obtained from the CAINC30 dataset created by the Bureau of Economic Analysis.

- **Time restrictions** - 1992 to 2021

- **Source** - The dataset was obtained from [this website](https://www.bea.gov/) via [this link](https://apps.bea.gov/itable/?ReqID=70&step=1#eyJhcHBpZCI6NzAsInN0ZXBzIjpbMSwyOSwyNSwzMSwyNiwyNywzMF0sImRhdGEiOltbIlRhYmxlSWQiLCIxMiJdLFsiTWFqb3JfQXJlYSIsIjQiXSxbIlN0YXRlIixbIjAiXV0sWyJBcmVhIixbIjAwMDAwIl1dLFsiU3RhdGlzdGljIixbIi0xIl1dLFsiVW5pdF9vZl9tZWFzdXJlIiwiTGV2ZWxzIl0sWyJZZWFyIixbIi0xIl1dLFsiWWVhckJlZ2luIiwiLTEiXSxbIlllYXJfRW5kIiwiLTEiXV19).

- **Notes** - Data up to 1992 was removed due to missingness in the past. 58 counties were excluded from the dataset because they are not common with the GDP dataset FIPS numbers.

### Transportation

- **Definition** - The following transportation variables (Road Network Density and National Walkability Index) were extracted from the Simple Location dataset.

- **Time restrictions** - The dataset was compiled in 2021, the variables vary in their sources and dates, as explained in the variable descriptions.

- **Source** - The Simple Location dataset (version 3.0) was obtained on 10th October 2023 from [this website](https://www.epa.gov/smartgrowth/smart-location-mapping) through [this link](https://edg.epa.gov/EPADataCommons/public/OA/).

- **Notes** - `smartLocationSmall.csv` is a preprocessed and condensed version of the main dataset. Its size was reduced, mainly because the information was provided for small subregions of counties, while for practical reasons we focus on counties/cities. This necessitated grouping the data, where some values were aggregated by summing them up for the counties, and others were averaged. More details are available in the variable descriptions. The dataset contains more variables regarding transportation, but many cases of missingness enforced their exclusion, given our interest in providing consistent information and analysis to as many locations as possible.


#### Road Network Density

`D3A`

This variable indicates road network density at the county level, sourced from Maps NAVSTREETS databases collected in 2018. These variable represent the miles of roads per square miles of land, calculated at the county level.



#### National Walkability Index

`WeightAvgNatWalkInd`

A National Walkability Index (NWI) was created in 2015 following the release of SLD version 2.0, aimed at aiding transportation planning and facilitating comparisons of places' suitability for walking as a form of travel.

National Walkability Index value between 1 (lowest walkability) and 20 (highest walkability). Scores are categorized into the following basic levels of walkability: 1) least walkable (1.0-5.75), 2) below average walkable (5.76-10.5), 3) above average walkable (10.51-15.25) and 4) most walkable (15.26-20.0).

In our abbreviated dataset (`smartLocationSmall.csv`), the index values were calculated as the population-weighted average for counties within subregions.

## Intervention Variables

### USA Spendings datasets:

- **Definition** - The datasets `spending_transportation`, `spending_commerce`, and `spending_HHS` contain information on grant expenditures for counties in the United States. These grants were awarded by the following government departments: Transportation, Commerce, and Health and Human Services.

- **Time Restrictions** - The data covers the period from 2010 to 2021.

- **Source** - This dataset was obtained from [USA Spending](https://www.usaspending.gov/) and was accessed in October 2023. The data was collected through a search on the platform's custom award data download center, and you can access it [here](https://www.usaspending.gov/download_center/custom_award_data).

- **Notes** - The raw dataset was filtered to include only the columns relevant to our analysis and was further grouped by FIPS code and year. This was done to reduce the size of the datasets for more efficient analysis. At the stage of initial filtering we also exlude all negative values that were found in `total_obligated_amount` column, as they were of no use for us. Every spending dataset have respective file with information on deleted negative value from every year, the files are in format info_negative..._val.pkl

Spendings on Transportation

- There were some FIPS values that were not matching the values found in GDP dataset. Many of them were only 3 digit, in order to not delete all of them we have run the comparison of names found in the `spending_transportation_names`, as the result more than 90% 3 digit FIPS codes were restored. As the final result we have excluded 181 FIPS.

Spendings on Commerce

- The amount of FIPS that were deleted because they were not common with GDP dataset: 73

Spendings on HHS

- The amount of FIPS that were deleted because they were not common with GDP dataset: 99