## General data rules


- We try to preserve as many FIPS codeas as possible; some exclusions are hard to avoid, for which we apologize, but we're doing our best to use as complete datasets as we can.
- Each processed dataset has the same first two columns, followed by float columns with features.
- Each processed dataset focuses on one feature group.
- Raw datasets are stored in the `data/raw` folder.
- Processed datasets are stored in the `data/processed` folder.
- Each dataset is processed into four formats, which all will be needed at different stages. long/wide, original/standardized and scaled values.
- All FIPS codes need to be the same in all datasets, in exact the same order. `gdp` dataset is the source of truth about this.
- Cleaning is supposed to be achieved by a separate function, similar to `clean_population.py` or `clean_spending_transportation.py`. The function is then supposed to be run within `cleaning_pipeline.py`.
- The `_sdt_` versions of processed data result from standardizing all float columns to have mean 0 and standard deviation 1, and then rescaling to fit between -1 and 1. This is done to make the data more amenable to fair similarity computations and machine learning algorithms.
- Exclusion enforced by NaNs in incoming accepted datasets are aggregated in `exclusions.pkl`, which is used in the dataset pipeline.
- Each included feature group is to be decsribed in `data_sources.ipynb`. It already includes some descriptions, please make yours similar.


## Outcome variables

These are time-series variables identified as those of interest by data advocates we interacted with (or at least as many of those as we were able to incorporate for various reasons).


### GDP


- **Definition** - Chain-type GDP is a method for calculating Gross Domestic Product (GDP) that adjusts for changes in the composition and prices of goods and services over time.

- **Time restrictions** - 2001 to 2021

- **Source** - The dataset was obtained from the [Bureau of Economic Analysis](https://www.bea.gov/), and it can be downloaded via [this link](https://apps.bea.gov/iTable/?reqid=70&step=1&isuri=1&acrdn=5#eyJhcHBpZCI6NzAsInN0ZXBzIjpbMSwyOSwyNSwzMSwyNiwyNywzMF0sImRhdGEiOltbIlRhYmxlSWQiLCI1MzMiXSxbIk1ham9yX0FyZWEiLCI0Il0sWyJTdGF0ZSIsWyIwIl1dLFsiQXJlYSIsWyIwMDAwMCJdXSxbIlN0YXRpc3RpYyIsWy-1Il1dLFsiVW5pdF9vZl9tZWFzdXJlIiwiTGV2ZWxzIl0sWyJZZWFyIiwi-1Il1dLFsiWWVhciJdXSxb-1Il1dLFsiWWVhcmJlZ2luIiwi-1Il1dLFsiWWVhcmJlZ2luIiwi-1Il1dfQ==).

- **Notes** - The dataset is missing values for the year 2012.

- The GDP data contained in `CAGDP1_2001_2021.csv` was downloaded on Oct 4, 2023. The following locations were removed due to NaNs:

|   GeoFIPS | GeoName                                    |
|----------:|--------------------------------------------|
|     02063 | Chugach Census Area, AK*                  |
|     02066 | Copper River Census Area, AK*            |
|     02105 | Hoonah-Angoon Census Area, AK*           |
|     02195 | Petersburg Borough, AK*                  |
|     02198 | Prince of Wales-Hyder Census Area, AK*   |
|     02201 | Prince of Wales-Outer Ketchikan Census Area, AK* |
|     02230 | Skagway Municipality, AK*                |
|     02261 | Valdez-Cordova Census Area, AK*          |
|     02275 | Wrangell City and Borough, AK*           |
|     08014 | Broomfield, CO*                          |




- Furthermore, the processed datasets do not include counties specified in the exclusions.pkl file. This decision was motivated by the necessity of ensuring consistent GeoFips values across all datasets.

## Background variables


These are variables that are not available as time series, but can nevertheless used to evaluate similarity between locations and to build predictive models. 
Since there will be many of them, for ease of use, we grouped them into categories.

### Demographic variables

#### Population

- **Definition** - Demographic variables, including a time series for population, were obtained from the CAINC30 dataset created by the Bureau of Economic Analysis.

- **Time restrictions** - 1992 to 2021

- **Source** - The dataset was obtained from [this website](https://www.bea.gov/) via [this link](https://apps.bea.gov/itable/?ReqID=70&step=1#eyJhcHBpZCI6NzAsInN0ZXBzIjpbMSwyOSwyNSwzMSwyNiwyNywzMF0sImRhdGEiOltbIlRhYmxlSWQiLCIxMiJdLFsiTWFqb3JfQXJlYSIsIjQiXSxbIlN0YXRlIixbIjAiXV0sWyJBcmVhIixbIjAwMDAwIl1dLFsiU3RhdGlzdGljIixbIi0xIl1dLFsiVW5pdF9vZl9tZWFzdXJlIiwiTGV2ZWxzIl0sWyJZZWFyIixbIi0xIl1dLFsiWWVhckJlZ2luIiwiLTEiXSxbIlllYXJfRW5kIiwiLTEiXV19).

- **Notes** - Data up to 1992 was removed due to missingness in the past. 58 counties were excluded from the dataset because they are not common with the GDP dataset FIPS numbers.

### Transportation

- **Definition** - The following transportation variables (Road Network Density and National Walkability Index) were extracted from the Simple Location dataset.

- **Time restrictions** - The dataset was compiled in 2021, the variables vary in their sources and dates, as explained in the variable descriptions.

- **Source** - The Simple Location dataset (version 3.0) was obtained on 10th October 2023 from [this website](https://www.epa.gov/smartgrowth/smart-location-mapping) through [this link](https://edg.epa.gov/EPADataCommons/public/OA/).

- **Notes** - `smartLocationSmall.csv` is a preprocessed and condensed version of the main dataset. Its size was reduced, mainly because the information was provided for small subregions of counties, while for practical reasons we focus on counties/cities. This necessitated grouping the data, where some values were aggregated by summing them up for the counties, and others were averaged. More details are available in the variable descriptions. The dataset contains more variables regarding transportation, but many cases of missingness enforced their exclusion, given our interest in providing consistent information and analysis to as many locations as possible.


### Ethnic composition

- **Definition:** This dataset contains demographic information extracted from the American Community Survey (ACS). The raw dataset is a subset of the full set of demographic and housing estimates (DP05), and it includes selected variables, as listed in the "Notes" section. The original dataset contained absolute counts, our final dataset provides proportions of various racial and ethnic groups.

- **Time Restrictions:** Data is based on 2021 ACS version, it contains 5-year estimates.

- **Source:** Data was obtained from the [American Community Survey DP05](https://data.census.gov/) via the [Census Data Platform](https://data.census.gov/table/ACSDP5Y2021.DP05?g=010XX00US$0500000).

- **Notes:** The `ethnic_composition_nominal` dataset contains absolute counts rather than proportions. It includes a few groups starting with "other", so after transforming into proportions, those in the original set do not add to 100%, presumably because some subjects marked themselves as belonging to multiple `other` categories. Notably, variables `DP05_0082E` (other race) and `DP05_0083E` (two or more races) were combined into a new variable called 'other_race_races.' After this revision, proportions add up to 100%.

The following variables were extracted from the raw dataset:


Column's IDs and their descriptions:    
'DP05_0070E': Total population   
'DP05_0071E': Hispanic or Latino (of any race) sum   
'DP05_0072E': Mexican   
'DP05_0073E': Puerto Rican   
'DP05_0074E': Cuban   
'DP05_0075E': Other Hispanic or Latino  
'DP05_0076E': Not Hispanic or Latino sum  
'DP05_0077E': White  
'DP05_0078E': Black  
'DP05_0079E': American Indian and Alaska Native  
'DP05_0080E': Asian  
'DP05_0081E': Native Hawaiian and other Pacific Islander    
'DP05_0082E': Other race   
'DP05_0083E': Two or more races sum  
'DP05_0084E': Two races, including some other race    
'DP05_0085E': Some other race and three or more races   





### Industry composition

- **Definition:** This dataset contains industry information extracted from the American Community Survey (ACS). The raw dataset is a subset of the full set of Selected Economic Characteristics (DP03) and it includes selected variables, as listed in the "Notes" section. The original dataset contained absolute counts, our final dataset provides proportions of various industry areas.

- **Time Restrictions:** Data is based on 2021 ACS version, it contains 5-year estimates.

- **Source:** Data was obtained from the [American Community Survey DP03](https://data.census.gov/) via the [Census Data Platform](https://data.census.gov/table/ACSDP5Y2021.DP03?t=Industry&g=010XX00US$0500000).

- **Notes:** The `industry_absolute` dataset contains absolute counts rather than proportions. The following variables were extracted from the raw dataset:

"DP03_0004E": Employed Population     
"DP03_0005E": Unemployed Population    
"DP03_0032E": Employed Population Sum    
"DP03_0033E": Agriculture, Forestry, Fishing, and Mining Industry Employment   
"DP03_0034E": Construction Industry Employment   
"DP03_0035E": Manufacturing Industry Employment    
"DP03_0036E": Wholesale Trade Industry Employment   
"DP03_0037E": Retail Trade Industry Employment   
"DP03_0038E": Transportation and Warehousing, and Utilities Industry Employment   
"DP03_0039E": Information Industry Employment   
"DP03_0040E": Finance and Insurance, Real Estate, and Rental and Leasing Industry Employment   
"DP03_0041E": Professional, Scientific, Management, Administrative, and Waste Management Services Employment   
"DP03_0042E": Educational Services, and Health Care and Social Assistance Employment    
"DP03_0043E": Arts, Entertainment, Recreation, and Accommodation and Food Services Employment    
"DP03_0044E": Other Services, Except Public Administration Employment    
"DP03_0045E": Public Administration Employment   


### Urbanization level

- **Definition:** This dataset comprises variables representing urban and rural areas within counties, including population, land, and housing characteristics.

- **Time Restrictions:** 2020

- **Source:** Data was obtained from the [United States Census Beureu](https://www.census.gov/en.html) via [this link](https://www.census.gov/programs-surveys/geography/guidance/geo-areas/urban-rural.html).

- **Notes:** The variables in the final datasets have the following interpretation:


HOUDEN_RUR - 2020 Rural housing unit density of the County (square miles)   
POPDEN_RUR - 2020 Rural population density of the County (square miles)   
POPDEN_URB - 2020 Urban population density of the County (square miles)   
HOUDEN_URB - 2020 Urban housing unit density of the County (square miles)   
ALAND_PCT_RUR - Percent of 2020 land within the County that is classified as Rural   


#### Road Network Density

`D3A`

This variable indicates road network density at the county level, sourced from Maps NAVSTREETS databases collected in 2018. These variable represent the miles of roads per square miles of land, calculated at the county level.



#### National Walkability Index

`WeightAvgNatWalkInd`

A National Walkability Index (NWI) was created in 2015 following the release of SLD version 2.0, aimed at aiding transportation planning and facilitating comparisons of places' suitability for walking as a form of travel.

National Walkability Index value between 1 (lowest walkability) and 20 (highest walkability). Scores are categorized into the following basic levels of walkability: 1) least walkable (1.0-5.75), 2) below average walkable (5.76-10.5), 3) above average walkable (10.51-15.25) and 4) most walkable (15.26-20.0).

In our abbreviated dataset (`smartLocationSmall.csv`), the index values were calculated as the population-weighted average for counties within subregions.

## Intervention Variables

### USA Spendings datasets:

- **Definition** - The datasets `spending_transportation`, `spending_commerce`, and `spending_HHS` contain information on grant expenditures for counties in the United States. These grants were awarded by the following government departments: Transportation, Commerce, and Health and Human Services.
- **Time Restrictions** - The data covers the period from 2010 to 2021.

- **Source** - These datasets were obtained from [USA Spending](https://www.usaspending.gov/) and were accessed in October 2023. The data were collected through a search on the platform's custom award data download center, and you can access it [here](https://www.usaspending.gov/download_center/custom_award_data).

- **Notes** - The raw datasets were filtered to include only the columns relevant to our analysis (we also dropped some potentially interesting columns that involved too many missing values) and were further grouped by FIPS code and year. This was done to reduce the size of the datasets for more efficient analysis. At the stage of initial filtering we also exlude all negative values that were found in `total_obligated_amount` column. Every spending dataset has a respective file with information on deleted negative values from every year, the file names follow the following pattern `info_negative..._val.pkl`.

Spendings on Transportation

- Some FIPS values did not match the values found in GDP dataset. Many of them were only 3 digits (that did not match any existing fips, usually they missed some zeroes in the middle). We identified them by location names found in the `spending_transportation_names`. As a result, more than 90% 3 digit FIPS codes were restored. As the final result we have excluded 181 FIPS codes.

Spendings on Commerce

- The number of FIPS codes deleted not present in the `gdp` dataset: 73

Spendings on HHS

- The number of FIPS codes deleted not present in the `gdp`  dataset: 99