# Enhasing the Data with Census FIPS and Ecosystem Data
### Purpose
In this notebook I will add in columns to the working data set that will contain 1) The census declaired blocks FIPS and County FIPS and 2) USGS declared ecosystems for each CBC location and 3) USGS declared ecosystems for each NOAA station location.

The census declaired blocks FIPS and County FIPS codes are the unquie identifyers census uses to identify an area. To learn more you can visit: https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html


### Author: 
Ren C'deBaca
### Date: 
2020-04-21
### Update Date: 
2020-04-26

### Inputs 
1.0-rec-initial-data-cleaning.txt - Tab seperated file of cleaned Christmas Bird Count events  Each row represents a single count in a given year. Data Dictonary can be found here: http://www.audubon.org/sites/default/files/documents/cbc_report_field_definitions_2013.pdf

np-circles-to-ecosys_data.csv - Comma seperated file from Nathan Pavlovic(nathan.pavlovic@gmail.com). This file was produced by first passing Nathan a file of approximatly 4000 unique lat lon present in the clean data file. 

Nathan then used the 2008 USGS raster ecosystem dataset. Info here https://rmgsc.cr.usgs.gov/outgoing/ecosystems/USdata/  

He used the Extract Values to Points tool in ArcGIS to find the raster value at each point. 

unique_stations_latlong_ecosys.csv _ Comma seperated file from Nathan Pavlovic(nathan.pavlovic@gmail.com). This file was produced by first passing Nathan a file of the unique NOAA station lat lons that were present in the file 1.1-circles_to_many_stations_usa_weather_data_20200424213015.csv.  See the abouve notes on his process.

1.2-ijd-fetch-circle-elevations_20200502155633.csv - CSV file of cbc circles matched with NOAA stations and elivation data. Each row is a cbc circle matched to a NOAA station. A cbc location can appear on multiple rows if they are matched to multiple stations




### Output Files
1.3-rec-connecting-fips-data.csv -- CSV file of the unique lat lons present in cbc data. Each Lat lon is matched to a Block FIPS and County FIPS. (This is the file that was shared with Nathan) 

1.3-rec-connecting-fips-ecosystem-data -- CSV file of the station matched cbc data with added columns for Ecosystem data for cbc circles and NOAA stations and cencus FIPS data


## Steps or Proceedures in the notebook 
1. Load in the cleaned data 
2. Identify the unique Lat Lons present in the cbc circle locations 
3. 
    OPTION 1: Load in the saved census FIPS data
    OPTION 2: Run the data through the census API (Note: Takes a few hours) 
5. Load in Ecological Data from Nathan
6. Create a key to based on the lat long of the cbc circles to merge the station matched data with the ecological data
7. Merge in the census FIPS data, the cbc ecological data, and the noaa station ecological data 


## Where the Data will Be Saved 
The raw ecosystem data and the output data will be saved in the Google Drive Folder
https://drive.google.com/drive/folders/1Nlj9Nq-_dPFTDbrSDf94XMritWYG6E2I

The path should look like this: 
audubon-cbc/data/Cloud_Data/<DATA FILE>

## Reference
    https://geo.fcc.gov/api/census/#!/block/get_block_find


In [75]:
# Imports
import os
from datetime import datetime
# Version .24.0
from google.cloud import bigquery
import pandas as pd
import pandas
import requests
import time
import numpy as np

pd.set_option('display.max_columns', 500)

In [27]:
# ALL File Paths should be declared at the TOP of the notebook
PATH_TO_CLEAN_CBC_DATA = "../data/Cloud_Data/1.0-rec-initial-data-cleaning.txt"
PATH_TO_WORKING_DATA = "../data/Cloud_Data/1.2-ijd-fetch-circle-elevations_20200617023815.txt"


PATH_TO_CBC_ECO_DATA = "../data/np-circles-to-ecosys_data.csv" 
PATH_TO_NOAA_ECO_DATA = "../data/unique_stations_latlong_ecosys.csv"

USE_CENSUS_BACKUP_FILE = True

## Load in the Clean Data

In [28]:
clean_data = pd.read_csv(PATH_TO_CLEAN_CBC_DATA, encoding = "ISO-8859-1", sep="\t")

  interactivity=interactivity, compiler=compiler, result=result)


In [29]:
clean_data.shape

(89215, 48)

In [30]:
clean_data.head()

Unnamed: 0,circle_name,country_state,lat,lon,count_year,count_date,n_field_counters,n_feeder_counters,min_field_parties,max_field_parties,...,max_snow_imperial,min_temp_imperial,max_temp_imperial,min_temp_metric,max_temp_metric,min_wind_metric,max_wind_metric,min_wind_imperial,max_wind_imperial,ui
0,Pacific Grove,US-CA,36.6167,-121.9167,1901,12/25/00,1.0,,,,...,,,,,,,,,,36.6167-121.9167_1901
1,Pueblo,US-CO,38.175251,-104.519575,1901,12/25/00,1.0,,,,...,,,,,,,,,,38.175251-104.519575_1901
2,Bristol,US-CT,41.6718,-72.9495,1901,12/25/00,2.0,,,,...,,,,,,,,,,41.6718-72.9495_1901
3,Norwalk,US-CT,41.1167,-73.4,1901,12/25/00,1.0,,,,...,,,,,,,,,,41.1167-73.4_1901
4,Glen Ellyn,US-IL,41.8833,-88.0667,1901,12/25/00,1.0,,,,...,,,,,,,,,,41.8833-88.0667_1901


### Create a string key to represent a unique lat lon combonation 

In [31]:
clean_data['temp_key_str'] = clean_data['lat'].astype(str) + clean_data['lon'].astype(str)

In [32]:
clean_data['temp_key_str'].nunique()

4531

In [33]:
clean_data['country_state'].value_counts()

US-CA    5866
US-TX    4997
US-NY    4835
US-PA    4120
US-OH    4018
US-WI    3866
US-FL    3280
US-IL    3078
US-MI    2925
US-VA    2418
US-MN    2378
US-NJ    2170
US-MA    2157
US-NC    2089
US-IN    2085
US-CO    2026
US-OR    1882
US-IA    1709
US-WA    1629
US-AZ    1494
US-MO    1481
US-MD    1477
US-KS    1457
US-TN    1438
US-ME    1391
US-AK    1374
US-MT    1342
US-NM    1327
US-GA    1254
US-CT    1236
US-OK    1168
US-AR    1115
US-LA    1030
US-ND    1006
US-ID    1000
US-SC     990
US-SD     927
US-VT     923
US-NH     904
US-WV     903
US-KY     886
US-UT     845
US-WY     823
US-MS     792
US-NE     681
US-AL     635
US-NV     509
US-HI     434
US-DE     382
US-RI     370
US-DC      93
Name: country_state, dtype: int64

## Census Data 
There are two options here
OPTION 1: Send the unique lat lons though a census API to find the block and county fips 
OPTION 2: Load in the saved census FIPS data

### Option 1: Load in the saved census FIPS data 

In [34]:
## Option: Set USE_CENSUS_BACKUP_FILE to True to use the file from backup
if USE_CENSUS_BACKUP_FILE:
    census_prep_df = pd.read_csv("1.3-rec-connecting-fips-data.csv")
    census_prep_df = census_prep_df[["lat", "lon", "block_fips", "county_fips"]]
    census_prep_df['temp_key_str'] = census_prep_df['lat'].astype(str) + census_prep_df['lon'].astype(str)
    print(clean_data.shape)
    census_prep_df.head()

(89215, 49)


### Option 2: Run the data through the census API (Note: Takes a few hours) 

In [35]:
if not USE_CENSUS_BACKUP_FILE:
    # Create a small dataframe of unique lat lon location to use with cencus data 
    census_prep_df = clean_data[['temp_key_str', 'lat', 'lon']]

In [36]:
if not USE_CENSUS_BACKUP_FILE:
    census_prep_df.shape

In [37]:
if not USE_CENSUS_BACKUP_FILE:
    # Drop duplicate rows 
    census_prep_df = census_prep_df.drop_duplicates(subset=['lat', 'lon'], keep= 'first') 
    print(census_prep_df.shape)

### Create a test call to the API to see how the data comes back 

In [38]:
if not USE_CENSUS_BACKUP_FILE:
    # Test Lat and Lon
    lat = 51.409713
    lon = 179.284881

    BASE_URL = "https://geo.fcc.gov/api/census/block/find?format=json&latitude=%s&longitude=%s"
    url = BASE_URL % (lat, lon)

    payload = {}
    headers= {}

    response = requests.request("GET", url, headers=headers, data = payload)

    print(response.text.encode('utf8'))

### Build a loop to build of list of results from the census API to get the block FIPS code and county FIPS code

In [39]:
if not USE_CENSUS_BACKUP_FILE:
    result_list = []
    county_result_list = []

    BASE_URL = "https://geo.fcc.gov/api/census/block/find?format=json&latitude=%s&longitude=%s"

    TIME_DELAY = 2

    for index, row in census_prep_df.iterrows():
        block_fips = ''
        county_fips = ''

        lat = row['lat']
        lon = row['lon']

        url = BASE_URL % (lat, lon)
        payload = {}
        headers= {}
        response = requests.request("GET", url, headers=headers, data = payload)

        try:
            block_fips = response.json()['Block']['FIPS']
            county_fips = response.json()['County']['FIPS']
        except:
            "Could not get FIPS "

        result_list.append(block_fips)
        county_result_list.append(county_fips)

        time.sleep(TIME_DELAY)


In [40]:
if not USE_CENSUS_BACKUP_FILE:
    print(len(result_list))
    print(len(county_result_list))


In [41]:
if not USE_CENSUS_BACKUP_FILE:
    # Turn the result list into arrays 
    result_arry = pd.Series(result_list)
    county_array = pd.Series(county_result_list)

In [42]:
if not USE_CENSUS_BACKUP_FILE:
    # Add the series into the data frame 
    census_prep_df['block_fips'] = result_arry.values
    census_prep_df['county_fips'] = county_array.values

In [44]:
if not USE_CENSUS_BACKUP_FILE:
    census_prep_df.head

### Choose to save the data to a file

In [45]:
## Save the data to a file 
# if not USE_CENSUS_BACKUP_FILE:
#    census_prep_df.to_csv('1.3-rec-connecting-fips-data.csv')

# Add Ecosystem Data to the Working Dataset

### Notes: The file 1.3-rec-connecting-fips-data.csv is the file I passed to Nathan for Ecosystem Processing.  He then returned to me a dataset with the ecosystem data added as columns. The next section will proceed to to add in the ecosystme data  

## Load in Ecosystem data for the CBC Circles 

In [46]:
eco_data = pd.read_csv(PATH_TO_CBC_ECO_DATA)

In [47]:
eco_data.shape

(4531, 15)

### Notes On Definitions 
Ecosys - The numberic code for an ecosystme provided by USGS https://www.arcgis.com/home/item.html?id=8e8015c1e60b431fb191b5ed0de97b33. Translates into the Usgsid_sys human readable value 
Usgsid_sys - Human Readable Ecosystem label 
Nlcd_code - The numberic code for an  National Land Cover Database code provided by USGS https://www.arcgis.com/home/item.html?id=8e8015c1e60b431fb191b5ed0de97b33. Translates into the Nlcd human readable value 
Nlcd - Human Readable Ecosystem label of National Land Cover Code

In [48]:
# Take the Columns we Need
eco_data = eco_data[["lat","lon","Ecosys", "Usgsid_sys", "Nlcd_code", "Nlcd"]]

In [49]:
eco_data.head()

Unnamed: 0,lat,lon,Ecosys,Usgsid_sys,Nlcd_code,Nlcd
0,36.6167,-121.9167,66.0,66_California Coastal Live Oak Woodland and Sa...,3.0,Steppe/Savanna
1,38.175251,-104.519575,274.0,274_Western Great Plains Shortgrass Prairie,4.0,Herbaceous
2,41.6718,-72.9495,300.0,300_Appalachian (Hemlock)-Northern Hardwood Fo...,1.0,Forest and Woodland
3,41.1167,-73.4,487.0,487_Northern Atlantic Coastal Plain Pitch Pine...,1.0,Forest and Woodland
4,41.8833,-88.0667,254.0,254_North-Central Interior Beech-Maple Forest,1.0,Forest and Woodland


In [50]:
# Create a temporary key to merge on
eco_data['temp_key_str'] = eco_data['lat'].astype(str) + eco_data['lon'].astype(str)


In [51]:
eco_data.head()

Unnamed: 0,lat,lon,Ecosys,Usgsid_sys,Nlcd_code,Nlcd,temp_key_str
0,36.6167,-121.9167,66.0,66_California Coastal Live Oak Woodland and Sa...,3.0,Steppe/Savanna,36.6167-121.9167
1,38.175251,-104.519575,274.0,274_Western Great Plains Shortgrass Prairie,4.0,Herbaceous,38.175251-104.519575
2,41.6718,-72.9495,300.0,300_Appalachian (Hemlock)-Northern Hardwood Fo...,1.0,Forest and Woodland,41.6718-72.9495
3,41.1167,-73.4,487.0,487_Northern Atlantic Coastal Plain Pitch Pine...,1.0,Forest and Woodland,41.1167-73.4
4,41.8833,-88.0667,254.0,254_North-Central Interior Beech-Maple Forest,1.0,Forest and Woodland,41.8833-88.0667


## Now Load and Merge in the Station Eco Data
We wont need a temporary key for this file because the station id's are unique

In [52]:
station_eco_data = pd.read_csv(PATH_TO_NOAA_ECO_DATA)

In [53]:
station_eco_data.head()

Unnamed: 0,X,id,latitude,longitude,RASTERVALU,Red,Green,Blue,Opacity,Ecosys,Usgsid_sys,Nlcd_code,Nlcd
0,0,USC00500252,51.3833,179.2833,,,,,,,,,
1,2,USW00014607,46.8706,-68.0172,324.0,0.504556,0.623333,0.369333,1.0,324.0,324_Laurentian-Acadian Northern Hardwoods Forest,1.0,Forest and Woodland
2,9,USC00176937,46.6539,-68.0089,325.0,0.469988,0.594037,0.333426,1.0,325.0,325_Laurentian-Acadian Pine-Hemlock-Hardwood F...,1.0,Forest and Woodland
3,49,US1MEAR0015,46.6796,-68.0127,324.0,0.504556,0.623333,0.369333,1.0,324.0,324_Laurentian-Acadian Northern Hardwoods Forest,1.0,Forest and Woodland
4,56,USC00171833,45.6611,-67.8614,324.0,0.504556,0.623333,0.369333,1.0,324.0,324_Laurentian-Acadian Northern Hardwoods Forest,1.0,Forest and Woodland


In [54]:
station_eco_data.shape

(11652, 13)

## Merge in the FIPs census data, the CBC circle Ecosystem data, and the NOAA station data with the Station Matched Data 

In [57]:
# Load in the file of noaa matched cbc circles
full_working_df = pd.read_csv(PATH_TO_WORKING_DATA, compression = "gzip", sep = "\t")

In [58]:
full_working_df.head()

Unnamed: 0,circle_name,country_state,lat,lon,count_year,count_date,n_field_counters,n_feeder_counters,min_field_parties,max_field_parties,...,temp_max_value,precipitation_value,temp_avg,snow,snwd,am_rain,pm_rain,am_snow,pm_snow,circle_elev
0,Hawai'i: Volcano,US-HI,19.517,-155.3,1973,1972-12-30,14.0,0.0,5.0,5.0,...,244.0,18.0,,0.0,0.0,2,2,3,3,1551.44
1,Hawai'i: Volcano,US-HI,19.517,-155.3,1973,1972-12-30,14.0,0.0,5.0,5.0,...,,3.0,,0.0,0.0,2,2,3,3,1551.44
2,Hawai'i: Volcano,US-HI,19.517,-155.3,1973,1972-12-30,14.0,0.0,5.0,5.0,...,,,,0.0,0.0,2,2,3,3,1551.44
3,Hawai'i: Volcano,US-HI,19.517,-155.3,1973,1972-12-30,14.0,0.0,5.0,5.0,...,167.0,10.0,,0.0,0.0,2,2,3,3,1551.44
4,Hawai'i: Volcano,US-HI,19.517,-155.3,1973,1972-12-30,14.0,0.0,5.0,5.0,...,,86.0,,0.0,0.0,2,2,3,3,1551.44


In [59]:
full_working_df.shape

(108957, 67)

In [60]:
full_working_df['temp_key_str'] = full_working_df['lat'].astype(str) + full_working_df['lon'].astype(str)

In [61]:
full_working_df['temp_key_str'].nunique()

2920

In [62]:
# Merge in the FIPS data with the full station data
full_working_df = pd.merge(full_working_df, census_prep_df[["temp_key_str", "block_fips", "county_fips"]], how="left", left_on="temp_key_str", right_on="temp_key_str")




In [63]:
full_working_df.shape

(108957, 70)

In [64]:
# Merge in the CBC Circle eco data 
full_working_df = pd.merge(full_working_df, eco_data[["temp_key_str","Ecosys", "Usgsid_sys", "Nlcd_code", "Nlcd"]], how="left", left_on= "temp_key_str", right_on = "temp_key_str")


In [65]:
full_working_df.shape

(108957, 74)

In [66]:
full_working_df.head()

Unnamed: 0,circle_name,country_state,lat,lon,count_year,count_date,n_field_counters,n_feeder_counters,min_field_parties,max_field_parties,...,am_snow,pm_snow,circle_elev,temp_key_str,block_fips,county_fips,Ecosys,Usgsid_sys,Nlcd_code,Nlcd
0,Hawai'i: Volcano,US-HI,19.517,-155.3,1973,1972-12-30,14.0,0.0,5.0,5.0,...,3,3,1551.44,19.517-155.3,,,,,,
1,Hawai'i: Volcano,US-HI,19.517,-155.3,1973,1972-12-30,14.0,0.0,5.0,5.0,...,3,3,1551.44,19.517-155.3,,,,,,
2,Hawai'i: Volcano,US-HI,19.517,-155.3,1973,1972-12-30,14.0,0.0,5.0,5.0,...,3,3,1551.44,19.517-155.3,,,,,,
3,Hawai'i: Volcano,US-HI,19.517,-155.3,1973,1972-12-30,14.0,0.0,5.0,5.0,...,3,3,1551.44,19.517-155.3,,,,,,
4,Hawai'i: Volcano,US-HI,19.517,-155.3,1973,1972-12-30,14.0,0.0,5.0,5.0,...,3,3,1551.44,19.517-155.3,,,,,,


In [67]:
# check that the merge went through
full_working_df['Usgsid_sys'].value_counts()

254_North-Central Interior Beech-Maple Forest                               162
487_Northern Atlantic Coastal Plain Pitch Pine Barrens                      101
600_Water                                                                    98
295_Central Appalachian Alkaline Glade and Woodland                          94
250_North-Central Interior Oak Savanna                                       87
                                                                           ... 
261_Northern Tallgrass Prairie                                                1
0_n/a                                                                         1
274_Western Great Plains Shortgrass Prairie                                   1
53_California Central Valley Alkali Sink                                      1
560_Northern Rocky Mountain Lower Montane, Foothill and Valley Grassland      1
Name: Usgsid_sys, Length: 75, dtype: int64

In [68]:
# Merge in the NOAA Station Eco data 
full_working_df = pd.merge(full_working_df, station_eco_data[["id","Ecosys", "Usgsid_sys", "Nlcd_code", "Nlcd"]], how="left", left_on= "id", right_on = "id", suffixes = ("_circle", "_station"))


In [69]:
full_working_df.shape

(108957, 78)

In [74]:
full_working_df['ui'].nunique()

52740

In [76]:
full_working_df.head()

Unnamed: 0,circle_name,country_state,lat,lon,count_year,count_date,n_field_counters,n_feeder_counters,min_field_parties,max_field_parties,field_hours,feeder_hours,nocturnal_hours,field_distance,nocturnal_distance,distance_units,min_temp,max_temp,temp_unit,min_wind,max_wind,wind_unit,min_snow,max_snow,snow_unit,am_cloud,pm_cloud,field_distance_imperial,field_distance_metric,nocturnal_distance_imperial,nocturnal_distance_metric,min_snow_imperial,min_snow_metric,max_snow_metric,max_snow_imperial,min_temp_imperial,max_temp_imperial,min_temp_metric,max_temp_metric,min_wind_metric,max_wind_metric,min_wind_imperial,max_wind_imperial,ui,geohash_circle,circle_id,id,latitude,longitude,elevation,state,name,gsn_flag,hcn_crn_flag,wmoid,geohash_station,temp_min_value,temp_max_value,precipitation_value,temp_avg,snow,snwd,am_rain,pm_rain,am_snow,pm_snow,circle_elev,block_fips,county_fips,Ecosys_circle,Usgsid_sys_circle,Nlcd_code_circle,Nlcd_circle,Ecosys_station,Usgsid_sys_station,Nlcd_code_station,Nlcd_station
0,Hawai'i: Volcano,US-HI,19.517,-155.3,1973,1972-12-30,14.0,0.0,5.0,5.0,57.0,0.0,0.0,169.0,0.0,Miles,37.0,75.0,2.0,0.0,5.0,1.0,0.0,0.0,2.0,1.0,2.0,169.0,271.966527,0.0,0.0,0.0,0.0,0.0,0.0,37.0,75.0,2.777778,23.888889,0.0,8.046347,0.0,5.0,19.516651-155.299965_1973,8e3x,8e3x40f,USC00516552,19.5486,-155.11,466.3,HI,MTN VIEW 91,,,,8e3x,144.0,244.0,18.0,,0.0,0.0,2,2,3,3,1551.44,,,,,,,,,,
1,Hawai'i: Volcano,US-HI,19.517,-155.3,1973,1972-12-30,14.0,0.0,5.0,5.0,57.0,0.0,0.0,169.0,0.0,Miles,37.0,75.0,2.0,0.0,5.0,1.0,0.0,0.0,2.0,1.0,2.0,169.0,271.966527,0.0,0.0,0.0,0.0,0.0,0.0,37.0,75.0,2.777778,23.888889,0.0,8.046347,0.0,5.0,19.516651-155.299965_1973,8e3x,8e3x40f,USC00511487,19.6833,-155.1667,487.7,HI,HILO COUNTRY CLUB 86,,,,8e3x,,,3.0,,0.0,0.0,2,2,3,3,1551.44,,,,,,,,,,
2,Hawai'i: Volcano,US-HI,19.517,-155.3,1973,1972-12-30,14.0,0.0,5.0,5.0,57.0,0.0,0.0,169.0,0.0,Miles,37.0,75.0,2.0,0.0,5.0,1.0,0.0,0.0,2.0,1.0,2.0,169.0,271.966527,0.0,0.0,0.0,0.0,0.0,0.0,37.0,75.0,2.777778,23.888889,0.0,8.046347,0.0,5.0,19.516651-155.299965_1973,8e3x,8e3x40f,USC00515021,19.5833,-155.3333,1748.3,HI,KULANI SCHOOL SITE 78,,,,8e3x,,,,,0.0,0.0,2,2,3,3,1551.44,,,,,,,,,,
3,Hawai'i: Volcano,US-HI,19.517,-155.3,1973,1972-12-30,14.0,0.0,5.0,5.0,57.0,0.0,0.0,169.0,0.0,Miles,37.0,75.0,2.0,0.0,5.0,1.0,0.0,0.0,2.0,1.0,2.0,169.0,271.966527,0.0,0.0,0.0,0.0,0.0,0.0,37.0,75.0,2.777778,23.888889,0.0,8.046347,0.0,5.0,19.516651-155.299965_1973,8e3x,8e3x40f,USC00515011,19.5494,-155.3011,1575.8,HI,KULANI CAMP 79,,,,8e3x,83.0,167.0,10.0,,0.0,0.0,2,2,3,3,1551.44,,,,,,,,,,
4,Hawai'i: Volcano,US-HI,19.517,-155.3,1973,1972-12-30,14.0,0.0,5.0,5.0,57.0,0.0,0.0,169.0,0.0,Miles,37.0,75.0,2.0,0.0,5.0,1.0,0.0,0.0,2.0,1.0,2.0,169.0,271.966527,0.0,0.0,0.0,0.0,0.0,0.0,37.0,75.0,2.777778,23.888889,0.0,8.046347,0.0,5.0,19.516651-155.299965_1973,8e3x,8e3x40f,USC00519025,19.6581,-155.1325,320.0,HI,WAIAKEA SCD 88.2,,,,8e3x,,,86.0,,0.0,0.0,2,2,3,3,1551.44,,,,,,,,,,


In [71]:
# Drop the temportary key 
full_working_df = full_working_df.drop("temp_key_str",axis=1)

In [73]:
full_working_df.to_csv('1.3-rec-connecting-fips-ecosystem-data.txt', compression = "gzip", sep="\t", index=False)