# Enhasing the Data with Census FIPS and Ecosystem Data
### Purpose
In this notebook I will add in columns to the working data set that will contain 1) The census declaired blocks FIPS and County FIPS and 2) USGS declared ecosystems for each CBC location and 3) USGS declared ecosystems for each NOAA station location 


### Author: 
Ren C'deBaca
### Date: 
2020-04-21
### Update Date: 
2020-04-26

### Inputs 
1.0-rec-initial-data-cleaning.txt - Tab seperated file of cleaned Christmas Bird Count events  Each row represents a single count in a given year. Data Dictonary can be found here: http://www.audubon.org/sites/default/files/documents/cbc_report_field_definitions_2013.pdf

np-circles-to-ecosys_data.csv - Comma seperated file from Nathan Pavlovic(nathan.pavlovic@gmail.com). This file was produced by first passing Nathan a file of approximatly 4000 unique lat lon present in the clean data file. 

Nathan then used the 2008 USGS raster ecosystem dataset. Info here https://rmgsc.cr.usgs.gov/outgoing/ecosystems/USdata/  

He used the Extract Values to Points tool in ArcGIS to find the raster value at each point. 

unique_stations_latlong_ecosys.csv _ Comma seperated file from Nathan Pavlovic(nathan.pavlovic@gmail.com). This file was produced by first passing Nathan a file of the unique NOAA station lat lons that were present in the file 1.1-circles_to_many_stations_usa_weather_data_20200424213015.csv.  See the abouve notes on his process.

1.1-circles_to_many_stations_usa_weather_data_20200424213015.csv - CSV file of cbc circles matched with NOAA stations. Each row is a cbc circle matched to a NOAA station. A cbc location can appear on multiple rows if they are matched to multiple stations




### Output Files
1.3-rec-connecting-fips-data.csv -- CSV file of the unique lat lons present in cbc data. Each Lat lon is matched to a Block FIPS and County FIPS. (This is the file that was shared with Nathan) 

1.3-rec-connecting-fips-ecosystem-data -- CSV file of the station matched cbc data with added columns for Ecosystem data for cbc circles and NOAA stations and cencus FIPS data


## Steps or Proceedures in the notebook 
1. Load in the cleaned data 
2. Identify the unique Lat Lons present in the cbc circle locations 
3. 
    OPTION 1: Send the unique lat lons though a census API to find the block and county fips 
    OPTION 2: Load in the saved census FIPS data
5. Load in Ecological Data from Nathan
6. Create a key to based on the lat long of the cbc circles to merge the station matched data with the ecological data
7. Merge in the census FIPS data, the cbc ecological data, and the noaa station ecological data 


## Where the Data will Be Saved 
The raw ecosystem data and the output data will be saved in the Google Drive Folder
https://drive.google.com/drive/folders/1Nlj9Nq-_dPFTDbrSDf94XMritWYG6E2I

The path should look like this: 
audubon-cbc/data/Cloud_Data/<DATA FILE>

## Reference
    https://geo.fcc.gov/api/census/#!/block/get_block_find


In [1]:
# Imports
import os
from datetime import datetime
# Version .24.0
from google.cloud import bigquery
import pandas as pd
import pandas
import requests
import time
import numpy as np

In [2]:
# ALL File Paths should be declared at the TOP of the notebook
PATH_TO_CLEAN_CBC_DATA = "../data/Cloud_Data/1.0-rec-initial-data-cleaning.txt"
PATH_TO_CBC_DATA_WITH_STATIONS = "../data/Cloud_Data/1.1-circles_to_many_stations_usa_weather_data_20200424213015.csv"


PATH_TO_CBC_ECO_DATA = "../data/np-circles-to-ecosys_data.csv" 
PATH_TO_NOAA_ECO_DATA = "../data/unique_stations_latlong_ecosys.csv"



## Load in the Clean Data

In [3]:
clean_data = pd.read_csv(PATH_TO_CLEAN_CBC_DATA, encoding = "ISO-8859-1", sep="\t")

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
clean_data.shape

(89568, 48)

In [5]:
clean_data.head()

Unnamed: 0.1,Unnamed: 0,circle_name,country_state,lat,lon,count_year,count_date,n_field_counters,n_feeder_counters,min_field_parties,...,max_snow_metric,max_snow_imperial,min_temp_imperial,max_temp_imperial,min_temp_metric,max_temp_metric,min_wind_metric,max_wind_metric,min_wind_imperial,max_wind_imperial
0,2,Pacific Grove,US-CA,36.6167,-121.9167,1901,12/25/00,1.0,,,...,,,,,,,,,,
1,3,Pueblo,US-CO,38.175251,-104.519575,1901,12/25/00,1.0,,,...,,,,,,,,,,
2,4,Bristol,US-CT,41.6718,-72.9495,1901,12/25/00,2.0,,,...,,,,,,,,,,
3,5,Norwalk,US-CT,41.1167,-73.4,1901,12/25/00,1.0,,,...,,,,,,,,,,
4,6,Glen Ellyn,US-IL,41.8833,-88.0667,1901,12/25/00,1.0,,,...,,,,,,,,,,


### Create a string key to represent a unique lat lon combonation 

In [6]:
clean_data['temp_key_str'] = clean_data['lat'].astype(str) + clean_data['lon'].astype(str)

In [7]:
clean_data['temp_key_str'].nunique()

4531

## Census Data 
There are two options here
OPTION 1: Send the unique lat lons though a census API to find the block and county fips 
OPTION 2: Load in the saved census FIPS data

### Option 1: Load in the saved census FIPS data 

In [8]:
## Option: Uncommonet the next section to load data from file
census_prep_df = pd.read_csv("1.3-rec-connecting-fips-data.csv")
census_prep_df = census_prep_df[["lat", "lon", "block_fips", "county_fips"]]
census_prep_df['temp_key_str'] = census_prep_df['lat'].astype(str) + census_prep_df['lon'].astype(str)
print(clean_data.shape)
census_prep_df.head()

(89568, 49)


Unnamed: 0,lat,lon,block_fips,county_fips,temp_key_str
0,36.6167,-121.9167,60530120000000.0,6053.0,36.6167-121.9167
1,38.175251,-104.519575,81010030000000.0,8101.0,38.175251-104.519575
2,41.6718,-72.9495,90034060000000.0,9003.0,41.6718-72.9495
3,41.1167,-73.4,90010440000000.0,9001.0,41.1167-73.4
4,41.8833,-88.0667,170438400000000.0,17043.0,41.8833-88.0667


### Option 2: Run the data through the census API (Note: Takes a few hours) 

In [None]:
# Create a small dataframe of unique lat lon location to use with cencus data 
census_prep_df = clean_data[['temp_key_str', 'lat', 'lon']]

In [None]:
census_prep_df.shape

In [None]:
# Drop duplicate rows 
census_prep_df = census_prep_df.drop_duplicates(subset=['lat', 'lon'], keep= 'first') 

In [None]:
census_prep_df.shape

### Create a test call to the API to see how the data comes back 

In [None]:
# Test Lat and Lon
lat = 51.409713
lon = 179.284881

BASE_URL = "https://geo.fcc.gov/api/census/block/find?format=json&latitude=%s&longitude=%s"
url = BASE_URL % (lat, lon)

payload = {}
headers= {}

response = requests.request("GET", url, headers=headers, data = payload)

print(response.text.encode('utf8'))

### Build a loop to build of list of results from the census API to get the block FIPS code and county FIPS code

In [None]:
result_list = []
county_result_list = []

BASE_URL = "https://geo.fcc.gov/api/census/block/find?format=json&latitude=%s&longitude=%s"

TIME_DELAY = 2

for index, row in census_prep_df.iterrows():
    block_fips = ''
    county_fips = ''
    
    lat = row['lat']
    lon = row['lon']
    
    url = BASE_URL % (lat, lon)
    payload = {}
    headers= {}
    response = requests.request("GET", url, headers=headers, data = payload)

    try:
        block_fips = response.json()['Block']['FIPS']
        county_fips = response.json()['County']['FIPS']
    except:
        "Could not get FIPS "
        
    result_list.append(block_fips)
    county_result_list.append(county_fips)
    
    time.sleep(TIME_DELAY)


In [None]:
print(len(result_list))
print(len(county_result_list))


In [None]:
# Turn the result list into arrays 
result_arry = pd.Series(result_list)
county_array = pd.Series(county_result_list)

In [None]:
# Add the series into the data frame 
census_prep_df['block_fips'] = result_arry.values
census_prep_df['county_fips'] = county_array.values

In [None]:
census_prep_df.head

### Choose to save the data to a file

In [None]:
## Save the data to a file 
#census_prep_df.to_csv('1.3-rec-connecting-fips-data.csv')

# Add Ecosystem Data to the Working Dataset

### Notes: The file 1.3-rec-connecting-fips-data.csv is the file I passed to Nathan for Ecosystem Processing.  He then returned to me a dataset with the ecosystem data added as columns. The next section will proceed to to add in the ecosystme data  

## Load in Ecosystem data for the CBC Circles 

In [9]:
eco_data = pd.read_csv(PATH_TO_CBC_ECO_DATA)

In [10]:
eco_data.shape

(4531, 15)

In [11]:
# Take the Columns we Need
eco_data = eco_data[["lat","lon","Ecosys", "Usgsid_sys", "Nlcd_code", "Nlcd"]]

In [12]:
eco_data.head()

Unnamed: 0,lat,lon,Ecosys,Usgsid_sys,Nlcd_code,Nlcd
0,36.6167,-121.9167,66.0,66_California Coastal Live Oak Woodland and Sa...,3.0,Steppe/Savanna
1,38.175251,-104.519575,274.0,274_Western Great Plains Shortgrass Prairie,4.0,Herbaceous
2,41.6718,-72.9495,300.0,300_Appalachian (Hemlock)-Northern Hardwood Fo...,1.0,Forest and Woodland
3,41.1167,-73.4,487.0,487_Northern Atlantic Coastal Plain Pitch Pine...,1.0,Forest and Woodland
4,41.8833,-88.0667,254.0,254_North-Central Interior Beech-Maple Forest,1.0,Forest and Woodland


In [13]:
# Create a temporary key to merge on
eco_data['temp_key_str'] = eco_data['lat'].astype(str) + eco_data['lon'].astype(str)


In [14]:
eco_data.head()

Unnamed: 0,lat,lon,Ecosys,Usgsid_sys,Nlcd_code,Nlcd,temp_key_str
0,36.6167,-121.9167,66.0,66_California Coastal Live Oak Woodland and Sa...,3.0,Steppe/Savanna,36.6167-121.9167
1,38.175251,-104.519575,274.0,274_Western Great Plains Shortgrass Prairie,4.0,Herbaceous,38.175251-104.519575
2,41.6718,-72.9495,300.0,300_Appalachian (Hemlock)-Northern Hardwood Fo...,1.0,Forest and Woodland,41.6718-72.9495
3,41.1167,-73.4,487.0,487_Northern Atlantic Coastal Plain Pitch Pine...,1.0,Forest and Woodland,41.1167-73.4
4,41.8833,-88.0667,254.0,254_North-Central Interior Beech-Maple Forest,1.0,Forest and Woodland,41.8833-88.0667


## Now Load and Merge in the Station Eco Data
We wont need a temporary key for this file because the station id's are unique

In [15]:
station_eco_data = pd.read_csv(PATH_TO_NOAA_ECO_DATA)

In [16]:
station_eco_data.head()

Unnamed: 0,X,id,latitude,longitude,RASTERVALU,Red,Green,Blue,Opacity,Ecosys,Usgsid_sys,Nlcd_code,Nlcd
0,0,USC00500252,51.3833,179.2833,,,,,,,,,
1,2,USW00014607,46.8706,-68.0172,324.0,0.504556,0.623333,0.369333,1.0,324.0,324_Laurentian-Acadian Northern Hardwoods Forest,1.0,Forest and Woodland
2,9,USC00176937,46.6539,-68.0089,325.0,0.469988,0.594037,0.333426,1.0,325.0,325_Laurentian-Acadian Pine-Hemlock-Hardwood F...,1.0,Forest and Woodland
3,49,US1MEAR0015,46.6796,-68.0127,324.0,0.504556,0.623333,0.369333,1.0,324.0,324_Laurentian-Acadian Northern Hardwoods Forest,1.0,Forest and Woodland
4,56,USC00171833,45.6611,-67.8614,324.0,0.504556,0.623333,0.369333,1.0,324.0,324_Laurentian-Acadian Northern Hardwoods Forest,1.0,Forest and Woodland


In [17]:
station_eco_data.shape

(11652, 13)

## Merge in the FIPs census data, the CBC circle Ecosystem data, and the NOAA station data with the Station Matched Data 

In [18]:
# Load in the file of noaa matched cbc circles
full_station_df = pd.read_csv(PATH_TO_CBC_DATA_WITH_STATIONS, compression = "gzip")

In [19]:
full_station_df.head()

Unnamed: 0.1,Unnamed: 0,int64_field_0,circle_name,country_state,lat,lon,count_year,count_date,n_field_counters,n_feeder_counters,...,gsn_flag,hcn_crn_flag,wmoid,geohash_station,temp_min_value,temp_max_value,precipitation_value,temp_avg,snow,snwd
0,7,32617,Amchitka Island,US-AK,51.409713,179.284881,1980,1979-12-18,4.0,,...,,,,zcpk,-17.0,17.0,5.0,,3.0,0.0
1,8,52625,Amchitka Island,US-AK,51.409713,179.284881,1993,1992-12-20,2.0,0.0,...,,,,zcpk,,,,,0.0,0.0
2,28,90930,Caribou,US-ME,46.912573,-67.947428,2012,2011-12-28,10.0,3.0,...,GSN,,72712.0,f2rd,-83.0,78.0,71.0,,8.0,25.0
3,30,93245,Caribou,US-ME,46.912573,-67.947428,2013,2012-12-29,10.0,4.0,...,GSN,,72712.0,f2rd,-139.0,-61.0,0.0,,0.0,229.0
4,32,95653,Caribou,US-ME,46.912573,-67.947428,2014,2014-01-01,7.0,5.0,...,GSN,,72712.0,f2rd,-282.0,-155.0,0.0,,3.0,460.0


In [20]:
full_station_df.shape

(109390, 67)

In [21]:
full_station_df['temp_key_str'] = full_station_df['lat'].astype(str) + full_station_df['lon'].astype(str)

In [22]:
# Merge in the FIPS data with the full station data
full_station_df = pd.merge(full_station_df, census_prep_df[["temp_key_str", "block_fips", "county_fips"]], how="left", left_on="temp_key_str", right_on="temp_key_str")




In [23]:
full_station_df.shape

(109390, 70)

In [24]:
# Merge in the CBC Circle eco data 
full_station_df = pd.merge(full_station_df, eco_data[["temp_key_str","Ecosys", "Usgsid_sys", "Nlcd_code", "Nlcd"]], how="left", left_on= "temp_key_str", right_on = "temp_key_str")


In [25]:
full_station_df.shape

(109390, 74)

In [26]:
full_station_df.head()

Unnamed: 0.1,Unnamed: 0,int64_field_0,circle_name,country_state,lat,lon,count_year,count_date,n_field_counters,n_feeder_counters,...,temp_avg,snow,snwd,temp_key_str,block_fips,county_fips,Ecosys,Usgsid_sys,Nlcd_code,Nlcd
0,7,32617,Amchitka Island,US-AK,51.409713,179.284881,1980,1979-12-18,4.0,,...,,3.0,0.0,51.409713179.284881,20160000000000.0,2016.0,,,,
1,8,52625,Amchitka Island,US-AK,51.409713,179.284881,1993,1992-12-20,2.0,0.0,...,,0.0,0.0,51.409713179.284881,20160000000000.0,2016.0,,,,
2,28,90930,Caribou,US-ME,46.912573,-67.947428,2012,2011-12-28,10.0,3.0,...,,8.0,25.0,46.912573-67.947428,,,,,,
3,30,93245,Caribou,US-ME,46.912573,-67.947428,2013,2012-12-29,10.0,4.0,...,,0.0,229.0,46.912573-67.947428,,,,,,
4,32,95653,Caribou,US-ME,46.912573,-67.947428,2014,2014-01-01,7.0,5.0,...,,3.0,460.0,46.912573-67.947428,,,,,,


In [27]:
# check that the merge went through
full_station_df['Usgsid_sys'].value_counts()

600_Water                                                       5497
254_North-Central Interior Beech-Maple Forest                   4722
300_Appalachian (Hemlock)-Northern Hardwood Forest              2724
301_Northeastern Interior Dry-Mesic Oak Forest                  2357
324_Laurentian-Acadian Northern Hardwoods Forest                2259
                                                                ... 
160_Rocky Mountain Alpine Fell-Field                               2
53_California Central Valley Alkali Sink                           1
530_Columbia Plateau Low Sagebrush Steppe                          1
132_Boreal White Spruce Forest and Woodland                        1
84_California Montane Jeffrey Pine-(Ponderosa Pine) Woodland       1
Name: Usgsid_sys, Length: 224, dtype: int64

In [28]:
# Merge in the NOAA Station Eco data 
full_station_df = pd.merge(full_station_df, station_eco_data[["id","Ecosys", "Usgsid_sys", "Nlcd_code", "Nlcd"]], how="left", left_on= "id", right_on = "id", suffixes = ("_circle", "_station"))


In [29]:
full_station_df.shape

(109390, 78)

In [30]:
full_station_df.head()

Unnamed: 0.1,Unnamed: 0,int64_field_0,circle_name,country_state,lat,lon,count_year,count_date,n_field_counters,n_feeder_counters,...,block_fips,county_fips,Ecosys_circle,Usgsid_sys_circle,Nlcd_code_circle,Nlcd_circle,Ecosys_station,Usgsid_sys_station,Nlcd_code_station,Nlcd_station
0,7,32617,Amchitka Island,US-AK,51.409713,179.284881,1980,1979-12-18,4.0,,...,20160000000000.0,2016.0,,,,,,,,
1,8,52625,Amchitka Island,US-AK,51.409713,179.284881,1993,1992-12-20,2.0,0.0,...,20160000000000.0,2016.0,,,,,,,,
2,28,90930,Caribou,US-ME,46.912573,-67.947428,2012,2011-12-28,10.0,3.0,...,,,,,,,324.0,324_Laurentian-Acadian Northern Hardwoods Forest,1.0,Forest and Woodland
3,30,93245,Caribou,US-ME,46.912573,-67.947428,2013,2012-12-29,10.0,4.0,...,,,,,,,324.0,324_Laurentian-Acadian Northern Hardwoods Forest,1.0,Forest and Woodland
4,32,95653,Caribou,US-ME,46.912573,-67.947428,2014,2014-01-01,7.0,5.0,...,,,,,,,324.0,324_Laurentian-Acadian Northern Hardwoods Forest,1.0,Forest and Woodland


In [32]:
# Drop the temportary key 
full_station_df = full_station_df.drop("temp_key_str",axis=1)

In [33]:
full_station_df.to_csv('1.3-rec-connecting-fips-ecosystem-data.csv', compression = "gzip")