# Fetch Circle Elevations
### Purpose
In this notebook I query the USGS to get elevation data for all of the circles.
This notebook addresses some one of the tasks in Github issue #35

### Author: 
Ian Davis
### Date: 
2020-03-31
### Update Date: 
2020-07-09

### Inputs 
1.1-circles_to_many_stations_usa_weather_data_20200424213015.txt - Tab separated file of the Christmas Bird Count and matches to 1 or more NOAA weather stations.
- Data Dictonary can be found here: http://www.audubon.org/sites/default/files/documents/cbc_report_field_definitions_2013.pdf

1.0-rec-initial-data-cleaning.txt - Tab seperated file of cleaned cbc data where each row is a circle count infomation for a given count year. 

1.2.1-ijd-fetch-circle-elevations-OFFLINE.csv - Previously generated elevation data. This file will be used when you want to get the elevation data from an offline source and aoivd 100,000+ queries.

### Output Files
1.2-ijd-fetch-circle-elevations_20200502155633.csv - Only 1 column is added to the dataset, 'circle_elev'. This column is the elevation in meters for a given latitude and longitude of the circle centroid.

## Steps or Proceedures in the notebook 
- Set runtime options
    - Set option to retrieve elevations from offline source, or through the USGS queries
    - Set option to only test the USGS query (NOTE: running the query function for the whole dataset will take 24+ hours)
- Create a function to make a remote request to the USGS API
- Create a function to supply inputs to the remote request and return the elevation value
- Main sequence
    - Read in dataset
    - Create a list of unique lat lon combinations 
    - Loop through the unique lat lons to get elevation data from usgs
    - (Optional) Retrieve elevations from offline data source instead of queries
    - Merge in the unique lat lon data with the full paired data file
    - Write new dataset .txt file

## References
- elevation query: https://stackoverflow.com/questions/58350063/obtain-elevation-from-latitude-longitude-coordinates-with-a-simple-python-script
- lamda functions: https://thispointer.com/python-how-to-use-if-else-elif-in-lambda-functions/
- apply on Nulls: https://stackoverflow.com/questions/26614465/python-pandas-apply-function-if-a-column-value-is-not-null

In [1]:
# Imports 
import pandas as pd
import numpy as np
import requests
import urllib
import urllib3
import time
import gzip
import logging
import sys
from datetime import datetime

In [2]:
# Check to see if you are running 32-bit Python (output would be False)
# 32-bit Python could result in Memory Error when reading in large dataset
import sys
sys.maxsize > 2**32

True

## Set File Paths and Runtime Options

In [3]:
# Used to classify the name 
time_now = datetime.today().strftime('%Y%m%d%H%M%S')

# File paths and script options
PATH_TO_PAIRED_DATA = "../data/Cloud_Data/1.1-circles_to_many_stations_usa_weather_data_20200709102406.txt"
PATH_TO_CLEAN_CBC_DATA = "../data/Cloud_Data/1.0-rec-initial-data-cleaning.txt"
PATH_TO_OFFLINE_ELEVATION_DATA = "../data/Cloud_Data/1.2.1-ijd-fetch-circle-elevations-OFFLINE.csv"
PATH_TO_LOG_FILE = "../data/Cloud_Data/1.2-ijd-fetch_circle_elevations_"+time_now+".log"

# option to pull offline elevation data from the /attic instead of running the queries
get_offline_data = True

# option to run a simple test of the query; only 1000 rows are queried instead of full dataset
test_query = True

## Load in the Clean Data to Create a List of Unique Locations 

In [4]:
clean_data = pd.read_csv(PATH_TO_CLEAN_CBC_DATA, encoding = "ISO-8859-1", sep="\t")

print(clean_data.shape)

clean_data['ui'].nunique()

clean_data.head()

(90411, 48)


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,circle_name,country_state,lat,lon,count_year,count_date,n_field_counters,n_feeder_counters,min_field_parties,max_field_parties,...,max_snow_imperial,min_temp_imperial,max_temp_imperial,min_temp_metric,max_temp_metric,min_wind_metric,max_wind_metric,min_wind_imperial,max_wind_imperial,ui
0,Pacific Grove,US-CA,36.6167,-121.9167,1901,1900-12-25,1.0,,,,...,,,,,,,,,,36.6167-121.9167_1901
1,Pueblo,US-CO,38.175251,-104.519575,1901,1900-12-25,1.0,,,,...,,,,,,,,,,38.175251-104.519575_1901
2,Bristol,US-CT,41.6718,-72.9495,1901,1900-12-25,2.0,,,,...,,,,,,,,,,41.6718-72.9495_1901
3,Norwalk,US-CT,41.1167,-73.4,1901,1900-12-25,1.0,,,,...,,,,,,,,,,41.1167-73.4_1901
4,Glen Ellyn,US-IL,41.8833,-88.0667,1901,1900-12-25,1.0,,,,...,,,,,,,,,,41.8833-88.0667_1901


In [5]:
# Create a Temportary String to Merge on
clean_data['temp_key_str'] = round(clean_data['lat'],3).astype(str) + round(clean_data['lon'],3).astype(str)
    
print("The number of unique Lat Lon combos in the dataset is: ")   
clean_data['temp_key_str'].nunique()

The number of unique Lat Lon combos in the dataset is: 


4584

In [6]:
clean_data_unique = clean_data[["lat", "lon", "temp_key_str"]]
clean_data_unique.shape

(90411, 3)

In [7]:
clean_data_unique = clean_data_unique.drop_duplicates("temp_key_str")
clean_data_unique.shape

(4584, 3)

In [8]:
clean_data_unique.head()

Unnamed: 0,lat,lon,temp_key_str
0,36.6167,-121.9167,36.617-121.917
1,38.175251,-104.519575,38.175-104.52
2,41.6718,-72.9495,41.672-72.95
3,41.1167,-73.4,41.117-73.4
4,41.8833,-88.0667,41.883-88.067


In [9]:
# Save the Data if nessasary 
# clean_data_unique.to_csv('../data/Cloud_data/1.2-ijd-fetch-circle-elevations_usgs.csv', index=False)

## Create a Log File

In [10]:
# if not get_offline_data:
#     logging.basicConfig(filename=PATH_TO_LOG_FILE, 
#                         filemode='w', 
#                         format='%(message)s', 
#                         level=logging.INFO)
#     logging.info('This log file shows the row index, lat, lon\n')

## Create a function to make a remote request to the USGS API

In [11]:
def make_remote_request(url: str, params: dict):
    """
    Makes the remote request
    Continues making attempts until it succeeds
    """

    count = 1
    while True:
        try:
            response = requests.get((url + urllib.parse.urlencode(params)))
            time.sleep(1)
        except (OSError, urllib3.exceptions.ProtocolError) as error:
            logging.info('\n')
            logging.info('*' * 20, 'Error Occured', '*' * 20)
            logging.info(f'Number of tries: {count}')
            logging.info(f'URL: {url}')
            logging.info(error)
            logging.info('\n')
            count += 1
            time.sleep(0.5)
            continue
        break

    return response

## Create a function to supply inputs to the remote request and return the elevation value

In [12]:
def elevation_function(x):
    """
    x - longitude
    y - latitude
    returns elevation in meters
    """
    
    url = 'https://nationalmap.gov/epqs/pqs.php?'
    params = {'x': x[1],
              'y': x[0],
              'units': 'Meters',
              'output': 'json'}
    logging.info(str(x.name)+'\t\t'+str(x[0])+'\t\t'+str(x[1]))   # print row index, lat, lon
    result = make_remote_request(url, params)
    
    return result.json()['USGS_Elevation_Point_Query_Service']['Elevation_Query']['Elevation']

# Collect Data From USGS for the unique Lat Lon Locations 

In [None]:
if not get_offline_data:
    temp = clean_data_unique[['lat', 'lon']]

    temp.loc[:, 'circle_elev'] = np.nan
    temp.head(50)

    res_list = []

    testing_count = 0

    elevation_function(temp.loc[0])

    for index, row in temp.iterrows():
    #     print(row)
    #     row.loc[index, 'circle_elev'] = "Cake"
    #     print(row)
        try:
            print(row)
            # combination of apply() function and lambda() function, only on nulls (see reference links above)
            res_list.append(elevation_function(row))
        except:
            # on occasion query completely fails and crashes the function call
            # problem is the stack prints to the notebook
            # https://gist.github.com/wassname/d17325f36c36fa663dd7de3c09a55e74
    #         logging.error("Exception occurred", exc_info=True)
            print("Exception occurred:")
            print(row)        
            res_list.append("")
            continue

        time.sleep(2)



In [None]:
# check on the results 
if not get_offline_data:
    print("The length of the result this is: " + str(len(res_list)))
    print("The length of the unique lat lons in the clean data was: " + str(clean_data_unique.shape))
    print(res_list[0:25])
    print("The number of elivations that got an error was: " + str(sum(1 for i in res_list_df if i  == "") ))

In [None]:
# Add the result list as circle_elev
if not get_offline_data:
    clean_data_unique["circle_elev"] = res_list

In [None]:
# Save this as the offline data
# if not get_offline_data:
#     clean_data_unique.to_csv('../data/Cloud_Data/1.2.1-ijd-fetch-circle-elevations-OFFLINE.csv', index=False)

## Merge in the data with the Paired dataset. Use offline data for Merge is the variable get_offline_data above is True 

In [13]:
# Load in the full dataset 
paired_df = pd.read_csv(PATH_TO_PAIRED_DATA, encoding = "ISO-8859-1", compression='gzip', sep="\t")

print("The shape of the paired datafram is: " + str(paired_df.shape))

# Create a key for the paired data to merge on 
paired_df['temp_key_str'] = paired_df['lat'].astype(str) + paired_df['lon'].astype(str)

paired_df['temp_key_str'] = round(paired_df['lat'],3).astype(str) + round(paired_df['lon'],3).astype(str)

# Count the number of unique
print("The number of unique latlon combos is " + str(paired_df["temp_key_str"].nunique()))


# Merge on either the clean data collected from usgs or offline data
if not get_offline_data:
    # Merge in on the key 
    paired_df_ = pd.merge(paired_df, clean_data_unique[['temp_key_str', 'circle_elev']], how="left", left_on="temp_key_str", right_on="temp_key_str")

else:
    offline_data = pd.read_csv(PATH_TO_OFFLINE_ELEVATION_DATA)
    # Merge in on the key 
    paired_df_ = pd.merge(paired_df, offline_data[['temp_key_str', 'circle_elev']], how="left", left_on="temp_key_str", right_on="temp_key_str")
    
    

  interactivity=interactivity, compiler=compiler, result=result)


The shape of the paired datafram is: (127174, 66)
The number of unique latlon combos is 3500


In [None]:
# Check the merge 
print("The shape of the merged data is : " + str(paired_df_.shape))
print("The number of NAs in the merged data: " + str(paired_df_['circle_elev'].isna().sum()))
print("The number of circles with %s:" % 'circle_elev' + str(paired_df_.shape[0] - paired_df_['circle_elev'].isna().sum()))
print("Value Counts")
print(paired_df_['circle_elev'].value_counts())

In [None]:
paired_df_.head(50)

## Screen Elevation Data

In [None]:
# Remove bad elevation values
paired_df_.loc[paired_df_['circle_elev'] < -10000.0, 'circle_elev'] = np.nan 

In [None]:
paired_df_[['lat', 'lon', 'count_date', 'circle_elev']].head()

In [None]:
# Create histogram of elevations
paired_df_.hist(column='circle_elev')

In [None]:
# Same number of rows? Should be 109390
len(paired_df_.index)

In [None]:
# sort dataframe on existing index
paired_df_.sort_values(['ui'], ascending=[True], inplace=True)

In [None]:
# Remove the temp key 
# Drop the temportary key 
paired_df_ = paired_df_.drop("temp_key_str",axis=1)

In [None]:
paired_df_.head()

In [None]:
### NEED A NEW WAY TO DO THIS TEST 
if get_offline_data:
    print('If from an offline source, check to make sure circle elevations are not being lost during merge:\n')
    print('NA in Merged:\n', paired_df_['circle_elev'].isna().value_counts())
    print('\n')
    print('NA in Offline:\n', offline_data['circle_elev'].isna().value_counts())

In [None]:
print('Missing elevations:')
paired_df_['circle_elev'].isna().value_counts()

In [None]:
print('How many elevations at sea level?')
paired_df_.loc[paired_df_['circle_elev'] == 0.0].shape

## Save the output

In [None]:
#But First Some QA checks (Should be checked against the input file and previous notebooks)
paired_df_.shape

In [None]:
paired_df_['ui'].nunique()

In [None]:
paired_df_.to_csv("../data/Cloud_Data/1.2.1-ijd-fetch-circle-elevations_"+time_now+".txt", 
                     sep='\t', 
                     compression='gzip',
                     index=False)