# This is The Title of Notebook
### Purpose
This notebook will look at comparing the usability of temperature readings between Christmas Bird Count Volunteers and NOAA Weather Stations.

### Author: 
Jacob Ellena
### Date: 
2020-07-30
### Update Date: 
2020-07-30

### Inputs 
1.3-rec-connecting-fips-ecosystem-data.txt -  
Example
cbc_effort_weather_1900-2018.txt - Tab seperated file of Christmas Bird Count events going back to 1900. Each row represents a single count in a given year. Data Dictonary can be found here: http://www.audubon.org/sites/default/files/documents/cbc_report_field_definitions_2013.pdf

### Output Files
None

## Steps or Proceedures in the notebook 
Comparisons are split into the following sections
- Data Import and Formatting
- User Variables
- Distance, Elevation, and Ecosystem Checks
- Missing Data 
- Out of Bounds Data
- Temperature Goodness Rating
- Ecosystem comparison


## Where the Data will Be Saved 
All data for this project will be saved in Google Drive. To start experimenting with data, download the folder hear and put it into your data folder.
https://drive.google.com/drive/folders/1Nlj9Nq-_dPFTDbrSDf94XMritWYG6E2I

The path should look like this: 
audubon-cbc/data/Cloud_Data/<DATA FILE>

---

## Importing and Formatting

In [1]:
# Imports 
import pandas as pd
import numpy as np
import math
from sklearn.metrics.pairwise import haversine_distances
from sklearn.neighbors import DistanceMetric
import plotly.graph_objects as go

#Options
pd.set_option("display.max_columns", 100)

---

## User Variables

In [2]:
# Drop all stations farther then defined threshold in meters
distance_threshold = 15000

# Drop all stations with a difference in defined elevation meters
elevation_threshold = 50

# Maximum and minimum temperature thresholds for comparing temperature readings
# Temperatures are in Fahrenheit and pulled from https://en.wikipedia.org/wiki/U.S._state_and_territory_temperature_extremes
max_temp_check = 134 # Death Valley California
min_temp_check = -80 # Fort Yukon Alaska

# Catagories for temperature goodness metric
excellent_score = 5
good_score      = 10
fair_score      = 15
poor_score      = 20

---
## Dataframe Generation

In [3]:
# ALL File Paths should be declared at the TOP of the notebook
PATH_TO_RAW_CBC_DATA = "../data/Cloud_Data/1.3-rec-connecting-fips-ecosystem-data.txt"

In [4]:
raw_data = pd.read_csv(PATH_TO_RAW_CBC_DATA, encoding = "ISO-8859-1", sep="\t", compression='gzip')

  interactivity=interactivity, compiler=compiler, result=result)


In [5]:
# Pulling out temperature data and renaming columns for clarification
temp_df_raw = raw_data[['count_year',
                    'circle_name', 
                    'circle_id',
                    'Usgsid_sys_circle',
                    'Nlcd_circle',
                    'circle_elev',
                    'lat',
                    'lon',
                    'min_temp',
                    'max_temp',
                    'id',
                    'Usgsid_sys_station',
                    'Nlcd_station',
                    'elevation',
                    'latitude',
                    'longitude',
                    'temp_min_value',
                    'temp_max_value']]

#Setting temp_df to be a copy to avoid indexing erros
temp_df = temp_df_raw.copy()

temp_df.rename(columns={
    'circle_elev':'circle_elevation',
    'Usgsid_sys_circle':'specific_circle_ecosystem',
    'Nlcd_circle':'macro_circle_ecosystem',
    'lat':'circle_lat',
    'lon':'circle_lon',
    'min_temp':'circle_min_temp',
    'max_temp':'circle_max_temp',
    'temp_unit':'circle_temp_unit',
    'id':'noaa_id',
    'Usgsid_sys_station':'specific_station_ecosystem',
    'Nlcd_station':'macro_station_ecosystem',
    'elevation':'noaa_elevation',
    'latitude':'noaa_lat',
    'longitude':'noaa_lon',
    'temp_min_value':'noaa_min_temp',
    'temp_max_value':'noaa_max_temp'},
    inplace=True
              )
#Setting number of rows for comparison of how much data is lost after cleaning
row_count = temp_df.shape[0]
temp_df.head()

Unnamed: 0,count_year,circle_name,circle_id,specific_circle_ecosystem,macro_circle_ecosystem,circle_elevation,circle_lat,circle_lon,circle_min_temp,circle_max_temp,noaa_id,specific_station_ecosystem,macro_station_ecosystem,noaa_elevation,noaa_lat,noaa_lon,noaa_min_temp,noaa_max_temp
0,1955,Hawai'i: Volcano N.P.,8e3wd3w,,,1228.18,19.4333,-155.2833,,,USC00511303,,,1210.4,19.4297,-155.2561,100.0,161.0
1,1956,Hawai'i: Volcano N.P.,8e3wd3w,,,1228.18,19.4333,-155.2833,,,USC00511303,,,1210.4,19.4297,-155.2561,117.0,189.0
2,1968,Hawai'i: Volcano N.P.,8e3wd3w,,,1228.18,19.4333,-155.2833,54.0,66.0,US1HIHI0013,,,1059.2,19.4391,-155.2156,,
3,1968,Hawai'i: Volcano N.P.,8e3wd3w,,,1228.18,19.4333,-155.2833,54.0,66.0,US1HIHI0071,,,1194.8,19.4414,-155.2487,,
4,1968,Hawai'i: Volcano N.P.,8e3wd3w,,,1228.18,19.4333,-155.2833,54.0,66.0,USC00514563,,,1079.87,19.4094,-155.2608,,


#### Calculating Temperature Averages

In [6]:
temp_df['circle_average_temp'] = temp_df[['circle_min_temp', 'circle_max_temp']].mean(axis=1)
temp_df['noaa_average_temp'] = temp_df[['noaa_min_temp', 'noaa_max_temp']].mean(axis=1)

---

## Distance and Elevation

#### Distance Calculations

In [7]:
# Forumula from noaa.py found in '../scripts' folder
def haversine_formula(coord1, coord2):
    """Haversine Forumla for calculating distance between two
    coordinates in meters.

    Distaince is similar to the GeoPy distance formulas except
    the geopy formula uses Vincenty’s formula. At longer distances,
    the difference is much more pronounced, however, since we are trying
    to find the closest one, the Haversine formula is a suitable
    approximation for our purposes.

    :param set coord1:
        A set containing the lat and long of the first location
    :param set coord1:
        A set containing the lat and long of the second location

    :return: distance between two sets in meters
    :rtype: float
    """
    R = 6372800  # Earth radius in meters
    lat1, lon1 = coord1
    lat2, lon2 = coord2

    phi1, phi2 = np.radians(lat1), np.radians(lat2)
    dphi = np.radians(lat2 - lat1)
    dlambda = np.radians(lon2 - lon1)

    a = np.sin(dphi / 2)**2 + \
        np.cos(phi1) * np.cos(phi2) * np.sin(dlambda / 2)**2

    return 2*R*np.arctan2(np.sqrt(a), np.sqrt(1 - a))

In [8]:
# Adding distance column based on haversine distance
temp_df['distance_diff'] = haversine_formula((temp_df['circle_lat'], temp_df['circle_lon']), (temp_df['noaa_lat'], temp_df['noaa_lon']))

#### Elevation Calculations

In [9]:
# Calculating difference in elevations between circles and stations
temp_df['elevation_diff'] = np.abs(temp_df['circle_elevation'] - temp_df['noaa_elevation'])

----

## Missing Data

#### Checking number of rows without a CBC Circle or NOAA station

In [10]:
print(f" Number of rows without a CBC Circle is:   {temp_df['circle_id'].isna().sum()}")
print(f" Number of rows without a NOAA Station is: {temp_df['noaa_id'].isna().sum()}")

 Number of rows without a CBC Circle is:   0
 Number of rows without a NOAA Station is: 0


#### Counting number of temperature measuremnts that are missing

In [11]:
print(f"Number of missing CBC Min Temps  : {temp_df['circle_min_temp'].isna().sum()}")
print(f"Number of missing CBC Max Temps  : {temp_df['circle_max_temp'].isna().sum()}")
print(f"Number of missing NOAA Min Temps : {temp_df['noaa_min_temp'].isna().sum()}")
print(f"Number of missing NOAA Max Temps : {temp_df['noaa_max_temp'].isna().sum()}")

Number of missing CBC Min Temps  : 26942
Number of missing CBC Max Temps  : 26960
Number of missing NOAA Min Temps : 675297
Number of missing NOAA Max Temps : 675285


In [12]:
print(f"Number of CBC rows missing both Min and Max Temps  : {temp_df.loc[temp_df['circle_min_temp'].isna() & temp_df['circle_max_temp'].isna()].shape[0]}")
print(f"Number of NOAA rows missing both Min and Max Temps : {temp_df.loc[temp_df['noaa_min_temp'].isna() & temp_df['noaa_max_temp'].isna()].shape[0]}")
print()
print(f"Number of rows missing all temperature data        : {temp_df.loc[temp_df['circle_min_temp'].isna() & temp_df['circle_max_temp'].isna() & temp_df['noaa_min_temp'].isna() & temp_df['noaa_max_temp'].isna()].shape[0]}")

Number of CBC rows missing both Min and Max Temps  : 26884
Number of NOAA rows missing both Min and Max Temps : 675076

Number of rows missing all temperature data        : 7621


#### Removing rows without temperature data for either CBC Circles or NOAA stations.

In [13]:
temp_df.dropna(axis=0, subset=['circle_min_temp', 'circle_max_temp', 'noaa_min_temp', 'noaa_max_temp'], inplace=True)
print(f"Number of rows before: {row_count}")
print(f"Number of rows after:  {temp_df.shape[0]}")
print(f"Total removed:         {row_count - temp_df.shape[0]}")

Number of rows before: 756378
Number of rows after:  61777
Total removed:         694601


-----

## Out of Bounds Data 

### Temperature Data
There are a number of outliers in the data set that could highly skew analysis. Any rows with a temperature outside of a min or max recorded temperature in the United States will be dropped.

To be conservative in data dropping we'll only using on max and one min for the entire country rather than by state or other locality. Additionally we'll check by each min/max temp for circles and stations to get an idea on if one is more error prone than another.

Data: https://en.wikipedia.org/wiki/U.S._state_and_territory_temperature_extremes

In [14]:
# Creating variables for each drop condition
circle_over_max_temp  = temp_df.loc[temp_df["circle_max_temp"]>max_temp_check]
circle_under_min_temp = temp_df.loc[temp_df["circle_min_temp"]<min_temp_check]

noaa_over_max_temp    = temp_df.loc[temp_df["noaa_max_temp"]>max_temp_check]
noaa_under_min_temp   = temp_df.loc[temp_df["noaa_min_temp"]<min_temp_check]


print(f'Number of CBC measurments outside max  : {circle_over_max_temp.shape[0]}')
print(f'Number of NOAA measurments outside max : {noaa_over_max_temp.shape[0]}')
print()
print(f'Number of CBC measurments outside min  : {circle_under_min_temp.shape[0]}')
print(f'Number of NOAA measurments outside min : {noaa_under_min_temp.shape[0]}')
print()
print(f'Number of NOAA stations with both outside : {temp_df.loc[(temp_df["noaa_max_temp"] > max_temp_check) & (temp_df["noaa_min_temp"] < min_temp_check)].shape[0]}')

# Setting list of indices to drop
index_drop_list = list(circle_over_max_temp.index) + list(circle_under_min_temp.index) + list(noaa_over_max_temp.index) + list(noaa_under_min_temp.index)

# Dropping All out of bout roundsRows
temp_df.drop(index_drop_list, inplace=True)

Number of CBC measurments outside max  : 3
Number of NOAA measurments outside max : 14207

Number of CBC measurments outside min  : 0
Number of NOAA measurments outside min : 17098

Number of NOAA stations with both outside : 125


#### Distance Data

In [15]:
# Dropping rows with distance differences larger then set threshold
temp_df.drop(temp_df[temp_df['distance_diff'] > distance_threshold].index, inplace=True)
print(f'Number of rows dropped outside of distance threshold: {temp_df.shape[0]}')

Number of rows dropped outside of distance threshold: 23364


#### Elevation Data

In [16]:
# Dropping rows with circles and stations that are over the elevation threshold
temp_df.drop(temp_df[temp_df['elevation_diff'] > elevation_threshold].index, inplace=True)

# Dropping rows with no elevation data
temp_df.dropna(subset=['circle_elevation', 'noaa_elevation'], inplace=True)
print(f'Number of rows dropped outside of elevation threshold: {temp_df.shape[0]}')

Number of rows dropped outside of elevation threshold: 16986


#### Checking to see how many CBC Circle temperatures records are within the bounds of the NOAA Station records

In [17]:
temp_df['temp_check'] = temp_df['circle_average_temp'].between(temp_df['noaa_min_temp'], temp_df['noaa_max_temp'])

In [18]:
# Counting number of circles that are true
temp_true = sum(temp_df['temp_check'])
temp_false = temp_df.shape[0] - sum(temp_df['temp_check'])
print(f"Number of CBC Cirlcs who's temperature is in the bounds of the corresponding NOAA station:     {sum(temp_df['temp_check'])}")
print(f"Number of CBC Cirlcs who's temperature is not in the bounds of the corresponding NOAA station: {temp_df.shape[0] - sum(temp_df['temp_check'])}")
print()
print(f"{round((temp_true/temp_df.shape[0])*100)}% of stations lay between")

Number of CBC Cirlcs who's temperature is in the bounds of the corresponding NOAA station:     10544
Number of CBC Cirlcs who's temperature is not in the bounds of the corresponding NOAA station: 6442

62% of stations lay between


---
## Temperature Measurement Goodness

temp_metric = sqrt( (noaa_min_temp - circle_min_temp)^2 + (noaa_max_temp - circle_max_temp)^2 )

### Goodness Metric
temp_goodness = sqrt( (noaa_min_temp - circle_min_temp)^2 + (noaa_max_temp - circle_max_temp)^2 )

In [19]:
temp_df['temp_goodness'] = round(np.sqrt(((temp_df['noaa_min_temp'] - temp_df['circle_min_temp'])**2) + ((temp_df['noaa_max_temp'] - temp_df['circle_max_temp'])**2)),2)

### Catagories
Values in catagories can be changed and then applied to dataframe

In [20]:
# Function to assign grade scores
def assign_grade(metric_score):
    if metric_score <= excellent_score:
        return 'excellent'
    elif metric_score <= good_score:
        return 'good'
    elif metric_score <= fair_score:
        return 'fair'
    else:
        return 'poor'

In [21]:
# Applying the scores
temp_df['goodness_grade'] = temp_df['temp_goodness'].apply(lambda metric_score: assign_grade(metric_score))

---

## Ecosystem Split
Creating two dataframes based on matching min and macro ecosystems

#### Specific Ecosystem Match

In [22]:
temp_df_specific_ecosystems = temp_df.loc[temp_df['specific_circle_ecosystem'].isna() == temp_df['specific_station_ecosystem'].isna()]
print(f'Number of rows before specific ecosystem match: {temp_df.shape[0]}')
print(f'Number of rows after specific ecosystem match:  {temp_df_specific_ecosystems.shape[0]}')
print()
print(f'Number of rows lost: {temp_df.shape[0] - temp_df_specific_ecosystems.shape[0]}')

Number of rows before specific ecosystem match: 16986
Number of rows after specific ecosystem match:  16160

Number of rows lost: 826


#### Macro Ecosystem Match

In [23]:
temp_df_macro_ecosystems = temp_df.loc[temp_df['macro_circle_ecosystem'].isna() == temp_df['macro_station_ecosystem'].isna()]
print(f'Number of rows before macro ecosystem match: {temp_df.shape[0]}')
print(f'Number of rows after macro ecosystem match:  {temp_df_macro_ecosystems.shape[0]}')
print()
print(f'Number of rows lost: {temp_df.shape[0] - temp_df_macro_ecosystems.shape[0]}')

Number of rows before macro ecosystem match: 16986
Number of rows after macro ecosystem match:  14798

Number of rows lost: 2188


---

## Ecosytem Grading Mapping

In [24]:
# Defining 
def grade_figure(fig_df):
    # Setting text for mouse overlay
    fig_df['text'] = 'Circle Name: ' + fig_df['circle_name'] + '<br>Goodness Grade: ' + fig_df['goodness_grade']
    
    # Generating figure
    fig = go.Figure(go.Scattergeo())

    # Trace layer for poor grading
    fig_poor = go.Scattergeo(
    locationmode = 'USA-states',
            lon = fig_df.loc[fig_df['goodness_grade'] == 'poor']['circle_lon'],
            lat = fig_df.loc[fig_df['goodness_grade'] == 'poor']['circle_lat'],
            text = fig_df.loc[fig_df['goodness_grade'] == 'poor']['text'], # Used for interactive map
            mode = 'markers',
            marker = dict(
                size = 2,
                opacity = .2,
                color = 'red'
            ),
            )
    
    # Trace layer for fair grading
    fig_fair = go.Scattergeo(
    locationmode = 'USA-states',
            lon = fig_df.loc[fig_df['goodness_grade'] == 'fair']['circle_lon'],
            lat = fig_df.loc[fig_df['goodness_grade'] == 'fair']['circle_lat'],
            text = fig_df.loc[fig_df['goodness_grade'] == 'fair']['text'], # Used for interactive map
            mode = 'markers',
            marker = dict(
                size = 4,
                opacity = .4,
                color = 'yellow'
            ),
            )
    
    # Trace layer for good grading
    fig_good = go.Scattergeo(
    locationmode = 'USA-states',
            lon = fig_df.loc[fig_df['goodness_grade'] == 'good']['circle_lon'],
            lat = fig_df.loc[fig_df['goodness_grade'] == 'good']['circle_lat'],
            text = fig_df.loc[fig_df['goodness_grade'] == 'good']['text'], # Used for interactive map
            mode = 'markers',
            marker = dict(
                size = 6,
                opacity = .8,
                color = 'blue'
            ),
            )
    
    # Trace layer for excellent grading
    fig_excellent = go.Scattergeo(
    locationmode = 'USA-states',
            lon = fig_df.loc[fig_df['goodness_grade'] == 'excellent']['circle_lon'],
            lat = fig_df.loc[fig_df['goodness_grade'] == 'excellent']['circle_lat'],
            text = fig_df.loc[fig_df['goodness_grade'] == 'excellent']['text'], # Used for interactive map
            mode = 'markers',
            marker = dict(
                size = 8,
                opacity = 1,
                color = 'chartreuse'
            ),
            )

    fig.add_trace(fig_poor)
    fig.add_trace(fig_fair)
    fig.add_trace(fig_good)
    fig.add_trace(fig_excellent)

    fig.update_layout(
            geo_scope='usa',
            showlegend=False
        )
    # Figures can slow down notebook so commenting out for review
    # fig.show()
    
    # Comment out below if you don't want to save the image
    # fig.write_image(f'{[x for x in globals() if globals()[x] is fig_df][0]}_temp_grade_geoscatter.png', scale = 5) #List comprehension pulls matching dataframe name from global ojects list

#### Specific Ecosystems

In [None]:
grade_figure(temp_df_specific_ecosystems)

#### Macro Ecosystems

In [None]:
grade_figure(temp_df_macro_ecosystems)