# This is The Title of Notebook
### Purpose
This notebook will look at comparing the usability of temperature readings between Christmas Bird Count Volunteers and NOAA Weather Stations.

### Author: 
Jacob Ellena
### Date: 
2020-07-30
### Update Date: 
2020-07-30

### Inputs 
1.3-rec-connecting-fips-ecosystem-data.txt -  
Example
cbc_effort_weather_1900-2018.txt - Tab seperated file of Christmas Bird Count events going back to 1900. Each row represents a single count in a given year. Data Dictonary can be found here: http://www.audubon.org/sites/default/files/documents/cbc_report_field_definitions_2013.pdf

### Output Files
None

## Steps or Proceedures in the notebook 
Comparisons are split into the following sections
- Data Import and Formatting
- User Variables
- Distance, Elevation, and Ecosystem Checks
- Missing Data 
- Out of Bounds Data


## Where the Data will Be Saved 
All data for this project will be saved in Google Drive. To start experimenting with data, download the folder hear and put it into your data folder.
https://drive.google.com/drive/folders/1Nlj9Nq-_dPFTDbrSDf94XMritWYG6E2I

The path should look like this: 
audubon-cbc/data/Cloud_Data/<DATA FILE>

---

## Importing and Formatting

In [1]:
# Imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

#Options
pd.set_option("display.max_columns", 100)

In [2]:
# ALL File Paths should be declared at the TOP of the notebook
PATH_TO_RAW_CBC_DATA = "../data/Cloud_Data/1.3-rec-connecting-fips-ecosystem-data.txt"

In [3]:
raw_data = pd.read_csv(PATH_TO_RAW_CBC_DATA, encoding = "ISO-8859-1", sep="\t", compression='gzip')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
raw_data.head()

Unnamed: 0,circle_name,country_state,lat,lon,count_year,count_date,n_field_counters,n_feeder_counters,min_field_parties,max_field_parties,field_hours,feeder_hours,nocturnal_hours,field_distance,nocturnal_distance,distance_units,min_temp,max_temp,temp_unit,min_wind,max_wind,wind_unit,min_snow,max_snow,snow_unit,am_cloud,pm_cloud,field_distance_imperial,field_distance_metric,nocturnal_distance_imperial,nocturnal_distance_metric,min_snow_imperial,min_snow_metric,max_snow_metric,max_snow_imperial,min_temp_imperial,max_temp_imperial,min_temp_metric,max_temp_metric,min_wind_metric,max_wind_metric,min_wind_imperial,max_wind_imperial,ui,geohash_circle,circle_id,id,latitude,longitude,elevation,state,name,gsn_flag,hcn_crn_flag,wmoid,geohash_station,temp_min_value,temp_max_value,precipitation_value,temp_avg,snow,snwd,am_rain,pm_rain,am_snow,pm_snow,circle_elev,elevation_source,block_fips,county_fips,Ecosys_circle,Usgsid_sys_circle,Nlcd_code_circle,Nlcd_circle,Ecosys_station,Usgsid_sys_station,Nlcd_code_station,Nlcd_station
0,Hawai'i: Volcano N.P.,US-HI,19.4333,-155.2833,1955,1955-01-01,11.0,,,,23.0,,,45.0,,Miles,,,,,,,,,,,,45.0,72.417123,,,,,,,,,,,,,,,19.4333-155.2833_1955,8e3w,8e3wd3w,USC00511303,19.4297,-155.2561,1210.4,HI,HAWAII VOL NP HQ 54,,,,8e3w,100.0,161.0,180.0,,0.0,0.0,,,,,1228.18,ghcn_d,,,,,,,,,,
1,Hawai'i: Volcano N.P.,US-HI,19.4333,-155.2833,1956,1955-12-31,11.0,,,,32.0,,,104.0,,Miles,,,,,,,,,,,,104.0,167.364017,,,,,,,,,,,,,,,19.4333-155.2833_1956,8e3w,8e3wd3w,USC00511303,19.4297,-155.2561,1210.4,HI,HAWAII VOL NP HQ 54,,,,8e3w,117.0,189.0,290.0,,0.0,0.0,,,,,1228.18,ghcn_d,,,,,,,,,,
2,Hawai'i: Volcano N.P.,US-HI,19.4333,-155.2833,1968,1967-12-30,2.0,,,,14.0,,,62.0,,Miles,54.0,66.0,,3.0,6.0,,0.0,0.0,,2.0,2.0,62.0,99.774702,,,0.0,0.0,0.0,0.0,,,12.222222,18.888889,4.827808,9.655616,1.8642,3.7284,19.4333-155.2833_1968,8e3w,8e3wd3w,US1HIHI0013,19.4391,-155.2156,1059.2,HI,VOLCANO 4.3 SSE,,,,8e3w,,,,,,,3.0,2.0,3.0,3.0,1228.18,ghcn_d,,,,,,,,,,
3,Hawai'i: Volcano N.P.,US-HI,19.4333,-155.2833,1968,1967-12-30,2.0,,,,14.0,,,62.0,,Miles,54.0,66.0,,3.0,6.0,,0.0,0.0,,2.0,2.0,62.0,99.774702,,,0.0,0.0,0.0,0.0,,,12.222222,18.888889,4.827808,9.655616,1.8642,3.7284,19.4333-155.2833_1968,8e3w,8e3wd3w,US1HIHI0071,19.4414,-155.2487,1194.8,HI,VOLCANO 4.0 S,,,,8e3w,,,,,,,3.0,2.0,3.0,3.0,1228.18,ghcn_d,,,,,,,,,,
4,Hawai'i: Volcano N.P.,US-HI,19.4333,-155.2833,1968,1967-12-30,2.0,,,,14.0,,,62.0,,Miles,54.0,66.0,,3.0,6.0,,0.0,0.0,,2.0,2.0,62.0,99.774702,,,0.0,0.0,0.0,0.0,,,12.222222,18.888889,4.827808,9.655616,1.8642,3.7284,19.4333-155.2833_1968,8e3w,8e3wd3w,USC00514563,19.4094,-155.2608,1079.87,HI,KILAUEA CAMP,,,,8e3w,,,,,,,3.0,2.0,3.0,3.0,1228.18,usgs_api,,,,,,,,,,


In [5]:
# Pulling out temperature data and renaming columns for clarification
temp_df_raw = raw_data[['count_year',
                    'circle_name', 
                    'circle_id',
                    'lat',
                    'lon',
                    'min_temp',
                    'max_temp',
                    'id',
                    'latitude',
                    'longitude',
                    'temp_min_value',
                    'temp_max_value']]

#Setting temp_df to be a copy to avoid indexing erros
temp_df = temp_df_raw.copy()

temp_df.rename(columns={
    'lat':'circle_lat',
    'lon':'circle_lon',
    'min_temp':'circle_min_temp',
    'max_temp':'circle_max_temp',
    'temp_unit':'circle_temp_unit',
    'id':'noaa_id',
    'latitude':'noaa_lat',
    'longitude':'noaa_lon',
    'temp_min_value':'noaa_min_temp',
    'temp_max_value':'noaa_max_temp'},
    inplace=True
              )
#Setting number of rows for comparison of how much data is lost after cleaning
row_count = temp_df.shape[0]
temp_df.head()

Unnamed: 0,count_year,circle_name,circle_id,circle_lat,circle_lon,circle_min_temp,circle_max_temp,noaa_id,noaa_lat,noaa_lon,noaa_min_temp,noaa_max_temp
0,1955,Hawai'i: Volcano N.P.,8e3wd3w,19.4333,-155.2833,,,USC00511303,19.4297,-155.2561,100.0,161.0
1,1956,Hawai'i: Volcano N.P.,8e3wd3w,19.4333,-155.2833,,,USC00511303,19.4297,-155.2561,117.0,189.0
2,1968,Hawai'i: Volcano N.P.,8e3wd3w,19.4333,-155.2833,54.0,66.0,US1HIHI0013,19.4391,-155.2156,,
3,1968,Hawai'i: Volcano N.P.,8e3wd3w,19.4333,-155.2833,54.0,66.0,US1HIHI0071,19.4414,-155.2487,,
4,1968,Hawai'i: Volcano N.P.,8e3wd3w,19.4333,-155.2833,54.0,66.0,USC00514563,19.4094,-155.2608,,


---

## User Variables

In [None]:
# Drop all stations farther then defined threshold
distance_threshold = 15000

# Drop all stations with a difference in defined elevation
elevation_threshold = 50

# Maximum and minimum temperature thresholds for comparing temperature readings
# Temperatures are in Fahrenheit
max_temp_check = 134 # Death Valley California
min_temp_check = -80 # Fort Yukon Alaska

---

## Distance, Elevation, and Ecosystem Checks

----

## Missing Data

#### Checking number of rows without a CBC Circle or NOAA station

In [8]:
print(f" Number of rows without a CBC Circle is:   {temp_df['circle_id'].isna().sum()}")
print(f" Number of rows without a NOAA Station is: {temp_df['noaa_id'].isna().sum()}")

 Number of rows without a CBC Circle is:   0
 Number of rows without a NOAA Station is: 0


#### Counting number of temperature measuremnts that are missing

In [9]:
print(f"Number of missing CBC Min Temps  : {temp_df['circle_min_temp'].isna().sum()}")
print(f"Number of missing CBC Max Temps  : {temp_df['circle_max_temp'].isna().sum()}")
print(f"Number of missing NOAA Min Temps : {temp_df['noaa_min_temp'].isna().sum()}")
print(f"Number of missing NOAA Max Temps : {temp_df['noaa_max_temp'].isna().sum()}")

Number of missing CBC Min Temps  : 26942
Number of missing CBC Max Temps  : 26960
Number of missing NOAA Min Temps : 675297
Number of missing NOAA Max Temps : 675285


In [27]:
print(f"Number of CBC rows missing both Min and Max Temps  : {temp_df.loc[temp_df['circle_min_temp'].isna() & temp_df['circle_max_temp'].isna()].shape[0]}")
print(f"Number of NOAA rows missing both Min and Max Temps : {temp_df.loc[temp_df['noaa_min_temp'].isna() & temp_df['noaa_max_temp'].isna()].shape[0]}")
print()
print(f"Number of rows missing all temperature data        : {temp_df.loc[temp_df['circle_min_temp'].isna() & temp_df['circle_max_temp'].isna() & temp_df['noaa_min_temp'].isna() & temp_df['noaa_max_temp'].isna()].shape[0]}")

Number of CBC rows missing both Min and Max Temps  : 26884
Number of NOAA rows missing both Min and Max Temps : 675076

Number of rows missing all temperature data        : 7621


#### Removing rows without temperature data for either CBC Circles or NOAA stations.

In [28]:
temp_df.dropna(axis=0, subset=['circle_min_temp', 'circle_max_temp', 'noaa_min_temp', 'noaa_max_temp'], inplace=True)
print(f"Number of rows before: {row_count}")
print(f"Number of rows after:  {temp_df.shape[0]}")
print(f"Total removed:         {row_count - temp_df.shape[0]}")

Number of rows before: 756378
Number of rows after:  61777
Total removed:         694601


-----

## Out of Bounds Data 
There are a number of outliers in the data set that could highly skew analysis. Any rows with a temperature outside of a min or max recorded temperature in the United States will be dropped.

To be conservative in data dropping we'll only using on max and one min for the entire country rather then by state or other locality. Additionally we'll check by each min/max temp for cirlces and stations to get an idea on if one is more disperate then another.

Data: https://en.wikipedia.org/wiki/U.S._state_and_territory_temperature_extremes

In [None]:
# Creating variables for each drop condition
circle_over_max_temp  = temp_df.loc[temp_df["circle_max_temp"]>max_temp_check]
circle_under_min_temp = temp_df.loc[temp_df["circle_min_temp"]<min_temp_check]
noaa_over_max_temp    = temp_df.loc[temp_df["noaa_max_temp"]>max_temp_check]
noaa_under_min_temp   = temp_df.loc[temp_df["noaa_min_temp"]<min_temp_check]


print(f'Number of CBC measurments outside max  : {circle_over_max_temp.shape[0]}')
print(f'Number of NOAA measurments outside max : {circle_under_min_temp.shape[0]}')
print()
print(f'Number of CBC measurments outside min  : {noaa_over_max_temp.shape[0]}')
print(f'Number of NOAA measurments outside min : {noaa_under_min_temp.shape[0]}')
print()
print(f'Number of NOAA stations with both outside : {temp_df.loc[(temp_df["noaa_max_temp"] > max_temp_check) & (temp_df["noaa_min_temp"] < min_temp_check)].shape[0]}')

In [None]:
# Setting list of indices to drop
index_drop_list = list(circle_over_max_temp.index) + list(circle_under_min_temp.index) + list(noaa_over_max_temp.index) + list(noaa_under_min_temp.index)

In [None]:
# Dropping Rows
temp_df.drop(index_drop_list, inplace=True)

In [None]:
temp_df.shape

-----

## EDA Notes
In order to compare temperature between CBC Circles and NOAA Stations several cleaning steps are needed
1. Compare only CBC Circles with NOAA Stations Attached
2. Remove rows with no data for either CBC Circle or NOAA Station
3. Create averages for both Circles and Stations for comparisons
4. Remove all rows with temperatures outside of max temperature bounds

#### EDA Note 3 
Generating Averages

In [None]:
temp_df['circle_average'] = (temp_df['circle_min_temp'] + temp_df['circle_max_temp'])/2
temp_df['noaa_average'] = (temp_df['noaa_min_temp'] + temp_df['noaa_max_temp'])/2

### Checking to see how many CBC Circle temperatures records are within the bounds of the NOAA Station records

In [None]:
temp_df['temp_check'] = temp_df['circle_average'].between(temp_df['noaa_min_temp'], temp_df['noaa_max_temp'])

In [None]:
# Counting number of circles that are true
temp_true = sum(temp_df['temp_check'])
temp_false = temp_df.shape[0] - sum(temp_df['temp_check'])
print(f"Number of CBC Cirlcs who's temperature is in the bounds of the corresponding NOAA station:     {sum(temp_df['temp_check'])}")
print(f"Number of CBC Cirlcs who's temperature is not in the bounds of the corresponding NOAA station: {temp_df.shape[0] - sum(temp_df['temp_check'])}")
print()
print(f"{round((temp_true/temp_df.shape[0])*100)}% of stations lay between")

---
# Plotting over time
Circles have multiple matching stations per year.

In [None]:
#Finding top 10 most common circles to compare to NOAA data over time.
most_active_circles_list = temp_df['circle_id'].value_counts()[:10].index.tolist()

In [None]:
# Finding matching NOAA stations
temp_df.loc[temp_df['circle_id'] == most_active_circles_list[0]]

---
## Plotting out CBC Circle temperature data to NOAA Station data

In [None]:
plt.figure(figsize=(20, 6))
sns.scatterplot(x=temp_df['circle_average'], y=temp_df['noaa_average'])
plt.title('CBC Average Temp to NOAA Station Average Temp', fontsize=20)
plt.xlabel('Average CBC Circle Temp', fontsize=10)
plt.ylabel('Average NOOA Station Temp', fontsize=10)
;

In [None]:
temp_df.loc[temp_df['noaa_average']]

---
## Temperature Measurement Goodness with Interchangeable Metric

temp_metric = sqrt( (noaa_min_temp - circle_min_temp)^2 + (noaa_max_temp - circle_max_temp)^2 )

Going to compare two different metrics to get and idea of how varied they could be

### Metric 1
temp_metric_1 = sqrt( (noaa_min_temp - circle_min_temp)^2 + (noaa_max_temp - circle_max_temp)^2 )

In [None]:
temp_df['temp_metric_1'] = round(np.sqrt(((temp_df['noaa_min_temp'] - temp_df['circle_min_temp'])**2) + ((temp_df['noaa_max_temp'] - temp_df['circle_max_temp'])**2)),2)

### Metric 2
temp_metric_2 = sqrt( (noaa_average - circle_average)^2 )

In [None]:
temp_df['temp_metric_2'] = np.sqrt((temp_df['noaa_average'] - temp_df['circle_average'])**2)

In [None]:
temp_df.sort_values(by=['temp_metric_1'])

### Catagories
Values in catagories can be changed and then applied to dataframe

In [None]:
# Creating Categories
excellent = 5
good      = 10
fair      = 15
poor      = 20

In [None]:
# Function to assign grade scores
def assign_grade(metric_score):
    if metric_score <= excellent:
        return 'excellent'
    elif metric_score <= good:
        return 'good'
    elif metric_score <= fair:
        return 'fair'
    else:
        return 'poor'

In [None]:
# Applying the scores
temp_df['metric_grade'] = temp_df['temp_metric_1'].apply(lambda metric_score: assign_grade(metric_score))

In [None]:
temp_df

---
## Plotting Circle Metric Scores by Lat/Lon

In [None]:
plt.figure(figsize=(20, 6))
sns.scatterplot(x='circle_lat', y='circle_lon', data=temp_df, hue='metric_grade')
plt.title('Circle Locations by Metric Grade', fontsize=20)
plt.xlabel('Circle Lat', fontsize=10)
plt.ylabel('Circle Lon', fontsize=10)
;

In [None]:
Next Steps
Clean Data
Plot time series using closest station