## Assigning circles to weather stations
### Purpose
Using a custom table created from uploading the CSV to Big Query (this table is called `cleaned_bird_counts_gstorage`) a join is done with the view that contains the flatten data.

### Author: 
Francisco Vannini
### Date: 
2020-04-02
### Update Date: 
2020-04-02

### Inputs
<ol>
<li> Google credential auth JSON </li>
<li> noaa_from_1900_to_present view in BQ</li>
<li> flatten_noaa_from_1900_to_present in BQ</li>
<li> cleaned_bird_count data</li>
</ol>

### Output Files
This notebook produces <strong>1.1-circles-to-many-noaa-stations-usa-weather-data-[data_this_process_was_run].csv.gzip</strong>. This data contains non-empty weather measurements for the NOAA stations that are in close proximity (using geohashes) of our CDC bird count. 

## Steps or Proceedures in the notebook
This notebook creates a query that interlaces the CDC bird count data, matches it with NOAA stations in close proximity with this station and then extracts the NOAA station weather measurements pertinenet to the dates. After the data is extracted the rows that have a NULL value of "temp_min" are pruned AND only USA weather measurements included.

To prep for the query, it loads in cleaned data and uploads it to BiqQuery so the query has access to it.

## Where the Data will Be Saved 
This script produces data at the level where this notebook is located.

## NOTES on Running This Notebook
If you are getting errors from the biquery modual that seem weird, Try complely stoping your notebook kernal and restarting it. There are some werid errors that can happen when running BigQuery from a notebook.

In [18]:
# Imports
import os
from datetime import datetime
# Version .24.0
from google.cloud import bigquery
import pandas as pd
import pandas

In [19]:
# Set Up the Enviroment 

# The path to your json credentials file. Replace with your corresponding file.
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "your_path_to_google_auth_keys.json"

# Used to classify the name 
time_now = datetime.today().strftime('%Y%m%d%H%M%S')

client = bigquery.Client()
project = 'birdproject-2020'
source_dataset_id = 'audubon_cdc'
# source_table_id = 'us_states'
shared_dataset_ref = client.dataset(source_dataset_id)

In [20]:
client

<google.cloud.bigquery.client.Client at 0x104833be0>

## Load in the Most Recent Data File 
THIS IS NOT REQUIRED -- But It is good practice to confirm it is there and can be read correctly. 
The next section will load the data as part of the upload to bigquery

In [21]:
# ALL File Paths should be declared at the TOP of the notebook
PATH_TO_CLEAN_CBC_DATA = "../data/Cloud_Data/1.0-rec-initial-data-cleaning.txt"

In [22]:
clean_data = pd.read_csv(PATH_TO_CLEAN_CBC_DATA, encoding = "ISO-8859-1", sep="\t")

  interactivity=interactivity, compiler=compiler, result=result)


In [23]:
clean_data.head()

Unnamed: 0.1,Unnamed: 0,circle_name,country_state,lat,lon,count_year,count_date,n_field_counters,n_feeder_counters,min_field_parties,...,max_snow_metric,max_snow_imperial,min_temp_imperial,max_temp_imperial,min_temp_metric,max_temp_metric,min_wind_metric,max_wind_metric,min_wind_imperial,max_wind_imperial
0,2,Pacific Grove,US-CA,36.6167,-121.9167,1901,12/25/00,1.0,,,...,,,,,,,,,,
1,3,Pueblo,US-CO,38.175251,-104.519575,1901,12/25/00,1.0,,,...,,,,,,,,,,
2,4,Bristol,US-CT,41.6718,-72.9495,1901,12/25/00,2.0,,,...,,,,,,,,,,
3,5,Norwalk,US-CT,41.1167,-73.4,1901,12/25/00,1.0,,,...,,,,,,,,,,
4,6,Glen Ellyn,US-IL,41.8833,-88.0667,1901,12/25/00,1.0,,,...,,,,,,,,,,


## Push this data up to bigQuery

In [24]:
# Set up Data name 
table_id = 'rec_initial_data_cleaning'

table_ref = dataset_ref.table(table_id)

table_full = project + "."+ source_dataset_id + "." + "rec_initial_data_cleaning"

In [25]:
# Delete the exisiting table if it exisits so we can replace it with new data
client.delete_table(table_full, not_found_ok=True)  # Make an API request.
print("Deleted table '{}'.".format(table_full))

Deleted table 'birdproject-2020.audubon_cdc.rec_initial_data_cleaning'.


In [26]:
# Push our file up to BigQuery
filename = PATH_TO_CLEAN_CBC_DATA

# Build the Job Config
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.CSV
job_config.skip_leading_rows = 1
job_config.autodetect = True


with open(filename, "rb") as source_file:
    job = client.load_table_from_file(source_file, table_ref, job_config=job_config)
job.result()  # Waits for table load to complete.
print("Loaded {} rows into {}:{}.".format(job.output_rows, source_dataset_id, table_id))

Loaded 89568 rows into audubon_cdc:rec_initial_data_cleaning.


## Build the Query and Submit it 
This is the query that interlaces the CDC bird count data, matches it with NOAA stations in close proximity with this station and then extracts the NOAA station weather measurements pertinenet to the dates. After the data is extracted the rows that have a NULL value of "temp_min" are pruned AND only USA weather measurements included

In [None]:
query = f"""
WITH circles_hash as (SELECT x.*, ST_GEOHASH(ST_GEOGPOINT(x.lon,x.lat), 4) as geohash_circle, ST_GEOHASH(ST_GEOGPOINT(x.lon,x.lat), 7) as circle_id

FROM `{project}.audubon_cdc.rec_initial_data_cleaning` x),

stations_hash as (SELECT y.*, ST_GEOHASH(ST_GEOGPOINT(y.longitude,y.latitude),4) as geohash_station FROM `bigquery-public-data`.ghcn_d.ghcnd_stations y),

circle_with_matched_stations as (SELECT * FROM circles_hash x INNER JOIN stations_hash y ON x.geohash_circle = y.geohash_station)

SELECT x.*, y.temp_min_value,y.temp_max_value,y.precipitation_value,y.temp_avg,y.snow,y.snwd

FROM circle_with_matched_stations x
LEFT JOIN `{project}.audubon_cdc.flatten_noaa_from_1900_to_present` y ON x.id = y.id AND x.count_date = y.date

ORDER BY circle_id DESC,count_date ASC """

# Queries BigQuery public data set and creates a new dataframe object
df_circles_to_stations_weather_data = client.query(query)


In [None]:
df_circles_to_stations_weather_data = df_circles_to_stations_weather_data.to_dataframe()

In [None]:
df_circles_to_stations_weather_data.shape

In [None]:
# Saving stations in csv COMPRESSED IN GZIP!!!
df_circles_to_stations_weather_data.to_csv(r'1.1-circles_to_many_stations_usa_weather_data_' + str(time_now) +  '.csv', compression = "gzip")



## Top 5 records
Showing the top 5 records of the data extracted to the query above

In [None]:
df_circles_to_stations_weather_data.head()

## Statistics on dataset
How many records are empty for the various temperature measurements

In [None]:
import numpy as np

record_count = len(df_circles_to_stations_weather_data.index)
print('How many rows in dataset with missing vals: ', record_count)

temp_min_nas = df_circles_to_stations_weather_data.temp_min_value.isna().sum()
print("Missing min temperature: " + str(temp_min_nas))

temp_max_nas = df_circles_to_stations_weather_data.temp_max_value.isna().sum()
print("Missing max temperature: " + str(temp_max_nas))

temp_avg_nas = df_circles_to_stations_weather_data.temp_avg.isna().sum()
print("Missing avg temperature: " + str(temp_avg_nas))

snow = df_circles_to_stations_weather_data.snow.isna().sum()
print("Missing snow temperature: " + str(snow))

## Remove empty min/max temperature
Create new data frame

In [None]:
ref=df_circles_to_stations_weather_data.temp_min_value
paired_data=df_circles_to_stations_weather_data[ref.notna()]
paired_data.head()

In [None]:
paired_data.shape

## Size of dataframe

In [None]:
print("The total number of records in this data set is: ", len(paired_data.circle_name))

## Only Data in the USA
Create new data frame for stations only for stations located in the USA

In [None]:
paired_data_usa = paired_data[paired_data.id.str.slice(stop=2)=="US"]
print("The number of rows station usa weather data: ", len(paired_data_usa))

In [None]:
# Saving stations in csv COMPRESSED IN GZIP!!!
paired_data_usa.to_csv(r'1.1-circles_to_many_stations_usa_weather_data_' + str(time_now) +  '.csv', compression = "gzip")

