# NOAA Data via BigQuery

**[NOAA](https://data.noaa.gov/dataset/dataset/global-surface-summary-of-the-day-gsod) Global Surface Summary of the Day**

**PreRequisites:**
1. Install the following in your dev environment:<br>
    a. google-cloud-bigquery: pip.exe install google-cloud-bigquery<br>
    b. db-types: pip install db-dtypes<br>
2. Install gcloud CLI <br>
    a. Install directions (with download link): https://cloud.google.com/sdk/docs/install<br>
    > i. pay attention to where it installs!<br>
    > ii. It says to leave all the shortcut, open terminal options checked. I received errors when it ran "gcloud info --run-diagnostics" and I ignored them for now...<br>
    
    b. Add this to your PATH environmental variables (for me this was C:\Users\vt_be\AppData\Local\Google\Cloud SDK\google-cloud-sdk)<br>
    c. reboot!<br>
    d. open git bash, switch to dev environment<br>
    > i. "gcloud info --run-diagnostics" now ran without issue<br>
    ii. add authentication (this opens browser to connect your google account):  gcloud auth application-default login<br>
    
    e. I also needed to set up a Big Query Project: mostly followed https://cloud.google.com/bigquery/docs/sandbox<br>
    > i. I didn't see the stuff mentioned in #3 but otherwise worked<br>
    > ii. Note that when you create the project, an id is generated that is project name - #### (for me BootCamp-Weather:  bootcamp-weather-400118<br>
    
    f. Add the project to default - back to gitbash: gcloud auth application-default set-quota-project <project-id><br>
    g. In the downloaded notebook, add the project id to the client = bigquery.Client("project-id") in the first cell<br>
    

**Credit:**
* Big Query calls adapted from https://www.kaggle.com/code/crained/noaa-dataset-with-google-bigquery
* SQL calls adapted from GitHub BigQuery documentation: https://github.com/googleapis/python-bigquery

In [1]:
# My project name (don't think can be shared across people) is stored in a config.py file as "google_project"
# Since this is unique to user, I added config.py to the gitignore. You must create your own config.py file with project name
from config import google_project
# bigquery and pandas work well together for dataframes!
import pandas as pd
import os
# Follow the prerequisite instructions to get bigquery going
from google.cloud import bigquery
# Create a "Client" object reference a google project for which your system has been authenticated
client = bigquery.Client(google_project)


## Cold only query for all min/mean/max temperature data into a single file ##
a single file by station with just absolute min temp in ascending order within each state
a single file by state with the stats for the absolute min temp: min, max, mean, std deviation of the min temps read across all stations within a state

In [2]:
QUERY = (
    'SELECT s.state, s.name, '
    'MIN(g.min) AS min_temp, '
    'FROM `bigquery-public-data.noaa_gsod.gsod2022` AS g '
    'INNER JOIN `bigquery-public-data.noaa_gsod.stations` AS s ON g.stn = s.usaf '
    'WHERE s.country = "US" AND s.state <> "None" '
    # This line below removes the 'not a reading' so we can run stats on those columns
    'AND g.min <> 9999.9 '
    'GROUP BY s.state, s.name '
    'ORDER BY s.state, min_temp'
    )
# API request
state_cold_result = client.query(QUERY)  

# Waits for query to finish
state_cold_data = state_cold_result.result()  

# Put the last query into a dataframe
state_cold_station = state_cold_result.to_dataframe()

# and export
state_cold_station.to_json("GEOJSON_data/state_station_cold.json", orient="records")
# state_cold_station.to_csv("GEOJSON_data/state_station_cold.csv")
state_cold_station.to_json("GEOJSON_data/state_station_cold.js", orient="records")

state_cold_station

Unnamed: 0,state,name,min_temp
0,AK,EAGLE AIRPORT,-59.1
1,AK,NORTHWAY AIRPORT,-59.1
2,AK,PUNTILLA LAKE,-57.1
3,AK,PRUDHOE BAY,-57.1
4,AK,CAPE DECISION,-57.1
...,...,...,...
4957,WY,KEMMERER MUNICIPAL AIRPORT,-9.4
4958,WY,RCK SRINGS-SWETWTER CO APT,-9.0
4959,WY,FORT BRIDGER AIRPORT,-7.8
4960,WY,EVANSTON UINTA CO BU,-6.0


If using the above table, each state is already sorted for the top ten

In [3]:
# Get the min, max, mean of the absolute minimums temperature measurement by state to use in hover text
state_cold_summary = state_cold_station[["state", "min_temp"]].groupby(["state"]).aggregate(func=["min", "max", "mean", "std"], axis="index")

# # and export
# state_cold_summary.to_json("GEOJSON_data/State_cold_summary.json", orient="index")
# state_cold_summary.to_csv("GEOJSON_data/State_cold_summary.csv")

print(len(state_cold_summary))
state_cold_summary.head()

53


Unnamed: 0_level_0,min_temp,min_temp,min_temp,min_temp
Unnamed: 0_level_1,min,max,mean,std
state,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
AK,-59.1,53.6,-28.521727,26.105537
AL,-57.1,46.9,-11.885714,34.861816
AR,-57.1,19.0,-7.362319,22.610019
AZ,-57.1,37.0,-7.793478,38.674726
CA,-57.1,45.5,3.082334,39.807844


Alaska's minimum temperature varied from -59.7 to 53.6 through out 2022 across the entire state with a mean minimum temperature of -28.5 and standard deviation of 26.1

## Hot only query for all min/mean/max temperature data into a single file ##
a single file by station with just absolute max temp in descending order within each state
a single file by state with the stats for the absolute min temp: min, max, mean, std deviation of the min temps read across all stations within a state

In [4]:
QUERY = (
    'SELECT s.state, s.name, '
    'MAX(g.max) AS max_temp, '
    'FROM `bigquery-public-data.noaa_gsod.gsod2022` AS g '
    'INNER JOIN `bigquery-public-data.noaa_gsod.stations` AS s ON g.stn = s.usaf '
    'WHERE s.country = "US" AND s.state <> "None" '
    # This line below removes the 'not a reading' so we can run stats on those columns
    'AND g.max <> 9999.9 '
    'GROUP BY s.state, s.name '
    'ORDER BY s.state, max_temp DESC'
    )
# API request
state_hot_result = client.query(QUERY)  

# Waits for query to finish
state_hot_data = state_hot_result.result()  

# Put the last query into a dataframe
state_hot_station = state_hot_result.to_dataframe()

# and export
state_hot_station.to_json("GEOJSON_data/state_station_hot.json", orient="records")
# state_hot_station.to_csv("GEOJSON_data/state_station_hot.csv")
state_hot_station.to_json("GEOJSON_data/state_station_hot.js", orient="records")

state_hot_station

Unnamed: 0,state,name,max_temp
0,AK,CAPE DECISION,122.4
1,AK,CAPE SPENCER,122.4
2,AK,GUSTAVUS,122.4
3,AK,HAINES,122.4
4,AK,KETCHIKAN TONGASS,122.4
...,...,...,...
4957,WY,BOYSEN THERMOPOL,87.8
4958,WY,BOYSEN,87.8
4959,WY,DIXON AIRPORT,86.5
4960,WY,YELLOWSTONE,82.9


If using the above table, each state is already sorted for the top ten

In [5]:
# Get the min, max, mean of the absolute maximums temperature measurement by state to use in hover text
state_hot_summary = state_hot_station[["state", "max_temp"]].groupby(["state"]).aggregate(func=["min", "max", "mean", "std"], axis="index")

# # and export
# state_hot_summary.to_json("GEOJSON_data/State_hot_summary.json", orient="index")
# state_hot_summary.to_csv("GEOJSON_data/State_hot_summary.csv")

print(len(state_hot_summary))
state_hot_summary.head()

53


Unnamed: 0_level_0,max_temp,max_temp,max_temp,max_temp
Unnamed: 0_level_1,min,max,mean,std
state,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
AK,50.0,122.4,87.950139,22.01016
AL,81.3,122.4,107.445714,11.627512
AR,98.1,122.4,106.37971,7.470483
AZ,78.1,122.4,111.845652,10.837059
CA,57.2,122.4,109.352366,12.2001


Alaska's maximum temperature varied from 50 to 122.4 through out 2022 across the entire state with a mean maximum temperature of 87.9 and standard deviation of 22