# NOAA Data via BigQuery

**[NOAA](https://data.noaa.gov/dataset/dataset/global-surface-summary-of-the-day-gsod) Global Surface Summary of the Day**

**PreRequisites:**
1. Install the following in your dev environment:<br>
    a. google-cloud-bigquery: pip.exe install google-cloud-bigquery<br>
    b. db-types: pip install db-dtypes<br>
2. Install gcloud CLI <br>
    a. Install directions (with download link): https://cloud.google.com/sdk/docs/install<br>
    > i. pay attention to where it installs!<br>
    > ii. It says to leave all the shortcut, open terminal options checked. I received errors when it ran "gcloud info --run-diagnostics" and I ignored them for now...<br>
    
    b. Add this to your PATH environmental variables (for me this was C:\Users\vt_be\AppData\Local\Google\Cloud SDK\google-cloud-sdk)<br>
    c. reboot!<br>
    d. open git bash, switch to dev environment<br>
    > i. "gcloud info --run-diagnostics" now ran without issue<br>
    ii. add authentication (this opens browser to connect your google account):  gcloud auth application-default login<br>
    
    e. I also needed to set up a Big Query Project: mostly followed https://cloud.google.com/bigquery/docs/sandbox<br>
    > i. I didn't see the stuff mentioned in #3 but otherwise worked<br>
    > ii. Note that when you create the project, an id is generated that is project name - #### (for me BootCamp-Weather:  bootcamp-weather-400118<br>
    
    f. Add the project to default - back to gitbash: gcloud auth application-default set-quota-project <project-id><br>
    g. In the downloaded notebook, add the project id to the client = bigquery.Client("project-id") in the first cell<br>
    

**Credit:**
* Big Query calls adapted from https://www.kaggle.com/code/crained/noaa-dataset-with-google-bigquery
* SQL calls adapted from GitHub BigQuery documentation: https://github.com/googleapis/python-bigquery

In [1]:
# My project name (don't think can be shared across people) is stored in a config.py file as "google_project"
# Since this is unique to user, I added config.py to the gitignore. You must create your own config.py file with project name
from config import google_project
# bigquery and pandas work well together for dataframes!
import pandas as pd
import os
# Follow the prerequisite instructions to get bigquery going
from google.cloud import bigquery
# Create a "Client" object
client = bigquery.Client(google_project)
# Construct a reference to dataset
dataset_ref = client.dataset("noaa_gsod", project="bigquery-public-data")
# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

In [2]:
# List all the tables in the dataset
tables = list(client.list_tables(dataset))

# Print names of the last 10 tables in the dataset (the first )
print('These are the first 5 tables in the DB')
for table in tables[0:5]:  
    print(table.table_id)
print('These are the last 5 tables in the DB')
for table in tables[-5:]:  
    print(table.table_id)
print(f'There are a total of {len(tables)} tables in the DB')

These are the first 5 tables in the DB
gsod1929
gsod1930
gsod1931
gsod1932
gsod1933
These are the last 5 tables in the DB
gsod2020
gsod2021
gsod2022
gsod2023
stations
There are a total of 96 tables in the DB


In [3]:
# Construct a reference to a "full" table
table_ref = dataset_ref.table("gsod2020")

# API request - fetch the table
table = client.get_table(table_ref)

In [4]:
# Print information on all the columns
table.schema

[SchemaField('stn', 'STRING', 'NULLABLE', None, 'Cloud - GSOD NOAA', (), None),
 SchemaField('wban', 'STRING', 'NULLABLE', None, 'WBAN number where applicable--this is the historical "Weather Bureau Air Force Navy" number - with WBAN being the acronym', (), None),
 SchemaField('date', 'DATE', 'NULLABLE', None, 'Date of the weather observations', (), None),
 SchemaField('year', 'STRING', 'NULLABLE', None, 'The year', (), None),
 SchemaField('mo', 'STRING', 'NULLABLE', None, 'The month', (), None),
 SchemaField('da', 'STRING', 'NULLABLE', None, 'The day', (), None),
 SchemaField('temp', 'FLOAT', 'NULLABLE', None, 'Mean temperature for the day in degrees Fahrenheit to tenths. Missing = 9999.9', (), None),
 SchemaField('count_temp', 'INTEGER', 'NULLABLE', None, 'Number of observations used in calculating mean temperature', (), None),
 SchemaField('dewp', 'FLOAT', 'NULLABLE', None, 'Mean dew point for the day in degreesm Fahrenheit to tenths.  Missing = 9999.9', (), None),
 SchemaField('cou

Each SchemaField tells us about a specific column (which we also refer to as a field). In order, the information is:

The name of the column
The field type (or datatype) in the column
The mode of the column ('NULLABLE' means that a column allows NULL values, and is the default)
A description of the data in that column
The first field has the SchemaField:

SchemaField('by', 'string', 'NULLABLE', "The username of the item's author.",())

This tells us:

the field (or column) is called by,
the data in this field is strings,
NULL values are allowed, and
it contains the usernames corresponding to each item's author.
We can use the list_rows() method to check just the first five lines of of the full table to make sure this is right. (Sometimes databases have outdated descriptions, so it's good to check.) This returns a BigQuery RowIterator object that can quickly be converted to a pandas DataFrame with the to_dataframe() method.

In [5]:
# Preview the first five lines of the "full" table
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,stn,wban,date,year,mo,da,temp,count_temp,dewp,count_dewp,...,flag_min,prcp,flag_prcp,sndp,fog,rain_drizzle,snow_ice_pellets,hail,thunder,tornado_funnel_cloud
0,10030,99999,2020-12-28,2020,12,28,25.5,4,20.5,4,...,,99.99,,999.9,0,0,1,0,0,0
1,10070,99999,2020-09-11,2020,9,11,42.8,4,38.1,4,...,,0.0,I,999.9,0,0,0,0,0,0
2,10070,99999,2020-12-06,2020,12,6,16.8,4,8.6,4,...,,0.0,I,999.9,0,0,0,0,0,0
3,10150,99999,2020-10-10,2020,10,10,47.1,4,9999.9,0,...,,0.0,I,999.9,0,0,0,0,0,0
4,10350,99999,2020-10-10,2020,10,10,41.6,4,9999.9,0,...,,0.0,I,999.9,0,0,0,0,0,0


In [6]:
# Construct a reference to the stations table to see what it has
stations_ref = dataset_ref.table("stations")

# API request - fetch the table
stations_table = client.get_table(stations_ref)

# Print information on all the columns
stations_table.schema

[SchemaField('usaf', 'STRING', 'NULLABLE', None, '', (), None),
 SchemaField('wban', 'STRING', 'NULLABLE', None, '', (), None),
 SchemaField('name', 'STRING', 'NULLABLE', None, '', (), None),
 SchemaField('country', 'STRING', 'NULLABLE', None, '', (), None),
 SchemaField('state', 'STRING', 'NULLABLE', None, '', (), None),
 SchemaField('call', 'STRING', 'NULLABLE', None, '', (), None),
 SchemaField('lat', 'FLOAT', 'NULLABLE', None, '', (), None),
 SchemaField('lon', 'FLOAT', 'NULLABLE', None, '', (), None),
 SchemaField('elev', 'STRING', 'NULLABLE', None, '', (), None),
 SchemaField('begin', 'STRING', 'NULLABLE', None, '', (), None),
 SchemaField('end', 'STRING', 'NULLABLE', None, '', (), None)]

In [7]:
# Preview the first five lines of the stations table
client.list_rows(stations_table, max_results=5).to_dataframe()

Unnamed: 0,usaf,wban,name,country,state,call,lat,lon,elev,begin,end
0,7018,99999,WXPOD 7018,,,,0.0,0.0,7018.0,20110309,20130730
1,7026,99999,WXPOD 7026,AF,,,0.0,0.0,7026.0,20120713,20170822
2,7070,99999,WXPOD 7070,AF,,,0.0,0.0,7070.0,20140923,20150926
3,8268,99999,WXPOD8278,AF,,,32.95,65.567,1156.7,20100519,20120323
4,8307,99999,WXPOD 8318,AF,,,0.0,0.0,8318.0,20100421,20100421


In [8]:
# Perform a filtering query to the stations table
# The spaces at the end of the lines are very important since this just joins each line for the full query
QUERY = (
    'SELECT usaf, name, country, state, lat, lon, elev FROM `bigquery-public-data.noaa_gsod.stations` '
    'WHERE country = "US" AND state = "TX" '
    'ORDER BY usaf DESC '
    'LIMIT 10')
query_job = client.query(QUERY)  # API request
tx_stations = query_job.result()  # Waits for query to finish

for row in tx_stations:
    print(row)

Row(('A05735', 'BOWIE MUNICIPAL AIRPORT', 'US', 'TX', 33.6, -97.783, '+0336.2'), {'usaf': 0, 'name': 1, 'country': 2, 'state': 3, 'lat': 4, 'lon': 5, 'elev': 6})
Row(('A00019', 'WILBARGER COUNTY AIRPORT', 'US', 'TX', 34.226, -99.284, '+0385.6'), {'usaf': 0, 'name': 1, 'country': 2, 'state': 3, 'lat': 4, 'lon': 5, 'elev': 6})
Row(('A00008', 'FAYETTE RGNL AIR CNTR ARP', 'US', 'TX', 29.908, -96.95, '+0098.8'), {'usaf': 0, 'name': 1, 'country': 2, 'state': 3, 'lat': 4, 'lon': 5, 'elev': 6})
Row(('A00002', 'BRENHAM MUNICIPAL AIRPORT', 'US', 'TX', 30.219, -96.374, '+0093.9'), {'usaf': 0, 'name': 1, 'country': 2, 'state': 3, 'lat': 4, 'lon': 5, 'elev': 6})
Row(('999999', 'ATHENS MUNICIPAL AIRPORT', 'US', 'TX', 32.164, -95.828, '+0135.3'), {'usaf': 0, 'name': 1, 'country': 2, 'state': 3, 'lat': 4, 'lon': 5, 'elev': 6})
Row(('999999', 'LAREDO AFB', 'US', 'TX', 27.533, -99.467, '+0154.8'), {'usaf': 0, 'name': 1, 'country': 2, 'state': 3, 'lat': 4, 'lon': 5, 'elev': 6})
Row(('999999', 'PALO PINTO

In [9]:
# Perform a query that pulls station string, min, mean, max temp, year, month, day from stations in TX
QUERY = (
    'SELECT stn, min, temp AS mean_temp, max, year, mo, da FROM `bigquery-public-data.noaa_gsod.gsod2022`'
    'WHERE stn IN (SELECT usaf FROM `bigquery-public-data.noaa_gsod.stations` WHERE country = "US" AND state = "TX") '
    'ORDER BY stn DESC '
    'LIMIT 10')
query_job = client.query(QUERY)  # API request
tx_station_measurement_data = query_job.result()  # Waits for query to finish

for row in tx_station_measurement_data:
    print(row)

Row(('A05735', 71.8, 79.2, 91.4, '2022', '05', '17'), {'stn': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('A05735', 67.1, 81.1, 91.4, '2022', '09', '06'), {'stn': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('A05735', 39.2, 44.0, 49.8, '2022', '12', '20'), {'stn': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('A05735', 79.0, 90.7, 102.9, '2022', '08', '04'), {'stn': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('A05735', 34.0, 53.1, 73.2, '2022', '03', '01'), {'stn': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('A05735', 68.9, 73.9, 78.4, '2022', '10', '07'), {'stn': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('A05735', 46.6, 66.3, 86.7, '2022', '03', '27'), {'stn': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('A05735', 24.8, 36.5, 55.2, '2022', '02', '27'), {'stn': 0, 'min': 1, 'mean_t

In [10]:
# Perform a query that pulls data from both the measurement and stations table
QUERY4 = (
    'SELECT s.name, g.min, g.temp AS mean_temp, g.max, g.year, g.mo, g.da FROM `bigquery-public-data.noaa_gsod.gsod2022` AS g '
    'INNER JOIN `bigquery-public-data.noaa_gsod.stations` AS s ON g.stn = s.usaf '
    'WHERE s.country = "US" AND s.state = "TX" '
    'LIMIT 10')
query_job4 = client.query(QUERY4)  # API request
tx_measurement_and_station_data = query_job4.result()  # Waits for query to finish

for row in tx_measurement_and_station_data:
    print(row)

Row(('EAGLE POINT', 71.4, 74.4, 77.0, '2022', '10', '22'), {'name': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('MORGANS POINT', 68.4, 73.6, 79.5, '2022', '10', '22'), {'name': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('PACKERY CHANNEL', 58.8, 63.8, 73.2, '2022', '10', '20'), {'name': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('MUSTANG ISLAND A85A          ', 6.6, 8.8, 10.0, '2022', '01', '02'), {'name': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('ATHENS MUNICIPAL AIRPORT', 6.6, 8.8, 10.0, '2022', '01', '02'), {'name': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('LONGVIEW GREGG COUNTY AP', 6.6, 8.8, 10.0, '2022', '01', '02'), {'name': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('COLLEGE STATION EASTERWOOD FL', 6.6, 8.8, 10.0, '2022', '01', '02'), {'name': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 

In [11]:
# Put the last query into a dataframe
query_job4.to_dataframe()

Unnamed: 0,name,min,mean_temp,max,year,mo,da
0,EAGLE POINT,71.4,74.4,77.0,2022,10,22
1,MORGANS POINT,68.4,73.6,79.5,2022,10,22
2,PACKERY CHANNEL,58.8,63.8,73.2,2022,10,20
3,MUSTANG ISLAND A85A,6.6,8.8,10.0,2022,1,2
4,ATHENS MUNICIPAL AIRPORT,6.6,8.8,10.0,2022,1,2
5,LONGVIEW GREGG COUNTY AP,6.6,8.8,10.0,2022,1,2
6,COLLEGE STATION EASTERWOOD FL,6.6,8.8,10.0,2022,1,2
7,MINERAL WELLS FT WOLTERS AF,6.6,8.8,10.0,2022,1,2
8,PALO PINTO DEMPSEY AF,6.6,8.8,10.0,2022,1,2
9,STEPHENVILLE CLARK FIELD,6.6,8.8,10.0,2022,1,2
