# NOAA Data via BigQuery

**[NOAA](https://data.noaa.gov/dataset/dataset/global-surface-summary-of-the-day-gsod) Global Surface Summary of the Day**

**PreRequisites:**
1. Install the following in your dev environment:<br>
    a. google-cloud-bigquery: pip.exe install google-cloud-bigquery<br>
    b. db-types: pip install db-dtypes<br>
2. Install gcloud CLI <br>
    a. Install directions (with download link): https://cloud.google.com/sdk/docs/install<br>
    > i. pay attention to where it installs!<br>
    > ii. It says to leave all the shortcut, open terminal options checked. I received errors when it ran "gcloud info --run-diagnostics" and I ignored them for now...<br>
    
    b. Add this to your PATH environmental variables (for me this was C:\Users\vt_be\AppData\Local\Google\Cloud SDK\google-cloud-sdk)<br>
    c. reboot!<br>
    d. open git bash, switch to dev environment<br>
    > i. "gcloud info --run-diagnostics" now ran without issue<br>
    ii. add authentication (this opens browser to connect your google account):  gcloud auth application-default login<br>
    
    e. I also needed to set up a Big Query Project: mostly followed https://cloud.google.com/bigquery/docs/sandbox<br>
    > i. I didn't see the stuff mentioned in #3 but otherwise worked<br>
    > ii. Note that when you create the project, an id is generated that is project name - #### (for me BootCamp-Weather:  bootcamp-weather-400118<br>
    
    f. Add the project to default - back to gitbash: gcloud auth application-default set-quota-project <project-id><br>
    g. In the downloaded notebook, add the project id to the client = bigquery.Client("project-id") in the first cell<br>
    

**Credit:**
* Big Query calls adapted from https://www.kaggle.com/code/crained/noaa-dataset-with-google-bigquery
* SQL calls adapted from GitHub BigQuery documentation: https://github.com/googleapis/python-bigquery

In [1]:
# My project name (don't think can be shared across people) is stored in a config.py file as "google_project"
# Since this is unique to user, I added config.py to the gitignore. You must create your own config.py file with project name
from config import google_project
# bigquery and pandas work well together for dataframes!
import pandas as pd
import os
# Follow the prerequisite instructions to get bigquery going
from google.cloud import bigquery
# Create a "Client" object
client = bigquery.Client(google_project)
# Construct a reference to dataset
dataset_ref = client.dataset("noaa_gsod", project="bigquery-public-data")
# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

In [2]:
# List all the tables in the dataset
tables = list(client.list_tables(dataset))

# Print names of the last 10 tables in the dataset (the first )
print('These are the first 5 tables in the DB')
for table in tables[0:5]:  
    print(table.table_id)
print('These are the last 5 tables in the DB')
for table in tables[-5:]:  
    print(table.table_id)
print(f'There are a total of {len(tables)} tables in the DB')

These are the first 5 tables in the DB
gsod1929
gsod1930
gsod1931
gsod1932
gsod1933
These are the last 5 tables in the DB
gsod2020
gsod2021
gsod2022
gsod2023
stations
There are a total of 96 tables in the DB


In [3]:
# Construct a reference to a "full" table
table_ref = dataset_ref.table("gsod2020")

# API request - fetch the table
table = client.get_table(table_ref)

In [4]:
# Print information on all the columns
table.schema

[SchemaField('stn', 'STRING', 'NULLABLE', None, 'Cloud - GSOD NOAA', (), None),
 SchemaField('wban', 'STRING', 'NULLABLE', None, 'WBAN number where applicable--this is the historical "Weather Bureau Air Force Navy" number - with WBAN being the acronym', (), None),
 SchemaField('date', 'DATE', 'NULLABLE', None, 'Date of the weather observations', (), None),
 SchemaField('year', 'STRING', 'NULLABLE', None, 'The year', (), None),
 SchemaField('mo', 'STRING', 'NULLABLE', None, 'The month', (), None),
 SchemaField('da', 'STRING', 'NULLABLE', None, 'The day', (), None),
 SchemaField('temp', 'FLOAT', 'NULLABLE', None, 'Mean temperature for the day in degrees Fahrenheit to tenths. Missing = 9999.9', (), None),
 SchemaField('count_temp', 'INTEGER', 'NULLABLE', None, 'Number of observations used in calculating mean temperature', (), None),
 SchemaField('dewp', 'FLOAT', 'NULLABLE', None, 'Mean dew point for the day in degreesm Fahrenheit to tenths.  Missing = 9999.9', (), None),
 SchemaField('cou

Each SchemaField tells us about a specific column (which we also refer to as a field). In order, the information is:

The name of the column
The field type (or datatype) in the column
The mode of the column ('NULLABLE' means that a column allows NULL values, and is the default)
A description of the data in that column
The first field has the SchemaField:

SchemaField('by', 'string', 'NULLABLE', "The username of the item's author.",())

This tells us:

the field (or column) is called by,
the data in this field is strings,
NULL values are allowed, and
it contains the usernames corresponding to each item's author.
We can use the list_rows() method to check just the first five lines of of the full table to make sure this is right. (Sometimes databases have outdated descriptions, so it's good to check.) This returns a BigQuery RowIterator object that can quickly be converted to a pandas DataFrame with the to_dataframe() method.

In [5]:
# Preview the first five lines of the "full" table
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,stn,wban,date,year,mo,da,temp,count_temp,dewp,count_dewp,...,flag_min,prcp,flag_prcp,sndp,fog,rain_drizzle,snow_ice_pellets,hail,thunder,tornado_funnel_cloud
0,10030,99999,2020-12-28,2020,12,28,25.5,4,20.5,4,...,,99.99,,999.9,0,0,1,0,0,0
1,10070,99999,2020-09-11,2020,9,11,42.8,4,38.1,4,...,,0.0,I,999.9,0,0,0,0,0,0
2,10070,99999,2020-12-06,2020,12,6,16.8,4,8.6,4,...,,0.0,I,999.9,0,0,0,0,0,0
3,10150,99999,2020-10-10,2020,10,10,47.1,4,9999.9,0,...,,0.0,I,999.9,0,0,0,0,0,0
4,10350,99999,2020-10-10,2020,10,10,41.6,4,9999.9,0,...,,0.0,I,999.9,0,0,0,0,0,0


In [6]:
# Construct a reference to the stations table to see what it has
stations_ref = dataset_ref.table("stations")

# API request - fetch the table
stations_table = client.get_table(stations_ref)

# Print information on all the columns
stations_table.schema

[SchemaField('usaf', 'STRING', 'NULLABLE', None, '', (), None),
 SchemaField('wban', 'STRING', 'NULLABLE', None, '', (), None),
 SchemaField('name', 'STRING', 'NULLABLE', None, '', (), None),
 SchemaField('country', 'STRING', 'NULLABLE', None, '', (), None),
 SchemaField('state', 'STRING', 'NULLABLE', None, '', (), None),
 SchemaField('call', 'STRING', 'NULLABLE', None, '', (), None),
 SchemaField('lat', 'FLOAT', 'NULLABLE', None, '', (), None),
 SchemaField('lon', 'FLOAT', 'NULLABLE', None, '', (), None),
 SchemaField('elev', 'STRING', 'NULLABLE', None, '', (), None),
 SchemaField('begin', 'STRING', 'NULLABLE', None, '', (), None),
 SchemaField('end', 'STRING', 'NULLABLE', None, '', (), None)]

In [7]:
# Preview the first five lines of the stations table
client.list_rows(stations_table, max_results=5).to_dataframe()

Unnamed: 0,usaf,wban,name,country,state,call,lat,lon,elev,begin,end
0,7018,99999,WXPOD 7018,,,,0.0,0.0,7018.0,20110309,20130730
1,7026,99999,WXPOD 7026,AF,,,0.0,0.0,7026.0,20120713,20170822
2,7070,99999,WXPOD 7070,AF,,,0.0,0.0,7070.0,20140923,20150926
3,8268,99999,WXPOD8278,AF,,,32.95,65.567,1156.7,20100519,20120323
4,8307,99999,WXPOD 8318,AF,,,0.0,0.0,8318.0,20100421,20100421


In [8]:
# Perform a filtering query to the stations table
# The spaces at the end of the lines are very important since this just joins each line for the full query
QUERY = (
    'SELECT usaf, name, country, state, lat, lon, elev FROM `bigquery-public-data.noaa_gsod.stations` '
    'WHERE country = "US" AND state = "TX" '
    'ORDER BY usaf DESC '
    'LIMIT 10')
query_job = client.query(QUERY)  # API request
tx_stations = query_job.result()  # Waits for query to finish

for row in tx_stations:
    print(row)

Row(('A05735', 'BOWIE MUNICIPAL AIRPORT', 'US', 'TX', 33.6, -97.783, '+0336.2'), {'usaf': 0, 'name': 1, 'country': 2, 'state': 3, 'lat': 4, 'lon': 5, 'elev': 6})
Row(('A00019', 'WILBARGER COUNTY AIRPORT', 'US', 'TX', 34.226, -99.284, '+0385.6'), {'usaf': 0, 'name': 1, 'country': 2, 'state': 3, 'lat': 4, 'lon': 5, 'elev': 6})
Row(('A00008', 'FAYETTE RGNL AIR CNTR ARP', 'US', 'TX', 29.908, -96.95, '+0098.8'), {'usaf': 0, 'name': 1, 'country': 2, 'state': 3, 'lat': 4, 'lon': 5, 'elev': 6})
Row(('A00002', 'BRENHAM MUNICIPAL AIRPORT', 'US', 'TX', 30.219, -96.374, '+0093.9'), {'usaf': 0, 'name': 1, 'country': 2, 'state': 3, 'lat': 4, 'lon': 5, 'elev': 6})
Row(('999999', 'ATHENS MUNICIPAL AIRPORT', 'US', 'TX', 32.164, -95.828, '+0135.3'), {'usaf': 0, 'name': 1, 'country': 2, 'state': 3, 'lat': 4, 'lon': 5, 'elev': 6})
Row(('999999', 'LAREDO AFB', 'US', 'TX', 27.533, -99.467, '+0154.8'), {'usaf': 0, 'name': 1, 'country': 2, 'state': 3, 'lat': 4, 'lon': 5, 'elev': 6})
Row(('999999', 'PALO PINTO

In [9]:
# Perform a query that pulls station string, min, mean, max temp, year, month, day from stations in TX
QUERY = (
    'SELECT stn, min, temp AS mean_temp, max, year, mo, da FROM `bigquery-public-data.noaa_gsod.gsod2022`'
    'WHERE stn IN (SELECT usaf FROM `bigquery-public-data.noaa_gsod.stations` WHERE country = "US" AND state = "TX") '
    'ORDER BY stn DESC '
    'LIMIT 10')
query_job = client.query(QUERY)  # API request
tx_station_measurement_data = query_job.result()  # Waits for query to finish

for row in tx_station_measurement_data:
    print(row)

Row(('A05735', 71.8, 79.2, 91.4, '2022', '05', '17'), {'stn': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('A05735', 67.1, 81.1, 91.4, '2022', '09', '06'), {'stn': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('A05735', 39.2, 44.0, 49.8, '2022', '12', '20'), {'stn': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('A05735', 79.0, 90.7, 102.9, '2022', '08', '04'), {'stn': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('A05735', 34.0, 53.1, 73.2, '2022', '03', '01'), {'stn': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('A05735', 68.9, 73.9, 78.4, '2022', '10', '07'), {'stn': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('A05735', 46.6, 66.3, 86.7, '2022', '03', '27'), {'stn': 0, 'min': 1, 'mean_temp': 2, 'max': 3, 'year': 4, 'mo': 5, 'da': 6})
Row(('A05735', 24.8, 36.5, 55.2, '2022', '02', '27'), {'stn': 0, 'min': 1, 'mean_t

In [10]:
# Perform a query that pulls data from both the measurement and stations table
QUERY4 = (
    'SELECT s.name, s.lat, s.lon, g.min, g.temp AS mean_temp, g.max, g.year, g.mo, g.da FROM `bigquery-public-data.noaa_gsod.gsod2022` AS g '
    'INNER JOIN `bigquery-public-data.noaa_gsod.stations` AS s ON g.stn = s.usaf '
    'WHERE s.country = "US" AND s.state = "TX" '
    'LIMIT 10')
query_job4 = client.query(QUERY4)  # API request
tx_measurement_and_station_data = query_job4.result()  # Waits for query to finish

for row in tx_measurement_and_station_data:
    print(row)

Row(('EAGLE POINT', 29.48, -94.92, 56.8, 62.0, 74.3, '2022', '10', '20'), {'name': 0, 'lat': 1, 'lon': 2, 'min': 3, 'mean_temp': 4, 'max': 5, 'year': 6, 'mo': 7, 'da': 8})
Row(('PORT ARANSAS', 27.833, -97.067, 62.4, 65.5, 71.6, '2022', '10', '20'), {'name': 0, 'lat': 1, 'lon': 2, 'min': 3, 'mean_temp': 4, 'max': 5, 'year': 6, 'mo': 7, 'da': 8})
Row(('ROCKPORT', 28.017, -97.05, 80.8, 81.1, 81.5, '2022', '10', '15'), {'name': 0, 'lat': 1, 'lon': 2, 'min': 3, 'mean_temp': 4, 'max': 5, 'year': 6, 'mo': 7, 'da': 8})
Row(('MUSTANG ISLAND A85A          ', 27.733, -96.183, 43.0, 53.0, 61.0, '2022', '02', '16'), {'name': 0, 'lat': 1, 'lon': 2, 'min': 3, 'mean_temp': 4, 'max': 5, 'year': 6, 'mo': 7, 'da': 8})
Row(('ATHENS MUNICIPAL AIRPORT', 32.164, -95.828, 43.0, 53.0, 61.0, '2022', '02', '16'), {'name': 0, 'lat': 1, 'lon': 2, 'min': 3, 'mean_temp': 4, 'max': 5, 'year': 6, 'mo': 7, 'da': 8})
Row(('LONGVIEW GREGG COUNTY AP', 32.385, -94.712, 43.0, 53.0, 61.0, '2022', '02', '16'), {'name': 0, 'la

In [11]:
# Put the last query into a dataframe 
sample_df = query_job4.to_dataframe()

# and export to json
query_job4.to_dataframe().to_json("sample_data/sample.json", orient="records")

# display df header
sample_df.head()

Unnamed: 0,name,lat,lon,min,mean_temp,max,year,mo,da
0,EAGLE POINT,29.48,-94.92,56.8,62.0,74.3,2022,10,20
1,PORT ARANSAS,27.833,-97.067,62.4,65.5,71.6,2022,10,20
2,ROCKPORT,28.017,-97.05,80.8,81.1,81.5,2022,10,15
3,MUSTANG ISLAND A85A,27.733,-96.183,43.0,53.0,61.0,2022,2,16
4,ATHENS MUNICIPAL AIRPORT,32.164,-95.828,43.0,53.0,61.0,2022,2,16


In [12]:
# Perform a query that pulls data from both the measurement and stations table by state
aggregate_query = (
    'SELECT s.state, '
    'AVG(g.min) AS avg_daily_min, MIN(g.min) AS absolute_daily_min, '
    'AVG(g.temp) AS avg_daily_mean, '
    'MAX(g.max) AS absolute_daily_max, AVG(g.max) AS avg_daily_max, '
    'SUM(g.prcp) AS total_precipitation, '
    'SUM(g.sndp) AS total_snow '
    'FROM `bigquery-public-data.noaa_gsod.gsod2022` AS g '
    'INNER JOIN `bigquery-public-data.noaa_gsod.stations` AS s ON g.stn = s.usaf '
    'WHERE s.country = "US" '
    'GROUP BY s.state')
state_temp_result = client.query(aggregate_query)  # API request
state_temp_data = state_temp_result.result()  # Waits for query to finish

# Put the last query into a dataframe
state_temp_result.to_dataframe()

# This result shows we need to replace the non values (99.9, 999.9, 9999.9) with 0 or ignore them when doing aggregates

Unnamed: 0,state,avg_daily_min,absolute_daily_min,avg_daily_mean,absolute_daily_max,avg_daily_max,total_precipitation,total_snow
0,AK,46.255926,-59.1,52.526213,9999.9,88.464909,42207340.0,7545190000.0
1,TX,46.727277,-57.1,53.125069,9999.9,89.323022,26456600.0,4781377000.0
2,NV,46.470817,-57.1,52.793654,9999.9,89.007231,8836059.0,1590563000.0
3,WA,46.387788,-57.1,52.738978,9999.9,88.696474,15257330.0,2737690000.0
4,FL,46.772363,-57.1,53.050482,9999.9,88.89107,22210320.0,3971207000.0
5,NJ,46.457949,-57.1,52.789173,9999.9,88.665231,5498056.0,991551100.0
6,NM,46.440218,-57.1,52.784849,9999.9,89.092572,16844380.0,3031091000.0
7,WY,47.375206,-57.1,52.631399,9999.9,89.055739,6875780.0,1220866000.0
8,NE,46.345308,-57.1,52.709461,9999.9,90.228895,7311901.0,1307714000.0
9,MT,46.371436,-57.1,52.69446,9999.9,88.905297,10988970.0,1971335000.0


In [13]:
# SchemaField('prcp', 'FLOAT', 'NULLABLE', None, "Total precipitation (rain and/or melted snow) reported during the day 
# in inches and hundredths; will usually not end with the midnight observation--i.e., may include latter part of 
# previous day.  .00 indicates no measurable precipitation (includes a trace). Missing = 99.99 Note: Many stations do 
# not report '0' on days with no precipitation--therefore, '99.99' will often appear on these days. Also, for example, 
# a station may only report a 6-hour amount for the period during which rain fell. See Flag field for source of data", (), None)

# Perform a query that pulls data from both the measurement and stations table by state
aggregate_query = (
    'SELECT s.state, '
    'SUM(g.prcp) AS total_precipitation '
    'FROM `bigquery-public-data.noaa_gsod.gsod2022` AS g '
    'INNER JOIN `bigquery-public-data.noaa_gsod.stations` AS s ON g.stn = s.usaf '
    'WHERE s.country = "US" AND g.prcp <> 99.9 '
    'GROUP BY s.state')
state_prcp_result = client.query(aggregate_query)  # API request
state_prcp_data = state_prcp_result.result()  # Waits for query to finish

# Put the last query into a dataframe
state_prcp_result.to_dataframe()

# This result shows filtering method is not refined enough for precipitation totals

Unnamed: 0,state,total_precipitation
0,AK,42207340.0
1,AR,4659832.0
2,IA,5721028.0
3,VA,11231410.0
4,NV,8836059.0
5,ND,6225631.0
6,LA,11482730.0
7,,1270356.0
8,FL,22210320.0
9,CA,39609710.0


In [14]:
# SchemaField('sndp', 'FLOAT', 'NULLABLE', None, "Snow depth in inches to tenths--last report for the day 
# if reported more than once. Missing = 999.9 Note: Most stations do not report '0' ondays with no snow on 
# the ground--therefore, '999.9' will often appear on these days", (), None)

# Perform a query that pulls data from both the measurement and stations table by state
aggregate_query = (
    'SELECT s.state, '
    'SUM(g.sndp) AS total_snow '
    'FROM `bigquery-public-data.noaa_gsod.gsod2022` AS g '
    'INNER JOIN `bigquery-public-data.noaa_gsod.stations` AS s ON g.stn = s.usaf '
    'WHERE s.country = "US" AND g.sndp <> 999.9 '
    'GROUP BY s.state '
    'ORDER BY total_snow DESC')
state_snow_result = client.query(aggregate_query)  # API request
state_snow_data = state_snow_result.result()  # Waits for query to finish

# Put the last query into a dataframe
state_snow_result.to_dataframe()

# This result shows that snow depth (and precipation) should be summed by station not state!
# A spot in Alaska did not receive 70,345 inches of rain in a year

Unnamed: 0,state,total_snow
0,AK,70345.8
1,WI,7389.2
2,ND,6965.0
3,MI,6440.4
4,MN,5876.4
5,SD,4439.0
6,NH,3920.0
7,NY,3824.5
8,MT,3657.9
9,CA,2693.6


In [15]:
# Perform a query that pulls data from both the measurement and stations table by state
aggregate_query = (
    'SELECT s.state, s.name, '
    'SUM(g.sndp) AS total_snow '
    'FROM `bigquery-public-data.noaa_gsod.gsod2022` AS g '
    'INNER JOIN `bigquery-public-data.noaa_gsod.stations` AS s ON g.stn = s.usaf '
    'WHERE s.country = "US" AND g.sndp <> 999.9 '
    'GROUP BY s.state, s.name '
    'ORDER BY total_snow DESC')
state_snow_result = client.query(aggregate_query)  # API request
state_snow_data = state_snow_result.result()  # Waits for query to finish

# Put the last query into a dataframe
state_snow_result.to_dataframe()

Unnamed: 0,state,name,total_snow
0,AK,CHULITNA AIRPORT,7360.6
1,AK,CHULITNA,7360.6
2,AK,BETTLES AIRPORT,5199.6
3,AK,FAIRBANKS INTERNATIONAL,4368.2
4,AK,FAIRBANKS/EIELSON A,4311.3
...,...,...,...
371,NM,ROSWELL INTERNATIONAL AIR CEN,1.2
372,WA,GRAY AAF,1.2
373,WA,GRAY AFF AIRPORT,1.2
374,WI,LANGLADE CO,1.2


**Taking a deeper dive into processing**<br>
Bekah does NOT recommend running this section<br>
The immediate cell below takes as long or longer to run than everything else combined

In [16]:
# Perform a query that pulls data from both the measurement and stations table
QUERY4 = (
    'SELECT s.name, s.lat, s.lon, g.min, g.temp AS mean_temp, g.max, g.prcp, g.sndp, g.year, g.mo, g.da '
    'FROM `bigquery-public-data.noaa_gsod.gsod2022` AS g '
    'INNER JOIN `bigquery-public-data.noaa_gsod.stations` AS s ON g.stn = s.usaf '
    'WHERE s.country = "US" AND s.state = "DE" '
    # This line below removes the not a reading so we can run stats on those columns
    'AND g.min <> 9999.9 AND g.max <> 9999.9 '
    # 'LIMIT 10000'
    )
query_job4 = client.query(QUERY4)  # API request
tx_measurement_and_station_data = query_job4.result()  # Waits for query to finish

# Put the last query into a dataframe 
DEsample_df = query_job4.to_dataframe()

# Replace the 999.9 sndp and 99.9 prcp with 0 to clean up the no readings
DEsample_df["sndp"].replace(999.9, 0, inplace=True)
DEsample_df["prcp"].replace(99.99, 0, inplace=True)

# and export to json
DEsample_df.to_json("sample_data/DEsample.json", orient="records")
DEsample_df.to_csv("sample_data/DEsample.csv")

# display df header
print(len(DEsample_df))
DEsample_df.head()

78240


Unnamed: 0,name,lat,lon,min,mean_temp,max,prcp,sndp,year,mo,da
0,DELAWARE RESERVE,39.083,-75.433,54.7,57.2,58.8,0.0,0.0,2022,4,20
1,DELAWARE RESERVE,39.083,-75.433,45.5,48.3,51.8,0.0,0.0,2022,3,23
2,WILMINGTON DUPONT AP,39.673,-75.601,36.7,38.2,38.8,0.0,0.0,2022,2,18
3,DELAWARE RESERVE,39.083,-75.433,22.5,31.9,42.3,0.0,0.0,2022,1,9
4,DELAWARE RESERVE,39.083,-75.433,62.6,64.3,66.2,0.0,0.0,2022,10,31


In [17]:
DEsample_temp_df = DEsample_df[["min", "mean_temp", "max"]]

DEsample_temp_df.aggregate(func=["min", "max", "mean", "std"], axis="index")

Unnamed: 0,min,mean_temp,max
min,-57.1,-50.5,-45.9
max,100.9,110.0,122.4
mean,43.159178,53.007944,65.066692
std,21.642708,21.705697,22.756226


In [18]:
DEstation_snow = DEsample_df[["name","sndp"]].groupby(["name"]).sum()
DEstation_snow

Unnamed: 0_level_0,sndp
name,Unnamed: 1_level_1
DELAWARE CITY,0.0
DELAWARE RESERVE,0.0
DOVER AFB,20.0
DOVER AFB AIRPORT,20.0
LEWES,0.0
NEW CASTLE COUNTY AIRPORT,17.2
REEDY POINT,0.0
SUSSEX CO,0.0
SUSSEX COUNTY AIRPORT,0.0
WILMINGTON DUPONT AP,0.0


In [19]:
DEsample_df[["name","prcp"]].groupby(["name"]).sum()

Unnamed: 0_level_0,prcp
name,Unnamed: 1_level_1
DELAWARE CITY,0.0
DELAWARE RESERVE,0.0
DOVER AFB,36.33
DOVER AFB AIRPORT,36.33
LEWES,0.0
NEW CASTLE COUNTY AIRPORT,43.46
REEDY POINT,0.0
SUSSEX CO,40.9
SUSSEX COUNTY AIRPORT,40.9
WILMINGTON DUPONT AP,5949.35


**Trying different approach for processing time...**<br>
This methodology works much better than the previous query sections

In [20]:
# Perform a query that pulls data from both the measurement and stations table by state
aggregate_query = (
    'SELECT s.state, s.name, '
    'SUM(g.sndp) AS total_snow '
    'FROM `bigquery-public-data.noaa_gsod.gsod2022` AS g '
    'INNER JOIN `bigquery-public-data.noaa_gsod.stations` AS s ON g.stn = s.usaf '
    'WHERE s.country = "US" AND g.sndp <> 999.9 '
    'GROUP BY s.state, s.name '
    'ORDER BY total_snow DESC')
state_snow_result = client.query(aggregate_query)  # API request
state_snow_data = state_snow_result.result()  # Waits for query to finish

# Put the last query into a dataframe
state_snow_station = state_snow_result.to_dataframe()
state_snow_station

Unnamed: 0,state,name,total_snow
0,AK,CHULITNA AIRPORT,7360.6
1,AK,CHULITNA,7360.6
2,AK,BETTLES AIRPORT,5199.6
3,AK,FAIRBANKS INTERNATIONAL,4368.2
4,AK,FAIRBANKS/EIELSON A,4311.3
...,...,...,...
371,NM,ROSWELL INTERNATIONAL AIR CEN,1.2
372,WA,GRAY AAF,1.2
373,WA,GRAY AFF AIRPORT,1.2
374,WI,LANGLADE CO,1.2


In [21]:
# state_snow_avg = state_snow_station[["state", "total_snow"]].groupby(["state"]).mean()
# state_snow_avg

state_snow_stats = state_snow_station[["state", "total_snow"]].groupby(["state"]).aggregate(func=["min", "max", "mean", "std"], axis="index")
state_snow_stats.head()

Unnamed: 0_level_0,total_snow,total_snow,total_snow,total_snow
Unnamed: 0_level_1,min,max,mean,std
state,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
AK,3.9,7360.6,1635.948837,2079.042702
AL,1.2,13.6,6.45,5.179768
AR,6.7,43.9,16.37,14.802931
AZ,19.9,245.8,132.85,130.423426
CA,2.4,1345.6,897.866667,775.496882


In [22]:
QUERY = (
    'SELECT s.state, s.name, '
    'MIN(g.min) AS min_temp, '
    'AVG(g.temp) AS mean_temp, '
    'MAX(g.max) AS max_temp, '
    'FROM `bigquery-public-data.noaa_gsod.gsod2022` AS g '
    'INNER JOIN `bigquery-public-data.noaa_gsod.stations` AS s ON g.stn = s.usaf '
    'WHERE s.country = "US" '
    # This line below removes the not a reading so we can run stats on those columns
    'AND g.min <> 9999.9 AND g.max <> 9999.9 '
    'GROUP BY s.state, s.name '
    )
state_temp_result = client.query(QUERY)  # API request
state_temp_data = state_temp_result.result()  # Waits for query to finish

# Put the last query into a dataframe
state_temp_station = state_temp_result.to_dataframe()


# and export
state_temp_station.to_json("sample_data/Station_temp_sample.json", orient="records")
state_temp_station.to_csv("sample_data/Station_temp_sample.csv")

state_temp_station

Unnamed: 0,state,name,min_temp,mean_temp,max_temp
0,AK,GAMBELL (AWOS),-5.1,30.590909,55.9
1,AK,GAMBELL AIRPORT,-5.1,30.590909,55.9
2,AK,KOYUK AIRPORT,-22.0,31.764780,81.0
3,AK,PUNTILLA,-27.4,32.636119,77.0
4,AK,PUNTILLA LAKE,-57.1,52.727017,122.4
...,...,...,...,...,...
5001,CA,BIG BEAR CITY AIRPORT,-0.4,46.237293,86.0
5002,ND,COOPERSTOWN MUNICIPAL AIRPORT,-29.2,38.689474,98.6
5003,AL,MIDDLE BAY LIGHT,46.9,65.450000,81.3
5004,CA,TRINITY CENTER AIRPORT,17.6,34.047619,57.2


In [23]:
# cold map data
state_mintemp_stats = state_temp_station[["state", "min_temp"]].groupby(["state"]).aggregate(func=["min", "max", "mean", "std"], axis="index")

# and export
state_mintemp_stats.to_json("sample_data/COLD_state_sample.json", orient="records")
state_mintemp_stats.to_csv("sample_data/COLD_state_sample.csv")

state_mintemp_stats.head()

Unnamed: 0_level_0,min_temp,min_temp,min_temp,min_temp
Unnamed: 0_level_1,min,max,mean,std
state,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
AK,-59.1,53.6,-28.521727,26.105537
AL,-57.1,46.9,-11.595238,35.01315
AR,-57.1,19.0,-7.362319,22.610019
AZ,-57.1,37.0,-7.793478,38.674726
CA,-57.1,45.5,3.156151,39.759413


In [24]:
# hot map data
state_maxtemp_stats = state_temp_station[["state", "max_temp"]].groupby(["state"]).aggregate(func=["min", "max", "mean", "std"], axis="index")

# and export
state_maxtemp_stats.to_json("sample_data/HOT_state_sample.json", orient="records")
state_maxtemp_stats.to_csv("sample_data/HOT_state_sample.csv")

state_maxtemp_stats.head()

Unnamed: 0_level_0,max_temp,max_temp,max_temp,max_temp
Unnamed: 0_level_1,min,max,mean,std
state,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
AK,50.0,122.4,87.950139,22.01016
AL,81.3,122.4,107.445714,11.627512
AR,98.1,122.4,106.37971,7.470483
AZ,78.1,122.4,111.845652,10.837059
CA,57.2,122.4,109.352366,12.2001
