ghcnd-stations.txt
is a freely available dataset about weather stations used in US government climate data. A direct download link can be found at that linked site.ghcn-stations-processed.csv
is generated from theghcnd-stations.txt
text file. To generate this file yourself, runpython data_processing_helper.py
from this directoryholidays.csv
is derived from data found on the US Open Government Federal Holiday Webpage. It was converted from iCalendar format to CSV following instructions found in this blog post.
Go through the tutorial to Run a data analytics DAG in Google Cloud skipping the cleanup steps.
This directory has a DAG similar to the data analytics DAG found in the Run a data analytics DAG in Google Cloud tutorial, but includes a more complicated data processing step with Dataproc. Instead of answering the question, "How warm was it in Chicago on Thanksgiving for the past 25 years?" you will answer the question, "How have the rainfall patterns changed over the past 25 years in the western part of the US and in Phoenix, AZ?" For this example, the western part of the US is defined as the census defined West region. Phoenix is used in this example because it is a city that has been affected by climate change in recent years, especially with respect to water.
The Dataproc Serverless job uses arithmetic mean to calculate precipitation and snowfall in the western states, and uses distance weighting to focus on the Phoenix specific area.
The DAG has three steps:
- Ingest the data about the weather stations from Cloud Storage into BigQuery
- Use BigQuery to join the weather station data with the data used in the prior tutorial - the GHCN data and write the results to a table
- Run a Dataproc Serverless job that processes the data by
- Removing any data points that are not from weather stations located in the Western US
- Removing any data points that are not about snow or other precipitation (data where
ELEMENT
is notSNOW
orPRCP
) - Convert the values in the
ELEMENT
column (now equal toSNOW
orPRCP
) to be in mm, instead of tenths of a mm. - Extract the year from the date so the
Date
column is left only with the year - Calculate the arithmetic mean of precipitation and of snowfall
- Calculate the distance weighting for Phoenix.
- Write the results to tables in BigQuery
- Add
data_analytics_dag_expansion.py
to the Composer environment you used in the previous tutorial - Add
data_analytics_process_expansion.py
andghcn-stations-processed.csv
to the Cloud Storage bucket you created in the previous tutorial - Create an empty BigQuery dataset called
precipitation_changes
You do not need to add any additional Airflow variables, add any additional permissions, or create any other resources.