# Accessing GHCN-D in Databricks

### This notebook provides a quick overview of accessing GHCN - Daily data from Azure Blob Storage URL in Databricks. It then demonstrates some examples of writing queries to interact with the data stored in tables and come up with visualizations.

[References for the GHCN-D metadata](https://github.com/awslabs/open-data-docs/tree/main/docs/noaa/noaa-ghcn)

### Requirements:
- Run `station_metadata_processing.ipynb` to have the table `station_metadata` in schema `ghcn`

#### 1. Add `start_year` and `end_year` parameters for job workflows usage

In [None]:
dbutils.widgets.text("start_year", "")
dbutils.widgets.text("end_year", "")

#### 2. Read GHCN-daily data from azure blob storage URL into pandas dataframe, improve date time format, and write it into `ghcn.ghcn_{year}`table


In [None]:
import pandas as pd

new_columns = ['ID', 'Time', 'Element', 'Value', 'M-Flag', 'Q-Flag', 'S-Flag', 'OBS-Time']

for year in range(int(dbutils.widgets.get("start_year")), int(dbutils.widgets.get("end_year")) + 1):
    
    URL = f'https://ghcn.blob.core.windows.net/ghcn/csv/daily/by_year/{year}.csv'

    df = pd.read_csv(URL, names = new_columns)
    
    df['Time'] = pd.to_datetime(df.Time, format='%Y%m%d')
    
    spark.createDataFrame(df).write.mode("overwrite").saveAsTable(f"ghcn.ghcn_{year}")


#### Query and Process the results into `ghcn_pivot_2024` table for dynamic dashboard usage

In [None]:
# query the data in US and specific climate attributes
merge_df = spark.sql(f'select date(g.Time), g.Element, mean(g.Value) as Value\
                              from ghcn.ghcn_{dbutils.widgets.get("end_year")} g\
                              join ghcn.station_metadata s\
                              on g.ID = s.ID\
                              where s.FIPS == "US" and g.Element in ("PRCP", "SNOW", "SNWD", "TMAX", "TMIN", "TAVG", "AWND")\
                              group by g.Time, g.Element\
                              order by g.Time')

# using `pivot_table()` to reshape the result
merge_df = merge_df.toPandas()
pivot_df = merge_df.pivot_table(index = 'Time', columns = 'Element', values = 'Value').reset_index()
spark.createDataFrame(pivot_df).write.mode("overwrite").saveAsTable(f'ghcn.ghcn_pivot_{dbutils.widgets.get("end_year")}')
pivot_df


Element,Time,AWND,PRCP,SNOW,SNWD,TAVG,TMAX,TMIN
0,2023-01-01,30.341571,38.608815,4.114291,163.480394,10.893327,87.219584,-0.663062
1,2023-01-02,29.601762,16.615280,5.752842,163.848451,-6.660801,79.502864,-5.078730
2,2023-01-03,33.545215,60.318453,8.591561,165.747986,-7.374890,73.878717,-5.901605
3,2023-01-04,34.746678,69.100191,8.563363,169.238464,-0.521356,76.695178,-10.295297
4,2023-01-05,30.766990,41.318529,4.923307,171.269034,7.154489,75.146978,-22.133913
...,...,...,...,...,...,...,...,...
360,2023-12-27,27.823839,38.640980,2.220655,65.116453,7.039578,73.251189,-14.364476
361,2023-12-28,28.728147,42.078535,0.749653,63.099165,13.276165,75.248744,-16.066061
362,2023-12-29,29.498236,15.808059,1.057905,61.236553,14.364355,74.082328,-21.630974
363,2023-12-30,27.469496,14.230746,0.460726,59.977047,10.763828,72.514484,-27.665112


#### 3. In SQL, join `ghcn.station_metadata` and `ghcn.ghcn_{year}` tables to query the results you are interested in
- Aggregated monthly precipitation data from weather stations in Central and South America

In [None]:
%sql
SELECT DISTINCT month(g.Time) as Month, round(mean(g.Value) / 10, 2) as Precipitation
FROM ghcn.ghcn_${end_year} g
JOIN ghcn.station_metadata s
ON g.ID = s.ID 
WHERE s.Region IN ('Central America', 'South America') AND g.Element == 'PRCP'
GROUP BY Month
ORDER BY Month

Month,Precipitation
1,3.64
2,3.4
3,4.22
4,4.33
5,4.92
6,2.63
7,3.44
8,3.34
9,3.13
10,4.33


Databricks visualization. Run in Databricks to view.

- Aggregated daily average temperature data from weather stations in North America

In [None]:
%sql
SELECT DISTINCT date(g.Time) as Date, round(mean(g.Value) / 10, 2) as AvgTemperature
FROM ghcn.ghcn_${end_year} g
JOIN ghcn.station_metadata s
ON g.ID = s.ID 
WHERE s.Region = 'North America' AND g.Element = 'TAVG'
GROUP BY Date
ORDER BY Date

Date,AvgTemperature
2023-01-01,-0.87
2023-01-02,-2.33
2023-01-03,-2.56
2023-01-04,-2.2
2023-01-05,-1.98
2023-01-06,-1.78
2023-01-07,-1.93
2023-01-08,-1.71
2023-01-09,-1.06
2023-01-10,-1.2


Databricks visualization. Run in Databricks to view.

- Temperature map in July `end_year` in the United States

In [None]:
%sql
SELECT s.State, round(mean(g.Value) / 10, 2) as AvgTemperature
FROM ghcn.ghcn_${end_year} g
JOIN ghcn.station_metadata s
ON g.ID = s.ID 
WHERE s.FIPS == 'US' AND g.Element = 'TAVG' AND month(g.Time) == '5'
GROUP BY s.State
ORDER BY s.State

State,AvgTemperature
AK,6.23
AL,21.37
AR,20.73
AZ,15.64
CA,13.89
CO,7.95
CT,15.47
FL,24.46
GA,20.38
HI,20.37


Databricks visualization. Run in Databricks to view.

- Temperature map in Jan `end_year` in the world

In [None]:
%sql
SELECT s.`ISO-2`, round(mean(g.Value) / 10, 2) as AvgTemperature
FROM ghcn.ghcn_${end_year} g
JOIN ghcn.station_metadata s
ON g.ID = s.ID 
WHERE g.Element = 'TAVG' AND month(g.Time) == '1'
GROUP BY s.`ISO-2`

ISO-2,AvgTemperature
MM,24.29
DZ,9.73
LT,0.49
CI,26.8
PM,0.83
SC,26.75
AZ,2.52
UA,0.66
RO,3.6
KI,28.15


Databricks visualization. Run in Databricks to view.