Read GHCN-daily data from azure blob storage URL into pandas dataframe,\
improve date time format, and write it into `ghcn_{year}`table
- Modify `start_year` and `end_year` to fetch all yearly data

In [0]:
import pandas as pd

new_columns = ['ID', 'Time', 'Element', 'Value', 'M-Flag', 'Q-Flag', 'S-Flag', 'OBS-Time']

start_year = 2022
end_year = 2022

for year in range(start_year, end_year + 1):
    
    URL = f'https://ghcn.blob.core.windows.net/ghcn/csv/daily/by_year/{year}.csv'

    df = pd.read_csv(URL, names = new_columns)
    
    df['Time'] = pd.to_datetime(df.Time, format='%Y%m%d')
    
    spark.createDataFrame(df).write.mode("overwrite").saveAsTable(f"ghcn.ghcn_{year}")


In SQL Queries, join GHCN and station_metadata and filter countries you are interested in
- Daily data in year 2022 from weather stations in Central and South America 

In [0]:
%sql
SELECT g.Time, g.ID, s.Country, s.StationName, g.Element, g.Value
FROM ghcn.ghcn_2022 g
JOIN ghcn.station_metadata s
ON g.ID = s.ID 
WHERE s.Country IN ("Belize", "Costa Rica", "El Salvador", "Guatemala", "Honduras", "Mexico", "Nicaragua", "Panama", "Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador", "French Guiana", "Guyana", "Paraguay", "Peru", "Suriname", "Uruguay", "Venezuela")
SORT BY g.Time

Time,ID,Country,StationName,Element,Value
2022-01-01T00:00:00Z,AR000000011,Argentina,MONTE CASEROS AERO,TMIN,232
2022-01-01T00:00:00Z,AR000000011,Argentina,MONTE CASEROS AERO,TAVG,308
2022-01-01T00:00:00Z,AR000087007,Argentina,LA QUIACA OBSERVATO,TMIN,77
2022-01-01T00:00:00Z,AR000087007,Argentina,LA QUIACA OBSERVATO,TAVG,166
2022-01-01T00:00:00Z,AR000087078,Argentina,LAS LOMITAS,TMIN,246
2022-01-01T00:00:00Z,AR000087078,Argentina,LAS LOMITAS,TAVG,372
2022-01-01T00:00:00Z,AR000087129,Argentina,SANTIAGO DEL ESTERO,TMIN,279
2022-01-01T00:00:00Z,AR000087129,Argentina,SANTIAGO DEL ESTERO,TAVG,363
2022-01-01T00:00:00Z,AR000087155,Argentina,RESISTENCIA AERO,TMIN,217
2022-01-01T00:00:00Z,AR000087155,Argentina,RESISTENCIA AERO,TAVG,323


Aggregated monthly precipitation data from weather stations in Central and South America

In [0]:
%sql
SELECT DISTINCT month(g.Time) as Month, round(mean(g.Value) / 10, 2) as Precipitation
FROM ghcn.ghcn_2022 g
JOIN ghcn.station_metadata s
ON g.ID = s.ID 
WHERE s.Country IN ("Belize", "Costa Rica", "El Salvador", "Guatemala", "Honduras", "Mexico", "Nicaragua", "Panama", "Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador", "French Guiana", "Guyana", "Paraguay", "Peru", "Suriname", "Uruguay", "Venezuela") AND (g.Element == 'PRCP')
GROUP BY Month
ORDER BY Month

Month,Precipitation
1,3.6
2,3.81
3,4.56
4,4.39
5,3.44
6,4.29
7,3.7
8,4.15
9,5.08
10,4.48


Databricks visualization. Run in Databricks to view.

Aggregated daily average temperature data from weather stations in the US

In [0]:
%sql
SELECT DISTINCT date(g.Time) as Date, round(mean(g.Value) / 10, 2) as AvgTemperature
FROM ghcn.ghcn_2022 g
JOIN ghcn.station_metadata s
ON g.ID = s.ID 
WHERE s.Country_code = 'US' AND g.Element = 'TAVG'
GROUP BY Date
ORDER BY Date

Date,AvgTemperature
2022-01-01,-6.42
2022-01-02,-4.72
2022-01-03,-3.1
2022-01-04,-2.39
2022-01-05,-1.55
2022-01-06,-0.66
2022-01-07,-0.65
2022-01-08,-2.28
2022-01-09,-1.91
2022-01-10,-1.08


Databricks visualization. Run in Databricks to view.