# Accessing Free Hourly Weather Data
NOAA stores Free Hourly Weather Data at: ftp://ftp.ncdc.noaa.gov/pub/data/noaa/

It is my understanding that this is the data underlying most of the websites that charge for historical weather data at the hourly level (at least when needed in bulk).

The most detailed data is in a very cumbersome format, but a subset of easy to parse data can be found at: ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite/

To make this topic approachable to a large audience, I've added some notes on how one could access this data simply using Excel at the end of this notebook.

In [1]:
# Column Names were determined from ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite/isd-lite-format.pdf
# That pdf describes what data is contained in the subset of data that I'll focus on.
isd_fwf_cols = ['year', 'month', 'day', 'hour', 'air_temp_c', 'dew_pt_temp_c',
                 'sea_lvl_press_hectoPa', 'wnd_dir_360', 'wnd_spd_mtrpersec',
                 'sky_condition', 'precip_hrly', 'precip_6hr_accum']

In [2]:
# Importing the python libraries that I use.
import pandas as pd
import numpy as np

In [3]:
# Importing the table defining the available data. 
# There is a row for each station and it includes the begin and end date of available data.
isd_stations_data = pd.read_csv('ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-history.csv')
isd_stations_data.head()

Unnamed: 0,USAF,WBAN,STATION NAME,CTRY,STATE,ICAO,LAT,LON,ELEV(M),BEGIN,END
0,7018,99999,WXPOD 7018,,,,0.0,0.0,7018.0,20110309,20130730
1,7026,99999,WXPOD 7026,AF,,,0.0,0.0,7026.0,20120713,20170822
2,7070,99999,WXPOD 7070,AF,,,0.0,0.0,7070.0,20140923,20150926
3,8260,99999,WXPOD8270,,,,0.0,0.0,0.0,20050101,20100731
4,8268,99999,WXPOD8278,AF,,,32.95,65.567,1156.7,20100519,20120323


In [4]:
# I want data for DC, so I've chosen to search for the local airport. Reagan National Airport (DCA).
# Note that all of the Station Names are uppercase.
DCA_search = isd_stations_data.loc[(isd_stations_data['STATION NAME'].isna() == False) 
                                   & (isd_stations_data['STATION NAME'].str.contains('REAGAN'))]

In [5]:
# Slicing out the BEGIN and END years to create the range of years for which I'll download data.
start_year = str(list(DCA_search.BEGIN)[0])[0:4]
end_year = str(list(DCA_search.END)[0])[0:4]
year_range = range(int(start_year), int(end_year)+1)
year_range

range(1936, 2020)

In [6]:
# Creating the station ID by which the ftp site is organized.
# Note that it is the concatenation of two columns separated by a hyphen.
station_id = str(list(DCA_search.USAF)[0])+'-'+str(list(DCA_search.WBAN)[0])
station_id

'724050-13743'

In [15]:
# Function to loop through a given station ID for a given range of years.
def download_isd_lite(station_id, year_range):
    isd_df = pd.DataFrame()
    for year in year_range:
        # There can be gaps of missing years in the data, so try and except were required. 
        # The gaps that I've seen are only from decades ago.
        try:
            new_isd_df = pd.read_fwf('ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite/'
                                     +str(year)+'/'
                                     +station_id+'-'
                                     +str(year)
                                     +'.gz',
                                     header=None)
            isd_df = pd.concat([isd_df, new_isd_df])
        except:
            continue
    
    # Resetting the index of the concatenated DataFrame
    isd_df.reset_index(inplace=True, drop=True)
    
    # Setting the column names that I've derived from the format guide
    isd_df.columns = isd_fwf_cols
   
    # NOAA populates missing values with -9999, but I've chosen to replace them with NaN's.
    isd_df.replace({-9999: np.nan}, inplace=True)
    
    # Some of the columns are scaled by a factor of 10 to eliminate decimal points,
    # which would complicate the fixed width format that NOAA has chosen to utilize
    scaled_columns = ['air_temp_c', 'dew_pt_temp_c', 'sea_lvl_press_hectoPa', 
                  'wnd_spd_mtrpersec', 'precip_hrly', 'precip_6hr_accum']
    scaling_factor = 10
    # Resolving the scaling factor
    isd_df[scaled_columns] = isd_df[scaled_columns] / 10
    
    # Creating a date_time column from the various time-based columns NOAA provides.
    # The first step is creating a properly formatted string that pandas can parse, and then parse them.
    isd_df['date_time'] = isd_df.day.astype('int').astype('str').str.zfill(2)+'/'\
                         +isd_df.month.astype('int').astype('str').str.zfill(2)+'/'\
                         +isd_df.year.astype('int').astype('str')+'/'\
                         +isd_df.hour.astype('int').astype('str').str.zfill(2)
    isd_df['date_time'] = pd.to_datetime(isd_df['date_time'], format='%d/%m/%Y/%H')
    
    return isd_df

In [16]:
# Running the function for DCA for all years
isd_df = download_isd_lite(station_id, year_range)

In [17]:
# Inspecting the results
isd_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555526 entries, 0 to 555525
Data columns (total 13 columns):
year                     555526 non-null float64
month                    555526 non-null float64
day                      555526 non-null float64
hour                     555526 non-null float64
air_temp_c               555521 non-null float64
dew_pt_temp_c            554144 non-null float64
sea_lvl_press_hectoPa    494056 non-null float64
wnd_dir_360              553465 non-null float64
wnd_spd_mtrpersec        555473 non-null float64
sky_condition            453694 non-null float64
precip_hrly              459914 non-null float64
precip_6hr_accum         33787 non-null float64
date_time                555526 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(12)
memory usage: 55.1 MB


In [18]:
isd_df.tail()

Unnamed: 0,year,month,day,hour,air_temp_c,dew_pt_temp_c,sea_lvl_press_hectoPa,wnd_dir_360,wnd_spd_mtrpersec,sky_condition,precip_hrly,precip_6hr_accum,date_time
555521,2019.0,9.0,19.0,3.0,18.9,13.3,1025.0,130.0,3.6,,0.0,,2019-09-19 03:00:00
555522,2019.0,9.0,19.0,4.0,17.8,12.8,1025.1,999.0,1.5,,0.0,,2019-09-19 04:00:00
555523,2019.0,9.0,19.0,5.0,17.2,12.8,1025.4,80.0,2.1,,0.0,,2019-09-19 05:00:00
555524,2019.0,9.0,19.0,6.0,16.7,12.2,1025.5,70.0,2.6,,,,2019-09-19 06:00:00
555525,2019.0,9.0,19.0,7.0,16.7,12.8,1025.4,60.0,2.6,,,,2019-09-19 07:00:00


In [19]:
# Not much missing temperature data, and the mising values aren't recent.
isd_df.loc[isd_df.air_temp_c.isna()].tail()

Unnamed: 0,year,month,day,hour,air_temp_c,dew_pt_temp_c,sea_lvl_press_hectoPa,wnd_dir_360,wnd_spd_mtrpersec,sky_condition,precip_hrly,precip_6hr_accum,date_time
278625,1988.0,1.0,4.0,8.0,,,,340.0,3.1,8.0,1.3,,1988-01-04 08:00:00
281732,1988.0,5.0,17.0,5.0,,,,999.0,99.9,,0.0,,1988-05-17 05:00:00
282002,1988.0,5.0,28.0,17.0,,11.7,1020.3,180.0,1.5,,0.0,0.0,1988-05-28 17:00:00
283239,1988.0,7.0,20.0,22.0,,,,310.0,10.8,8.0,9.9,,1988-07-20 22:00:00
283256,1988.0,7.0,21.0,17.0,,,,999.0,99.9,,0.0,,1988-07-21 17:00:00


In [20]:
# More missing humidity, but not recently.
isd_df.loc[isd_df.dew_pt_temp_c.isna()].tail()

Unnamed: 0,year,month,day,hour,air_temp_c,dew_pt_temp_c,sea_lvl_press_hectoPa,wnd_dir_360,wnd_spd_mtrpersec,sky_condition,precip_hrly,precip_6hr_accum,date_time
283239,1988.0,7.0,20.0,22.0,,,,310.0,10.8,8.0,9.9,,1988-07-20 22:00:00
283256,1988.0,7.0,21.0,17.0,,,,999.0,99.9,,0.0,,1988-07-21 17:00:00
283450,1988.0,7.0,30.0,0.0,33.3,,1019.6,270.0,3.6,0.0,0.0,0.0,1988-07-30 00:00:00
283452,1988.0,7.0,30.0,2.0,33.3,,1019.6,270.0,3.6,0.0,0.0,0.0,1988-07-30 02:00:00
284763,1988.0,9.0,24.0,10.0,13.3,,1014.9,30.0,6.2,8.0,0.0,,1988-09-24 10:00:00


# Excel Users
You could create the ftp URLs in Excel and then manually click all the links you create. If you had a column with the station_id and a column with the year for each file you want then you could construct the URLs as follows:

=HYPERLINK("ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite/"&reference_year_cell&"/"&reference_station_id_cell&"-"&reference_year_cell&".gz")

You can then unzip each file, and open it in Excel. Using Excel's "Text to Columns" feature with the "Original data type" option set to "Fixed width", Excel will correctly separate the data in to columns. You can then manually add the column headers and save the data as an Excel file as desired. Then you could manually aggregate the data for multiple years and/or stations as needed.

In [21]:
import folium

In [40]:
stations_with_coordinates = isd_stations_data.loc[((isd_stations_data.LAT != 0) 
                                                  & (isd_stations_data.LON != 0))
                                                  & (isd_stations_data.LAT.isna() == False)
                                                  & (isd_stations_data.LON.isna() == False)].copy()

In [46]:
stations_with_coordinates['COORDINATES'] = list(zip(round(stations_with_coordinates.LAT,3), 
                                                    round(stations_with_coordinates.LON,3)))

In [96]:
# Trying to reduce the volume so a reasonable map can be rendered, and only active sites are included
us_airports_with_coordinates = stations_with_coordinates.loc[(stations_with_coordinates.CTRY == 'US')
                                                             & (stations_with_coordinates.ICAO.isna() == False)
                                                             & (stations_with_coordinates.END > 20190917)].copy()

In [98]:
us_airports_with_coordinates['IDENTIFIER'] = (us_airports_with_coordinates['STATION NAME']+'; Station ID: '+
                                              us_airports_with_coordinates['USAF'].astype('str')+'-'
                                              +us_airports_with_coordinates['WBAN'].astype('str'))

In [99]:
# sorting it so that when I trim the number of entries using .head(), I preserve those with the longest history of data
us_airports_with_coordinates.sort_values(by=['BEGIN'], ascending=True, inplace=True)

In [103]:
base_map = folium.Map([38.6270, -90.1994], zoom_start=4)
# Also, limiting it to 1000 locations
for index, coord in enumerate(us_airports_with_coordinates['COORDINATES'].head(1000)):
    marker = folium.Marker([coord[0],coord[1]], popup=us_airports_with_coordinates.iloc[index]['IDENTIFIER']).add_to(base_map)

In [104]:
# base_map = folium.Map([38.6270, -90.1994], zoom_start=4)
base_map