# Accessing Free Hourly Weather Data
NOAA stores Free Hourly Weather Data at: ftp://ftp.ncdc.noaa.gov/pub/data/noaa/

It is my understanding that this is the data underlying most of the websites that charge for historical weather data at the hourly level (at least when needed in bulk).

The most detailed data is in a very cumbersome format, but a subset of easy to parse data can be found at: ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite/

To make this topic approachable to a large audience, I've added some notes on how one could access this data simply using Excel at the end of this notebook.

In [1]:
# Column Names were determined from ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite/isd-lite-format.pdf
# That pdf describes what data is contained in the subset of data that I'll focus on.
isd_fwf_cols = ['year', 'month', 'day', 'hour', 'air_temp_c', 'dew_pt_temp_c',
                 'sea_lvl_press_hectoPa', 'wnd_dir_360', 'wnd_spd_mtrpersec',
                 'sky_condition', 'precip_hrly', 'precip_6hr_accum']

In [2]:
# Importing the python libraries that I use.
import pandas as pd
import numpy as np

In [3]:
# Importing the table defining the available data. 
# There is a row for each station and it includes the begin and end date of available data.
isd_stations_data = pd.read_csv('ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-history.csv')
# isd_stations_data.head()

In [4]:
# I want data for DC, so I've chosen to search for the local airport. Reagan National Airport (DCA).
# Note that all of the Station Names are uppercase.
DCA_search = isd_stations_data.loc[(isd_stations_data['STATION NAME'].isna() == False) 
                                   & (isd_stations_data['STATION NAME'].str.contains('REAGAN'))]

In [5]:
# Slicing out the BEGIN and END years to create the range of years for which I'll download data.
# start_year = str(list(DCA_search.BEGIN)[0])[0:4]
# end_year = str(list(DCA_search.END)[0])[0:4]
year_range = range(2001,2019)
year_range

range(2001, 2019)

In [6]:
# Creating the station ID by which the ftp site is organized.
# Note that it is the concatenation of two columns separated by a hyphen.
station_id = str(list(DCA_search.USAF)[0])+'-'+str(list(DCA_search.WBAN)[0])
station_id

'724050-13743'

In [7]:
# Function to loop through a given station ID for a given range of years.
def download_isd_lite(station_id, year_range):
    isd_df = pd.DataFrame()
    for year in year_range:
        # There can be gaps of missing years in the data, so try and except were required. 
        # The gaps that I've seen are only from decades ago.
        try:
            new_isd_df = pd.read_fwf('ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite/'
                                     +str(year)+'/'
                                     +station_id+'-'
                                     +str(year)
                                     +'.gz',
                                     header=None)
            isd_df = pd.concat([isd_df, new_isd_df])
        except:
            continue
    
    # Resetting the index of the concatenated DataFrame
    isd_df.reset_index(inplace=True, drop=True)
    
    # Setting the column names that I've derived from the format guide
    isd_df.columns = isd_fwf_cols
   
    # NOAA populates missing values with -9999, but I've chosen to replace them with NaN's.
    isd_df.replace({-9999: np.nan}, inplace=True)
    
    # Some of the columns are scaled by a factor of 10 to eliminate decimal points,
    # which would complicate the fixed width format that NOAA has chosen to utilize
    scaled_columns = ['air_temp_c', 'dew_pt_temp_c', 'sea_lvl_press_hectoPa', 
                  'wnd_spd_mtrpersec', 'precip_hrly', 'precip_6hr_accum']
    scaling_factor = 10
    # Resolving the scaling factor
    isd_df[scaled_columns] = isd_df[scaled_columns] / 10
    
    # Creating a date_time column from the various time-based columns NOAA provides.
    # The first step is creating a properly formatted string that pandas can parse, and then parse them.
    isd_df['date_time'] = isd_df.day.astype('int').astype('str').str.zfill(2)+'/'\
                         +isd_df.month.astype('int').astype('str').str.zfill(2)+'/'\
                         +isd_df.year.astype('int').astype('str')+'/'\
                         +isd_df.hour.astype('int').astype('str').str.zfill(2)
    isd_df['date_time'] = pd.to_datetime(isd_df['date_time'], format='%d/%m/%Y/%H')
    
    return isd_df

In [8]:
# Running the function for DCA for all years
isd_df = download_isd_lite(station_id, year_range)

In [9]:
isd_df.to_csv('DC_weather_data_2001-2018.csv', index=False)

In [146]:
# Inspecting the results
isd_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 564220 entries, 0 to 564219
Data columns (total 13 columns):
year                     564220 non-null float64
month                    564220 non-null float64
day                      564220 non-null float64
hour                     564220 non-null float64
air_temp_c               564215 non-null float64
dew_pt_temp_c            562838 non-null float64
sea_lvl_press_hectoPa    502737 non-null float64
wnd_dir_360              562159 non-null float64
wnd_spd_mtrpersec        564167 non-null float64
sky_condition            462316 non-null float64
precip_hrly              468604 non-null float64
precip_6hr_accum         35259 non-null float64
date_time                564220 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(12)
memory usage: 56.0 MB


In [147]:
isd_df.tail()

Unnamed: 0,year,month,day,hour,air_temp_c,dew_pt_temp_c,sea_lvl_press_hectoPa,wnd_dir_360,wnd_spd_mtrpersec,sky_condition,precip_hrly,precip_6hr_accum,date_time
564215,2019.0,9.0,24.0,3.0,26.1,18.9,1010.8,310.0,5.7,,0.0,,2019-09-24 03:00:00
564216,2019.0,9.0,24.0,4.0,26.1,18.3,1010.5,300.0,4.6,,0.0,,2019-09-24 04:00:00
564217,2019.0,9.0,24.0,5.0,25.6,17.8,1010.6,300.0,3.6,,0.0,,2019-09-24 05:00:00
564218,2019.0,9.0,24.0,6.0,24.4,17.8,1010.8,310.0,4.6,,,,2019-09-24 06:00:00
564219,2019.0,9.0,24.0,7.0,23.9,17.8,1010.6,330.0,4.1,,,,2019-09-24 07:00:00


In [148]:
# Not much missing temperature data, and the mising values aren't recent.
isd_df.loc[isd_df.air_temp_c.isna()].tail()

Unnamed: 0,year,month,day,hour,air_temp_c,dew_pt_temp_c,sea_lvl_press_hectoPa,wnd_dir_360,wnd_spd_mtrpersec,sky_condition,precip_hrly,precip_6hr_accum,date_time
287199,1988.0,1.0,4.0,8.0,,,,340.0,3.1,8.0,1.3,,1988-01-04 08:00:00
290306,1988.0,5.0,17.0,5.0,,,,999.0,99.9,,0.0,,1988-05-17 05:00:00
290576,1988.0,5.0,28.0,17.0,,11.7,1020.3,180.0,1.5,,0.0,0.0,1988-05-28 17:00:00
291813,1988.0,7.0,20.0,22.0,,,,310.0,10.8,8.0,9.9,,1988-07-20 22:00:00
291830,1988.0,7.0,21.0,17.0,,,,999.0,99.9,,0.0,,1988-07-21 17:00:00


In [149]:
# More missing humidity, but not recently.
isd_df.loc[isd_df.dew_pt_temp_c.isna()].tail()

Unnamed: 0,year,month,day,hour,air_temp_c,dew_pt_temp_c,sea_lvl_press_hectoPa,wnd_dir_360,wnd_spd_mtrpersec,sky_condition,precip_hrly,precip_6hr_accum,date_time
291813,1988.0,7.0,20.0,22.0,,,,310.0,10.8,8.0,9.9,,1988-07-20 22:00:00
291830,1988.0,7.0,21.0,17.0,,,,999.0,99.9,,0.0,,1988-07-21 17:00:00
292024,1988.0,7.0,30.0,0.0,33.3,,1019.6,270.0,3.6,0.0,0.0,0.0,1988-07-30 00:00:00
292026,1988.0,7.0,30.0,2.0,33.3,,1019.6,270.0,3.6,0.0,0.0,0.0,1988-07-30 02:00:00
293337,1988.0,9.0,24.0,10.0,13.3,,1014.9,30.0,6.2,8.0,0.0,,1988-09-24 10:00:00
