-
Notifications
You must be signed in to change notification settings - Fork 5
Week 01 (W46 Nov16) Global Climate Dataset
This week we downloaded and explored our datasets. We mainly focused on our main dataset, which the Global Climate Daily Dataset. We examined the features, their types and figured out efficient ways to collect the data, organize in data structures, pre-process them and visualize. We particularly examined a major feature of our dataset (Maximum Temperature), identified problems, figured out transformations to solve some of them by visualizing the features and executing some transformations. Additionally we tried to filter the data to try fit an appropriate distribution.
Global Climate Data (GCD) : Main Dataset
- Number of files: 100.791
- Format: .dly files (Complete Works Wordprocessing Template)
- Size: 26.5 GB
- Features: 46
- Source Date: 1763 - 2015
World Bank (WB) : Complementary Dataset
- Number of files: 1
- Format: .csv
- Size: ~15 MB
- Features: 82
- Source Date: 1960 - 2015
-
ID : Nominal (station identification code)
-
YEAR : Integer (year of the record)
-
MONTH : Integer (month of the record)
-
PRCP : ratio-scaled (Precipitation (tenths of mm))
-
SNOW : ratio-scaled (Snowfall (mm))
-
SNWD : ratio-scaled (Snow depth (mm))
-
TMAX : interval-scaled (Maximum temperature (tenths of degrees C))
-
TMIN : interval-scaled (Minimum temperature (tenths of degrees C))
-
MFLAG : Nominal,Integer (measurement flag)
-
QFLAG : Nominal,Integer (quality flag)
-
SFLAG : Nominal,Integer (source flag)
Rest of the features also belong to the Integer, Nominal, interval-scaled and ratio-scaled type. We are not writing them for now due to space restrictions, but might invoke them for analysis-shake in future work.
Our dataset consists of many .dly files. Each .dly file contains data for one station. The name of the file corresponds to a station's identification code. For example, "USC00026481.dly" contains the data for the station with the identification code USC00026481.
In order to make use of our dataset we decided to merge the .dly files into one using cat * > merged_dataset.dly
Then using data frames, every entry contains all the measurements of a station for all the available dates.
Additionally, we decided to create a script to directly download the desired dataset for any station from the source through ftp.
The format of a typical entry looks like the following:
‘X1': 'CA002303986198503PRCP 90 C 0 C 0 C 0 C 4 C 0 C 2 C 8 C 0 C 80 C 0 C 0 C 0 C 0 C 0 C 12 C 122 C 186 C 0 C 0 C 0T C 0 C 0 C 57 C 13 C 0 C 0 C 0 C 0 C 26 C 0 C'
where CA002303986198503PRCP is the ID. First two letters, declare the country (here Canada), next 11 characters contain the station-specific ID, next 4 the year of the record, next 2 the month of the record and finally last 4 characters describes the element measured. These records repeat for all the available features and for 31 days for each (!) wether measurements were taken or not. The rest of the columns contain values and flags. Each entry has the format VVVVF1F2F3, a sequence of 4 numbers followed by the 3 different flags (optional).
- There are missing values for many days and there are missing values for meteorological elements. The missing values are marked with -9999. There are obvious outliers for our data.
- Flags are not important to our purposes, as they contain information about the conditions and efficiency of the measurements. They can be integers or alphabetical or blank if there are none. They need to be filtered out. An issue arises here if they are integers (ex. 0, 6, 7) as there might be conflict with the actual values of the elements and lead to faulty or noisy data.
- There are values of the elements for every day of the month in each record. However, there are 31 values each time, and depending on the number of days each month has, the remaining are missing. This creates an issue in data structures and corrupts the actual date representation in the visualization.
For the first week to obtain a better view of our dataset we decided to analyze and visualize one of the main features, the maximum Temperature measured. Particularly we examine this feature for all the measurements of the station with ID: AGM00060425 for all the years, months, days. We are using Python for any coding and processing executed this week. The format of one of the records in the data structure is as we already mentioned above the following:
'AGM00060425194302TAVG-9999 148H S 96H S-9999 -9999 -9999 -9999 -9999 93H S-9999 -9999 -9999 -9999 -9999 126H S-9999 -9999 -9999 -9999 -9999 133H S 79H S-9999 -9999 -9999 -9999 112H S 106H S-9999 -9999 -9999 \n'
Our goal is to isolate the date of each measurement and the measurement values, create a new data structure and visualize it. We executed the following steps:
-
Extracted the years in an array "[1957, 1957, 1957, 1967...]"
-
Extracted the months in an array "[1, 2, 2, 3...]"
-
Estimated days of each month for any year with "calendar" library of Python and the sequence of number of days in another array "[1..31]"
-
Concatenated the three arrays above and converted datatype string to datatime
-
Extracted maximum Temperature values per day
-
Filtered out flags with the use of regex (regular expression)
-
Visualized data [Figure 1]

-
We observe that due to the outliers the data are not clear and do not provide actual information. Thus, we replaced the outliers (values = -9999) with nan (not a number) and revisualized.[Figure 2]
9. We now observe several missing and scattered values
10. Finally we created an histogram with our data to estimate a possible distribution. [Figure 3] Seems like a cosine distribution over the months, which makes sense due to the seasonal effect of temperatures.

things to improve
Our complementary dataset of World Bank (WB) contains a combination of Integer, Nominal, interval-scaled and ratio-scaled types of features. It contains useful information about the environment, pollution and emissions statistics, population statistics etc which can be combined with our climate data to derive fruitful results and correlate effects. For example the percentage of population growth can be correlated with the increased CO2 emissions and further the rise of average temperature. The WB dataset contains data per year and per country from 1960 to 2015. The GCD dataset contains data per day/month/year and per station/city/country from 1763 to 2016. Our merging steps of the two datasets are the following:
-
Extract data and organize data country-wise for years 1960-2015 from GCD.
-
Compute yearly average values for data
-
Convert .dly to csv
-
Missing values in GCD are marked as -9999, while in WB as blank. Normalize
-
Merge into one .csv file
-
Filter out countries and years that contain limited data, further processing
- Menne, M.J., I. Durre, R.S. Vose, B.E. Gleason, and T.G. Houston, 2012: An overview of the Global Historical Climatology Network-Daily Database. Journal of Atmospheric and Oceanic Technology, 29, 897-910, doi:10.1175/JTECH-D-11-00103.1.
- Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R. Ray, R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: Global Historical Climatology Network - Daily (GHCN-Daily), Version 3. [indicate subset used following decimal, e.g. Version 3.12]. NOAA National Climatic Data Center. http://doi.org/10.7289/V5D21VHZ