# Reading Precipitation Data

In this notebook, we'll read the HDF5 data files about precipitation data and then process it to combine all together from January, 2016 to December, 2017.

## Import libraries

We'll need to use the `h5py` library to read the HDF5 files and then `pandas` to create a data frame and save it as a CSV file.

In [1]:
import h5py
import pandas as pd

## Explore the file components

We'll first explore the items in the first dataset and select the data we want to keep. All original files are inside the `raw_data` folder.

In [6]:
file = h5py.File('raw_data/3B-MO.MS.MRG.3IMERG.20160101-S000000-E235959.01.V06B.HDF5.SUB.hdf5', 'r')

for item in file.items():
    print(item)

('gaugeRelativeWeighting', <HDF5 dataset "gaugeRelativeWeighting": shape (1, 232, 364), type "<i2">)
('lat', <HDF5 dataset "lat": shape (364,), type "<f4">)
('lon', <HDF5 dataset "lon": shape (232,), type "<f4">)
('precipitation', <HDF5 dataset "precipitation": shape (1, 232, 364), type "<f4">)
('precipitationQualityIndex', <HDF5 dataset "precipitationQualityIndex": shape (1, 232, 364), type "<f4">)
('probabilityLiquidPrecipitation', <HDF5 dataset "probabilityLiquidPrecipitation": shape (1, 232, 364), type "<i2">)
('randomError', <HDF5 dataset "randomError": shape (1, 232, 364), type "<f4">)
('time', <HDF5 dataset "time": shape (1,), type "<f8">)


As we can see, the data includes latitute and longitude values as `lat` and `lon` respectively. Furthermore, precipitation data is available in `precipitation`. We'll just use these three data attributes.

## Define arrays and append data

We'll create 5 arrays with the following data:
- **year:** The year of precipitation data.
- **month:** The month of precipitation data.
- **latitude:** The latitude value.
- **longitude:** The longitude value.
- **precipitation:** The actual value of precipitation for that month.

In [16]:
years = ["2016", "2017"]
months = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]
year = []
month = []
latitude = []
longitude = []
precipitation = []

In [17]:
for yr in years:
    for mnth in months:
        file = h5py.File("raw_data/" + 
                            "3B-MO.MS.MRG.3IMERG." + yr + mnth + 
                            "01-S000000-E235959." + mnth + ".V06B.HDF5.SUB.hdf5", "r")
        
        precipitation_vals = []
        for data in file['precipitation']:
            precipitation_vals.append(data)
        for i, lon in enumerate(file['lon']):
            data = precipitation_vals[0][i]
            for j, lat in enumerate(file['lat']):
                year.append(yr)
                month.append(mnth)
                latitude.append(lat)
                longitude.append(lon)
                precipitation.append(data[j])

## Create dataset and save

Now that we've compiled all our data into specific arrays, we can store it as a CSV file in the data file.

In [20]:
df = pd.DataFrame({"year": year, 
                   "month": month, 
                   "latitude": latitude, 
                   "longitude": longitude, 
                   "precipitation": precipitation})

In [21]:
df.head(5)

Unnamed: 0,year,month,latitude,longitude,precipitation
0,2016,1,-56.549999,-75.349998,0.255904
1,2016,1,-56.450001,-75.349998,0.237106
2,2016,1,-56.349998,-75.349998,0.23077
3,2016,1,-56.25,-75.349998,0.224248
4,2016,1,-56.150002,-75.349998,0.234647


In [25]:
df.to_csv("data/combined_data.csv", index = False)