# Clean precipitation data

In this notebook, I'll use spacial join to extract all data points that just lie inside the Argentina region and save it.

## Import libraries

I'll use the `shapely` and `geopandas` libraries to work with the spacial data.

In [1]:
import numpy as np
import pandas as pd

import descartes
import geopandas as gpd
from geopandas.tools import sjoin
from shapely.geometry import Point, Polygon, shape

## Load data

I'll load in the data converted from HDF5 to CSV, so that I can retrieve useful information from it. I'll also load the shapefile for Argentina.

In [2]:
dataset = pd.read_csv("data/combined_data.csv")
argentina = gpd.read_file("shapefiles/country/ARG_adm0.shp")

## Retrieve correct points

Next, I'll select only the data points that lie within Argentina.

### March 2016

In [3]:
temp_df = dataset[(dataset["year"] == 2016) & (dataset["month"] == 3)].reset_index(drop = True)

geometry = [Point(xy) for xy in zip(temp_df["longitude"], temp_df["latitude"])]
points = gpd.GeoDataFrame(temp_df, crs = {'init': 'epsg:4326'}, geometry = geometry)

final_df = sjoin(points, argentina, how = 'inner', op = 'intersects')
final_df = final_df[["year", "month", "latitude", "longitude", "precipitation", "geometry"]].reset_index(drop = True)
final_df.to_csv("data/precipitation_3_2016.csv", index = False)

### March 2017

In [4]:
temp_df = dataset[(dataset["year"] == 2017) & (dataset["month"] == 3)].reset_index(drop = True)
geometry = [Point(xy) for xy in zip(temp_df["longitude"], temp_df["latitude"])]
points = gpd.GeoDataFrame(temp_df, crs = {'init': 'epsg:4326'}, geometry = geometry)
final_df = sjoin(points, argentina, how = 'inner', op = 'intersects')
final_df = final_df[["year", "month", "latitude", "longitude", "precipitation", "geometry"]].reset_index(drop = True)
final_df.to_csv("data/precipitation_3_2017.csv")