# Part 2 - Vector Data

In this, second, notebook we will focus on vector data.

This notebook was designed for a session as part of the UKCEH Summer School. It does not cover all aspects of vector data use by the Python scientific communities. Additional resources can be found throughout the notebook.

## Contents

- Shapefiles
    - Plotting and exploring shapefiles using Geopandas
    - Using Cartopy to enhance shapefile plotting
    - Interactive maps via Folium and Geopandas 'explore' method
- GeoJSON data
- Textual vector data

In [None]:
%%capture
!pip install cartopy s3fs mapclassify

In [None]:
import numpy as np
import pandas as pd
import geopandas as gpd
import cartopy as cp
import xarray as xr
from dask.diagnostics import ProgressBar
import matplotlib.pyplot as plt
import folium
import s3fs

## Shapefiles

Shapefiles are files that contain vectors (shapes) geo-referenced to a particular coordinate system

They are often used in hydrology to define catchment areas, river channels and hydrological administrative regions. 

Each 'shape' in a shapefile will typically have a set of attributes that tell you more information about the shape, such as the name of the river or catchment it defines. 

The easiest way to see what's in a shapefile in python is to use a package called [Geopandas](https://geopandas.org/en/stable/). It is essentially an extension of the [Pandas](https://pandas.pydata.org/) package, which is used to work with tabular data, and that we saw in Part 1.

**Sidenote:** Shapefiles often confusingly come with several ancillary files. The main shapefile will have a **'.shp'** ending, with ancillaries ending with some or all of **'shx', '.sbx', '.sbn', '.dbf', '.cpg', '.prj'**, which provide additional information about the vectors/shapes contained in the **'.shp'** file. Most shapefile packages in python and elsewhere will read in the information they need from these ancillary files automatically if you provide the path to the **'.shp'** file, and therefore most of the time you can ignore them! If reading the shapefiles from an S3 object storage filesystem, like we are doing below, the ancillary files need to be zipped up together with the **'.shp'** file and the zip archive read in for the ancillaries to be automatically loaded. 

In [None]:
# Set up the S3 (object storage) filesystem object
s3 = s3fs.S3FileSystem(anon=True, endpoint_url="https://fdri-o.s3-ext.jc.rl.ac.uk")

In [None]:
# Read from the filesystem - note the zip as explained above
shapefile = gpd.read_file(s3.open('s3://example-data/gb_catchments.zip'))

In [None]:
shapefile

The tabular representation of the shapefile shows each shape in the shapefile as a separate row. Each column shows an attribute of the shapes, with the actual vector geometry stored in the final column.

We can select out one of the rows and plot it to see what it looks like, much like we could do with Pandas:

In [None]:
shapefile.iloc[[0],:].plot()

Hmm, that's not wonderfully helpful. Let's see if we can find out some more information about it. 

**Sidenote:** the extra '[]' around the 0 below, which isn't usually necessary. Here they are needed to ensure that the output of the command remains a 'Table' rather than a 'Series'. Geopandas only knows about Tables, not Series, so to retain the geographical capabilities of Geopandas, we need to retain the table! 

In [None]:
shapefile.iloc[[0],:]

The attributes aren't wonderfully helpful either. All we've got to go is some numeric identifiers.

Sometimes the 'attrs' command below can produce some more information:

In [None]:
shapefile.attrs

But not this time. If this shapefile was following [FAIR](https://www.go-fair.org/fair-principles/)\* guidelines it should have some more useful metadata!

\*Findable, Accessible, Interoperable, Reproducable

Fortunately for us, I know what this dataset is, and it's a collection of all the catchments in the UK. 

Perhaps we can get an impression of this if we plot the entire file, i.e. all the catchments it contains:

In [None]:
shapefile.plot()

We can now see that it looks like a map of the UK, but the overlapping nature of the catchments makes it hard to make out much detail. We can fix that!

In [None]:
shapefile.plot(facecolor='None')

We've now made the shapes transparent and just drawn their borders instead, but it's still hard to make out any detail because there's too many overlapping catchments! Let's focus on an individual catchment instead.

I happen to know that the [Thames](https://nrfa.ceh.ac.uk/data/station/spatial/39001) has catchment ID 39001. Given the catchment ID is one of the attributes listed in the table, we can use it to select this specific catchment from the table.

In [None]:
thames = shapefile.loc[shapefile['ID_STRING'] == '39001']

In [None]:
thames

Let's plot it to check if looks the right shape:

In [None]:
thames.plot()

I also happen to know that all the sub-catchments of the Thames basin will begin with '39', followed by a three-digit 0-padded number. Let's select all those out:

In [None]:
all_thames = shapefile.loc[lambda ds: ds['ID'] > 39000].loc[lambda ds: ds['ID'] < 40000]

**Note** how we are now using the 'ID' attribute instead of the 'ID_STRING' to allow numerical comparisons.

In [None]:
all_thames

In [None]:
all_thames.plot()

Hmm that doesn't look any different.

That's because we need to set facecolor to 'None' again so that the shapes are not filled with colour:

In [None]:
all_thames.plot(facecolor='None')

Now we're starting to get a better picture of what the Thames basin looks like!

In [None]:
shapefile

Notice that for some reason the attribute 'SHAPE_AREA' is 0.0 for all catchments. That's annoying, as this is a genuinely useful catchment property we might want to use.

Fortunately, Geopandas is able to calculate the area for us, based on the geometeries in the 'geometry' column. 

In [None]:
shapefile.area

Let's assign these area values to a new column:

In [None]:
shapefile['catchment_areas'] = shapefile.area

In [None]:
shapefile

Now that we have this information, let's do something with it. 

Perhaps we're interested in finding the largest and smallest catchments within the Thames basin

In [None]:
all_thames = shapefile.loc[lambda df: df['ID'] > 39000].loc[lambda df: df['ID'] < 40000]

In [None]:
all_thames_sorted = all_thames.sort_values('catchment_areas')

In [None]:
all_thames_sorted

Looks like the smallest is at the top, largest is at the bottom.

In [None]:
smallest_thames = all_thames_sorted.iloc[[0],:]

**Note:** We want the second largest in this table, as the largest is the whole Thames basin itself, and we want the largest *within* this

In [None]:
largest_thames = all_thames_sorted.iloc[[-2],:]

Now let's see them on the map:

In [None]:
all_thames_sorted.plot(facecolor='None')
current_axes = plt.gca() # we want all of these commands to plot on the same set of axes, this retrieves a 'handle' to the axes
largest_thames.plot(facecolor='green', ax=current_axes, zorder=0) # note we specify the axes 
smallest_thames.plot(facecolor='blue', ax=current_axes)

Can you spot the blue catchment? What do you think the 'zorder' parameter is doing?

Now let's spruce up our plots a bit. We can use the [Cartopy](https://scitools.org.uk/cartopy/docs/latest/) package to produce good plots.

In [None]:
gbax = plt.axes(projection=cp.crs.OSGB())
gbax.set_global()
gbax.coastlines(resolution='10m')
thames.plot(ax=gbax, facecolor='green')
rivers = cp.feature.NaturalEarthFeature('physical', 'rivers_lake_centerlines', '10m', edgecolor='blue', facecolor='none', lw=0.5)
gbax.add_feature(rivers)

Now we can see where the Thames catchment sits in the UK!

There's a lot to unpack in the commands we used though, so let's go through that:

```gbax = plt.axes(projection=cp.crs.OSGB())```

Here we are creating a set of axes that we'll be using for the plot. The ```projection``` argument defines what map projection to use for the plotting. In this case we are using the Ordnance Survey's grid, which approximates the UK as a flat plane with x/y coordinates. 

**Note:** The OSGB projection matches the coordinate system that the shapefile is defined on. In the UK this is typical for hydrological data. Datasets covering other geographical areas are more likely to be a lonlat grid, or sometimes a UTM cartesian grid (which the OSGB grid is an example of). For lonlat grids a good default for the projection argument would be ```projection=cp.crs.PlateCarree()```. Sometimes you can find out which coordinate system the shapefile is using from the crs attribute, e.g.: ```shapefile.crs```.

```gbax.set_global()```

This forces the axes to their maximum possible extent for the given projection.

What happens if we remove this?

```gbax.coastlines(resolution='10m')```

This adds coastlines to the axes.

```thames.plot(ax=gbax, facecolor='green')```

This is the same plotting function we've been using throughout, with the addition of specifying the axes on which we wish to plot, and the colour of the shape(s) we are plotting.

```rivers = cp.feature.NaturalEarthFeature('physical', 'rivers_lake_centerlines', '10m', edgecolor='blue', facecolor='none', lw=0.5)```

This one looks complicated but really it's just accessing a built-in shapefile that cartopy has access to. ```physical``` is the category, ```rivers_lake_centerlines``` is the name of the dataset, ```10m``` is the resolution we want to use, then ```edgecolour``` and ```facecolor``` are the same as in the plotting command and ```lw=0.5``` sets the line-width of the plotted shapes.

**Further info:** Cartopy is actually accessing the shapefile datasets from the [Natural Earth website](https://www.naturalearthdata.com/downloads/). See what other datasets you can make use of for free!

**Further info:** There is also a UK Rivers shapefile available on the object store at 's3://example-data/main_uk_river_1km.zip', see if you can plot it instead of the cartopy built-in.

In [None]:
rivers_shapefile = gpd.read_file(s3.open('s3://example-data/main_uk_river_1km.zip'))

In [None]:
gbax = plt.axes(projection=cp.crs.OSGB())
fig = plt.gcf()
fig.set_size_inches(18.5, 10.5, forward=True)
gbax.set_global()
gbax.coastlines(resolution='10m')
thames.plot(ax=gbax, facecolor='green')
rivers_shapefile.plot(ax=gbax, facecolor='None', edgecolor='blue', lw=0.5)

We can go one step further and easily plot the shapefile data on a zoomable interactive map:

In [None]:
thames.crs = 27700

m = thames.explore(style_kwds={'color': 'black', 'fill': False})

In [None]:
m

To plot multiple shapefiles on one map, we have to first set up the base map manually. The ```explore()``` function uses a package called [folium](https://python-visualization.github.io/folium/latest/) under the hood, so we use this to set up our base map. To get the map to open with a specific location centred and at a specific zoom level, the ```location``` and ```zoom_start``` parameters can be specified. The ```location``` parameter is specified as [lat,lon] and picking the right zoom level is trial and error, but as a rough guide it starts at 1 (most zoomed out). 

In [None]:
mapplot = folium.Map(location=[51.5, -1], zoom_start=8)
thames.explore(m=mapplot, style_kwds={'color': 'black', 'fill': False})
rivers_shapefile.crs = 27700
rivers_shapefile.explore(m=mapplot, style_kwds={'fill': False, 'color': 'blue', 'opacity': 0.5})

The ```explore()``` function is very customisable, see below for all the options. Have a play around and see what more you can do!

**Note:** The 'crs' (coordinate reference system) needs to be set in order to be plotted with the ```explore()``` function. This specifies the coordinate system the shapefile is using, so that the ```explore()``` function knows how to interpret the coordinates of the shapes in the shapefile and where to put them on a map.

In [None]:
thames.explore?

One cool thing you can do, to finish this section, is save this map to a html file which you can then open with any internet browser:

In [None]:
mapplot.save('thames_cat_map.html')

## Other vector data formats

You may come across [geojson](https://geojson.org/) files as a popular alternative to shapefiles. Geopandas can work with these too, and the functionality is exactly the same as if you were working with shapefiles. For example:

In [None]:
scotland = gpd.read_file(s3.open("s3://example-data/scotland_boundaries.geojson"))

In [None]:
scotland.plot()

Sometimes you may find geographical information has been stored in text-based csv files. Once again geopandas can work with these in the same way as the other formats mentioned, though reading them in involves slightly different commands:

In [None]:
buoys = pd.read_csv(s3.open("s3://example-data/buoy_data.csv"))

In [None]:
buoys

Note that we have used *Pandas* instead of *Geo*pandas to read in this file as it is a text file, not a file container vectors. The location information in this csv file is stored in the latitude and longitude columns. *Geo*pandas can be told to make these into vector points:

In [None]:
buoys_geo = gpd.GeoDataFrame(buoys, geometry=gpd.points_from_xy(buoys.longitude, buoys.latitude),
                             crs="EPSG:4326")

Note we use the [4326 EPSG code](https://epsg.io/4326) for the coordinate reference system, as this is the standard one for longitude and latitude coordinate systems.

In [None]:
buoys_geo

In [None]:
gbax = plt.axes(projection=cp.crs.PlateCarree())
gbax.coastlines(resolution='10m')
buoys_geo.plot(ax=gbax, color='red')

**Note:** we are using the [PlateCarree projection](https://scitools.org.uk/cartopy/docs/v0.15/crs/projections.html#platecarree), which is a standard one to use for plotting data in lon/lat coordinates.

## Further Resources

- [Geopandas documentation](https://geopandas.org/en/stable/)
- [Wikipedia over-detailed description of Shapefiles](https://en.wikipedia.org/wiki/Shapefile)
- [geoJSON](https://geojson.org/)
- [EPSG Codes](https://epsg.org/home.html)
- [Cartopy projections](https://scitools.org.uk/cartopy/docs/v0.15/crs/projections.html)