# Overview

In this lesson, we will learn how to handle spatial data in Python using [Geopandas](https://geopandas.org/en/stable/). Geopandas combines the power of Pandas to analyse tabular data and shapely for handling geometries.

We will learn how to read and write spatial data from and to files, how to manipulate geometries, and how to transform data between different coordinate reference systems (CRS).
 
## Learning Goals

After this week, you should be able to:
- Read and write spatial data from and to common file formats,
- Filter and re-group data by spatial and non-spatial characteristics, and
- Manage and transform a data set’s coordinate reference system.

## Key Concepts

### Check Your Understanding

Before diving into this week’s Python lesson, you should already be familiar with some basic spatial data file formats and projection definitions, such as these:
- Shapefile
- GeoPackage
- CRS
- Datum
- EPSG

### Definitions

**Shapefile**: a vector data format for storing location information and related attributes. A shapefile consists of several files with a common prefix that need to be stored in the same directory. `.shp`, `.shx`, and `.dbf` are required file extensions in a shapefile. Other file extensions are not required, but for example, the file extension `.prj` is often essential. More information about Shapefile file extensions can be found [here](https://www.esri.com/en-us/home). The shapefile format is developed by ESRI.

**GeoPackage**: an open-source format for storing and transferring geospatial information. GeoPackages are able to store both vector data and raster data. In more detail, GeoPackage is a container for an SQLite database with a `.gpkg` extension (all in one file!). The GeoPackage format is governed by the Open GeoSpatial Consortium. More information at: [https://www.geopackage.org/](https://www.geopackage.org/)

**CRS**: Coordinate reference systems define how coordinates relate to real locations on the Earth. Geographic coordinate reference systems commonly use latitude and longitude degrees. Projected coordinate reference systems use x and y coordinates to represent locations on a flat surface. You will learn more about coordinate reference systems during this lesson!

**Datum**: defines the center point, orientation, and scale of the reference surface related to a coordinate reference system. Same coordinates can relate to different locations depending on the Datum! For example, WGS84 is a widely used global datum. ETRS89 is a datum used in Europe. Coordinate reference systems are often named based on the datum used.

**EPSG**: EPSG codes refer to specific reference systems. EPSG stands for “European Petroleum Survey Group” that originally published a database for spatial reference systems. For example, SWEREF 99 TM (EPSG:3006), is part of the SWEREF 99 reference system. SWEREF 99 is the Swedish national grid and is based on the ETRS89 (European Terrestrial Reference System 1989),   EPSG:4326 refers to WGS84. You can search for EPSG codes at: [https://spatialreference.org/](https://spatialreference.org/)

## Managing File Paths

When working with data, it is important to keep track of where which input files are stored, and where which output files should be written. This is especially important when moving between computers or between virtual machines.

Earlier, file paths have often been hard-coded strings, text values. If, for instance, an output file name had to be derived from an input file name, all kind of slicing and other string manipulation methods would be used. More recently, the `os.path` module of Python became popular, that allowed to split a path into directories, and file names into base names and file extensions. However, manipulating file paths still required knowledge about the computer a script would ultimately run on. For instance, on all Unix-based operating systems, such as Linux or MacOS, directories are separated by forward-slashes (`/`), while Microsoft Windows uses back-slashes (`\`) (this particular problem can be worked around with `os.sep` and `os.path.join`, but not in a very convenient way).

Since Python 3.4 , there exists a built-in module that eases much of the hassle with managing file paths: `pathlib`. It provides an abstract layer on top of the actual operating system file paths that is consistent across computers. A `pathlib.Path()` object can be initiated with a file path (as a str), when created without an argument, it refers to the directory of the script or notebook file.


It is recommended to define a path in a directory when importing various types of spatial datasets in python. More information on how to define and manage paths in Python can be found here (https://www.pythoncheatsheet.org/cheatsheet/file-directory-path). 

In [None]:
import pathlib
pathlib.Path()

So far, this path is not checked against the actual directory structure, but we can resolve() it to convert it into an absolute path:

In [None]:
path = pathlib.Path()
path = path.resolve()
path

This path object now has a number of properties and methods. For instance, we can test whether the path exists in the file system, or whether it is a directory:

In [None]:
path.exists()

In [None]:
path.is_dir()

Finally, to traverse within this path, you don’t have to think of whether you are running the script on Windows or Linux, and you most definitely don’t have to use string manipulation. To refer to a directory inside path, use the / (division operator) to append another path component (can be a string). For instance, to refer to a folder data within the same directory as this notebook, write the following:

In [None]:
data_directory = path / "data"
data_directory

In [None]:
path.parent

Path() objects can be used (almost) anywhere a file path is expected as a variable of type str, as it automatically typecasts (converts) itself to a suitable type.

In data science projects, it is a good habit to define a constant at the beginning of each notebook that points to the data directory, or multiple constants to point to, for instance, input and output directories. In today’s exercises we use different sample data sets from files stored in the same data directory. At the top of the notebooks, we thus define a constant DATA_DIRECTORY that we can later use to find the sample data set files:

In [None]:
 #location (directory) of the notebook
import pathlib
NOTEBOOK_PATH = pathlib.Path().resolve()
DATA_DIRECTORY = NOTEBOOK_PATH / "data"

In [None]:
print(DATA_DIRECTORY)

In [None]:
# this can then be used, for instance, in `geopandas.read_file()` (see next section):
import geopandas
data_set = geopandas.read_file(DATA_DIRECTORY / "UGA_adm1_2011.shp")
data_set.plot()

Here is an alternative way!

In [None]:
import os
import sys

# Get the path of the current file and the directory containing it
FILE_DIR = os.path.dirname(os.path.abspath("UGA_adm1_2011.shp"))

# Add the directory containing the current file to the Python module search path
sys.path.append(FILE_DIR)

# Import the pathlib module from the directory containing the current file
import pathlib

# Get the absolute path of the current notebook
NOTEBOOK_PATH = pathlib.Path(FILE_DIR).resolve()

# Define the data directory as a subdirectory of the notebook directory
DATA_DIRECTORY = NOTEBOOK_PATH / "data"

In [None]:
# this can then be used, for instance, in `geopandas.read_file()` (see next section):
import geopandas
data_set = geopandas.read_file(DATA_DIRECTORY / "UGA_adm1_2011.shp")
data_set.plot()

In the examples above, we used a path that we <span style="color: red;">resolve()</span>d earlier on. This further improves compatibility and consistency across operating systems and local installations.

Especially when using the path of the current file (as in <span style="color: red;">pathlib.Path()</span> without parameters), we recommend to resolve the path before traversing into any other directory.


## Sources

This lesson is inspired by the [Programming in Python lessons](http://swcarpentry.github.io/python-novice-inflammation/) from the [Software Carpentry organization](http://software-carpentry.org) and has adapted or reused material from University of Helsinki Automating GIS processis course (https://autogis-site.readthedocs.io/en/latest/course-info/license.html) under a Creative Commons Attribution-ShareAlike 4.0 International licence (https://creativecommons.org/licenses/by-sa/4.0/deed.en).