# `Geopandas`: GeoSpatial Tabular Data Analysis

We present the roadmap for the notebook:

1. **Introduction**
    - Briefly introduce `GeoPandas` and its role in geospatial analysis
    - Provide the installation instructions
2. **Geospatial Data Structures in GeoPandas**
    - Introduce `GeoSeries` and `GeoDataFrame`
    - Create `GeoSeries` and `GeoDataFrame` from scratch
    - Read and write geospatial data (e.g., `shapefiles`, `GeoJSON`)
3. **Explore and Visualize Geospatial Data**
    - Explore GeoDataFrames: `head`, `tail`, `info`, and `describe` methods
    - Coordinate Reference Systems (CRS) in GeoPandas
    - Basic geospatial data visualization using the `plot()` method
4. **Geometric Operations**
    - Geometry manipulation with GeoPandas and Shapely (e.g., buffer, centroid, area)
    - Spatial joins and overlays (e.g., intersection, union)
5. **Spatial Relationships and Predicates**
    - Point-in-polygon analysis with GeoPandas (e.g., sjoin)
    - Spatial relationships and predicates (e.g., contains, intersects, within)
6. **Coordinate Reference Systems and Transformations**
    - Understanding Coordinate Reference Systems (CRS)
    - Setting and transforming CRS in GeoPandas
7. **Practical Examples and Use Cases**
    - Real-world examples applying concepts and techniques from the notebook
8. **Additional Resources and Further Reading**
    - Links to GeoPandas documentation, tutorials, and other resources for learners to explore further

## Introduction

`GeoPandas` is a powerful Python library designed to make working with geospatial data in Python easier and more efficient. It extends the functionality of `Pandas`, a popular data analysis library, by introducing two new data structures: `GeoSeries` and `GeoDataFrame`. These data structures are built on top of `Shapely` geometries and can efficiently store and manipulate geospatial data.

`GeoPandas` combines the capabilities of `Shapely`, `Fiona`, and `Pyproj`, making it an essential tool for many geospatial analysis tasks, such as **reading and writing geospatial data**, performing **geometric operations**, and **visualizing geospatial data**. With GeoPandas, you can work with both vector and raster data and perform complex geospatial analyses.

### Installation

To install GeoPandas, you can use either pip or conda. It's highly recommended to install GeoPandas and its dependencies within a virtual environment: 

```bash

# Installation using pip
!pip install geopandas

# Installation using conda
!conda install -c conda-forge geopandas
```

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from shapely.geometry import Point, Polygon, LineString
from shapely import unary_union
import geopandas as gpd
import contextily as ctx
ctx.set_cache_dir("/Users/akramz/.cache/")

## Geospatial Data Structures in GeoPandas

- `GeoPandas` provides two main data structures: 
    - `GeoSeries`: a one-dimensional array that can store and manipulate Shapely geometric objects. It is similar to a `Pandas` `Series` but designed specifically for handling geospatial data.
    - `GeoDataFrame`: a two-dimensional data structure that can store and manipulate tabular data with a **GeoSeries column** for storing geometries. It is similar to a Pandas DataFrame but with additional geospatial functionality.
- The two data structures are designed to handle geospatial data and are built on top of `Pandas` and `Shapely`.

### Creating `GeoSeries` and `GeoDataFrame` from Scratch

To create a GeoSeries or GeoDataFrame from scratch, you can use the following methods:

In [None]:
# Create a GeoSeries
geom_series = gpd.GeoSeries([Point(2.3, 48.8), Point(2.4, 48.9), Point(2.5, 48.7)])
geom_series

In [None]:
# Create a GeoDataFrame from a DataFrame and a GeoSeries
data = {
    'name': ['A', 'B', 'C'],
    'population': [1000, 2000, 3000]
}

# Use the dictionary to create a Pandas dataframe
df = pd.DataFrame(data)

# Use the pandas dataframe and the GeoSeries to create a GeoDataframe
gdf = gpd.GeoDataFrame(df, geometry=geom_series)
gdf.head()

### Reading and Writing Geospatial Data

- `GeoPandas` makes it easy to read and write geospatial data, such as `shapefiles` and `GeoJSON` files. 

Here's how to read and write geospatial data using GeoPandas:

In [None]:
# Read GeoJSON file
gdf_paris_districts = gpd.read_file("./data/vector/paris_districts_utm.geojson")
gdf_paris_districts.head()

In [None]:
# Write GeoJSON file
gdf_paris_districts.to_file("./data/vector/output_paris_districts.geojson", driver="GeoJSON")

In [None]:
# Read shapefile (assuming it's in a zip file)
gdf_countries = gpd.read_file("zip://./data/vector/ne_110m_admin_0_countries.zip")
gdf_countries.head()

In [None]:
# Write shapefile
gdf_countries.to_file("./data/vector/output_countries.shp")

## Exploring and Visualizing Geospatial Data

- We will explore and visualize geospatial data using `GeoPandas`. 
- We will cover methods to inspect `GeoDataFrame`s and discuss Coordinate Reference Systems (CRS). 
- We will demonstrate how to create basic geospatial visualizations using the `plot()` method.

### Exploring GeoDataFrames

You can use methods like `head()`, `tail()`, `info()`, and `describe()` to inspect and explore GeoDataFrames, similar to how you would use them with Pandas DataFrames.

In [None]:
# Inspect the first 5 rows of the GeoDataFrame
print("Head:")
gdf_paris_districts.head()

In [None]:
# Inspect the last 5 rows of the GeoDataFrame
print("Tail:")
gdf_paris_districts.tail()

In [None]:
# Get a summary of the GeoDataFrame's structure
print("Info:")
gdf_paris_districts.info()

In [None]:
# Generate descriptive statistics of the GeoDataFrame's columns
print("Describe:")
gdf_paris_districts.describe()

### Coordinate Reference Systems (CRS) in `GeoPandas`

Coordinate Reference Systems (CRS) define how coordinates are related to the Earth's surface. `GeoPandas` allows you to set, inspect, and transform the CRS of a GeoDataFrame. 

The CRS information is stored in the `crs` attribute of a GeoDataFrame.

In [None]:
# Inspect the CRS of the GeoDataFrame
print("CRS of Paris Districts GeoDataFrame:")
print(gdf_paris_districts.crs)

In [None]:
# Inspect the CRS of the Countries GeoDataFrame
print("CRS of Countries GeoDataFrame:")
print(gdf_countries.crs)

### Basic Geospatial Data Visualization with the `plot()` Method

`GeoPandas` provides a simple `plot()` method for visualizing `GeoDataFrame`s. This method creates a matplotlib plot of the geometries in the GeoDataFrame.

In [None]:
# Plot Paris Districts
gdf_paris_districts.plot(figsize=(10, 10), column="population")
plt.title("Paris Districts")
plt.show()

In [None]:
# Plot Countries
gdf_countries.plot(figsize=(10, 6))
plt.title("Countries")
plt.show()

In [None]:
# Plot Africa
ax = gdf_countries[gdf_countries["continent"] == "Africa"].plot(figsize=(10, 5), column="gdp_md_est", legend=True)
ax.axis("off")
plt.title("Africa")
plt.show()

- The above examples demonstrate how to create basic geospatial visualizations using the `plot()` method in `GeoPandas`. 
- You can customize the appearance of these plots by passing additional arguments to the `plot()` method and using matplotlib functions.

## Geometric Operations

In this section, we will explore geometric operations in `GeoPandas`. We'll cover geometry manipulation using both `GeoPandas` and `Shapely`, as well as spatial joins and overlays.

### Geometry Manipulation with `GeoPandas` and `Shapely`

- **`GeoPandas` provides easy access to geometric operations from the `Shapely` library**. 
- You can perform operations such as buffering, computing centroids, and calculating areas directly on `GeoDataFrames` and `GeoSeries`.

In [None]:
# Get a copy of the district dataset
d = gdf_paris_districts.copy()

# Transform each district's polygon into its centorid 
d["geometry"] = d.centroid

# Plot the districts and centroids
fig, ax = plt.subplots()
_ = gdf_paris_districts.plot(ax=ax)
_ = d.plot(ax=ax, color="red")
ax.axis("off")
plt.show()

Let's calculate the are of each district:

In [None]:
# Area operation
gdf_paris_districts["area"] = gdf_paris_districts.area
print("Area of geometries:")
print(gdf_paris_districts[['geometry', 'area']].head())

### Spatial Joins and Overlays

- Spatial joins and overlays are essential operations for combining and analyzing geospatial data from different sources. 
- `GeoPandas` provides `sjoin()` for spatial joins and `overlay()` for spatial overlays.

In [None]:
# Create a buffer around the first district
first_district = gdf_paris_districts.iloc[0]
buffered_first_district = first_district['geometry'].buffer(500)

In [None]:
# Create a new GeoDataFrame with the buffered geometry
gdf_buffer = gpd.GeoDataFrame(geometry=[buffered_first_district], crs=gdf_paris_districts.crs)

In [None]:
# Perform a spatial join to find districts that intersect the buffer
intersecting_districts = gpd.sjoin(gdf_paris_districts.iloc[1:], gdf_buffer, predicate='intersects')

In [None]:
# Plot the districts and centroids
fig, ax = plt.subplots()
_ = gdf_buffer.plot(ax=ax)
_ = intersecting_districts.plot(ax=ax, color="red", alpha=.66)
ax.axis("off")
plt.show()

In [None]:
# Perform an overlay to compute the union of intersecting districts
union = gpd.overlay(intersecting_districts, gdf_buffer, how='union')
ax = union.plot()
ax.axis("off")
plt.show()

- In this example, we first create a `buffer` around the first district in the `gdf_paris_districts` GeoDataFrame. 
- We then use spatial join and overlay operations to find intersecting districts and compute the union of those districts.

## Spatial Relationships and Predicates

- In this section, we'll explore spatial relationships and predicates in GeoPandas. 
- We'll cover point-in-polygon analysis and discuss various spatial relationships, such as `contains`, `intersects`, and `within`.

### Point-in-Polygon Analysis with GeoPandas

- Point-in-polygon analysis is a common geospatial operation used to determine if a point is inside a polygon. 
- You can perform point-in-polygon analysis in `GeoPandas` using the `sjoin()` function.

In [None]:
# Create sample points
point_data = {
    'name': ['Point 1', 'Point 2', 'Point 3'],
    'geometry': [Point(2.35, 48.85), Point(2.4, 48.87), Point(2.45, 48.83)]
}
gdf_points = gpd.GeoDataFrame(point_data, crs="EPSG:4326")
gdf_points = gdf_points.to_crs(gdf_paris_districts.crs)

# Perform point-in-polygon analysis using sjoin()
point_in_polygon = gpd.sjoin(gdf_points, gdf_paris_districts, predicate="within")
point_in_polygon

In this example, we create a GeoDataFrame with sample points and perform a point-in-polygon analysis using sjoin() to find which Paris district each point belongs to.

### Spatial Relationships and Predicates

`GeoPandas` provides various spatial relationships and predicates to analyze the relationships between geometries in a GeoDataFrame. Some of the most common spatial relationships are `contains`, `intersects`, and `within`.

In [None]:
# Let's visualize the districts and points
fig, ax = plt.subplots()
_ = gdf_paris_districts.plot(ax=ax)
_ = gdf_points.plot(ax=ax, color="red", alpha=.66)
ax.axis("off")
plt.show()

Let's filter the districts that contain the points:

In [None]:
# Create a geometry that has the 3 different points
geom = unary_union(gdf_points["geometry"])

# Filter for districts that intersect it
ds = gdf_paris_districts[gdf_paris_districts.intersects(geom)]

# Plot
fig, ax = plt.subplots()
_ = ds.plot(ax=ax)
_ = gpd.GeoDataFrame(geometry=[geom]).plot(ax=ax, color="red", alpha=.66)
ax.axis("off")
plt.show()

Let's pick one point and check which district contains it:

In [None]:
# Get a point geometry and filter for the district that contains it
point_geom = gdf_points.iloc[0:1]
district = gdf_paris_districts[gdf_paris_districts.contains(point_geom.iloc[0, -1])]

# Plot
fig, ax = plt.subplots()
_ = district.plot(ax=ax)
_ = point_geom.plot(ax=ax, color="red", alpha=.66)
ax.axis("off")
plt.show()

In this example, we demonstrated how to use spatial relationships and predicates, such as `contains`, `intersects`, and `within`, to analyze the relationships between sample points and the first district in the `gdf_paris_districts` `GeoDataFrame`.

## Coordinate Reference Systems and Transformations

In this section, we will discuss Coordinate Reference Systems (CRS) and how to set and transform them in `GeoPandas`.

### Coordinate Reference Systems (CRS)

- A Coordinate Reference System (`CRS`) defines how coordinates are related to the Earth's surface. 
- It consists of a **coordinate system (e.g., Cartesian, polar)** and a **projection** that maps points from the Earth's surface onto that coordinate system. 
- A CRS can be represented using an **EPSG code**, a **PROJ string**, or a **WKT string**.

When working with geospatial data from different sources, **it is essential to ensure that the data is in the same CRS**. Otherwise, spatial operations and calculations may produce inaccurate results.

### Setting and Transforming CRS in GeoPandas

`GeoPandas` allows you to set, inspect, and transform the CRS of a `GeoDataFrame`. The CRS information is stored in the crs attribute of a GeoDataFrame, and you can transform the CRS using the `to_crs()` method.

In [None]:
# Inspect the CRS of the Paris Districts GeoDataFrame
print("Original CRS of Paris Districts GeoDataFrame:")
print(gdf_paris_districts.crs)

In [None]:
gdf_paris_districts["geometry"].head(3)

In [None]:
# Transform the CRS of the Paris Districts GeoDataFrame to EPSG:3857
gdf_paris_districts_mercator = gdf_paris_districts.to_crs(epsg=3857)

print("Transformed CRS of Paris Districts GeoDataFrame (EPSG:3857):")
print(gdf_paris_districts_mercator.crs)

In [None]:
gdf_paris_districts_mercator["geometry"].head(3)

In [None]:
# Inspect the CRS of the Countries GeoDataFrame
print("Original CRS of Countries GeoDataFrame:")
print(gdf_countries.crs)

In [None]:
# Transform the CRS of the Countries GeoDataFrame to match Paris Districts GeoDataFrame
gdf_countries_transformed = gdf_countries.to_crs(gdf_paris_districts.crs)
print("Transformed CRS of Countries GeoDataFrame:")
print(gdf_countries_transformed.crs)

# Practical Examples and Use Cases

We will work with several datasets about the city of Paris:
- The administrative districts of Paris: [`paris_districts_utm.geojson`](https://opendata.paris.fr/explore/dataset/quartier_paris/).
- Real-time (at the moment I downloaded them ..) information about the public bicycle sharing system in Paris: [`data/paris_bike_stations_mercator.gpkg`](https://opendata.paris.fr/explore/dataset/stations-velib-disponibilites-en-temps-reel/information/).

Both datasets are provided as spatial datasets using a GIS file format. 

Let's starting exploring the data:

<div class="alert alert-success">

**EXERCISE**:

We will start with exploring the bicycle station dataset (available as a GeoPackage file: `data/paris_bike_stations_mercator.gpkg`)
    
* Read the stations datasets into a GeoDataFrame called `stations`.
* Check the type of the returned object
* Check the first rows of the dataframes. What kind of geometries does this datasets contain?
* How many features are there in the dataset? 
    
<details><summary>Hints</summary>

* Use `type(..)` to check any Python object type
* The `geopandas.read_file()` function can read different geospatial file formats. You pass the file name as first argument.
* Use the `.shape` attribute to get the number of features

</details>
    
    
</div>

In [None]:
stations = gpd.read_file("./data/vector/paris_bike_stations_mercator.gpkg")
type(stations)

In [None]:
stations.head()

In [None]:
stations.shape

<div class="alert alert-success">

**EXERCISE**:

* Make a quick plot of the `stations` dataset.
* Make the plot a bit larger by setting the figure size to (12, 6) (hint: the `plot` method accepts a `figsize` keyword).
 
</div>

In [None]:
_ = stations.plot(figsize=(10, 5))

A plot with points can be hard to interpret without any spatial context. Therefore, we will learn how to add a background map.

We are going to make use of the [contextily](https://github.com/darribas/contextily) package. the `add_basemap()` function of this package makes it easy to add a background web map to our plot. We begin by plotting our data, then pass the matplotlib axes object to the `add_basemap()` function. `contextily` will then download the web tiles needed for the geographical extent of the plot. 

<div class="alert alert-success">

**EXERCISE**:

* Import `contextily`.
* Re-do the figure of the previous exercise: make a plot of all the points in `stations`, but assign the result to an `ax` variable.
* Set the marker size equal to 5 to reduce the size of the points (use the `markersize` keyword of the `plot()` method for this).
* Use the `add_basemap()` function of `contextily` to add a background map: the first argument is the matplotlib axes object `ax`.

</div>

In [None]:
# Plot the stations with a supporting basemap 
ax = stations.plot(figsize=(12, 6), markersize=5)
ctx.add_basemap(ax)

<div class="alert alert-success">

**EXERCISE**:

* Make a histogram showing the distribution of the number of bike stands in the stations.

<details>
  <summary>Hints</summary>

* Selecting a column can be done with the square brackets: `df['col_name']`
* Single columns have a `hist()` method to plot a histogram of its values.
    
</details>
    
</div>

In [None]:
_ = stations["bike_stands"].hist(figsize=(5,2), bins=100)

<div class="alert alert-success">

**EXERCISE**:

Let's now visualize where the available bikes are actually stationed:
    
* Make a plot of the `stations` dataset (also with a (12, 6) figsize).
* Use the `'available_bikes'` columns to determine the color of the points. For this, use the `column=` keyword.
* Use the `legend=True` keyword to show a color bar.
 
</div>

In [None]:
_ = stations.plot(figsize=(10, 5), column='available_bikes', legend=True)

<div class="alert alert-success">

**EXERCISE**:

Next, we will explore the dataset on the administrative districts of Paris (available as a GeoJSON file: "data/paris_districts_utm.geojson")

* Read the dataset into a GeoDataFrame called `districts`.
* Check the first rows of the dataframe. What kind of geometries does this dataset contain?
* How many features are there in the dataset? (hint: use the `.shape` attribute)
* Make a quick plot of the `districts` dataset (set the figure size to (12, 6)).
    
</div>

In [None]:
districts = gpd.read_file("./data/vector/paris_districts_utm.geojson")
districts.head()

In [None]:
_ = districts.plot(figsize=(8, 6))

<div class="alert alert-success">

**EXERCISE**:
    
What are the largest districts (biggest area)?

* Calculate the area of each district.
* Add this area as a new column to the `districts` dataframe.
* Sort the dataframe by this area column for largest to smallest values (descending).

<details><summary>Hints</summary>

* Adding a column can be done by assigning values to a column using the same square brackets syntax: `df['new_col'] = values`
* To sort the rows of a DataFrame, use the `sort_values()` method, specifying the colum to sort on with the `by='col_name'` keyword. Check the help of this method to see how to sort ascending or descending.

</details>

</div>

In [None]:
districts.geometry.area

In [None]:
# dividing by 10^6 for showing km²
districts["area"] = districts.geometry.area / 1e6

In [None]:
districts.sort_values(by='area', ascending=False)

<div class="alert alert-success">

**EXERCISE**:

* Add a column `'population_density'` representing the number of inhabitants per squared kilometer (Note: The area is given in squared meter, so you will need to multiply the result with `10**6`).
* Plot the districts using the `'population_density'` to color the polygons. For this, use the `column=` keyword.
* Use the `legend=True` keyword to show a color bar.

</div>

In [None]:
# Add a population density column
districts['population_density'] = districts['population'] / districts.geometry.area * 10**6

# Make a plot of the districts colored by the population density
_ = districts.plot(column='population_density', figsize=(8, 6), legend=True)

In [None]:
# As comparison, the misleading plot when not turning the population number into a density
_ = districts.plot(column="population", figsize=(12, 6), legend=True)

## Additional Resources and Further Reading

Here are some resources to help you learn more about GeoPandas and further develop your geospatial analysis skills in Python:

- [GeoPandas Documentation](https://geopandas.org/en/stable/): The official documentation is an excellent resource to learn more about GeoPandas, its features, and API. GeoPandas Documentation
- [GeoPandas Gallery](https://geopandas.org/en/stable/gallery/index.html): The GeoPandas Gallery contains various examples and use cases to help you understand the capabilities of the library. GeoPandas Gallery
- [Automating GIS-processes course](https://autogis-site.readthedocs.io/en/latest/): This course, offered by the University of Helsinki, covers several Python libraries for geospatial analysis, including GeoPandas. Automating GIS-processes course
- [Introduction to Geospatial Data Analysis with Python](https://www.datacamp.com/tutorial/geospatial-data-python): This tutorial by DataCamp provides an introduction to geospatial data analysis using Python and GeoPandas. Introduction to Geospatial Data Analysis with Python

---