<center><img src="https://i.imgur.com/zRrFdsf.png" width="700"></center>



<a target="_blank" href="https://colab.research.google.com/github/CienciaDeDatosEspacial/Operations-onGeoDF/blob/main/index.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Basic Spatial operations on  Geo Dataframes

We will review some important operations for GeoDataframes. This is a basic set of tools for a social scientist, but which depend a lot on the quality of the maps you have.

A spatial operation is a way of doing arithmetics with geometries! Our inputs being the maps (polygons, lines, or points) will be summed, differentiated, filtered, dissected, and so on. 

Keep in mind that as basic operations, they will be used later for practical applications in the coming weeks. 

## Getting ready

The links to the our maps on GitHub are here:

In [1]:
linkWorldMap="https://github.com/CienciaDeDatosEspacial/dataSets/raw/refs/heads/main/WORLD/worldMaps.gpkg"
LinkBrazil="https://github.com/CienciaDeDatosEspacial/dataSets/raw/refs/heads/main/BRAZIL/brazil_5880.gpkg"
linkIndicators="https://github.com/CienciaDeDatosEspacial/dataSets/raw/refs/heads/main/WORLD/worldindicators.json"


Let's get some maps:

In [2]:
import geopandas as gpd

#world 
world_rivers=gpd.read_file(linkWorldMap,layer='rivers')
#brazil 
brazil5880=gpd.read_file(LinkBrazil,layer='country')
airports_brazil5880=gpd.read_file(LinkBrazil,layer='airports')
states_brazil5880=gpd.read_file(LinkBrazil,layer='states')
municipalities_brazil5880=gpd.read_file(LinkBrazil,layer='municipalities')
#someindicatos
indicators=gpd.read_file(linkInd)

NameError: name 'linkInd' is not defined

Let's open some the layers (this takes a minute):

In [None]:
brazil5880=gpd.read_file(LinkBrazil,layer='country')
airports_brazil5880=gpd.read_file(LinkBrazil,layer='airports')
states_brazil5880=gpd.read_file(LinkBrazil,layer='states')
municipalities_brazil5880=gpd.read_file(LinkBrazil,layer='municipalities')
world_rivers=gpd.read_file(linkWorldMap,layer='rivers')



Now, let's see some important spatial operations.


<a class="anchor" id="1"></a>

# Filtering

## Slicing with **iloc** and **loc**

You can keep some elements by subsetting by *filtering*, as we used to do in common pandas data frames.

In [None]:
states_brazil5880.head()

In [None]:
# as in DF
states_brazil5880.iloc[:10,1:]

In [None]:
# as DF
states_brazil5880.loc[:8,'state_code':]

Keep in mind that if you do not include the geometry column, you will get a DataFrame (DF) back, not a GeoDF.

In [None]:
#geodf
type(states_brazil5880.loc[:8,'state_code':])

In [None]:
# df
type(states_brazil5880.loc[:8,:'state_code'])

Also remember this detail:

In [None]:
# you lost the spatial structure when keeping ONE row!
type(states_brazil5880.loc[8,:])

In [None]:
# you keep the spatial structure if the row index is a list
type(states_brazil5880.loc[[8],:])

## Filtering with **cx**

But as a GeoDF, you can also filter using a coordinate point via __cx__. 

Now, let me get Brazil's centroid:

In [None]:
brazil5880.centroid

Here, I recover each coordinate values:

In [None]:
mid_x,mid_y=brazil5880.centroid.x[0],brazil5880.centroid.y[0]
mid_x,mid_y

Let me select airports north of the centroid:

In [None]:
airports_brazil5880.cx[:,mid_y:]

In [None]:
# the viz
base=brazil5880.plot(color='yellow')
airports_brazil5880.cx[:,mid_y:].plot(ax=base)
brazil5880.centroid.plot(color='red',ax=base)

Notice __cx__ would be cleaner if spatial element is a point. 

Let me split the states (polygons) using the centroid so you see a less clean result:

In [None]:
# the north
N_brazil=states_brazil5880.cx[:,mid_y:]
# the south
S_brazil=states_brazil5880.cx[:,:mid_y]
# the west
W_brazil=states_brazil5880.cx[:mid_x,:]
# the east
E_brazil=states_brazil5880.cx[mid_x:,:]

Notice the centroid does not cut polygons:

In [None]:
base=N_brazil.plot()
brazil5880.centroid.plot(color='red',ax=base)

In [None]:
base=W_brazil.plot()
brazil5880.centroid.plot(color='red',ax=base)

## Clipping

Pay attention to this GDF:

In [None]:
world_rivers

As you see, this GDF has no Country column. But since it has geometry, you can keep the rivers, or their sections, that serve a country:

In [None]:
rivers_brazil5880 = gpd.clip(gdf=world_rivers.to_crs(5880),
                             mask=brazil5880)

Then, you can plot the clipped version:

In [None]:
base = brazil5880.plot(facecolor="greenyellow", edgecolor='black', linewidth=0.4,figsize=(5,5))
rivers_brazil5880.plot(edgecolor='blue', linewidth=0.5,
                    ax=base)

We can create our own mask for clipping:

Let me get the **bounding box** of the map (the smallest possible rectangle that completely encloses a geometric shape or set of shapes):

In [None]:
brazil5880.total_bounds #[minx, miny, maxx, maxy]

In [None]:
# or
minx, miny, maxx, maxy=brazil5880.total_bounds
minx, miny, maxx, maxy

I will combine those coordinates with the centroid to create a BOX of the north and south of Brazil:

In [None]:
north_mask = [minx, mid_y, maxx, maxy]
south_mask = [minx, minx, maxx, mid_y]

# split Brazil
states_brazil5880.clip(north_mask).plot(edgecolor="yellow")

In [None]:
states_brazil5880.clip(south_mask).plot(edgecolor="yellow")

As you see, with clip we can cut polygons.

## Spatial Joins

We’re familiar with **merging**, which joins tables using common keys. Spatial joins, by contrast, rely solely on **geometry columns** to perform various types of filtering. 

Let me keep the large airports:

In [None]:
# just the union

large_airports=airports_brazil5880[airports_brazil5880.airport_type=='large_airport']
large_airports.head()

...and:

In [None]:
states_brazil5880.head()

Let's keep (filter):
> The large airports whose geometries are within the borders of a state in Brazil.

In [None]:
airports_within_states = gpd.sjoin(
    large_airports,         # LEFT: airports we want to filter/keep
    states_brazil5880,      # RIGHT: spatial boundaries to check against
    how='inner',            # return geometries that match in both LEFT/RIGHT (jointype)
    predicate='within'      # spatial condition: LEFT geometry within RIGHT geometry
)

# these are:
airports_within_states

We just performed a point-to-polygon spatial join.
Notice that the result preserves the original geometries from the LEFT GeoDataFrame — specifically, only those features whose spatial relationship satisfied both the predicate (e.g., 'within') and the join type ('inner').
The non-geometric attributes (columns) from the RIGHT GeoDataFrame are joined to the matching rows.

Importantly, if the LEFT GeoDataFrame contains polygons and the RIGHT contains points (a polygon-to-point join), you’ll typically need to use a different predicate — such as 'contains' — to express the spatial relationship correctly.

In [None]:
states_containing_LargeAirports = gpd.sjoin(states_brazil5880,large_airports,how='inner',
                                            predicate='contains')

states_containing_LargeAirports

'Contains' is literally strict: Any airport located exactly on a state boundary — whether due to data precision, snapping, or real geography — will be excluded, even if it’s “practically” inside the state.

To keep airports that may lie on the border, use the predicate 'intersects':

In [None]:
gpd.sjoin(states_brazil5880,large_airports,
          how='inner', predicate='intersects')

**Intersects** needs at least a common point between both GeoDFs. 

In [None]:
# Neighbors of Bahia?
gpd.sjoin(N_brazil.loc[N_brazil.state_name=='Bahia',:],N_brazil,how='inner', predicate='intersects').shape

That is, Bahia seems to share borders with 5 states:

In [None]:
base=gpd.sjoin(N_brazil,N_brazil.loc[N_brazil.state_name=='Bahia',:],
               how='inner', 
               predicate='intersects').plot(color='yellow',edgecolor='red')
N_brazil.loc[N_brazil.state_name=='Bahia',:].plot(ax=base, color='red')

We also have 'touches', a more stringent predicate than 'intersects'. It returns geometries that:
 - Share a border (for polygons or lines), or
 - Contact at exactly one point (for points or endpoints).

However, because many free GeoDataFrames — especially those sourced as Shapefiles — contain topological imperfections like gaps, overlaps, or misaligned vertices, 'touches' often fails to detect what should be adjacent features. Ironically, this “failure” can be useful: 'touches' acts as a diagnostic tool — highlighting where boundaries are not perfectly aligned.

In [None]:
gpd.sjoin(N_brazil.loc[N_brazil.state_name=='Bahia',:],N_brazil,how='inner', predicate='touches').shape

See the neighbor that disappears:

In [None]:
base=gpd.sjoin(N_brazil,N_brazil.loc[N_brazil.state_name=='Bahia',:],
               how='inner', 
               predicate='touches').plot(color='yellow',edgecolor='red')
N_brazil.loc[N_brazil.state_name=='Bahia',:].plot(ax=base, color='red')

In [None]:
amazonSystem=rivers_brazil5880[rivers_brazil5880.SYSTEM=='Amazon']
amazonSystem

In [None]:
gpd.sjoin(states_brazil5880,amazonSystem,how='inner', predicate='intersects').shape

In [None]:
gpd.sjoin(states_brazil5880,amazonSystem,how='inner', predicate='crosses').shape

In [None]:
gpd.sjoin(states_brazil5880,amazonSystem,how='inner', predicate='intersects')

In [None]:
intersects_result

In [None]:
# Get intersects result
intersects_result = gpd.sjoin(states_brazil5880, amazonSystem, how='inner', predicate='intersects')

# Get crosses result
crosses_result = gpd.sjoin(states_brazil5880, amazonSystem, how='inner', predicate='crosses')

# Find the one that's in intersects but not in crosses
notInCrosses=list(set(intersects_result.index_right)-set(crosses_result.index_right))
stateToWatch=intersects_result[intersects_result.index_right.isin(notInCrosses)].index

# see
states_brazil5880.loc[stateToWatch,'state_name'],amazonSystem.loc[notInCrosses,"RIVER"]

In [None]:
list(set(intersects_result.index_right)-set(crosses_result.index_right))

In [None]:
base=states_brazil5880.loc[stateToWatch,:].plot(color='w',edgecolor='k',figsize=(15, 10))
amazonSystem.plot(ax=base)
amazonSystem.loc[notInCrosses,:].plot(color='red',ax=base)




_____________


<a class="anchor" id="3"></a>

# UNARY Operations on GeoDF



In [None]:
#see
municipalities_brazil5880.head(20)

Then, this is Rondônia:

In [None]:
muniRondonia=municipalities_brazil5880[municipalities_brazil5880.state_name=='Rondônia']

In [None]:
muniRondonia.plot(edgecolor='yellow')

## I. Operation that combine 

Let's see the options to combine:

### Unary UNION

We can combine all these polygons into one:

In [None]:
muniRondonia.union_all()

Let's save that result:

In [None]:
Rondonia_union=muniRondonia.union_all()

In [None]:
# what do we have?
type(Rondonia_union)

You can turn that _shapely_ object into a GeoDF like this:

In [None]:
gpd.GeoDataFrame(geometry=[Rondonia_union]) # the recent union

Even better:

In [None]:
gpd.GeoDataFrame(index=[0], # one element
                 data={'state':'Rondonia'}, # the column and the value
                 geometry=[Rondonia_union]) # the recent union

<a class="anchor" id="21"></a>

### Dissolve

#### a. Dissolve as Union
Using  **dissolve** is an alternative to _UNION_:

In [None]:
muniRondonia.dissolve().plot()

Let me save the result, and see the type :

In [None]:
Rondonia_dissolved=muniRondonia.dissolve()

# we got?
type(Rondonia_dissolved)

You got a GEOdf this time:

In [None]:
## see
Rondonia_dissolved

In [None]:
# keeping what is relevant
Rondonia_dissolved.drop(columns=['municipality_name','municipality_code'],inplace=True)

# then
Rondonia_dissolved

#### b. Dissolve for groups

Using _dissolve()_ with no arguments returns the union of the polygons as above, AND also you get a GEOdf.
However, if you have a column that represents a grouping (as we do), you can dissolve by that column:

In [None]:
# dissolving
municipalities_brazil5880.dissolve(by='state_code').plot(facecolor='lightgrey', edgecolor='black',linewidth=0.2)

Again, let me save this result:

In [None]:
Brazil_adm1_diss=municipalities_brazil5880.dissolve(by='state_code')

We know we have a GeoDF; let's see contents:

In [None]:
Brazil_adm1_diss.head()

Again, we can drop columns that do not bring important information:

In [None]:
Brazil_adm1_diss.drop(columns=['municipality_name',	'municipality_code'],inplace=True)
Brazil_adm1_diss.reset_index(inplace=True)
Brazil_adm1_diss.info()

#### c. Dissolve and aggregate

In pandas, you can aggregate data using some statistics. Let me open the map with indicators we created in a previous session:

In [None]:

indicators.head()

You can compute the mean of the countries by region, using a DF approach like this:

In [None]:
indicators.groupby('region').agg({'fragility':'mean'}) 


You do not see a "geometry" column. It got lost when using **groupby().agg()**.

The appropriate operation to conserve spatial information is also **dissolve**:

In [None]:
indicatorsByRegion=indicators.dissolve(
    by="region", #groupby()
    aggfunc={"fragility": "mean"}, #agg()
    )

## see the GeoDF
indicatorsByRegion

Without renaming, you can request a choropleth:

In [None]:
# !pip install mapclassify

In [None]:
indicatorsByRegion.plot(column ='fragility',edgecolor='white',
                        figsize=(15, 10))

Keep in mind that the combining of objects via UNION_ALL and DISSOLVE are destructive, we can not undo them. We have operations like EXPLODE that work in the reverse direction (splitting) but even that function can not undo the output of UNION_ALL and DISSOLVE. Always preserve your original GeoDataFrame before using these operations, as they permanently alter your data in ways that cannot be reversed.

_____________


<a class="anchor" id="4"></a>

## II. The convex hull

Sometimes you may have the need to create a polygon that serves as an envelope to a set of points.

For this example, let me use the large airports:

In [None]:
large_airports.plot()

May I use now **convex_hull**?

In [None]:
## you see no difference!!
large_airports.convex_hull.plot()

The objects to be enveloped required to be **previously combined**: 

In [None]:
# hull of the union
large_airports.union_all().convex_hull

The structure we  got is:

In [None]:
# this geometry not a GeoDF...yet
type(large_airports.union_all().convex_hull)

Let's turn this geometry into a GDF:

In [None]:
LargeAirports_hull= gpd.GeoDataFrame(index=[0],
                                     data={'hull':'Large airports'}, # the column and the value
                                    #crs=large_airports.crs,
                                    geometry=[large_airports.union_all().convex_hull])

# then

LargeAirports_hull

Let's use the GDF in plotting:

In [None]:

base=brazil5880.plot(facecolor='yellow')
large_airports.plot(ax=base)
LargeAirports_hull.plot(ax=base,facecolor='green',
                       edgecolor='white',alpha=0.4,
                       hatch='X')

You can get a convex hull of lines or polygons:

In [None]:
rivers_brazil5880.union_all().convex_hull

You can use it for dissolved polygons:

In [None]:
Rondonia_dissolved.convex_hull.plot()

Remember that **union_all** and **dissolve()** give different outputs:

In [None]:
# you got a series, not just a geometry 
type(Rondonia_dissolved.convex_hull)

In [None]:
# a simple "to_frame" does the job
Rondonia_dissolved.convex_hull.to_frame()

In [None]:
# more details
Rondonia_hull=Rondonia_dissolved.convex_hull.to_frame()
Rondonia_hull.rename(columns={0:"geometry"},inplace=True)
Rondonia_hull.set_geometry('geometry',inplace=True)
Rondonia_hull["name"]="Rondonia"
Rondonia_hull

In [None]:
#noticed the crs was inherited
Rondonia_hull.crs

Unless you need a hull per row, you need to union/dissolve the polygons (rows) of a GeoDF, see:

In [None]:
#original not COMBINED:
Brazil_adm1_diss.plot(edgecolor="yellow")

In [None]:
# hull of Non combined
Brazil_adm1_diss.convex_hull.plot(edgecolor="yellow")

In [None]:
# the hull of Brazil
Brazil_adm1_diss.dissolve().convex_hull.plot(edgecolor="yellow")

## III. The Buffer

The buffer will create a polygon that follows the same shape of the original vector (line, polygon, point).

Let me buffer the Brazil rivers:

In [None]:
# this is the original
rivers_brazil5880.plot()

But, verify crs as we are going to use distances:

In [None]:
rivers_brazil5880.crs

Now I can use the rivers to create a buffer of 50000 meters:

In [None]:
# 50000 at each side (radius)
rivers_brazil5880.buffer(50000).plot(facecolor='yellow', edgecolor='black',linewidth=0.2)

The resulting buffer is:

In [None]:
type(rivers_brazil5880.buffer(50000))

Then:

In [None]:
base=rivers_brazil5880.buffer(50000).plot(facecolor='yellow',edgecolor='black',linewidth=0.2)
rivers_brazil5880.plot(ax=base)

notice:

In [None]:
riv_buf_right = rivers_brazil5880.buffer(distance = 50000, single_sided = True)
riv_buf_left = rivers_brazil5880.buffer(distance = -25000, single_sided = True)

base =riv_buf_right.plot(color='green')
riv_buf_left.plot(ax=base, color='purple')

Let me save the rivers reprojected in a JSON file:

In [None]:
rivers_brazil5880.to_file("rivers_brazil5880.geojson", driver="GeoJSON")


_____________

<a class="anchor" id="5"></a>
# BINARY Operations: Spatial Overlay

We might need to create or find some geometries from the geometries we already have. Using a set theory approach, we will see the use of _intersection_, _union_, _difference_, and _symmetric difference_.

Let's remember these results:

In [None]:
N_brazil

In [None]:
S_brazil

Let me plot both of them:

In [None]:
base= N_brazil.plot(facecolor='black', edgecolor='white',linewidth=0.2, alpha=0.6)
S_brazil.plot(facecolor='white', edgecolor='black',linewidth=0.2,ax=base, alpha=0.6)

Notice that the coordinates we used to split the states did not give us a clean cut. Here you see the states in common:

In [None]:
set(S_brazil.state_name) & set(N_brazil.state_name)

The same happened in East vs West:

In [None]:
set(E_brazil.state_name) & set(W_brazil.state_name)

In [None]:
# visualizing
base= E_brazil.plot(facecolor='black', edgecolor='white',linewidth=0.2, alpha=0.6)
W_brazil.plot(facecolor='white', edgecolor='black',linewidth=0.2,ax=base, alpha=0.6)

## Intersection

We keep what is common between GeoDFs:

In [None]:
NS_brazil=N_brazil.overlay(S_brazil, how="intersection",keep_geom_type=True)
# see results
NS_brazil

Notice we got more rows than when we did this operation:

```
set(S_brazil.state_name) & set(N_brazil.state_name)
```
We have three more polygons:

In [None]:
NS_brazil[NS_brazil.state_name_1!= NS_brazil.state_name_2]

In fact, we are NOT intersecting state names, we are intersecting geometries. Then, the input maps have some topological issues.

This is the amount of area that is in fact a topological problem:

In [None]:
NS_brazil[NS_brazil.state_name_1!= NS_brazil.state_name_2].geometry.area.sum()

This represents the area with topologically valid boundaries:

In [None]:
NS_brazil[NS_brazil.state_name_1== NS_brazil.state_name_2].geometry.area.sum()

A way to measure the share of the low quality:

In [None]:
NS_brazil[NS_brazil.state_name_1!= NS_brazil.state_name_2].geometry.area.sum()/  \
NS_brazil[NS_brazil.state_name_1== NS_brazil.state_name_2].geometry.area.sum() #continues from above

So, spatial overlay operations do their best to give you true results; but unfortunately, as the quality of the sources is not perfect, you may get messy results. It is our job to detect and make decisions. Let's keep two GeoDF, one with the unperfect result, and another with the true output.

In [None]:
NS_brazil_messy=NS_brazil.copy()
NS_brazil=NS_brazil[NS_brazil.state_name_1== NS_brazil.state_name_2]

This should be what we expected to see:

In [None]:
NS_brazil

The clean data has minor things to improve, delete redundant columns, rename columns, and reset the index so they are a correlative sequence. 

In [None]:
# avoid redundancy
keep=['state_name_1','state_code_1','geometry']
NS_brazil=NS_brazil.loc[:,keep]
NS_brazil.rename(columns={'state_name_1':'state_name','state_code_1':'state_code'},inplace=True)

# reset for correlative sequence
NS_brazil.reset_index(drop=True, inplace=True)

Based on the previous case, we may expect a similar situation here:

In [None]:
# keeping the overlay
WE_brazil=W_brazil.overlay(E_brazil, how="intersection",keep_geom_type=True)
WE_brazil[WE_brazil.state_name_1!= WE_brazil.state_name_2]

Let's do the same as before:

In [None]:
WE_brazil_messy=WE_brazil.copy()
WE_brazil=WE_brazil[WE_brazil.state_name_1== WE_brazil.state_name_2]

keep=['state_name_1','state_code_1','geometry']
WE_brazil=WE_brazil.loc[:,keep]
WE_brazil.rename(columns={'state_name_1':'state_name','state_code_1':'state_code'},inplace=True)
WE_brazil.reset_index(drop=True, inplace=True)

## Union

Different from UNION_ALL (which acts as DISSOLVE), here we will combine two GeoDFs. 

In [None]:
NS_brazil.info()

In [None]:
WE_brazil.info()

In [None]:
# now
NS_brazil.overlay(WE_brazil,how="union",keep_geom_type=True)

As you see, geometries are fine, but missing values were created where no intersection exists. Notice this operation does not identity the intersection, just pastes one of top of the other:

In [None]:
# appending
import pandas as pd

pd.concat([NS_brazil,WE_brazil],ignore_index=True)

Let me create an object to save the previous result:

In [None]:
MidBrazil=NS_brazil.overlay(WE_brazil,how="union",keep_geom_type=True).dissolve()
MidBrazil

In [None]:
# some cleaning

MidBrazil['country']='Brazil'
MidBrazil['region']='center'
# reordering
MidBrazil=MidBrazil.loc[:,['country','region','geometry']]

MidBrazil

In [None]:
# see it
base=brazil_5880.plot(facecolor='yellow')
MidBrazil.plot(ax=base)

## Difference

Here, you keep what belongs to the GeoDF to left that is not in the GeoDF to the right:

In [None]:
# we keep nothern states that are not in the 'S_brazil' region
N_brazil.overlay(S_brazil, how='difference')

In [None]:
# using set operations:
set(N_brazil.state_name)- set(S_brazil.state_name)

We got a clean result. Let's plot it:

In [None]:
base=N_brazil.plot(color='yellow', edgecolor='black',alpha=0.1)
N_brazil.overlay(S_brazil, how='difference').plot(ax=base)

Keep in mind that **difference** is not commutative:

In [None]:
S_brazil.overlay(N_brazil, how='difference')

In [None]:
base=N_brazil.plot(color='yellow', edgecolor='black',alpha=0.1)
S_brazil.overlay(N_brazil, how='difference').plot(ax=base)

## Symmetric Difference

This is the opposite to *intersection*, you keep what is not in the intersection. Notice that this operation is commutative!

In [None]:
N_brazil.overlay(S_brazil, how='symmetric_difference')

This operation gave a clean result again. Let's plot it:

In [None]:
N_brazil.overlay(S_brazil, how='symmetric_difference').plot()


_____________

<a class="anchor" id="6"></a>

# Validity of Geometry

Geometries are created in a way that some issues may appear, especially in (multi) polygons.
Let's check if our recent maps on states and municipalities are valid:

In [None]:
# non valid

S_brazil[~S_brazil.is_valid]

In [None]:
# see the invalid:
S_brazil[~S_brazil.is_valid].plot()

It is difficult to see what is wrong. Let's get some information:

In [None]:
# what is wrong?

from shapely.validation import explain_validity, make_valid

explain_validity(S_brazil[~S_brazil.is_valid].geometry)

In [None]:
explain_validity(S_brazil.geometry).str.split("[",expand=True)[0].value_counts()

In [None]:
S_brazil_valid=S_brazil.copy()

S_brazil_valid['geometry'] = [make_valid(row)  if not row.is_valid else row for row in S_brazil['geometry'] ]
#any invalid?
S_brazil_valid[~S_brazil_valid.is_valid]

Let´s verify we have not created **collections**:

In [None]:
pd.Series([type(x) for x in S_brazil_valid.geometry]).value_counts()

## Buffers and Validity

The buffering process helps cleaning simple invalidities:

In [None]:
S_brazil_valid=S_brazil.copy()

S_brazil_valid['geometry'] = S_brazil_valid['geometry'].buffer(0)

#any invalid?
S_brazil_valid[~S_brazil_valid.is_valid]

This 'buffer trick' may not always work:

In [None]:
# previously
indicatorsByRegion.plot(column =indicatorsByRegion.index,
                        edgecolor='white',
                        figsize=(15, 10))

The worst cases seem AFRICA and EAST AND SOUTHEAST ASIA, as both show some lines that should have disappeared after the dissolving we did a while ago.

Did the dissolving process created invalid geometries?

In [None]:
indicatorsByRegion.geometry.is_valid.value_counts()

Since we do not have invalid geometries, we know the dissolving created some gaps, so the goal is to snap the boundaries together to eliminate these microscopic gaps.

We could try the trick  of buffer(0), again:

In [None]:
indicatorsByRegion_prjd=indicatorsByRegion.to_crs("ESRI:54052").copy()
indicatorsByRegion_prjd['geometry'] = indicatorsByRegion_prjd.buffer(0)

# previously
indicatorsByRegion_prjd.plot(column =indicatorsByRegion_prjd.index,
                        edgecolor='white',
                        figsize=(15, 10))

It did not work either. We may increase the buffer:

In [None]:
indicatorsByRegion_prjd['geometry'] = indicatorsByRegion_prjd.buffer(1)

indicatorsByRegion_prjd.plot(column =indicatorsByRegion_prjd.index,
                        edgecolor='white',
                        figsize=(15, 10))

The last version did got rid of the gaps. Let's just check the counts in each case:

In [None]:
[(r,len(g.geoms)) for r,g in zip(indicatorsByRegion.index,indicatorsByRegion.geometry) if g.geom_type.startswith('Multi')]

In [None]:
[(r,len(g.geoms)) for r,g in zip(indicatorsByRegion_prjd.index,indicatorsByRegion_prjd.geometry)  if g.geom_type.startswith('Multi')]

It seems AFRICA issue was solved, but not EAST AND SOUTHEAST ASIA. Thee seems to be a really big issue in those borders (Mongolia and China). Let's explore:

In [None]:
china=indicators[indicators.Country.isin(['CHINA'])]
mongolia=indicators[indicators.Country.isin(['MONGOLIA'])]

china.overlay(mongolia, how='intersection',keep_geom_type=False).geometry

So, we have some really bad situation:

- There is an intersection between two countries, and there should be none.
- There intersection includes objects other than polygons: 'GEOMETRYCOLLECTION'

See:


In [None]:
# Quick count of objects in the GeometryCollection
result_geom = china.overlay(mongolia, how='intersection',keep_geom_type=False).geometry.iloc[0]
if result_geom.geom_type == 'GeometryCollection':
    print(f"Objects in collection: {len(result_geom.geoms)}")
    from collections import Counter
    print(dict(Counter(g.geom_type for g in result_geom.geoms)))

In [None]:
## see the intersection:
base=china.plot(color='lightgrey')
mongolia.plot(color='yellow',ax=base)
china.overlay(mongolia, how='intersection',keep_geom_type=False).plot(ax=base)

The solution to this, believe me will not be trivial: the border is not continuous, and creating a 'new frontier' between China and Mongolia will demand more functions than the ones we have taught so far. Situations like this require smart decisions, like get a new map  with a [better quality](https://www.naturalearthdata.com/downloads/110m-cultural-vectors/).