<center><img src="https://i.imgur.com/zRrFdsf.png" width="700"></center>

## The Geo Dataframe

Let's see a  case of **maps represented by polygons**. Let me visit this page from the [World Bank](https://datacatalog.worldbank.org/search/dataset/0038272/World-Bank-Official-Boundaries). I have download the World Boundaries Low Resolution in GeoJSON format.

In [None]:
import os, geopandas as gpd

worldmap=gpd.read_file(os.path.join("maps","WB_countries_Admin0_lowres.geojson"))

In [None]:
worldmap.plot()

In [None]:
# see the structure
worldmap.info()

You know you have a geodataframe when you have a **geometry column type**.

In [None]:
worldmap.geometry

This data frame has several columns. Using **head()** is important to detect salient problems, but pandas or geopandas may hide some columns:

In [None]:
worldmap.head()

An strategy would be to see the head as an **html**:

In [None]:
from IPython.core.display import HTML

display(HTML(worldmap.head().to_html()))

The geometry column does not allow a clean visual of the data. Let's omit it and redo the previous code:

In [None]:
display(HTML(worldmap.drop(columns='geometry').to_html())) #omit, NOT Dropping

There are so many columns, it may be very hard to check each one for particular problems. In general, you need a clean map where you can add other data to it. The most important imformation will be in the data you add.
Then, in this case, you will pay attention to the columns that are needed to combine this map and other data; that is, verify that the **key** or **identifiers** are clean. Let's see:

In [None]:
# any missing values?
worldmap[worldmap.isna().any(axis=1)]

From the data frame head, you may use these as identifiers of the rows:

In [None]:
identifiers=['TYPE','FORMAL_EN','FIPS_10_','ISO_A2', 'ISO_A3',"ISO_A3_EH"]
worldmap.loc[:,identifiers]

In [None]:
# any repeated values in country name ('FORMAL_EN')?

worldmap[worldmap.duplicated(subset=['FORMAL_EN'],keep=False)].drop(columns='geometry')

This is a nice case of repeated values. Why would a map of countries repeat country names? 
Our first **guess** is that each row represents a polygon (a surface on the planet), so maybe a country may be composed of several polygons. But, we already saw the presence of **multipolygons** in a row. Then, it is possible that some polygons are differentiated for some international politics reason.

In order to find out the nature of these findings we might need a closer look to the data. Let's use [dtale](https://github.com/man-group/dtale)

In [None]:
from dtale import show as dtshow


dtshow(worldmap[worldmap.duplicated(subset=['FORMAL_EN'],keep=False)].drop(columns='geometry'))

As you see, the map data is not perfect because of the complexity of international laws and the like. Notice the presence of "-99" and "-099". Those are representing a missing value. You may want to keep it that way as some map formats may not work as expected with missing values.

Let's update our identifiers and create a new map:

In [None]:
identifiers=['TYPE','FORMAL_EN','WB_NAME','NAME_EN','FIPS_10_','ISO_A2', 'ISO_A3',"ISO_A3_EH",'ISO_N3','UN_A3',"WB_A2",'WB_A3','REGION_UN','geometry']
mapWorld=worldmap.loc[:,identifiers]

#then
mapWorld[mapWorld.duplicated(subset=['FORMAL_EN'],keep=False)].drop(columns='geometry')

Now that we have a simpler map, it would be good to have a column with no repeated values in the ISO codes (that requires some research):

In [None]:
mapWorld.loc[[234,235,236,249],'ISO_A2']=['BQ','BQ','BQ','TK']
mapWorld.loc[[234,235,236,249],'ISO_A3']=['BQ1','BQ2','BQ3','TKL']
mapWorld.loc[[234,235,236,249],'ISO_A3_EH']=['BQ1','BQ2','BQ3','TKL']

It seems **WB_NAME** would be the best candidate for unique names. Let's verify:

In [None]:
mapWorld[mapWorld.duplicated(subset=['WB_NAME'],keep=False)].drop(columns='geometry')

Let's check the missing values:

In [None]:
# on the missing ISOs?
mapWorld[(mapWorld.loc[:,['ISO_A2','ISO_A3','ISO_A3_EH']].isin(["-99","-099"])).any(axis=1)]

It may call your attention the case of France, Norway and Kosovo.  This is clearly a mistake we need to solve (The Kosovo ISO3 came from [here](https://knowledgecenter.zuora.com/Quick_References/Country%2C_State%2C_and_Province_Codes/A_Country_Names_and_Their_ISO_Codes)):

In [None]:
mapWorld.loc[[20,50,232,238,239],'ISO_A2']=['FR','NO','CP','CC','CX']
mapWorld.loc[[20,50,62,232,238,239],'ISO_A3']=['FRA','NOR','XKX','CPT','CCK','CXR']
mapWorld.loc[[50,62,232,238,239],'ISO_A3_EH']=['NOR','XKX','CPT','CCK','CXR']

Let's recheck the repeated ISOs:

In [None]:
mapWorld[mapWorld.duplicated(subset=['ISO_A3'],keep=False)].drop(columns='geometry')

Notice that what is left are the **UMI**. Are these polygons neighbors?

In [None]:
mapWorld[mapWorld.ISO_A3.isin(['UMI'])].explore(location=[10, -130],
                                                zoom_start=3,tiles='CartoDB positron',
                                                color='red',style_kwds={'weight':5})

This is good they each have a different polygon. If they were neighbors you may try to *glue* them into one geometry, but this is not the case.

We know there are no empty cells, but there might be cells with just one space (or more):

In [None]:
CellOnlySpaces=lambda x:x.str.match(r'\s+')
where=['TYPE','FORMAL_EN','WB_NAME','NAME_EN']
mapWorld[(mapWorld.loc[:,where].apply(CellOnlySpaces)).any(axis=1)]

We have a column that could be used to replace those values:

In [None]:
mapWorld.loc[worldmap.FORMAL_EN.str.match(r'\s+'),'WB_NAME']

We can complete those cells with:

In [None]:
# getting rid of any text \w+. within parentheses \(\), with one space before, and possibly one after
mapWorld.loc[worldmap.FORMAL_EN.str.match(r'\s+'),'WB_NAME'].str.replace('\s\(\w+.\)\s*',"",regex=True)

In [None]:
# then,
newNames=mapWorld.loc[worldmap.FORMAL_EN.str.match(r'\s+'),'WB_NAME'].str.replace('\s\(\w+.\)\s*',"",regex=True)
mapWorld.loc[worldmap.FORMAL_EN.str.match(r'\s+'),'FORMAL_EN']=newNames

In [None]:
# confirming
mapWorld[(mapWorld.loc[:,where].apply(CellOnlySpaces)).any(axis=1)]

Let's save the file:

In [None]:
mapWorld.to_file(os.path.join("maps","mapWorld.gpkg"), layer='countries', driver="GPKG")