Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Admin2 Import #1492

Open
geohacker opened this issue Aug 11, 2022 · 17 comments
Open

Admin2 Import #1492

geohacker opened this issue Aug 11, 2022 · 17 comments
Assignees
Labels

Comments

@geohacker
Copy link
Collaborator

After importing admin1 data and building a workflow to update geometries and attributes, and update Mapbox Vector tilesets we will now look at doing pretty much the same for admin2. #470

Workflow

The workflow will remain largely same. We'll write management commands that can read a shapefile to create new admin2 geometries or update existing ones based on a new dataset. This ensures there's an easy way to fix incorrect or disputed geoms. The geometries will be stored in a separate table (similar to admin1) and not in the districts table. This means that it won't impact the performance of existing GO API endpoints.

We'll also add a query param for the API to fetch geometries and also write a script to update Mapbox tiles when needed.

Base data source

We think that the admin boundaries from FEWS is a good baseline. FEWS is a good dataset that is best of FAO-GAUL, GADM, the Humanitarian Data Exchange. HDX uses the UN OCHA datasets. FEWS also incorporates standard names from the GEONet Names database.

It's not perfect but with the workflow to be able to update easily, we should be able to fix issues as they are reported. We have had some good experience using FEWS in a few different projects.

@tovari In our dev catchup call the other day, you mentioned a few cases where FEWS wasn't reliable. Do you mind outlining them here? We can probably catch these early on and look for alternatives for those countries.

cc @batpad @LukeCaley @frozenhelium @szabozoltan69

@tovari
Copy link

tovari commented Aug 12, 2022

The issue with FEWS is that it doesn't contain local names. E.g. in case of Ukraine no cyrillic names are available, only the English transliterations.
The other problem is the coverage. It has a good covergae in Africa, but not on other continents. A list of admin2 layers per countries is shared here

@geohacker
Copy link
Collaborator Author

At the GO Sprint in Kathmandu we decided we'll go ahead with the OCHA CODs for admin2 that are published on geoboundaries.org. Since we will rely on CODs, it will allow us to import progressively without changing the data drastically quickly. We decided to start with an inspection of the data and how that lines up with the existing admin0 and admin1 data in GO. We also decided to consider importing countries in the Caribbean to start with.

I started looking at OCHA COD admin2 data for importing to GO. Here are some findings for Haiti and Kosovo:

  • The admin boundaries almost always don’t line up well. This means we’ll have to import admin1 and admin2 from the same data source.

  • Some small areas are missing from the OCHA admin1 polygons compared to what’s already in GO from ICRC

  • For Kosovo, looking at admin0, there are some shifts in the boundary

  • Looks like Kosovo admin2 is actually what we use as admin1 in the GO database. But then this is not part of the OCHA COD.

@geohacker
Copy link
Collaborator Author

The issues illustrated above are not particularly surprising but something we needed to take a look at with good examples. This makes me feel like I think we should work towards an expectation of getting reliable admin2 boundaries into the database, without removing the admin1 and admin0 data that came from ICRC. Some thoughts:

  1. Changing the ICRC admin0 and admin1 data means we'll have to start from scratch in terms of the boundaries, disputed and overseas territories managed in GO right now
  2. This will have an impact on existing field reports as we change the admin1 dataset. We may have to manually remap or deprecate some of the old admin1s
  3. From what I can see, CODs aren't available consistently for all countries (see stats here https://cod.unocha.org/). This means we'll be mixing admin1s and admin2 which will lead to a lot of issues similar to Kosovo above. This will be misleading.

For the GO API and Risk Module use cases, I think we can do the following:

  1. Stick to admin2 from OCHA CODs
  2. Import admin2s country by country without replacing existing admin1s
  3. Ensure there's a proper mapping between admin2s to admin1. This could be done manually through workflows using qgis or tools used around DEEP
  4. Support admin2 map based selection tool on GO
  5. Visualize admin2 and admin1 in a mutually exclusive way in the style. This will still have some edge cases but will be largely ok
  6. Add a disclaimer about mixed data sources and documentation users can read to understand why this is the case.

cc @batpad @tovari @LukeCaley @justinginnetti

@geohacker
Copy link
Collaborator Author

Thanks for the productive discussion today @tovari @LukeCaley @justinginnetti @batpad. We are in agreement to move forward with the above approach — we won't replace all admin1s but only in cases were it's absolutely necessary due to reasons like:

  • admin1 in GO is outdated (potential example, Hungary)
  • admin1 level data in GO is actually admin2 as per OCHA in which case we have to import new admin1 from OCHA

In terms of next steps:

  • We'll work on getting 5-10 countries with admin2 data in GO
  • This will focus on the data and documenting the workflows of spatial and name matching to identify admin1 ids in GO so the new admin2s show up appropriate to their admin1s.

Over the next couple days, I'll update this ticket with progress.

@geohacker
Copy link
Collaborator Author

geohacker commented Oct 19, 2022

I'm continuing this work in #1557 PR.

Haiti

image

This lines up pretty well with admin1 data that's already in GO. So we don't need to replace that

Now to get the admin1 ID from GO into the Haiti admin2 shapefile, this is my workflow:

  • Step 1: Open the Haiti admin2 shapefile from geoboundaries (COD via OCHA) in QGIS
  • Step 2: Open the admin1 layer from GO by connecting the GO database locally with QGIS (could also be done with a remote staging database)
  • Step 3: Create a centroid layer of the polygon layer using Vector > Geometry Tools > Centroids

image

  • Step 4: Use attribute join functionality to join admin1_id to the centroid layer

image

* Select the base layer as Centroid layer * Select the join layer as api_districtgeoms * Select the predicate as intersects * Select the field as district_id * And run, this will create another Centroid layer with `district_id` as a new attribute

image

* Step 5: Use attribute join to combine the new Centroid layer with the Haiti admin2 shapefile, following instructions above but using base layer as admin2 and join layer as Centroid layer and field as district_id * Step 6: Save the new layer as a shapefile to import into GO

@geohacker
Copy link
Collaborator Author

geohacker commented Oct 20, 2022

Colombia

There are CODs available for Colombia. This is the workflow I used. The goal is to have an admin2 shapefile for Colombia that has the following attributes shapeName, pcode, admin1_id (which needs to derived like above from the GO admin1 data).

Inspect the admin2 and admin1 data against existing GO admin1

image
image

Looks all good in terms of territories but some minor issues likely due to different geometry simplifications. So we don't need to change the admin1 data.

Match admin1 id to admin2 COD

Create centroids
Centroids won't work really well for this matching due geometries like below
image

For this admin2 polygon, the centroid is actually outside the geometry. One could use geometric center instead of centroid but it might be better to prepare random points inside the geometry for the matching.

Create random points inside polygon
image

Set number of points as 1 in the dialog and create a new temporary layer.

Join the random points layer with admin1 layer to add district_id
Follow steps outlined previously by using the Join Attributes by Location option. In the new joined random points layer, inspect the attribute table.
image

Check if there are any NULL values by clicking the district_id column to sort it. In this case we can see there are two NULLs. Meaning for two admin2s we couldn't find an admin1 match. To inspect why that is, select the row and then click on 'Zoom map to selected rows'
image

image

Now we can see that the point wasn't able get a match because it's sitting outside the admin1 boundary because of the minor geometry issue. In this case, it's easier to look up the admin1 geom and then edit the id column manually.

image

The ID is 642. To update, follow the steps below.

  • Go back to the attribute table
  • Use the Edit button
    image
  • And add the 642 id for the missing cell as we identified
  • Repeat this for the next NULL item.
  • Once done, click on the Edit button again and Save

Now join this random points layer with the admin2 polygon layer using the join attributes by location tool.
In the end, it's important to make sure all the join layers have the same feature count
image

Finally, rename district_id to admin1_id and save as shapefile.

image

@geohacker
Copy link
Collaborator Author

geohacker commented Oct 21, 2022

I thought I'd look at the Ukraine admin2 that are getting a lot of movement on the HDX page https://data.humdata.org/dataset/cod-ab-ukr — the data i'm looking at is updated on October 11, 2022 ukr_adm_sspe_20221005.zipSHP

Looking at admin1 and admin2

image
All good. Some minor polygon simplification issues but we can stick to our existing admin1 data.

image

admin2 also looks good.

image
The column names are different so we have to make sure to rename.

I followed the same steps as above

  1. Create random points
  2. Join that with admin1 to add district_id to the random points layer, inspect
  3. Join the new joined random points layer to the admin2 polygon layer to add district_id to polygons, inspect
  4. Rename columns. name, code and admin1_id
  5. Export to shapefile
  6. 🎉

Checking an admin2 in the GO Admin
image

@geohacker
Copy link
Collaborator Author

Same workflow as above for Venezuela
image

@tovari
Copy link

tovari commented Oct 27, 2022

@geohacker, would you mind to list the mandatory fields with types of the admin2 geo files? Should it be a shp, geojson, or something else?

@geohacker
Copy link
Collaborator Author

@tovari sure! Currently we support only shapefiles with mandatory fields name — name of the admin2 (or shapeName as in CODs), code — pcode (orpcode as in CODs), and admin1_id — which is the admin 1 ID from the GO database.

@tovari
Copy link

tovari commented Oct 27, 2022

Thanks @geohacker! What optional fields can be added? I'm think about e.g. local_name and LN_lang_code, alternate_name and AN_lang_code.
I'm not sure, if it makes sense to add an option for local admin ID, and for population data.

@geohacker
Copy link
Collaborator Author

At the moment, we don't have any other fields https://github.com/IFRCGo/go-api/blob/develop/api/models.py#L265-L272 — but we can certainly add to account for names in other languages. But that we should be consistent with how we are doing languages for admin1 and regions, with columns called name, name_es, name_ar, name_fr, name_en.

I think we should not store population data in the admin2 table. Because it needs to be updated more regularly perhaps. Ideally that data should live in a different table with pcode mapping so we don't have to worry about updating the geometries when we need to update population data. Only if there's an immediate use case.

@tovari
Copy link

tovari commented Oct 27, 2022

Ok, agree on not including the population data.

I think, names on local language have an importance on lower admin levels as mostly there won't be en, es, fr, ar versions of the names. There might be transliterations to latin from other alphabets, but I think we should still preserve the local names written in the local alphabet. Alternate name and alphabet might be relevant as well in multi language countries. Thus we will have an option to store 2 versions of the names in 2 languages.
name is the transliterated name to latin in this case, I assume.

@geohacker
Copy link
Collaborator Author

@tovari ok makes sense! I've just added local_name, local_name_code, alternate_name and alternate_name_code as optional fields. The import script will also look for the presence of these columns in the shapefile and import accordingly.

Just to note that the OCHA cod shapefiles we are importing do not have local name fields so currently all of them only have the default name field.

@geohacker
Copy link
Collaborator Author

The PR #1557 is now ready for review. So far, we have prepared and imported (locally):

  • Colombia
  • El Salvador
  • Guatemala
  • Haiti
  • Honduras
  • Ukraine
  • Venezuela

Once the PR is merged, we can import these on staging to test. cc @batpad

@geohacker
Copy link
Collaborator Author

A workflow to import admin2 is now merged to develop. This also includes methods to create, update and publish mapbox tilesets. At the moment, there's a sample mapbox map style with some admin2.

image
image

The process is documented in the README

@tovari
Copy link

tovari commented Dec 7, 2022

I did the admin1-2 matching a bit differently to make sure we link admin2s to the correct admin1 even when there are significant deviations between OCHA and ICRC admin1s.:

  1. Create random points inside the Admin1 OCHA polygons.
  2. Spatial join that points with admin1 to add district_id to the random points layer
  3. Join (1:many) the new joined random points layer to the admin2 polygon layer based on Pcodes to add district_id to polygons, inspect
  4. Rename columns. name, code and admin1_id

Check:

  1. Create random points inside admin2 polygons

  2. spatial join that points with admin1 to add district_id (another one) to the random points

  3. check if that another district_id and the district_id from the transition process match. In case they don't, the admin2 center inside point is outside of the admin1 which should cover the admin2.

  4. List these admin2, with significant discrepancies, inspect the polygon borders

  5. Export admin2 to shapefile

The check method may not find all discrepancies, but it finds them with a good chance when a good part of the admin2 is out of ICRC admin1.

One sample of the detected discrepancy:
image
Admin1 update should follow in such cases.

cc: @geohacker, David, @jhenshall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants