# Poverty & Wealth Mapping


"Despite decades of declining poverty rates, an estimated 8.4% of the global population remains in extreme poverty as of 2019, and progress has slowed in recent years [1]. But data on poverty remain surprisingly sparse, hampering efforts at monitoring local progress, targeting aid to those who need it, and evaluating the effectiveness of antipoverty programs [2]. Previous works [3,4] have demonstrated using computer vision on satellite images and street-level images to predict economic livelihood." [5]

In this notebook, we will pull a 2021 benchmark dataset from the Stanford Sustainability and AI Lab called "SustainBench". This dataset contains a variety of datasets related to sustainability, including datasets related to poverty and wealth mapping. More info on it can be found on the [project website](https://sustainlab-group.github.io/sustainbench/), the [GitHub repo](https://github.com/sustainlab-group/sustainbench), or the [arXiv paper](https://arxiv.org/abs/2111.04724). The data comes from surveys collected by the [Demographic and Health Surveys (DHS) Program](https://dhsprogram.com/) from USAID (RIP 😢). Nationally represenative surveys are conducted every few years in dozens of low- and middle-income countries (LMICs) around the world. Surveyors will go out to urban neigborhoods or rural communities and survey a few dozen random households within that "cluster". The anonymized household level data is geotagged with the coordinates of the cluster with a jitter to further protect privacy. The jitter is within a 2km radius for urban clusters, and a 5km radius for rural clusters. We will focus on Task 1A from SustainBench, mapping wealth and poverty spatially. SustainBench has made our lives easier by collating this data for 80k+ clusters and making it publicly avaiable, but you can request the original and latest household-level data directly from the [DHS on their website](https://dhsprogram.com/data/available-datasets.cfm), it takes just a couple of days to get approved.

![sustainbench](https://sustainlab-group.github.io/sustainbench/assets/images/sdg1_summary.png)


We then will pull in 3 geospatial foundation models that use three different architectures:
1. [CLAY](https://madewithclay.org/) (pre-trained with a masked autoencoder)
2. [SatCLIP](https://github.com/microsoft/satclip) (pre-trained with contrastive learning)
3. [MOSAIKS](https://www.mosaiks.org/) (random convolutional features)

We will then use these models to extract features from the poverty and wealth mapping dataset, and then train a linear classifier on top of these features to predict the poverty and wealth labels. We will then evaluate the performance of each model on the test set.

### References

[1] United Nations Department of Economic and Social Affairs. The Sustainable Development Goals Report 2021. The Sustainable Development Goals Report. United Nations, 2021 edition, 2021. ISBN 978-92-1-005608-3. doi: 10.18356/9789210056083. URL https://www.un-ilibrary.org/content/books/9789210056083.

[2] M. Burke, A. Driscoll, D. B. Lobell, and S. Ermon. Using satellite imagery to understand and promote sustainable development. Science, 371(6535), 2021. doi: 10.1126/science.448abe8628. URL https://www.science.org/doi/10.1126/science.abe8628.

[3] C. Yeh, A. Perez, A. Driscoll, G. Azzari, Z. Tang, D. Lobell, S. Ermon, and M. Burke. Using publicly available satellite imagery and deep learning to understand economic well-being in Africa. Nature Communications, 11(1), 5 2020. ISSN 2041-1723. doi: 10.1038/s41467-020-58916185-w. URL https://www.nature.com/articles/s41467-020-16185-w.

[4] J. Lee, D. Grosz, B. Uzkent, S. Zeng, M. Burke, D. Lobell, and S. Ermon. Predicting Livelihood Indicators from Community-Generated Street-Level Imagery. Proceedings of the AAAI Conference on Artificial Intelligence, 35(1):268–276, 5 2021. ISSN 2374-3468. URL https://ojs.aaai.org/index.php/AAAI/article/view/16101.

[5] C. Yeh, C. Meng, S. Wang, A. Driscoll, E. Rozi, P. Liu, J. Lee, M. Burke, D. Lobell, and S. Ermon, “SustainBench: Benchmarks for Monitoring the Sustainable Development Goals with Machine Learning,” in Thirty-fifth Conference on Neural Information Processing Systems, Datasets and Benchmarks Track (Round 2), Dec. 2021. [Online]. Available: https://openreview.net/forum?id=5HR3vCylqD.

## Environment Setup

In [None]:
# install any libraries that are missing
# !pip install geopandas
# !pip install lonboard

In [2]:
# import necessary libraries
import os
import pandas as pd
import numpy as np
import geopandas as gpd
import torch
from lonboard import viz

# set the random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("imported")


imported


## Read in and Visualize Dataset
Every good data scientists knows that you need to visualize your data to understand it.

In [3]:
# read in the csv file
df = pd.read_csv("~/code/ai4good/data/wk5/dhs_final_labels.csv")

# now convert this regular dataframe into a nifty geopandas dataframe
# learn more about what a "geo" dataframe is here: https://geopandas.org/en/stable/docs/user_guide/data_structures.html#geodataframe
# it's based on Python Shapely geometries: https://shapely.readthedocs.io/en/stable/geometry.html
# which is in turn based on C/C++ GEOS geometries: https://libgeos.org/usage/
# which is in turn based on the OGC Simple Features standard: https://en.wikipedia.org/wiki/Simple_Features
# which describes well-known text (WKT) representations of vector geometry: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry
# fun!
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.lon, df.lat), crs="EPSG:4326")
gdf.head()

Unnamed: 0,DHSID_EA,cname,year,lat,lon,n_asset,asset_index,n_water,water_index,n_sanitation,...,n_under5_mort,women_edu,women_bmi,n_women_edu,n_women_bmi,cluster_id,adm1fips,adm1dhs,urban,geometry
0,AL-2008-5#-00000001,AL,2008,40.822652,19.838321,18.0,2.430596,18.0,3.444444,18.0,...,6.0,9.5,24.365,18.0,18.0,1,,9999,R,POINT (19.83832 40.82265)
1,AL-2008-5#-00000002,AL,2008,40.696846,20.007555,20.0,2.867678,20.0,4.7,20.0,...,,8.6,23.104,20.0,20.0,2,,9999,R,POINT (20.00755 40.69685)
2,AL-2008-5#-00000003,AL,2008,40.750037,19.974262,18.0,2.909049,18.0,4.5,18.0,...,,9.666667,22.387778,18.0,18.0,3,,9999,R,POINT (19.97426 40.75004)
3,AL-2008-5#-00000004,AL,2008,40.798931,19.863338,19.0,2.881122,19.0,4.947368,19.0,...,,9.952381,27.0845,21.0,20.0,4,,9999,R,POINT (19.86334 40.79893)
4,AL-2008-5#-00000005,AL,2008,40.746123,19.843885,19.0,2.54683,19.0,4.684211,19.0,...,6.0,8.9375,24.523125,16.0,16.0,5,,9999,R,POINT (19.84389 40.74612)


In [4]:
# visualize with lonboard from Development Seed built on top of deck.gl
# learn more about it at https://github.com/developmentseed/lonboard or in the documentation here https://developmentseed.org/lonboard/latest/
# hover over the points to see the data labels
viz(gdf)

Map(basemap_style=<CartoBasemap.DarkMatter: 'https://basemaps.cartocdn.com/gl/dark-matter-gl-style/style.json'…

That's a lot of clusters! Thanks `lonboard` for loading it all so quickly. What are some countries missing from this dataset? Why do you think DHS didn't include them? Could this lead to potential biases? 🤔

## Geospatial Foundation Models
Since we aren't going to be doing any training in this notebook, we don't need the inputs from SustainBench, we'll just use the labels. Each of the foundation models that we're going to be using can take in a given lat/long and return a vector of embeddings. Let's fetch those embeddings for each of the clusters in our dataset one model at a time.

### CLAY
https://clay-foundation.github.io/model/index.html

### MOSAIKS
MULTI-TASK OBSERVATION USING SATELLITE IMAGERY & KITCHEN SINKS
Nature Paper: https://www.nature.com/articles/s41467-021-24638-z
Website with API: https://api.mosaiks.org/portal/index/

In [11]:
import os
import requests
from tqdm import tqdm

# Cookie for authentication
COOKIE = {
    "csrftoken": "I9x2jvGGE4se3MBa9moavDtC9o8YEgaA4Rup5ijhHJjCTRn0qRpHGJW06XG0SooG",
    "sessionid": "y44nlmh7rjrqvvxj902jc8pmw918m1p7",
}

# Base URL for file downloads
BASE_URL = "https://api.mosaiks.org/portal/download_grid_file/coarsened_global_dense_grid_decimal_place=1_GHS_pop_weight=True"

# Destination directory
DEST_DIR = "/Users/isaiah/code/ai4good/data/wk5/mosaiks"
os.makedirs(DEST_DIR, exist_ok=True)  # Ensure the directory exists

# Regions and chunks to download
regions = {
    # "Africa": [1, 2, 3],
    # "Asia": [1, 2, 3, 4, 5, 6],
    # "Europe": [1, 2],
    # "North America": [1, 2, 3],
    "Oceania": [1],
    # "South America": [1, 2],
    "Australia": [1],
}

# Function to download a file with progress bar
def download_file(url, filename):
    response = requests.get(url, cookies=COOKIE, stream=True, verify=False)
    
    if response.status_code == 200:
        total_size = int(response.headers.get("content-length", 0))
        with open(filename, "wb") as f, tqdm(
            desc=os.path.basename(filename),
            total=total_size,
            unit="B",
            unit_scale=True,
            unit_divisor=1024,
        ) as bar:
            for chunk in response.iter_content(1024):
                f.write(chunk)
                bar.update(len(chunk))
        print(f"✅ Downloaded: {filename}")
    else:
        print(f"❌ Failed to download {url} (Status {response.status_code})")

# Download each file
for region, chunks in regions.items():
    for chunk in chunks:
        filename = os.path.join(DEST_DIR, f"{region.lower()}_{chunk}.zip")
        url = f"{BASE_URL}/coarsened_global_dense_grid_decimal_place=1_GHS_pop_weight=True_{region}_chunk={chunk}.zip"
        if (region == "Australia" or region == "Oceania"):
            url = f"{BASE_URL}/coarsened_global_dense_grid_decimal_place=1_GHS_pop_weight=True_{region}.zip"
        
        print(f"📥 Downloading {url}...")
        download_file(url, filename)




📥 Downloading https://api.mosaiks.org/portal/download_grid_file/coarsened_global_dense_grid_decimal_place=1_GHS_pop_weight=True/coarsened_global_dense_grid_decimal_place=1_GHS_pop_weight=True_Oceania.zip...


oceania_1.zip: 100%|██████████| 133M/133M [00:09<00:00, 15.5MB/s] 


✅ Downloaded: /Users/isaiah/code/ai4good/data/wk5/mosaiks/oceania_1.zip
📥 Downloading https://api.mosaiks.org/portal/download_grid_file/coarsened_global_dense_grid_decimal_place=1_GHS_pop_weight=True/coarsened_global_dense_grid_decimal_place=1_GHS_pop_weight=True_Australia.zip...


australia_1.zip: 100%|██████████| 1.48G/1.48G [01:39<00:00, 16.0MB/s]

✅ Downloaded: /Users/isaiah/code/ai4good/data/wk5/mosaiks/australia_1.zip





In [8]:
# unzip the files
import zipfile
import os
import glob

# Define the directory containing the zip files
zip_dir = "/Users/isaiah/code/ai4good/data/wk5/mosaiks"
# Define the directory to extract the contents
extract_dir = "/Users/isaiah/code/ai4good/data/wk5/mosaiks/"
# Create the extraction directory if it doesn't exist
os.makedirs(extract_dir, exist_ok=True)
# Loop through all zip files in the directory
for zip_file in glob.glob(os.path.join(zip_dir, "*.zip")):
    # Open the zip file
    with zipfile.ZipFile(zip_file, "r") as z:
        # Extract all the contents into the extraction directory
        z.extractall(extract_dir)
        print(f"Extracted {zip_file} to {extract_dir}")


Extracted /Users/isaiah/code/ai4good/data/wk5/mosaiks/europe_2.zip to /Users/isaiah/code/ai4good/data/wk5/mosaiks/
Extracted /Users/isaiah/code/ai4good/data/wk5/mosaiks/europe_1.zip to /Users/isaiah/code/ai4good/data/wk5/mosaiks/
Extracted /Users/isaiah/code/ai4good/data/wk5/mosaiks/asia_1.zip to /Users/isaiah/code/ai4good/data/wk5/mosaiks/
Extracted /Users/isaiah/code/ai4good/data/wk5/mosaiks/asia_3.zip to /Users/isaiah/code/ai4good/data/wk5/mosaiks/
Extracted /Users/isaiah/code/ai4good/data/wk5/mosaiks/asia_2.zip to /Users/isaiah/code/ai4good/data/wk5/mosaiks/
Extracted /Users/isaiah/code/ai4good/data/wk5/mosaiks/north america_3.zip to /Users/isaiah/code/ai4good/data/wk5/mosaiks/
Extracted /Users/isaiah/code/ai4good/data/wk5/mosaiks/asia_6.zip to /Users/isaiah/code/ai4good/data/wk5/mosaiks/
Extracted /Users/isaiah/code/ai4good/data/wk5/mosaiks/north america_2.zip to /Users/isaiah/code/ai4good/data/wk5/mosaiks/
Extracted /Users/isaiah/code/ai4good/data/wk5/mosaiks/asia_5.zip to /Users

In [1]:
# replace all " " in the filenames with "_"
import os
import glob
import shutil
import re

# Define the directory containing the files
directory = "/Users/isaiah/code/ai4good/data/wk5/mosaiks/"
# Loop through all files in the directory
for filename in glob.glob(os.path.join(directory, "*")):
    # Replace spaces with underscores
    new_filename = re.sub(r"\s+", "_", filename)
    # Rename the file
    os.rename(filename, new_filename)
    print(f"Renamed {filename} to {new_filename}")

Renamed /Users/isaiah/code/ai4good/data/wk5/mosaiks/coarsened_global_dense_grid_decimal_place=1_GHS_pop_weight=True_Africa_chunk=2.csv to /Users/isaiah/code/ai4good/data/wk5/mosaiks/coarsened_global_dense_grid_decimal_place=1_GHS_pop_weight=True_Africa_chunk=2.csv
Renamed /Users/isaiah/code/ai4good/data/wk5/mosaiks/europe_2.zip to /Users/isaiah/code/ai4good/data/wk5/mosaiks/europe_2.zip
Renamed /Users/isaiah/code/ai4good/data/wk5/mosaiks/coarsened_global_dense_grid_decimal_place=1_GHS_pop_weight=True_Africa_chunk=3.csv to /Users/isaiah/code/ai4good/data/wk5/mosaiks/coarsened_global_dense_grid_decimal_place=1_GHS_pop_weight=True_Africa_chunk=3.csv
Renamed /Users/isaiah/code/ai4good/data/wk5/mosaiks/coarsened_global_dense_grid_decimal_place=1_GHS_pop_weight=True_Africa_chunk=1.csv to /Users/isaiah/code/ai4good/data/wk5/mosaiks/coarsened_global_dense_grid_decimal_place=1_GHS_pop_weight=True_Africa_chunk=1.csv
Renamed /Users/isaiah/code/ai4good/data/wk5/mosaiks/europe_1.zip to /Users/isaia

In [None]:
# combine the csv files into one for each continent
import pandas as pd
import os
import glob

# Define the directory containing the CSV files
csv_dir = "/Users/isaiah/code/ai4good/data/wk5/mosaiks/"
# Define the output directory for the combined CSV files
output_dir = "/Users/isaiah/code/ai4good/data/wk5/mosaiks/"
# Create the output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)
# Loop through each continent
for continent in ["Europe", "North_America", "South_America", "Oceania", "Asia"]:
    print(f"Combining {continent} files...")
    # Find all CSV files for the continent
    csv_files = glob.glob(os.path.join(csv_dir, f"*{continent}_*.csv"))
    # Combine the CSV files into one DataFrame
    print(f"Combining {len(csv_files)} files...")
    combined_df = pd.concat([pd.read_csv(f) for f in csv_files], ignore_index=True)
    # Save the combined DataFrame to a new CSV file
    combined_df.to_csv(os.path.join(output_dir, f"{continent}.csv"), index=False)
    print(f"Combined {len(csv_files)} files into {continent}.csv")

### Prepare Data for MOSAIKS Embeddings
Now we need to match up our locations with the MOSAIKS locations.

In [21]:
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt

# Load dataset
df = pd.read_csv("~/code/ai4good/data/wk5/dhs_final_labels.csv")

# Assign coordinates to nearest tile centroid
df["Lon"] = round(round(df["lon"] + 0.5, 0) - 0.5, 1)
df["Lat"] = round(round(df["lat"] + 0.5, 0) - 0.5, 1)

# Remove original coordinates
df = df.drop(columns=["lon", "lat"])

# Convert to GeoDataFrame
df_gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df["Lon"], df["Lat"]), crs="EPSG:4326")

viz(df_gdf)

# Write to CSV
df.to_csv("dhs_final_labels_centered_1_deg.csv", index=False)


## To Do
- [ ] Get the precise embeddings from the MOSAIKS folks
- [ ] Work with the 1.0 degrees for now
- [ ] Join the MOSAIKS embeddings with the SustainBench data
- [ ] Run Ridge regression on the MOSAIKS embeddings with poverty as the label
- [ ] Report on the results
- [ ] Compare with CLAY and SatCLIP