# Scraping images from mapbox satelitar view  
Mapbox Free tier allows 50,000 requests per month : consider logging how many images have been scraped already to not exceed this limit.  


Pipeline : 
- Define area of interest (bounding box) from coordinates (lat, lon)
- Get the satelitar view of the bounding box from mapbox API
- Crop the images to remove watermark
- Save the images to disk
- Get the bounding boxes of buildings from OSM for the area of interest
- Save the bounding boxes to disk
- Clean overlapping bounding boxes (remove smaller ones or merge them)
- Save the cleaned bounding boxes to disk

The areas of interest have been picked manually :
- Top 20 most populous cities in France
- 100 rural areas coordinates picked randomly across France
- Maybe some random forest to get images without buildings (negative samples)

For the cities : 
- We define a bounding box around the city center with a fixed size (e.g., 10km x 10km)
- We then take random points from this bounding box to scrape images

Then, we can use the pipeline with these points to scrape images and get bounding boxes of buildings.

importing packages

In [None]:
import os
import random
import requests
from pathlib import Path
from dotenv import load_dotenv

CONFIG

In [None]:
# Load environment variables from .env file
load_dotenv()

# MAPBOX
MAPBOX_ACCESS_TOKEN = os.getenv("MAPBOX_ACCESS_TOKEN")

# Image parameters
NB_IMAGES = 2
WIDTH = 512
HEIGHT = 512 + 30  # extra space to then delete bottom part with watermark
ZOOM = 18  # 18-19 to see buildings, swimming pools clearly

BASE_DIR = Path().cwd() / "data"
RAW_DIR = BASE_DIR / "raw_images"
CROPPED_DIR = BASE_DIR / "cropped_images"

RAW_POLYGON_DIR = BASE_DIR / "raw_polygons"
CLEANED_POLYGON_DIR = BASE_DIR / "cleaned_polygons"


SETUP

In [None]:
# ======================
# DOWNLOAD LOOP
# ======================
for i in range(NB_IMAGES):
    lat, lon = random_coord_france()

    url = (
        f"https://api.mapbox.com/styles/v1/mapbox/satellite-v9/static/"
        f"{lon},{lat},{ZOOM}/"
        f"{WIDTH}x{HEIGHT}"
        f"?access_token={MAPBOX_ACCESS_TOKEN}"
    )

    filename = os.path.join(OUTPUT_DIR, f"sat_{i:04d}.jpg")

    response = requests.get(url, timeout=30)

    if response.status_code == 200:
        with open(filename, "wb") as f:
            f.write(response.content)
        print(f"[OK] {filename}")
    else:
        print(f"[ERREUR] Image {i} - status {response.status_code}")



[OK] c:\Users\Prout\Documents\GitHub\SISE_satelitar_identifier\notebooks\data\scraped_images\sat_0000.jpg
[OK] c:\Users\Prout\Documents\GitHub\SISE_satelitar_identifier\notebooks\data\scraped_images\sat_0001.jpg


Trying on a city: Lyon

In [15]:
LYON_COORDS = (45.764043, 4.835659)

In [16]:
url = (
    f"https://api.mapbox.com/styles/v1/mapbox/satellite-v9/static/"
    f"{LYON_COORDS[1]},{LYON_COORDS[0]},{ZOOM}/"
    f"{WIDTH}x{HEIGHT}"
    f"?access_token={MAPBOX_ACCESS_TOKEN}"
)

filename = os.path.join(OUTPUT_DIR, f"sat_lyon.jpg")

response = requests.get(url, timeout=30)

if response.status_code == 200:
    with open(filename, "wb") as f:
        f.write(response.content)
    print(f"[OK] {filename}")
else:
    print(f"[ERREUR] Image {i} - status {response.status_code}")

[OK] c:\Users\Prout\Documents\GitHub\SISE_satelitar_identifier\notebooks\data\scraped_images\sat_lyon.jpg
