<a href="https://colab.research.google.com/github/AditPradana36/mapillary_streetviewimage_scraper/blob/main/mapillary_streetviewimage_scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🛰️ Mapillary Image Scraper

This notebook lets you fetch **Mapillary street-level images** by point coordinates using three different methods:
1. 📸 The **newest image** per coordinate
2. 📅 Filter by a **specific year**
3. 🔁 Filter by a **range of years**

---

### 🎯 Goals:
- To **retrieve at least one representative image per location**
- To enable **temporal comparison** of urban environments using year-based filters
- To support spatial analysis or visual datasets for GIS and urban research

---

### 🔧 What You Need:
- A **Mapillary access token**, which you can obtain by signing up at [mapillary.com](https://www.mapillary.com/).
- An **Excel file (.xlsx)** with three required columns:
  - `ID`: unique identifier
  - `X`: longitude
  - `Y`: latitude

✅ You can either:
- Upload the file manually into Google Colab
- Or access it from your Google Drive (e.g. `/content/drive/MyDrive/foldername/filename.xlsx`)

---

### 📁 Sample Excel Format

| ID | X             | Y            |
|----|------------------|-----------------|
| 1  | 106.9023591000    | -6.3759857000   |
| 2  | 106.8292590000    | -6.3672050000   |
| 3  | 106.8293310000    | -6.3659200000   |
| 4  | 106.8340730000    | -6.3673730000   |
| 5  | 106.8339340000    | -6.3656490000   |
| 6  | 106.8321280000    | -6.3667600000   |
| ...| ...               | ...             |

> ℹ️ Save the file as `.xlsx` and refer to it using the `input_excel_path` parameter in the next cell.

---

*By Mohammad Raditia Pradana, 2025*


## 📦 Requirements & Dependencies

Before using this notebook, make sure all the necessary Python packages are installed and imported.

In [None]:
# ✅ Install if not yet available
!pip install mercantile mapbox-vector-tile vt2geojson openpyxl requests

# 📚 Imports
import mercantile, requests, json, os, math
from vt2geojson.tools import vt_bytes_to_geojson
import pandas as pd
from google.colab import drive

In [15]:
# Define the vector tile endpoints
tile_coverage = 'mly1_public'
tile_layer = "image"

## 💾 Optional: Mount Google Drive

If your Excel file is stored in **Google Drive** or you want to **save the downloaded images and metadata** directly to a Drive folder, you need to mount your Google Drive.

> ✅ Skip this step if you're uploading files manually via the Colab file browser or only saving to `/content/`.


In [None]:
from google.colab import drive
drive.mount('/content/drive')


## 🔧 User Input Parameters

Fill in the form fields below to set the required parameters for scraping Mapillary imagery.

### Required Parameters:
- **access_token**: Your Mapillary API access token (starts with `MLY|...`)
- **input_excel_path**: Filepath to your `.xlsx` file containing coordinates  
  👉 Format: must contain columns `ID`, `X` (longitude), and `Y` (latitude)  
  📌 You can upload the file manually or access it from Google Drive.
- **output_dir**: Folder where downloaded images and metadata will be saved  
  (e.g., `/content/` for local Colab or `/content/drive/MyDrive/...` for Google Drive)

In [None]:
# =========================================================
# 🔧 USER INPUT SECTION — Customize these parameters below
# =========================================================

access_token = "your_mapillary_token"  #@param {type:"string"}
input_excel_path = "/content/your_file.xlsx"  #@param {type:"string"}
output_dir = "/content/"  #@param {type:"string"}



print("📥 Access Token:", access_token)
print("📄 Input File:", input_excel_path)
print("📂 Output Folder:", output_dir)


## 📍 Search Radius (Buffer in Meters)

This section loads the Excel file and defines the **search radius** for finding nearby Mapillary images.

- **`buffer_meters`** determines how far around each point (in meters) the script will look for images.


> 🔔 **The larger the buffer, the longer the process will take**, especially for areas with lots of image tiles.

>💡 If you're working with urban areas, a buffer of **10–20 meters** is usually sufficient.

In [13]:
# Load Excel
df = pd.read_excel(input_excel_path)
buffer_meters = 10  #@param {type:"number"}

import math  # Make sure math is imported

def create_bbox(lon, lat):
    lat_conversion = buffer_meters / 111000
    lon_conversion = buffer_meters / (111000 * math.cos(math.radians(lat)))
    return [lon - lon_conversion, lat - lat_conversion, lon + lon_conversion, lat + lat_conversion]



---



## 📸 Section 1 — Newest Image Crawling per Point

This section performs **automated crawling** to retrieve the **most recent Mapillary image** around each coordinate point from your Excel file.

### 🔍 How it works:
- Searches within the **buffer distance** (in meters) defined earlier
- Finds and downloads the **newest available image** per location
- Uses the Mapillary vector tile API and image metadata API

### 💾 Output:
- Each image is saved to `{output_dir}/Photo_NEWEST_{ID}.jpg`
- Metadata (ID, lat/lon, capture date) is saved to:


In [None]:
metadata_list = []

for index, row in df.iterrows():
    ID = row['ID']
    lon = row['X']
    lat = row['Y']
    image_url, captured_at, actual_lat, actual_lon = None, None, None, None

    west, south, east, north = create_bbox(lon, lat)
    tiles = list(mercantile.tiles(west, south, east, north, 14))

    newest_image, newest_date = None, None

    for tile in tiles:
        tile_url = f'https://tiles.mapillary.com/maps/vtp/{tile_coverage}/2/{tile.z}/{tile.x}/{tile.y}?access_token={access_token}'
        response = requests.get(tile_url)
        data = vt_bytes_to_geojson(response.content, tile.x, tile.y, tile.z, layer=tile_layer)

        for feature in data['features']:
            lng, lat_tmp = feature['geometry']['coordinates']
            if west < lng < east and south < lat_tmp < north:
                image_id = feature['properties']['id']
                header = {'Authorization': f'OAuth {access_token}'}
                url = f'https://graph.mapillary.com/{image_id}?fields=thumb_2048_url,captured_at'
                r = requests.get(url, headers=header)

                if r.status_code == 200:
                    data = r.json()
                    image_url_tmp = data.get('thumb_2048_url')
                    captured_at_tmp = data.get('captured_at')

                    if image_url_tmp and captured_at_tmp and (newest_date is None or captured_at_tmp > newest_date):
                        newest_image = image_url_tmp
                        newest_date = captured_at_tmp
                        actual_lat = lat_tmp
                        actual_lon = lng

    if newest_image:
        img_path = os.path.join(output_dir, f'Photo_NEWEST_{ID}.jpg')
        img_data = requests.get(newest_image).content
        with open(img_path, 'wb') as handler:
            handler.write(img_data)
        print(f"✅ Saved image for {ID}")
    else:
        print(f"⚠️ No image found for ID {ID}")

    metadata_list.append({
        'id': ID,
        'lat': actual_lat if actual_lat else lat,
        'long': actual_lon if actual_lon else lon,
        'date': newest_date
    })

# Save metadata
df_result = pd.DataFrame(metadata_list)
df_result['date'] = pd.to_datetime(df_result['date'], unit='ms', errors='coerce')
df_result.to_excel(os.path.join(output_dir, "mapillary_metadata_newest.xlsx"), index=False)




---



## 📅 Section 2 — Filter by Single Year

This section allows you to **retrieve Mapillary images captured in a specific year** for each point.

### 🎯 Goal:
- To get **one image per coordinate** that was captured in the **target year**
- This is useful when you need to analyze data from a **specific period**

### 🧩 How it works:
- Searches within the same **buffer distance** (defined earlier)
- Loops through nearby tiles and filters images **by year**
- Selects the **latest image** from that year (if available)

### 📥 Input:
- `target_year`: set the desired year (e.g. `2022`)

### 💾 Output:
- Images are saved to `{output_dir}/Photo_{ID}_{target_year}.jpg`
- Metadata is saved to: `{output_dir}/mapillary_filtered_{target_year}.xlsx`


> ⚠️ If no image is found in the selected year for a location, it will be skipped.


In [None]:

target_year = 2022  #@param {type:"number"}
metadata_list = []

for index, row in df.iterrows():
    ID = row['ID']
    lon = row['X']
    lat = row['Y']
    west, south, east, north = create_bbox(lon, lat)

    tiles = list(mercantile.tiles(west, south, east, north, 14))
    best_image, best_date = None, None

    for tile in tiles:
        tile_url = f'https://tiles.mapillary.com/maps/vtp/{tile_coverage}/2/{tile.z}/{tile.x}/{tile.y}?access_token={access_token}'
        r = requests.get(tile_url)
        data = vt_bytes_to_geojson(r.content, tile.x, tile.y, tile.z, layer=tile_layer)

        for feat in data['features']:
            lng, lat_tmp = feat['geometry']['coordinates']
            if west < lng < east and south < lat_tmp < north:
                image_id = feat['properties']['id']
                api_url = f'https://graph.mapillary.com/{image_id}?fields=thumb_2048_url,captured_at'
                headers = {'Authorization': f'OAuth {access_token}'}
                resp = requests.get(api_url, headers=headers)

                if resp.status_code == 200:
                    meta = resp.json()
                    ts = pd.to_datetime(meta.get('captured_at', None), unit='ms', errors='coerce')
                    if pd.notna(ts) and ts.year == target_year:
                        if best_date is None or ts > best_date:
                            best_image = {
                                'id': image_id,
                                'lat': lat_tmp,
                                'long': lng,
                                'date': ts,
                                'image_url': meta['thumb_2048_url']
                            }
                            best_date = ts

    if best_image:
        img_path = os.path.join(output_dir, f'Photo_{ID}_{target_year}.jpg')
        with open(img_path, 'wb') as f:
            f.write(requests.get(best_image['image_url']).content)
        print(f"✅ Saved {ID}")
        metadata_list.append({
            'id': ID,
            'lat': best_image['lat'],
            'long': best_image['long'],
            'date': best_image['date']
        })
    else:
        print(f"⚠️ No image found for {ID} in year {target_year}")

df_result = pd.DataFrame(metadata_list)
df_result.to_excel(os.path.join(output_dir, f"mapillary_filtered_{target_year}.xlsx"), index=False)




---



## 🔁 Section 3 — Filter by Range of Years

This section lets you retrieve **the most recent image** captured **within a specified year range** for each coordinate point.

### 🎯 Goal:
- To get 1 image per point that was captured in **any year between `year_start` and `year_end`**
- This is helpful for broader temporal analysis (e.g., images from the past 3 years)

### 🧩 How it works:
- Uses the same buffer-based bounding box as before
- Searches Mapillary tiles and filters images by the **year range**
- If multiple images are found in the range, the most recent one is selected

### 📥 Input:
- `year_start`: beginning year (e.g. 2021)
- `year_end`: ending year (e.g. 2023)

### 💾 Output:
- Images are saved to `{output_dir}/Photo_{ID}_{best_image["date"].year}.jpg`
- Metadata is saved to: `{output_dir}/mapillary_filtered_{year_start}_{year_end}.xlsx`

> ⚠️ If no image is found in the year range for a point, that entry will be skipped.


In [None]:
year_start = 2021  #@param {type:"number"}
year_end = 2023  #@param {type:"number"}
target_years = range(year_start, year_end + 1)

metadata_list = []

for index, row in df.iterrows():
    ID = row['ID']
    lon = row['X']
    lat = row['Y']
    west, south, east, north = create_bbox(lon, lat)

    tiles = list(mercantile.tiles(west, south, east, north, 14))
    best_image, best_date = None, None

    for tile in tiles:
        tile_url = f'https://tiles.mapillary.com/maps/vtp/{tile_coverage}/2/{tile.z}/{tile.x}/{tile.y}?access_token={access_token}'
        r = requests.get(tile_url)
        data = vt_bytes_to_geojson(r.content, tile.x, tile.y, tile.z, layer=tile_layer)

        for feat in data['features']:
            lng, lat_tmp = feat['geometry']['coordinates']
            if west < lng < east and south < lat_tmp < north:
                image_id = feat['properties']['id']
                api_url = f'https://graph.mapillary.com/{image_id}?fields=thumb_2048_url,captured_at'
                headers = {'Authorization': f'OAuth {access_token}'}
                resp = requests.get(api_url, headers=headers)

                if resp.status_code == 200:
                    meta = resp.json()
                    ts = pd.to_datetime(meta.get('captured_at', None), unit='ms', errors='coerce')
                    if pd.notna(ts) and ts.year in target_years:
                        if best_date is None or ts > best_date:
                            best_image = {
                                'id': image_id,
                                'lat': lat_tmp,
                                'long': lng,
                                'date': ts,
                                'image_url': meta['thumb_2048_url']
                            }
                            best_date = ts

    if best_image:
        img_path = os.path.join(output_dir, f'Photo_{ID}_{best_image["date"].year}.jpg')
        with open(img_path, 'wb') as f:
            f.write(requests.get(best_image['image_url']).content)
        print(f"✅ Saved image for {ID}")
        metadata_list.append({
            'id': ID,
            'lat': best_image['lat'],
            'long': best_image['long'],
            'date': best_image['date']
        })
    else:
        print(f"⚠️ No image found for ID {ID} in years {list(target_years)}")

df_result = pd.DataFrame(metadata_list)
df_result.to_excel(os.path.join(output_dir, f"mapillary_filtered_{year_start}_{year_end}.xlsx"), index=False)
