## Meta-Visualization Framework for Spatiotemporal Analytics:
### From Data Generation to Advanced Visualization on Maps

## **Stage 1: Data Acquisition and Preparation Tools**

**Overview**

This notebook implements the core data acquisition and preparation methodologies described in our research article, focusing on three key techniques for gathering heterogeneous data sources:

1.   Map Scan Processing (Geospatial)
2.   Web Scraping (Tabular/Text)
3.   IoT Sensor Integration (Real-time)

Each section provides executable code examples, best practices, and ethical considerations. Note that the validation and quality of the obtained datasets are discussed in Case_Study repository.


In [24]:
from IPython.display import Image
Image(url='https://github.com/AnonymAuthors2025/Revised_Meta_Visualization/blob/main/Stage1/Fig1_pipeline.png?raw=true', width=500)



---


1. **Map Scan Processing**

Archival maps (PNG/JPEG/PDF) contain valuable geospatial data but are unstructured and require conversion to machine-readable formats (e.g., GeoJSON).


To deal with such source of data, we propose a new technique based on segmentation and georeferencing (See next Figure).

In [22]:
Image(url='https://github.com/AnonymAuthors2025/Revised_Meta_Visualization/blob/main/Stage1/Fig2.png?raw=true', width=200)

Here is an example of script to extract features from a map using our pipeline:

We start by selecting a map scan to layer it on the map by georeferencing. Then, extracting the labels from the legend to compare the colors of each segment. We save the extracted properties as a list into a csv.

In [None]:
#importing libraries
from PIL import Image
import folium
from folium.map import Marker

# Load map image
from PIL import Image
import requests
from io import BytesIO
url = "https://raw.githubusercontent.com/AnonymAuthors2025/Revised_Meta_Visualization/main/Stage1/soil.png"
response = requests.get(url)
map_image = Image.open(BytesIO(response.content))  # Replace "url/soil.png" with your map image file

In [None]:
# Georeferencing
soil = folium.Map(location=[33, 0], zoom_start=4)

# Add a new image layer
folium.raster_layers.ImageOverlay(
    image='https://raw.githubusercontent.com/AnonymAuthors2025/Revised_Meta_Visualization/main/Stage1/soil.png',
    bounds=[[9.9, -16.1], [39.1, 16.8]],
    opacity=0.7,
    name='My Image Layer'
).add_to(soil)

# Save the map and display it
soil

In [None]:
#label extraction + map segmentation
from pyproj import Proj, transform
import math

import warnings
warnings.filterwarnings("ignore")

from pyproj import Proj, transform
import math

def euclidean_distance(color1, color2):
    r1, g1, b1 = color1
    r2, g2, b2 = color2
    return math.sqrt((r1 - r2) ** 2 + (g1 - g2) ** 2 + (b1 - b2) ** 2)

def find_most_similar_color(rgb_color, color_list):
    min_distance = float('inf')
    most_similar_color = None

    for color_name, color in color_list.items():
        distance = euclidean_distance(rgb_color, color)
        if distance < min_distance:
            min_distance = distance
            most_similar_color = (color, color_name)

    return most_similar_color

# Dictionary of colors with their names
color_dict = {
    'AR': (237, 232, 171),
    'CL': (255, 255, 0),
    'CM': (255, 166, 0),
    'FL': (0, 191, 255),
    'GY': (255, 250, 204),
    'LP': (212, 212, 212),
    'LV': (250, 128, 115),
    'NT': (255, 161, 122),
    'RG': (245, 222, 197),
    'SC': (255, 0 , 255),
    'VR': (148, 112, 219),
    'WR': (135, 207, 250)
}


color_list = []

# Load map image
url = "https://raw.githubusercontent.com/AnonymAuthors2025/Revised_Meta_Visualization/main/Stage1/soil.png"
response = requests.get(url)
map_image = Image.open(BytesIO(response.content))  # Replace "url/soil.png" with your map image file

# Define the bounding box of the map in terms of latitude and longitude
# (min_latitude, min_longitude, max_latitude, max_longitude)
map_bbox = (9.9, -16.1, 39.09, 16.79)  # Example bounding box bounds=[[9.9, -16.1], [39.1, 16.8]],

# Determine the size of the map image in pixels
map_width, map_height = map_image.size

# Define the projection for the map (e.g., Web Mercator)
map_projection = Proj(init='epsg:3857')

# Convert latitude and longitude coordinates to pixel coordinates
def lat_lon_to_pixel(lat, lon):
    # Transform latitude and longitude to the map's projection
    x, y = transform(Proj(init='epsg:4326'), map_projection, lon, lat)
    # Scale the transformed coordinates to fit within the map image size
    pixel_x = int((x - map_projection(map_bbox[1], map_bbox[0])[0]) /
                  (map_projection(map_bbox[3], map_bbox[0])[0] - map_projection(map_bbox[1], map_bbox[0])[0]) * map_width)
    pixel_y = int((map_projection(map_bbox[3], map_bbox[0])[1] - y) /
                  (map_projection(map_bbox[3], map_bbox[2])[1] - map_projection(map_bbox[3], map_bbox[0])[1]) * map_height)
    return pixel_x, pixel_y


# Define intervals for latitude and longitude
latitude_interval = (9.9, 39.1)  # Example latitude interval
longitude_interval = (-16.1, 16.79)  # Example longitude interval

# Define segment size for precision e.g., 0.1 for high precision, and 0.5 for medium precision ...
latitude_step = 0.1
longitude_step = 0.1


# Generate latitude and longitude coordinates within the intervals
locations = []
for lat in range(int(latitude_interval[0] * 100), int(latitude_interval[1] * 100), int(latitude_step * 100)):
    for lon in range(int(longitude_interval[0] * 100), int(longitude_interval[1] * 100), int(longitude_step * 100)):
        locations.append((lat / 100, lon / 100))

# Loop over each location and get the color
for latitude, longitude in locations:
    # Convert latitude and longitude to pixel coordinates
    pixel_x, pixel_y = lat_lon_to_pixel(latitude, longitude)

    # Get the color of the pixel at the specified location
    pixel_color = map_image.getpixel((pixel_x, pixel_y))

    # Find the most similar color from the dictionary
    c, n = find_most_similar_color(pixel_color, color_dict)
    color_list.append({"latitude": latitude, "longitude": longitude, "color": pixel_color, "most similar": c, "type": n})

#print(color_list)

In [None]:
import pandas as pd
df = pd.DataFrame(color_list)
df.to_csv('https://raw.githubusercontent.com/AnonymAuthors2025/Revised_Meta_Visualization/main/Stage1/extracted_soil_types.csv', index=True)

**The output of this process is a list of (lat, long, property) sets covering the map, where each pair (lat, long) has its corresponding properties from the legend. Such list can be stored as a csv or any other format to feed different types of databases and GIS.**

Note that higher precision requires significanlty more time for extraction than medium and lower precision.
The previous extraction step takes about 70 minutes.

Other tools exist for sophisticated pattern extraction, such as Rasterio/GDAL, QGIS (Manual validation) etc., but they are often costly and complex to handle by some user profiles.

Here is an example of Rasterio extraction:



```
import cv2, rasterio
from geojson import FeatureCollection

# Step 1: Preprocess scan
image = cv2.imread("historical_map.png")
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)  

# Step 2: Feature extraction (simplified)
contours, _ = cv2.findContours(gray, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

# Step 3: Georeference and export
features = [contour_to_geojson(c) for c in contours]  # Custom function
FeatureCollection(features).dump("output.geojson")
```




---


2. **Web scraping**

This technique is becoming more and more adopted in data science projects where many libraries and APIs are being available for free. For example, Scrapy, BeautifulSoup, Selenium are some examples of Python libraries dedicated to web-scraping. Other API and tools exist such as Overpass Turbo, Google Map scraper etc.

Here is an example of script for web-scraping from a website (FAO/AQUASTAT for weather data) using some Python libraries:



```
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Constants
BASE_URL = "http://www.fao.org/aquastat/statistics/query/index.html"
USER_AGENT = "ResearchBot (your-email@example.com)"  # Identify yourself ethically
DELAY = 2  # Seconds between requests

def fetch_aquastat_data(country_code="EGY", parameter="Precipitation"):
    """
    Scrapes AQUASTAT for country-specific weather data.
    
    Args:
        country_code (str): 3-letter ISO country code (e.g., "EGY" for Egypt)
        parameter (str): Target parameter (e.g., "Precipitation", "Temperature")
    
    Returns:
        pd.DataFrame: Extracted data table
    """
    try:
        # 1. Navigate to search page (POST request)
        session = requests.Session()
        headers = {"User-Agent": USER_AGENT}
        
        # 2. Submit search form (modify payload as needed)
        payload = {
            "country": country_code,
            "parameter": parameter,
            "submit": "Submit"
        }
        
        response = session.post(BASE_URL, data=payload, headers=headers)
        response.raise_for_status()  # Check for HTTP errors
        
        # 3. Parse HTML table
        soup = BeautifulSoup(response.text, 'html.parser')
        table = soup.find("table", {"class": "data"})
        
        if not table:
            raise ValueError("No data table found - check parameters or website structure")
        
        # 4. Convert to DataFrame
        rows = []
        for tr in table.find_all("tr")[1:]:  # Skip header
            cells = [td.get_text(strip=True) for td in tr.find_all("td")]
            if cells:
                rows.append(cells)
        
        df = pd.DataFrame(rows, columns=["Year", parameter])
        return df
        
    except Exception as e:
        print(f"Error: {e}")
        return pd.DataFrame()

# Example Usage
if __name__ == "__main__":
    # Fetch precipitation data for Egypt
    df = fetch_aquastat_data(country_code="EGY", parameter="Precipitation")
    if not df.empty:
        print(df.head())
        df.to_csv("aquastat_precipitation_egypt.csv", index=False)
    time.sleep(DELAY)  # Respect crawl delay
```


**The output of this script is a csv (tabular) file that can feed different types of databases and information systems.**


**Ethical Considerations:**

✔ Check privacy and intellectual property aspects before data extraction

✔ Limit request rate (e.g., time.sleep())

✔ Use APIs when available



```
+--------------+---------------------+---------+------------+
| Library      | Use Case            | Speed   | Complexity |
+--------------+---------------------+---------+------------+
| BeautifulSoup| Static HTML         | Medium  | Low        |
| Scrapy       | Large-scale scraping| High    | Medium     |
| Selenium     | Dynamic content     | Low     | High       |
+--------------+---------------------+---------+------------+
```






---


3. **IoT devices:**

IoT technology is becoming omnipresent in environmental monitoring, urban planning and many other projects related to spatiotemporal analytics. sensors, GPS trackers etc. are examples of devices used to generate data and feed databases and applications with required data on real-time. In soil sampling for instance, we use IoT devices to measure soil properties and send it to the Cloud.

Common Protocols:

1.   MQTT (Lightweight messaging)
2.   REST APIs (Sensor gateways)
3.   WebSocket (Real-time streams)


Example: Sensor Data Pipeline


```
import paho.mqtt.client as mqtt

def on_message(client, userdata, msg):
    payload = json.loads(msg.payload)
    save_to_database(payload)  # Custom function

client = mqtt.Client()
client.connect("iot_gateway.example.com", 1883)
client.subscribe("sensors/temperature")
```



---


## **Next Steps**

- Run cells sequentially to generate sample datasets (csv).

- Validate outputs with different stakeholders and domain-experts.

- Proceed to Stage 2 (Data Integration).