Afonso Fonseca - 20241781

Martinho ...

...

...

We start by importing the necessary libraries for data handling, web scraping, regex operations, concurrent execution, and interactive visualization.

In [11]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import plotly.express as px
import re
from concurrent.futures import ThreadPoolExecutor

We load the city dataset, fix formatting issues, separate the city and country into distinct columns, and remove duplicate entries to avoid scraping the same city multiple times.

In [12]:
data = pd.read_csv("city_data.csv", sep="|", header=1)
data["City"] = data["City"].str.replace(".", ",", regex=False).str.replace(";", ",", regex=False)
data.loc[data["City"] == "Greece, Athens", "City"] = "Athens, Greece"
data.columns = data.columns.str.strip()
data["City Only"] = data["City"].str.split(",").str[0].str.strip()
data["Country"] = data["City"].str.split(",").str[1].str.strip()
data = data.drop_duplicates(subset=["City Only", "Country"]).reset_index(drop=True)

We initialize a requests session with a User-Agent header to simulate a browser and reduce the risk of being blocked.

In [13]:
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"})

We define a function that converts a string of coordinates from Wikipedia into decimal latitude and longitude.

In [14]:
def _extract_decimal_coords(text):
    nums = re.findall(r"[-+]?\d+\.\d+|[-+]?\d+(?:\.\d+)?", text)
    if len(nums) >= 2:
        return float(nums[0]), float(nums[1])
    return None, None

Then, we create a function that searches Wikipedia starting from the main page, finds the first search result for the city, and extracts its coordinates.

In [None]:
def get_coordinates(city, country):
    try:
        session.get("https://en.wikipedia.org/wiki/Main_Page", timeout=10)
        search_url = "https://en.wikipedia.org/w/index.php?search=" + requests.utils.quote(f"{city} {country}")
        r = session.get(search_url, timeout=10)
        soup = BeautifulSoup(r.content, "html.parser")
        
        first_link = soup.select_one("ul.mw-search-results li a")
        if first_link:
            city_url = "https://en.wikipedia.org" + first_link["href"]
            r2 = session.get(city_url, timeout=10)
            soup2 = BeautifulSoup(r2.content, "html.parser")
            geo = soup2.find("span", {"class": "geo"})
            if geo:
                return _extract_decimal_coords(geo.text)
    except:
        return None, None
    return None, None

Afterwards, we use a thread pool to scrape all cities concurrently for speed, while still starting from the Wikipedia main page for each request.

In [16]:
def scrape_city(row):
    return get_coordinates(row["City Only"], row["Country"])

with ThreadPoolExecutor(max_workers=10) as executor:
    results = list(executor.map(scrape_city, [row for _, row in data.iterrows()]))

We append the scraped latitude and longitude to the dataset, remove cities without coordinates, and rename columns for clarity in visualization.

In [17]:
data["Latitude"] = [r[0] for r in results]
data["Longitude"] = [r[1] for r in results]

data_map = data.dropna(subset=["Latitude", "Longitude"]).copy()
data_map = data_map.rename(columns={
    "Average Monthly Salary": "Average monthly salary",
    "Average Cost of Living": "Average cost of living"
})

We use Plotly to create an interactive map of Europe. Users can hover over city markers to see country, population, average monthly salary, and average cost of living.

In [21]:
fig = px.scatter_mapbox(
    data_map,
    lat="Latitude",
    lon="Longitude",
    hover_name="City Only",
    hover_data={
        "Country": True,
        "Population": True,
        "Average monthly salary": True,
        "Average cost of living": True,
        "Latitude": False,
        "Longitude": False

    },
    color="Country",
    zoom=3,
    center={"lat": 50.0, "lon": 10.0},
    height=700,
    title="Where Should I Live? - European City Map"
)
fig.update_layout(mapbox_style="open-street-map", margin={"r":0,"t":40,"l":0,"b":0})
fig.show()