Afonso Fonseca - 20241781

Martinho ...

...

...

We start by importing the necessary libraries for data handling, web scraping, regex operations, and interactive visualization.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import plotly.express as px
import time
import re

We load the city dataset, fix formatting issues, separate the city and country into distinct columns, and remove duplicate entries.

In [2]:
data = pd.read_csv("city_data.csv", sep="|", header=1)
data["City"] = data["City"].str.replace(".", ",", regex=False).str.replace(";", ",", regex=False)
data.loc[data["City"] == "Greece, Athens", "City"] = "Athens, Greece"
data = data.copy()
data.columns = data.columns.str.strip()
data["City Only"] = data["City"].str.split(",").str[0].str.strip()
data["Country"] = data["City"].str.split(",").str[1].str.strip()
data = data.drop_duplicates(subset=["City Only", "Country"]).reset_index(drop=True)

We initialize a requests session with a User-Agent header to simulate a browser and reduce the risk of being blocked.

In [3]:
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"})

We now create a function that extracts decimal latitude and longitude from the text format returned by Wikipedia coordinates.

In [4]:
def _extract_decimal_coords(text):
    nums = re.findall(r"[-+]?\d+\.\d+|[-+]?\d+(?:\.\d+)?", text)
    if len(nums) >= 2:
        try:
            return float(nums[0]), float(nums[1])
        except:
            return None, None
    return None, None

Then, we create another function. This one searches Wikipedia for a city and extracts the geographical coordinates from the first result.

In [5]:
def get_coordinates_from_main(city, country):
    main_url = "https://en.wikipedia.org/wiki/Main_Page"
    try:
        r = session.get(main_url, timeout=10)
        soup = BeautifulSoup(r.content, "html.parser")
        search_url = "https://en.wikipedia.org/w/index.php?search=" + requests.utils.quote(f"{city} {country}")
        r2 = session.get(search_url, timeout=10)
        soup2 = BeautifulSoup(r2.content, "html.parser")
        first_link = soup2.select_one("ul.mw-search-results li a")
        if first_link:
            page_url = "https://en.wikipedia.org" + first_link["href"]
            r3 = session.get(page_url, timeout=10)
            soup3 = BeautifulSoup(r3.content, "html.parser")
            geo = soup3.find("span", {"class": "geo"})
            if geo:
                return _extract_decimal_coords(geo.text)
    except:
        return None, None
    return None, None

We loop through all cities in the dataset, scrape their coordinates, and store them in lists. A short pause between requests avoids overwhelming the server.

In [6]:
lats = []
lons = []
for _, row in data.iterrows():
    lat, lon = get_coordinates_from_main(row["City Only"], row["Country"])
    lats.append(lat)
    lons.append(lon)
    time.sleep(0.25)

We append the scraped latitude and longitude to the dataset, remove cities without coordinates, and rename columns for clarity in visualization.

In [12]:
data["Latitude"] = lats
data["Longitude"] = lons
data_map = data.dropna(subset=["Latitude", "Longitude"]).copy()
data_map = data_map.rename(columns={
    "Average Monthly Salary": "Average monthly salary",
    "Average Cost of Living": "Average cost of living"
})

We use Plotly to create an interactive map of Europe. Users can hover over city markers to see country, population, salary, and cost of living information.

In [None]:
fig = px.scatter_mapbox(
    data_map,
    lat="Latitude",
    lon="Longitude",
    hover_name="City Only",
    hover_data={
        "Country": True,
        "Population": True,
        "Average monthly salary": True,
        "Average cost of living": True,
        "Latitude": False,
        "Longitude": False
    },
    color="Country",
    zoom=3,
    center={"lat": 50.0, "lon": 10.0},
    height=700,
    title="Where Should I Live? - European City Map"
)
fig.update_layout(mapbox_style="open-street-map", margin={"r":0,"t":40,"l":0,"b":0})
fig.show()
