## Introduction
In this article we will discuss a very common *decision problem*: how best to travel from a starting point `A` to some target destination `Z` in the *"best"* manner possible. Of course, as we previously discussed in the [last article](https://diogenesanalytics.com/blog/2025/02/26/reasoning-about-utility), it is the *utility* that must be defined in order to *quantitatively* define what is best. The majority of the article will present the *mathematical* definition of the utility function $U(x)$ for the *travel problem*, as well as the reasoning behind the definition, and finally a simple application.

## Utility Function Defined
Before we really get into explaining the *reasoning* behind our choice of *function definition*, we will first simply show the mathematical formula in all its glory:

$$
U(x) = \frac{D_{\text{ideal}}}{M \cdot T \cdot S \cdot D_{\text{actual}}}
$$

Where:

- $D_{\text{ideal}}$ is the ideal (straight-line) distance between the start and end points of the journey.
- $D_{\text{actual}}$ is the actual traveled distance.
- $M$ is the monetary cost.
- $T$ is the total time.
- $S$ is the stress.

In the next section we will explore the *"reasoning"* behind this equation.

## Reasoning About Travel Utility
To understand how we arrived at such an equation, we must ask ourselves a simple question: what is the purpose of travel? Simply put, *travel* is about getting from one point to another. That could be called the *"reward"* or *"benefit."* But the cost of said travel does not include just money $M$, but also time $T$, and even something more abstract, which is stress $S$.
So we can think about the *"utility"* of such a problem as simply:

$$
U(x) = \frac{\text{What You Want}}{\text{What It Costs}}
$$

In this case we want to travel some distance $D$ between two points, but we will be forced to pay some cost term that includes *money* $M$, *time* $T$, and stress $S$:

$$
U(x) = \frac{D}{M \cdot T \cdot S}
$$

## Distance Efficiency
But there is something else to consider: should utility increase or decrease as we are *"forced"* to travel **more** distance than the straight line path?

In [None]:
# libs for downloading earth map data
import pathlib
import tempfile
import urllib.request
import zipfile
from tqdm import tqdm

# natural earth 10m cultural shape file
shape_file_url = (
    "https://naciscdn.org/naturalearth/10m/cultural/10m_cultural.zip"
)

# directory to save the downloaded file
data_dir = pathlib.Path("./data/geo/")

# directory to extract the contents of the ZIP file
extract_dir = data_dir / "natural_earth_10m_cultural"

def download_with_progress(url: str, filename: pathlib.Path) -> None:
    """Utility function to download files with progress bar."""
    # set headers
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }

    # create request
    request = urllib.request.Request(url, headers=headers)

    # open URL
    response = urllib.request.urlopen(request)

    # get total file size
    total = int(response.getheader('Content-Length', 0))

    # set block size for updating
    block_size = 1024

    # create progress bar
    tqdm_bar = tqdm(total=total, unit='B', unit_scale=True, desc=filename.name)

    # open new file
    with open(filename, 'wb') as file:
        # loop over blocks
        while True:
            # get next data chunk
            buffer = response.read(block_size)

            # quit if finished
            if not buffer:
                break

            # write out data chunk
            file.write(buffer)

            # update progress bar
            tqdm_bar.update(len(buffer))

    # finished
    tqdm_bar.close()

# check for previous download
if not extract_dir.exists():
    # create the parent tree (in case it doesn't exist)
    data_dir.mkdir(parents=True, exist_ok=True)
    
    # create a temporary directory to download the file
    with tempfile.TemporaryDirectory() as temp_dir:
        # filename for the downloaded ZIP file
        zip_filename = pathlib.Path(temp_dir) / "10m_cultural.zip"
        
        # download the ZIP file with progress bar
        download_with_progress(shape_file_url, zip_filename)
    
        # extract the contents of the ZIP file
        with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
            zip_ref.extractall(extract_dir)

In [None]:
import geopandas as gpd

# get world geo data
world_map = gpd.read_file(extract_dir / "10m_cultural")

# filter for South American countries
south_america = world_map[(world_map["CONTINENT"] == "South America")]

In [None]:
import geopandas as gpd
import matplotlib.pyplot as plt
from shapely.geometry import Point, LineString

# set initial figure counter to 1
fig_count = 1

# set the style to a dark theme
plt.style.use("dark_background")

# match website background
plt.rcParams["figure.facecolor"] = "#181818"
plt.rcParams["axes.facecolor"] = "#181818"
plt.rcParams["axes.edgecolor"] = "#181818"

# get subplot for South America
fig, ax = plt.subplots(figsize=(8, 8))

# set the view for only the mainland
ax.set_xlim([-85, -30])
ax.set_ylim([-60, 15])

# turn off axes for the South America plot only
ax.axis("off")

# plot all countries with darkish grey color
south_america.plot(
    ax=ax,
    color=plt.cm.viridis(0.67),
    edgecolor="#181818",
    linewidth=0.2,
);

# define airport coordinates (longitude, latitude)
bogota = (-74.0721, 4.7110)          # Bogotá, Colombia (BOG)
buenos_aires = (-58.3816, -34.6037)  # Buenos Aires, Argentina (EZE)
sao_paulo = (-46.6333, -23.5505)     # São Paulo, Brazil (GRU)
lima = (-77.0428, -12.0464)          # Lima, Peru (LIM)
santiago = (-70.6483, -33.4489)      # Santiago, Chile (SCL)
santa_cruz = (-63.1805, -17.7892)    # Santa Cruz de la Sierra, Bolivia (VVI)
asuncion = (-57.5759, -25.2637)      # Asunción, Paraguay (ASU)

# create GeoDataFrame for airports
airports = gpd.GeoDataFrame(
    {
        "City": [
            "Bogotá (BOG)", "São Paulo (GRU)", "Buenos Aires (EZE)", 
            "Lima (LIM)", "Santiago (SCL)", "Santa Cruz (VVI)",
            "Asunción (ASU)"
        ],
        "geometry": [
            Point(bogota), Point(sao_paulo), Point(buenos_aires),
            Point(lima), Point(santiago), Point(santa_cruz),
            Point(asuncion)
        ]
    },
    crs="EPSG:4326",
)

# define different flight paths (corrected)
routes = {
    "Direct Flight": [bogota, buenos_aires],
    "Via São Paulo": [bogota, sao_paulo, buenos_aires],
    "Via Lima": [bogota, lima, buenos_aires],
    "Via Santiago": [bogota, santiago, buenos_aires],
    "Via Santa Cruz": [bogota, santa_cruz, buenos_aires],
    "Via Lima & Santiago": [bogota, lima, santiago, buenos_aires],
    "Via Lima & Asunción": [bogota, lima, asuncion, buenos_aires],
    "Via Lima & Santa Cruz": [bogota, lima, santa_cruz, buenos_aires],
    "Via Santa Cruz & Santiago": [bogota, santa_cruz, santiago, buenos_aires],
}

# plot flight paths with a specific label for each route
for route, path in routes.items():
    # Set color to red for Direct Flight, else a different color
    if "Direct" in route:
        color = "red"
        style = "solid"

    else:
        color = plt.cm.inferno(1.0)
        style = "dashed"
    
    # create a LineString for the route and plot it with the specified color
    line = LineString(path)
    ax.plot([point[0] for point in line.coords], [point[1] for point in line.coords], 
            color=color, linewidth=2, linestyle=style, alpha=0.7)

# plot airports
airports.plot(ax=ax, color="red", markersize=50, edgecolor="black")

# add labels
for x, y, label in zip(airports.geometry.x, airports.geometry.y, airports["City"]):
    ax.text(x, y, label, fontsize=8, ha="right", color="black", bbox=dict(facecolor="white", alpha=0.7))

# Manually add the legend for "Direct Flight" and "Non-Direct Flight"
from matplotlib.lines import Line2D

legend_elements = [
    Line2D([0], [0], color="red", lw=2, linestyle="solid", label="Direct Flight"),
    Line2D([0], [0], color=plt.cm.inferno(1.0), lw=2, linestyle="dashed", label="Non-Direct Flight")
]

# Add legend
ax.legend(handles=legend_elements, loc="upper right", fontsize=8, frameon=False)

# set title
plt.suptitle(
    f"Figure {fig_count}. Bogota to Buenos Aires Flight Paths.", y=0.0001, fontsize=10,
)

# increment fig count
fig_count += 1

# adjust padding
plt.tight_layout(pad=0.5)  # Adjust the padding as needed

# display
plt.show()

In *figure 1* above we can see the various flight paths possible from *Bogota, Columbia* to *Buenos Aires, Argentina*. Clearly, any flight path that is not a *direct flight* will have a longer total distance. Should our *utility* function reward or penalize this? It would not make sense to *"reward"* excess distance, because that would be equivalent to *"waste"* distance. This can be understood in terms of **accuracy** or **efficiency**: we do not want to simply travel from point `A` to `Z` (e.g. *Bogota* to *Buenos Aires*) by any path, we want to get there by the path of *minimal waste*. So then how do we account for this *"waste distance"* because clearly we must consider it, otherwise our *utility function* would be *increasing* as the distance increases **beyond** the *"ideal distance."* The solution comes from understanding that the *"benefit"* gained from traveling is not an absolute distance but a *distance efficiency* ($E_D$):

$$
E_D = \frac{D_{\text{ideal}}}{D_{\text{actual}}}
$$

We are not *"buying"* raw distance, but instead an *"efficiency"* or *"accuracy"* of distance from our origin `A` to our target destination `Z`. Now we have the correct behavior:

- **$D_{\text{actual}} = D_{\text{ideal}}; E_D = 1$**: utility is maximized, since the denominator is just
  $M \cdot T \cdot S$, and there is no penalty from the actual distance.
- **$D_{\text{actual}} > D_{\text{ideal}}; E_D < 0$**: utility decreases as the actual distance grows larger.

We can now simplify the *utility function* even further:

$$
U(x) = \frac{E_D}{M \cdot T \cdot S}
$$

## Mathematics of Stress
The final term of the equation that we need to discuss is *stress* $S$. The *money* $M$ and *time* $T$ terms are relatively simple:

+ $M$ represents all the financial costs spent on the trip (e.g. ticket cost, baggage fees, etc ...)
+ $T$ represents all the *temporal* costs spent on the trip (e.g. flight time, layovers, traffic, delays, etc ...)

But *stress* $S$ is a bit more abstract and complex:

+ *Stress accumulates over time*, meaning that longer trips generally result in higher stress.
+ However, stress is not **linearly** dependent on time.

We *"intuitively"* feel this: a non-stop, 12-hour flight might feel more stressful than two 6-hour flights with a comfortable layover in between. Conversely, too many layovers can introduce additional stressors such as airport transfers, security checks, and unpredictability.

So how do we go about deriving the *mathematical model* of stress? Well the *"key"* insight here is to understand that *stress* is simply an example of [continuously compounding](https://en.wikipedia.org/wiki/Compound_interest) which is just a type of [exponential growth](https://en.wikipedia.org/wiki/Exponential_growth):

$$
S = S_0 e^{kT}
$$

where:  
- $S_0$ is the **baseline stress** (initial stress level).  
- $k$ is a **stress growth rate**, which could depend on factors like discomfort, unpredictability, or psychological burden.  
- $T$ is the **total time** spent on the trip.

Let's look at an example scenario that really shows what this *mathematical model* captures about *stress* $S$. For a **12-hour** direct flight:

$$
S_{\text{long-haul}} = S_0 e^{k(12)}
$$

For a trip broken into **three 4-hour segments** with **two layovers** (each reducing stress by a factor of $\alpha$, where $0 < \alpha < 1$:

1. **First Flight (4 hours)**:

$$
S_1 = S_0 e^{k(4)}
$$

2. **First Layover (partial stress reset)**:

$$
S_2 = \alpha S_1 = \alpha S_0 e^{k(4)}
$$

3. **Second Flight (4 hours)**:

$$
S_3 = S_2 e^{k(4)} = \alpha S_0 e^{k(4)} e^{k(4)} = \alpha S_0 e^{k(8)}
$$

4. **Second Layover (another stress reset)**:

$$
S_4 = \alpha S_3 = \alpha^2 S_0 e^{k(8)}
$$

5. **Final Flight (4 hours)**:

$$
S_5 = S_4 e^{k(4)} = \alpha^2 S_0 e^{k(8)} e^{k(4)} = \alpha^2 S_0 e^{k(12)}
$$

Thus, the final stress for the segmented trip is:

$$
S_{\text{segmented}} = \alpha^2 S_0 e^{k(12)}
$$

If we compare the two *travel strategies*:

1. **Long-haul flight stress**:

$$
S_{\text{long-haul}} = S_0 e^{k(12)}
$$

2. **Segmented flight stress (with layovers reducing stress by $\alpha^2$)**:

$$
S_{\text{segmented}} = \alpha^2 S_0 e^{k(12)}
$$

Since $0 < \alpha < 1$, we have:

$$
S_{\text{segmented}} < S_{\text{long-haul}}
$$

indicating that **layovers help reduce accumulated stress**, despite the total time being the same. Certainly this depends on $\alpha < 1$, otherwise we would not see such *"benefit"* from *interrupting stress growth*. To make this more clear let's look at the same example but with actual values:

- **Baseline stress level**: $S_0 = 10$ (arbitrary unit)
- **Stress growth rate**: $k = 0.1$ per hour
- **Total travel time**: $T = 12$ hours  
- **Layover stress reduction factor**: $\alpha = 0.5$ (each layover halves the accumulated stress)

First we calculate the continuous *long-haul* flight:

$$
S_{\text{long-haul}} = 10 \cdot e^{0.1 \times 12}
$$

$$
S_{\text{long-haul}} = 10 \cdot e^{1.2}
$$

$$
S_{\text{long-haul}} \approx 10 \cdot 3.32 = 33.2
$$

Now the *multi-segment* flight:

1. **First Flight (4 hours)**:

$$
S_1 = 10 \cdot e^{0.1 \times 4} = 10 \cdot e^{0.4}
$$

$$
S_1 \approx 10 \cdot 1.49 = 14.9
$$

2. **First Layover (partial stress reset)**:

$$
S_2 = 0.5 \times 14.9 = 7.45
$$

3. **Second Flight (4 hours)**:

$$
S_3 = 7.45 \cdot e^{0.1 \times 4} = 7.45 \cdot e^{0.4}
$$

$$
S_3 \approx 7.45 \cdot 1.49 = 11.1
$$

4. **Second Layover (another stress reset)**:

$$
S_4 = 0.5 \times 11.1 = 5.55
$$

5. **Final Flight (4 hours)**:

$$
S_5 = 5.55 \cdot e^{0.1 \times 4} = 5.55 \cdot e^{0.4}
$$

$$
S_5 \approx 5.55 \cdot 1.49 = 8.3
$$

The final comparison:
- **Long-haul stress**: $S_{\text{long-haul}} \approx 33.2$
- **Segmented flight stress**: $S_{\text{segmented}} \approx 8.3$

So, despite both flights covering **12 hours**, breaking it into *3 segments with 2 layovers* **reduced stress from 33.2 to 8.3**, which is a **75% reduction in final stress**! Key insights:

- A direct flight accumulates **continuous** stress exponentially.
- Layovers **reset** stress by a factor $\alpha$, preventing excessive buildup.
- Even though layovers might be inconvenient, they can significantly reduce total travel stress.

This illustrates why **ultra-long-haul flights** might feel much more exhausting compared to breaking the journey into shorter, more manageable segments.

## Growing Stress
But one aspect of the *mathematics of stress* we have not discussed is what *influences* the *growth rate* $k$? For our purposes, let us consider some *"travel examples"* in an attempt to reason about what causes *"stress"* in travel. Consider the following two scenarios:

**Scenario 1**: Empty Train
- You board a **nearly empty** train car.  
- There is **ample seating**, personal space, and no loud distractions.  
- No one is **blocking the aisles**, slowing you down, or invading your personal space.  
- The ride is **quiet**, predictable, and uninterrupted.

**Scenario 2**: Packed Subway
- You board a **rush-hour subway**, standing shoulder to shoulder with strangers.  
- Noise levels are high: **announcements, conversations, crying babies**.  
- The **air is stuffy**, and you’re forced to hold onto a metal bar with limited mobility.  
- At each stop, **more people cram in**, pushing against you.  
- The train **suddenly stops** due to delays, adding uncertainty.

There are many different ways we can *"describe"* what is the difference between the two scenarios: *noise*, *threat*, *attack*, *interruption*, *diversion*, *chaos*, *distraction*, etc ... But ultimately one word encapsulates the difference: **stimulation**. The difference between the *two scenarios* is that **scenario 1** is less stimulating. And why? There is less *stimulation* from the environment: less people, less noise, less distractions, and less interruptions. Whereas **scenario 2** is **total chaos**.

But how might we represent this *mathematically*? We simply need to account for both the *frequency* and the *amplification* of *each stimuli*:

$$
k = k_0 + \sum_{i} \sigma_i \lambda_i
$$

where:
- $k_0$ = **intrinsic stress growth rate** (baseline stress accumulation).
- $i$ = index for different types of stimuli.
- $\sigma_i$ = **stress amplification factor** for stimulus type $i$.
- $\lambda_i$ = **stimulus frequency** (events per minute, per hour, etc.).

Now we can return to the *two scenarios* above:

**Scenario 1**: Empty Train
- **Very few passengers** → low interaction frequency.  
- **Minimal announcements**.  
- **Smooth ride, few disturbances**.  

We assign reasonable values for the stress equation:  

$$
S = S_0 e^{kT}
$$

where the growth rate $k$ is influenced by **stimulus frequency**:

$$
k = \sigma_{\text{people}} \cdot \lambda_{\text{people}} + \sigma_{\text{announcements}} \cdot \lambda_{\text{announcements}} + \sigma_{\text{movement}} \cdot \lambda_{\text{movement}}
$$

Using estimated values:  

| **Stimulus**    | **Impact Factor** $\sigma$ | **Frequency** $\lambda$ (per min) |
|-----------------|-------------------|----------------|
| People moving   | $0.2$             | $1$            |
| Announcements   | $0.3$             | $0.2$          |
| Train movement  | $0.1$             | $1$            |

$$
k_{\text{empty}} = (0.2 \times 1) + (0.3 \times 0.2) + (0.1 \times 1) = 0.2 + 0.06 + 0.1 = 0.36
$$

For a **30-minute ride**, assuming $S_0 = 1$:

$$
S_{\text{empty}} = 1 \cdot e^{0.36 \times 30} \approx e^{10.8} \approx 49,402
$$  

 **Scenario 2**: Packed Subway:
 
- **Crowded conditions** → high people density.  
- **Frequent loud announcements**.  
- **Jostling, noise, and disruptions**.  

Using estimated values:  

| **Stimulus**    | **Impact Factor** $\sigma$ | **Frequency** $\lambda$ (per min) |
|------------------|------------------|----------------|
| People moving   | $0.5$         | $10$      |
| Announcements   | $0.4$         | $2$       |
| Train movement  | $0.2$         | $5$       |

$$
k_{\text{packed}} = (0.5 \times 10) + (0.4 \times 2) + (0.2 \times 5) = 5 + 0.8 + 1 = 6.8
$$

For the same **30-minute ride**, with $S_0 = 1$:

$$
S_{\text{packed}} = 1 \cdot e^{6.8 \times 30} \approx e^{204} \approx 2.27 \times 10^{88}
$$

The stress growth rate in a **packed subway** ($k = 6.8$) is exponentially higher than in an **empty train** ($k = 0.36$), leading to an *astronomical difference* in total stress over time. This highlights how a crowded, noisy environment rapidly compounds stress, while a quiet, low-stimulus setting keeps it minimal. Ultimately, **stimulus frequency** — the rate of encounters with people, noise, and movement — emerges as the key driver of stress accumulation.

## Price Utility
One final point needs to be considered before we actually apply the *travel utility function* in the next section, and that is to discuss the *optional* usage of *price utility* as an alternative to *total price*. Instead of summing all absolute costs:  

$$
M = P_{\text{total}} = P_{\text{ticket}} + P_{\text{fees}} + P_{\text{hotel}} + \dots
$$

we redefine price in terms of *price utility* by normalizing costs relative to wealth $W$:  

$$
M = \frac{P_{\text{total}}}{W} = \frac{P_{\text{ticket}} + P_{\text{fees}} + P_{\text{hotel}} + \dots}{W}
$$

Substituting for $M$ in the *travel utility function*:

$$
\begin{aligned}
U(x) &= \frac{D_{\text{ideal}}}{\frac{P_{\text{total}}}{W} \cdot T \cdot S \cdot D_{\text{actual}}} \\
U(x) &= \frac{D_{\text{ideal}} \cdot W}{P_{\text{total}} \cdot T \cdot S \cdot D_{\text{actual}}}
\end{aligned}
$$

This formulation acknowledges that the same monetary cost has a vastly different impact depending on the traveler’s financial situation. A $500$ expense may be trivial for one person yet burdensome for another. This *"optional"* way of calculating the $M$ value will be useful when we want to consider the *optimal travel option* for individuals with differing values of *wealth* $W$.

## Synthetic Travel Data
Now we can actually apply the *travel utility function* to the previous *travel plan* (i.e. *Bogota* -> *Buenos Aires*). But before we get into *applying* the function to determine the *optimal travel option*, we need to generate some *data* that approximates the *travel options* that will be encountered in reality. Below is a *sample* of *synthetic travel data*:

In [None]:
import numpy as np

# set seed for reproducibility
np.random.seed(42)

# for flipping lat/long coordinates
def flip_coords(coord):
    return (coord[1], coord[0])

# function to calculate the great-circle distance (in km) between two points
def calculate_distance(p1, p2):
    lat1, lon1 = np.radians(p1)
    lat2, lon2 = np.radians(p2)
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    return 6371 * c  # radius of Earth in km

# define constants
D_ideal = calculate_distance(flip_coords(bogota), flip_coords(buenos_aires))
FLIGHT_SPEED_KMH = 800
BASE_COST_PER_1000KM = 150

# airlines that fly each route
route_airlines = {
    "Direct Flight": ["Avianca", "LATAM", "Aerolineas Argentinas"],
    "Via São Paulo": ["LATAM", "Gol", "Avianca + Gol", "Azul"],
    "Via Lima": ["LATAM", "Sky Airline", "Viva Air + LATAM"],
    "Via Santiago": ["LATAM", "Sky Airline", "Jetsmart"],
    "Via Santa Cruz": ["BoA", "Amaszonas", "LATAM"],
    "Via Lima & Santiago": ["LATAM", "Sky Airline + LATAM"],
    "Via Lima & Asunción": ["LATAM + Paranair", "Copa + Paranair"],
    "Via Lima & Santa Cruz": ["LATAM + BoA", "Sky Airline + BoA"],
    "Via Santa Cruz & Santiago": ["BoA + LATAM", "Amaszonas + Sky Airline"],
}

# travel class impact on stress growth rate (lower k means less stress)
class_k_modifier = {
    "Economy": 1.0,
    "Business": 0.7,
    "First Class": 0.3
}

# day of week impact (weekends are less stressful for travel)
day_k_modifier = {
    "Mon": 1.2, "Tue": 1.1, "Wed": 1.1, "Thu": 1.0,
    "Fri": 1.3, "Sat": 0.9, "Sun": 0.8
}

# month impact on stress (holidays and peak travel months like December or July increase stress)
month_k_modifier = {
    "Jan": 1.1, "Feb": 1.0, "Mar": 1.0, "Apr": 1.0, "May": 1.0, "Jun": 1.1,
    "Jul": 1.3, "Aug": 1.0, "Sep": 1.0, "Oct": 1.0, "Nov": 1.1, "Dec": 1.3
}

# time of day impact (red-eye flights = less stress)
time_k_modifier = {
    "Morning": 1.2, "Afternoon": 1.1, "Evening": 1.0, "Red-Eye": 0.7
}

# weather conditions and their impact on stress
weather_k_modifier = {
    "Clear": 0.8,  # ideal conditions
    "Cloudy": 1.0, # neutral
    "Rainy": 1.2,  # increased stress
    "Stormy": 1.5, # very stressful
    "Snowy": 1.4   # increased stress, especially for delays and logistics
}

# lounge access types and stress modifiers
lounge_access = {
    "None": 1.0,        # no reduction in stress
    "Basic": 0.9,       # mild reduction in stress
    "Premium": 0.7,     # moderate reduction in stress
    "VIP": 0.5,         # significant reduction in stress
    "Private Room": 0.2 # near-zero stress
}

# class multipliers (how much more expensive each class is compared to Economy)
class_price_multiplier = {
    "Economy": 1.0,
    "Business": 2.0,
    "First Class": 4.0
}

# lounge access cost multipliers
lounge_price_multiplier = {
    "None": 1.0,
    "Basic": 1.1,
    "Premium": 1.3,
    "VIP": 1.6,
    "Private Room": 2.0
}

# stress tier definitions (S_0)
stress_tiers = {
    "low": 1,
    "mild": 3,
    "moderate": 5,
    "high": 8,
    "extreme": 10
}

# stress growth rate modifiers for each airline
airline_k_modifier = {
    "Avianca": 0.9,
    "LATAM": 1.0,
    "Aerolineas Argentinas": 1.1,
    "Gol": 1.2,
    "Sky Airline": 1.3,
    "Viva Air": 1.4,
    "Jetsmart": 1.5,
    "BoA": 1.1,
    "Amaszonas": 1.3,
    "Paranair": 1.4,
    "Copa": 1.0,
    "Azul": 1.2
}

# price multipliers for each airline
airline_price_modifier = {
    "Avianca": 1.1,
    "LATAM": 1.2,
    "Aerolineas Argentinas": 1.3,
    "Gol": 1.0,
    "Sky Airline": 0.9,
    "Viva Air": 0.8,
    "Jetsmart": 0.85,
    "BoA": 1.0,
    "Amaszonas": 1.05,
    "Paranair": 1.2,
    "Copa": 1.15,
    "Azul": 1.1
}

In [None]:
import itertools
from typing import Generator, List, Dict, Any, Tuple

# define realistic layover times (as numbers)
valid_layover_times = [1, 2, 3]  # Representing hours for layovers

def get_max_layovers(routes: Dict[str, Any]) -> int:
    """Finds the maximum number of layovers in the given routes dictionary."""
    # set initial count
    max_layovers_count = 0

    # loop over routes
    for path in routes.values():
        # calculate layover number
        layovers = len(path) - 2

        # update max
        max_layovers_count = max(max_layovers_count, layovers)
    
    return max_layovers_count

def generate_layover_combinations(max_layovers: int) -> Generator[List[Tuple[int, str]], None, None]:
    """Generate all possible valid layover combinations based on the specified maximum number of layovers."""    
    # yield the special case for direct flights (no layovers, no lounge)
    yield [(0, "None")]  # representing a direct flight with no layovers or lounge access
    
    # generate all possible combinations of layover times and lounge access
    layover_combinations = list(itertools.product(valid_layover_times, lounge_access.keys()))
    
    # generate combinations for every number of layovers from 1 to max_layovers
    for num_layovers in range(1, max_layovers + 1):
        # generate combinations for the given number of layovers
        for combination in itertools.product(layover_combinations, repeat=num_layovers):
            # flatten the combination and yield it as a list of tuples
            yield list(combination)


def generate_valid_flight_parameters() -> Generator[Dict[str, Any], None, None]:    
    """
    Generate valid flight combinations based on predefined rules and parameters.

    This function yields all valid combinations of flight parameters including flight path,
    airline, month, day, time of day, lounge access, and layover time (if applicable).
    It respects the following rules:
    - Only airlines that are allowed for a specific flight path will be included.
    - Direct flights must have a layover time of 0.
    - For flights with layovers, the layover time must be one of the valid options (1, 2, 3 hours).
    - The function uses a lazy generator, meaning combinations are computed one at a time to 
      minimize memory usage.
    """
    
    # generate all combinations of flight parameters (using itertools.product to create the Cartesian product)
    for flight_path, airline, travel_class, day, month, time_of_day, weather, layover in itertools.product(
        routes.keys(),
        airline_k_modifier.keys(),
        class_k_modifier.keys(),
        day_k_modifier.keys(),
        month_k_modifier.keys(),
        time_k_modifier.keys(),
        weather_k_modifier.keys(),
        list(generate_layover_combinations(get_max_layovers(routes)))
    ):
        # calculate the number of layovers based on the flight path (count the stops)
        num_layovers = len(routes[flight_path]) - 2
        
        # Check if the airline is allowed for the current flight path
        if airline not in route_airlines.get(flight_path, []):
            continue  # Skip this combination if the airline is not valid for the flight path

        # direct flight check: if it's a direct flight, layover must be [(0, 'None')]
        if 'Direct Flight' in flight_path and layover != [(0, 'None')]:
            continue  # Skip direct flight with any layovers

        # for non-direct flights, check that the number of layovers in the combination matches the expected number
        if 'Via' in flight_path and len(layover) != num_layovers:
            continue  # Skip this combination if the layover length doesn't match the number of layovers

        # extract layover_times and lounges from the layover combination (which is a list of tuples)
        layover_times = [time for time, _ in layover]
        lounges = [lounge for _, lounge in layover]
        
        # yield the valid combination of parameters
        yield {
            'Flight Path': flight_path,
            'Airline': airline,
            'Class': travel_class,
            'Day': day,
            'Month': month,
            'Time of Day': time_of_day,
            'Weather': weather,
            'Layover Times': layover_times,
            'Lounges': lounges,
        }

In [None]:
import pandas as pd
from typing import List, Dict, Tuple, Optional

def generate_flight_data(
    routes: Dict[str, List[Tuple[float, float]]],
    num_samples: int,
    base_stress: Optional[str] = None,
    wealth: Optional[float] = None,
    utility_scale: float = 1.0
)-> pd.DataFrame:
    """
    Generates synthetic flight data based on various travel factors such as route, class, 
    time of travel, weather conditions, and stress modifiers.
    
    Args:
        routes (Dict[str, List[Tuple[float, float]]]):
            A dictionary where keys are route names and values are lists of latitude/longitude coordinates.
        num_samples (int):
            The number of synthetic flight records to generate.
        base_stress (Union[str, None], optional):
            If set, the stress level will be based on the given tier ('low', 'moderate', 'high', 'extreme');
            If None (default), the stress level will be randomized.
        wealth (Optional[float], optional):
            If provided, an additional price utility metric (`U_p = M / W`) is calculated.
        utility_scale (float):
            A scaling factor applied to the final travel utility (`U_t`).
            Defaults to 1.0 (no scaling).
    
    Returns:
        pd.DataFrame: A DataFrame containing the generated flight data, including:
            - Route
            - Airline
            - Class
            - Day
            - Month
            - Time of Day
            - Weather
            - D_actual (Total Distance)
            - T_flight (Total Flight Time)
            - T_layover (Total Layover Time)
            - M (Cost $)
            - Lounges (Airport Lounges Purchased)
            - W (Wealth, if provided)
            - U_p (Price Utility, if W is provided)
            - S_0 (Base Stress)
            - S_flight (Flight Stress)
            - S_layover (Layover Stress)
            - S_total (Total Stress)
            - U_t (Scaled Travel Utility)
    """
    # temp travel option data storage
    data = []

    # generate data
    for _ in range(num_samples):
        # randomly pick route, class, and travel conditions
        route = np.random.choice(list(routes.keys()))
        airlines = np.random.choice(route_airlines[route])
        travel_class = np.random.choice(list(class_k_modifier.keys()))
        day_of_week = np.random.choice(list(day_k_modifier.keys()))
        month = np.random.choice(list(month_k_modifier.keys()))
        time_of_day = np.random.choice(list(time_k_modifier.keys()))
        weather = np.random.choice(list(weather_k_modifier.keys()))

        # get city coordinates for the route
        cities = routes[route]
        distance = 0
        flight_time_per_leg = []

        # calculate total distance and flight times
        for i in range(len(cities) - 1):
            city1, city2 = cities[i], cities[i + 1]
            leg_distance = calculate_distance(flip_coords(city1), flip_coords(city2))
            flight_time_per_leg.append(leg_distance / FLIGHT_SPEED_KMH)
            distance += leg_distance

        # generate random layover times
        layover_times = np.random.uniform(1, 5, len(cities) - 2)  # random layovers per stop
        T_layover = sum(layover_times)  # total layover time

        # control flow for picking lounges
        if len(layover_times) == 0:
            # no lounge for direct flights
            lounge_types = ["None"]
        else:
            # different lounge type for each layover
            lounge_types = [np.random.choice(list(lounge_access.keys())) for _ in range(len(layover_times))]

        # handle composite airlines by splitting and averaging
        airline_parts = airlines.split(" + ")
        airline_k = np.mean([airline_k_modifier[airline] for airline in airline_parts])
        airline_price = np.mean([airline_price_modifier[airline] for airline in airline_parts])

        # compute stress growth rate modifiers
        k_flight = sum((
            class_k_modifier[travel_class], 
            day_k_modifier[day_of_week],
            month_k_modifier[month], 
            time_k_modifier[time_of_day],
            weather_k_modifier[weather], 
            airline_k
        ))

        k_layover = sum((
            day_k_modifier[day_of_week], 
            month_k_modifier[month], 
            time_k_modifier[time_of_day]
        ))

        # set stress tier (or randomize)
        initial_stress_lvl = np.random.choice(list(stress_tiers.keys())) if not base_stress else base_stress

        # get initial stress level
        S_0 = stress_tiers[initial_stress_lvl]

        # compute flight stress
        S_flight = sum(S_0 * np.exp(k_flight * t) for t in flight_time_per_leg)

        # compute layover(s) stress
        S_layover = 0
        for i, layover_time in enumerate(layover_times):
            # get type of lounge for the i-th layover
            lounge_type = lounge_types[i]

            # get stress modifier
            lounge_modifier = lounge_access[lounge_type]

            # modify base layover stress rate
            k_layover_adjusted = k_layover * lounge_modifier

            # update sum of all layover stress
            S_layover += S_0 * np.exp(k_layover_adjusted * layover_time)


        # total them
        S_total = S_flight + S_layover

        # get base cost
        base_cost = (distance / 1000) * BASE_COST_PER_1000KM

        # get lounge costs
        lounge_price_modifiers = [lounge_price_multiplier[lounge] for lounge in lounge_types]
        total_lounge_modifier = np.mean(lounge_price_modifiers)  # averaging across all lounges used

        # get total cost
        M = base_cost * class_price_multiplier[travel_class] * total_lounge_modifier * airline_price

        # compute travel utility
        U_t = D_ideal / (M * (sum(flight_time_per_leg) + T_layover) * S_total * distance)

        # scale travel utility
        U_t *= utility_scale

         # Create row dictionary
        row = {
            "Route": route, "Airline": airlines, "Class": travel_class, "Day": day_of_week,
            "Month": month, "Time of Day": time_of_day, "Weather": weather, "D_actual": distance,
            "T_flight": sum(flight_time_per_leg), "T_layover": T_layover, "M ($)": M, "Lounges": " + ".join(lounge_types),
            "S_0": initial_stress_lvl, "S_flight": S_flight, "S_layover": S_layover, "S_total": S_total, "U_t": U_t
        }

        # If wealth is given, compute U_p and scale U_t
        if wealth is not None:
            row["W"] = wealth
            row["U_p"] = M / wealth
            row["U_t"] *= wealth

        data.append(row)

    # define column order
    columns = [
        "Route", "Airline", "Class", "Day", "Month", "Time of Day", "Weather",
        "D_actual", "T_flight", "T_layover", "M ($)", "Lounges"
    ]

    # insert right after "M ($)"
    if wealth is not None:
        columns.extend(["W", "U_p"])

    # default
    columns.extend(["S_0", "S_flight", "S_layover", "S_total", "U_t"])

    # finished product
    return pd.DataFrame(data, columns=columns)

In [None]:
# get some sample flight data
sample_flights_df = generate_flight_data(routes, 1000, base_stress="high", utility_scale=1e12)
sample_best_routes = sample_flights_df.loc[sample_flights_df.groupby('Route')['U_t'].idxmax()]
sorted_best_routes = sample_best_routes.sort_values(by='U_t', ascending=False)[:10]

# split the DataFrame into two halves
halfway = len(sorted_best_routes.columns) // 2
first_half = sorted_best_routes.iloc[:, :halfway]
second_half = sorted_best_routes.iloc[:, halfway:]

# Print the two halves separately
print(first_half.round(2).to_string())
print("\n")
print(second_half.round(2).to_string())

 Here is a breakdown of what each column in the above *sample data* represents:  

- **`Route`**: The specific flight route taken, including possible layovers or connections.
  <br><br>
- **`Airline`**: The airline or airlines providing the travel service. If multiple airlines are used, they are separated by a
  `+` symbol.
  <br><br>
- **`Class`**: The travel class for the majority of the trip, which influences stress and comfort (e.g., Economy, Business,
  First Class).
  <br><br>
- **`Day`**: The day of the week on which the journey begins (e.g., Mon, Tue, Wed).
  <br><br>
- **`Month`**: The month in which the journey begins (e.g., Jan, Feb, Mar).
  <br><br>
- **`Time of Day`**: The approximate time of day when the flight begins (e.g., Morning, Afternoon, Evening, Red-Eye).
  <br><br>
- **`Weather`**: Weather conditions during takeoff or landing, which can influence stress (e.g., Clear, Cloudy, Stormy,
  Snowy).
  <br><br>
- **`D_actual`**: The actual distance covered by the route in miles, which may be greater than the ideal distance due to
  layovers or deviations.
  <br><br>
- **`T_flight`**: Total inflight time in hours.
  <br><br> 
- **`T_layover`**: Total layover time in hours.
  <br><br>
- **`M ($)`**: Total monetary cost of the journey in U.S. dollars.
  <br><br>
- **`Lounges`**: Available lounges or relaxation areas during layovers, which affect stress (e.g., VIP, Basic, Premium,
  Private Room).
  <br><br>
- **`S_0`**: Initial stress level, which starts at a moderate value for all options but can escalate based on various factors.
  <br><br>
- **`S_flight`**: Stress accumulated during the flight, influenced by factors such as class, flight duration, and weather.
  <br><br>  
- **`S_layover`**: Stress accumulated during layovers, influenced by factors such as available lounges, duration, and crowd
  density.
  <br><br>
- **`S_total`**: Total stress for the entire journey, calculated as the sum of `S_flight` and `S_layover`.
  <br><br>
- **`U_t`**: The calculated travel utility score for the route, where higher values represent better travel options.  

The $U_t$ column is calculated using the travel utility function:

$$
U_t = \frac{D_\text{ideal}}{M \times (T_\text{flight} + T_\text{layover}) \times S_\text{total} \times D_\text{actual}}
$$

Higher utility values indicate more favorable travel options considering distance, cost, stress, and time. The $D_{\text{actual}}$ values vary depending on different stopovers and connecting routes, which can increase total distance traveled.  