# Updated analysis with helper functions

This notebook reworks the original exploratory analysis to use the reusable helpers for supplier name normalization and geolocation lookups. It demonstrates a lightweight pipeline on a small sample of shipments so the logic can be exercised without large datasets.

In [None]:
import pandas as pd

from utils_data_cleansing import load_rules
from utils_geoloc import get_geoloc, load_data
from utils_supplier_name import (
    guess_supplier_name,
    guess_supplier_name_from_priority,
    load_supplier_names,
)


## Load rule and reference data

The cleansing and geolocation helpers operate on YAML and CSV reference files stored in the repository. If PyYAML is not installed, `utils_geoloc.load_data` will fall back to a minimal parser while `utils_data_cleansing.load_rules` requires PyYAML. Warnings are captured for missing files so they can be surfaced downstream.


In [None]:
rules = load_rules("rules.yml")
known_suppliers = load_supplier_names("CONSIGNEE_NAME.csv")
geoloc_rules, geoloc_warnings = load_data("geoloc.yml")

print(f"Loaded {len(rules)} cleansing rules and {len(known_suppliers)} supplier names.")
print(f"Geolocation rules: {len(geoloc_rules)} entries; warnings: {geoloc_warnings}")


## Build a small shipment sample

The sample mixes clean and noisy supplier names along with imperfect location spellings. The priority-based helper shows how multiple raw columns can be evaluated without duplicating logic.


In [None]:
raw_shipments = pd.DataFrame(
    [
        {
            "tms_id": 1,
            "consignee_name": "Huebner GmbH",
            "fallback_name": "Hubner",
            "loading_city": "Amsterdam",
            "consignee_city": "Hannover",
        },
        {
            "tms_id": 2,
            "consignee_name": "DB",
            "fallback_name": "Deutsche Bahn AG",
            "loading_city": "Paris",
            "consignee_city": "Berlin",
        },
        {
            "tms_id": 3,
            "consignee_name": "   ALSTOM  transport  ",
            "fallback_name": "Alstom",
            "loading_city": "Lisboa",
            "consignee_city": "Madrid",
        },
        {
            "tms_id": 4,
            "consignee_name": None,
            "fallback_name": "Knorr-Bremse",
            "loading_city": "Munich",
            "consignee_city": "Vienna",
        },
    ]
)
raw_shipments


## Normalize supplier names

`guess_supplier_name_from_priority` is used to evaluate the preferred and fallback supplier name columns. The function leverages `utils_data_cleansing.apply_rule` under the hood when rules are provided.


In [None]:
raw_shipments["normalized_consignee"] = raw_shipments.apply(
    lambda row: guess_supplier_name_from_priority(
        [row["consignee_name"], row["fallback_name"]],
        known_suppliers,
        rules=rules,
        min_score=0.65,
    ),
    axis=1,
)
raw_shipments[["tms_id", "consignee_name", "fallback_name", "normalized_consignee"]]


## Geolocate shipment endpoints

The geolocation helper returns coordinates for normalized place names. A default of `[0.0, 0.0]` is used when a city is missing from `geoloc.yml`, and warnings from the loader can be logged or inspected separately.


In [None]:
raw_shipments["pickup_coords"] = raw_shipments["loading_city"].apply(
    lambda name: get_geoloc(name, geoloc_rules, default_return=[0.0, 0.0])
)
raw_shipments["delivery_coords"] = raw_shipments["consignee_city"].apply(
    lambda name: get_geoloc(name, geoloc_rules, default_return=[0.0, 0.0])
)

# Split the coordinate pairs into separate columns for easier analysis
raw_shipments[["pickup_lat", "pickup_lon"]] = pd.DataFrame(
    raw_shipments["pickup_coords"].tolist(), index=raw_shipments.index
)
raw_shipments[["delivery_lat", "delivery_lon"]] = pd.DataFrame(
    raw_shipments["delivery_coords"].tolist(), index=raw_shipments.index
)

raw_shipments[
    [
        "tms_id",
        "loading_city",
        "pickup_lat",
        "pickup_lon",
        "consignee_city",
        "delivery_lat",
        "delivery_lon",
    ]
)


## Next steps

Use the normalized consignee names and coordinates as the basis for the milkrun loop designs. The same pattern can be scaled to full datasets by adding distance calculations, clustering, and vehicle constraints. The outputs below keep the notebook focused on showing how the cleaned data feeds routing heuristics.


## Prototype milkrun loop sketch

The table now contains normalized consignee names alongside pickup and delivery coordinates. A simple nearest-neighbor heuristic can draft loop candidates so planners can review the ordering before applying more robust optimization. The example below separates pickup and delivery legs and reports the cumulative kilometers driven for each loop.


In [None]:
import math

def haversine(lat1, lon1, lat2, lon2):
    radius_km = 6371
    phi1, phi2 = math.radians(lat1), math.radians(lat2)
    dphi = math.radians(lat2 - lat1)
    dlambda = math.radians(lon2 - lon1)
    a = math.sin(dphi / 2) ** 2 + math.cos(phi1) * math.cos(phi2) * math.sin(dlambda / 2) ** 2
    return 2 * radius_km * math.atan2(math.sqrt(a), math.sqrt(1 - a))

def build_loop(points):
    if points.empty:
        return [], 0.0

    remaining = list(range(len(points)))
    route = [remaining.pop(0)]  # start from the first stop

    while remaining:
        current = route[-1]
        next_idx = min(remaining, key=lambda idx: haversine(
            points.iloc[current]["lat"], points.iloc[current]["lon"], points.iloc[idx]["lat"], points.iloc[idx]["lon"]
        ))
        route.append(next_idx)
        remaining.remove(next_idx)

    km_total = 0.0
    for prev, nxt in zip(route, route[1:]):
        km_total += haversine(
            points.iloc[prev]["lat"], points.iloc[prev]["lon"], points.iloc[nxt]["lat"], points.iloc[nxt]["lon"]
        )
    return route, km_total

pickup_points = raw_shipments[["loading_city", "pickup_lat", "pickup_lon"]].rename(columns={"pickup_lat": "lat", "pickup_lon": "lon"})
delivery_points = raw_shipments[["consignee_city", "delivery_lat", "delivery_lon"]].rename(columns={"delivery_lat": "lat", "delivery_lon": "lon"})

pickup_route, pickup_km = build_loop(pickup_points)
delivery_route, delivery_km = build_loop(delivery_points)

pickup_plan = pickup_points.iloc[pickup_route].reset_index(drop=True)
pickup_plan["km_from_prev"] = [0.0] + [
    haversine(
        pickup_plan.iloc[i - 1]["lat"],
        pickup_plan.iloc[i - 1]["lon"],
        pickup_plan.iloc[i]["lat"],
        pickup_plan.iloc[i]["lon"]
    )
    for i in range(1, len(pickup_plan))
]

delivery_plan = delivery_points.iloc[delivery_route].reset_index(drop=True)
delivery_plan["km_from_prev"] = [0.0] + [
    haversine(
        delivery_plan.iloc[i - 1]["lat"],
        delivery_plan.iloc[i - 1]["lon"],
        delivery_plan.iloc[i]["lat"],
        delivery_plan.iloc[i]["lon"]
    )
    for i in range(1, len(delivery_plan))
]

print("Pickup loop (nearest-neighbor)")
print(pickup_plan.assign(km_total=pickup_km))
print("\nDelivery loop (nearest-neighbor)")
print(delivery_plan.assign(km_total=delivery_km))
