# Updated analysis with helper functions

This notebook reworks the original exploratory analysis to use the reusable helpers for supplier name normalization and geolocation lookups. It demonstrates a lightweight pipeline on a small sample of shipments so the logic can be exercised without large datasets.

In [None]:
import pandas as pd

from utils_data_cleansing import load_rules
from utils_geoloc import get_geoloc, load_data
from utils_supplier_name import (
    guess_supplier_name,
    guess_supplier_name_from_priority,
    load_supplier_names,
)


## Load rule and reference data

The cleansing and geolocation helpers operate on YAML and CSV reference files stored in the repository. If PyYAML is not installed, `utils_geoloc.load_data` will fall back to a minimal parser while `utils_data_cleansing.load_rules` requires PyYAML. Warnings are captured for missing files so they can be surfaced downstream.


In [None]:
rules = load_rules("rules.yml")
known_suppliers = load_supplier_names("CONSIGNEE_NAME.csv")
geoloc_rules, geoloc_warnings = load_data("geoloc.yml")

print(f"Loaded {len(rules)} cleansing rules and {len(known_suppliers)} supplier names.")
print(f"Geolocation rules: {len(geoloc_rules)} entries; warnings: {geoloc_warnings}")


## Build a small shipment sample

The sample mixes clean and noisy supplier names along with imperfect location spellings. The priority-based helper shows how multiple raw columns can be evaluated without duplicating logic.


In [None]:
raw_shipments = pd.DataFrame(
    [
        {
            "tms_id": 1,
            "consignee_name": "Huebner GmbH",
            "fallback_name": "Hubner",
            "loading_city": "Amsterdam",
            "consignee_city": "Hannover",
        },
        {
            "tms_id": 2,
            "consignee_name": "DB",
            "fallback_name": "Deutsche Bahn AG",
            "loading_city": "Paris",
            "consignee_city": "Berlin",
        },
        {
            "tms_id": 3,
            "consignee_name": "   ALSTOM  transport  ",
            "fallback_name": "Alstom",
            "loading_city": "Lisboa",
            "consignee_city": "Madrid",
        },
        {
            "tms_id": 4,
            "consignee_name": None,
            "fallback_name": "Knorr-Bremse",
            "loading_city": "Munich",
            "consignee_city": "Vienna",
        },
    ]
)
raw_shipments


## Normalize supplier names

`guess_supplier_name_from_priority` is used to evaluate the preferred and fallback supplier name columns. The function leverages `utils_data_cleansing.apply_rule` under the hood when rules are provided.


In [None]:
raw_shipments["normalized_consignee"] = raw_shipments.apply(
    lambda row: guess_supplier_name_from_priority(
        [row["consignee_name"], row["fallback_name"]],
        known_suppliers,
        rules=rules,
        min_score=0.65,
    ),
    axis=1,
)
raw_shipments[["tms_id", "consignee_name", "fallback_name", "normalized_consignee"]]


## Geolocate shipment endpoints

The geolocation helper returns coordinates for normalized place names. A default of `[0.0, 0.0]` is used when a city is missing from `geoloc.yml`, and warnings from the loader can be logged or inspected separately.


In [None]:
raw_shipments["pickup_coords"] = raw_shipments["loading_city"].apply(
    lambda name: get_geoloc(name, geoloc_rules, default_return=[0.0, 0.0])
)
raw_shipments["delivery_coords"] = raw_shipments["consignee_city"].apply(
    lambda name: get_geoloc(name, geoloc_rules, default_return=[0.0, 0.0])
)

# Split the coordinate pairs into separate columns for easier analysis
raw_shipments[["pickup_lat", "pickup_lon"]] = pd.DataFrame(
    raw_shipments["pickup_coords"].tolist(), index=raw_shipments.index
)
raw_shipments[["delivery_lat", "delivery_lon"]] = pd.DataFrame(
    raw_shipments["delivery_coords"].tolist(), index=raw_shipments.index
)

raw_shipments[
    [
        "tms_id",
        "loading_city",
        "pickup_lat",
        "pickup_lon",
        "consignee_city",
        "delivery_lat",
        "delivery_lon",
    ]
)


## Next steps

This scaffolded pipeline can be expanded to include distance calculations, clustering, and cost analytics from the original analysis by reusing the normalized names and coordinates produced above.
