# Analysis v2: scikit-learn-based supplier cleansing

This notebook aligns the exploratory workflow with the TF-IDF matcher in `utils_supplier_name`. It demonstrates how the scikit-learn logic cleans supplier names, handles prioritized columns, and feeds normalized data into simple routing/geolocation steps.


In [None]:
import pandas as pd

from utils_data_cleansing import apply_rule, load_rules
from utils_geoloc import get_geoloc, load_data
from utils_supplier_name import (
    batch_guess_supplier_names,
    build_tfidf_matcher,
    guess_supplier_name_from_priority,
    load_supplier_names,
)

pd.set_option('display.max_columns', 100)


## Load cleansing assets

The matching helpers depend on the rule set in `rules.yml`, the supplier baseline in `CONSIGNEE_NAME.csv`, and the geolocation mapping in `geoloc.yml`.


In [None]:
rules = load_rules('rules.yml')
known_suppliers = load_supplier_names('CONSIGNEE_NAME.csv')
geoloc_rules, geoloc_warnings = load_data('geoloc.yml')

print(f"Loaded {len(rules)} rules and {len(known_suppliers)} canonical supplier names.")
print(f"Geolocation rules: {len(geoloc_rules)} entries; warnings: {geoloc_warnings}")


## Sample shipments

A compact dataset mixes clean and noisy consignee spellings along with a secondary booking name column. The cities are kept simple so the focus stays on the name cleansing behavior.


In [None]:
raw_shipments = pd.DataFrame(
    [
        {
            'tms_id': 1,
            'consignee_name': 'Huebner GmbH',
            'booking_name': 'Hubner',
            'loading_city': 'Amsterdam',
            'consignee_city': 'Hannover',
        },
        {
            'tms_id': 2,
            'consignee_name': 'DB cargo',
            'booking_name': 'Deutsche Bahn AG',
            'loading_city': 'Paris',
            'consignee_city': 'Berlin',
        },
        {
            'tms_id': 3,
            'consignee_name': 'Kable Technik Polska',
            'booking_name': 'Kabel Technik Polska',
            'loading_city': 'Aachen',
            'consignee_city': 'Wroclaw',
        },
        {
            'tms_id': 4,
            'consignee_name': 'Knorr-Brems',
            'booking_name': '',
            'loading_city': 'Vienna',
            'consignee_city': 'Munich',
        },
    ]
)

raw_shipments


## Normalize consignee names with TF-IDF

The scikit-learn flow fits a character-level TF-IDF model once via `build_tfidf_matcher`, then reuses it inside `batch_guess_supplier_names` and `guess_supplier_name_from_priority`.


In [None]:
# Fit a reusable matcher on the canonical supplier list.
vectorizer, tfidf_matrix, canonical_suppliers = build_tfidf_matcher(known_suppliers)

raw_shipments['normalized_consignee'] = batch_guess_supplier_names(
    raw_shipments['consignee_name'],
    canonical_suppliers,
    rules=rules,
    min_score=0.7,
)

raw_shipments['priority_normalized'] = raw_shipments.apply(
    lambda row: guess_supplier_name_from_priority(
        [row['consignee_name'], row['booking_name']],
        canonical_suppliers,
        rules=rules,
        min_score=0.7,
    ),
    axis=1,
)

raw_shipments[['tms_id', 'consignee_name', 'booking_name', 'normalized_consignee', 'priority_normalized']]


## Geolocate shipment endpoints

Once names are normalized, the helper coordinates can be attached to each shipment. Unknown cities default to `[0.0, 0.0]` so downstream routing logic can flag them.


In [None]:
raw_shipments['pickup_coords'] = raw_shipments['loading_city'].apply(
    lambda name: get_geoloc(name, geoloc_rules, default_return=[0.0, 0.0])
)
raw_shipments['delivery_coords'] = raw_shipments['consignee_city'].apply(
    lambda name: get_geoloc(name, geoloc_rules, default_return=[0.0, 0.0])
)

raw_shipments[['pickup_lat', 'pickup_lon']] = pd.DataFrame(
    raw_shipments['pickup_coords'].tolist(), index=raw_shipments.index
)
raw_shipments[['delivery_lat', 'delivery_lon']] = pd.DataFrame(
    raw_shipments['delivery_coords'].tolist(), index=raw_shipments.index
)

raw_shipments[
    ['tms_id', 'normalized_consignee', 'pickup_lat', 'pickup_lon', 'delivery_lat', 'delivery_lon']
]


## Next steps

The normalized names and coordinates can now feed the routing prototype (e.g., nearest-neighbor loops or clustering). Because the TF-IDF matcher is pre-fit, the same vectorizer can be reused for larger datasets without changing the cleansing behavior.
