## Supplier Risk MVP for Schmidt Medical Technology 

- Goal is to build a minimal, explainable quantitative risk score per supplier using the customer's ERP export. 
- Inputs: 5 CSVs (orders, suppliers, articles, addresses, indices)
- Outputs: tidy result table with "supplier_id", "country", "sum_of_oder_volume", "industry", "quantitative_ris_score" and some visuals
- Remark: all assumptions & weights are set in the CONFIG cell to ensure fast iteration

In [5]:
# Imports
import pandas as pd
import numpy as np
from dataclasses import asdict
import sys
from pathlib import Path
from src.utils import Config

sys.path.append(str(Path("..").resolve()))  

# Paths
PROJ = Path("..").resolve() # moves one level up from notebooks folder (parent folder) 
DATA = PROJ / 'data' / 'raw' # portable path to input files
REPORTS = PROJ / 'reports' # results/figures are saved here
REPORTS.mkdir(parents=True, exist_ok=True) # ensures folder exists to ensure figure/results saves

## CONFIG 
Preset A — LkSG & ESG emphasis 
- w_country = 0.45
- w_concentration = 0.25
- w_industry = 0.20
- w_compliance = 0.10

Why: The brief says Schmidt is indirectly affected by the LkSG and wants help “to calculate and monitor the risk” beyond gut feeling. Tacto’s risk flow highlights outside-in country risk and industry risk, plus spend (our proxy for concentration), and uses certificates to mitigate risk, so we weight those accordingly. 
 (see “Intuitive Risikoanalyse… Externe (outside-in) Analyse: Länderrisiko, Branchenrisiko, Spend” and certificates in supplier mgmt.)

How each subscore is computed (quick plan)
- Country risk (0–1): combine normalized human_rights_index (inverted; higher index = lower risk → risk = 1 − norm) and environmental_risk (use as-is if higher = worse; otherwise invert). Start mix: 70% human-rights, 30% environmental. 

- Concentration risk (0–1): Herfindahl (HHI) over supplier spend by article (or top-1 spend share). Also add a boost if top-1 article share ≥ top_share_single_source (e.g., 0.8).

- Industry risk (0–1): simple baseline mapping by article industry (electronics > filters > plastics > metals initially). This is your creative lever—justify with volatility/complexity/regulatory exposure; keep it editable.

- Compliance risk (0–1): certificates_valid lowers risk; missing/invalid increases it; status != 'active' adds a penalty. Mirrors Tacto’s certificate/document focus. 

- We’ll clip extreme values with clip_low=0.05, clip_high=0.95 so one noisy field doesn’t dominate.


In [7]:
sys.path.append(str(Path("..").resolve()))  

CONFIG = Config(
    w_country=0.45, w_concentration=0.25, w_industry=0.20, w_compliance=0.10,
    top_share_single_source=0.80, clip_low=0.05, clip_high=0.95,
    country_hri_weight=0.7, country_env_weight=0.3,
    industry_risk_map={
        "electronics": 0.70,
        "filters":     0.60,
        "plastics":    0.50,
        "metals":      0.40,
        "other":       0.50
    }
)
asdict(CONFIG)

TypeError: Config.__init__() got an unexpected keyword argument 'w_industry'

## Step 1: Load & preview CSVs

In [12]:
# Define file paths for all 5 datasets (stored in data/raw/)
FILES = {
    'addresses': DATA / 'addresses.csv',
    'articles': DATA / 'articles.csv',
    'indices': DATA / 'indices.csv',
    'orders': DATA / 'orders.csv',
    'suppliers': DATA / 'suppliers.csv',
}

# Load each existing CSV into a pandas dataframe (df)
# -> key = dataset name (string)
# -> value = corresponding df
dfs = {k: pd.read_csv(v) for k, v in FILES.items() if v.exists()}

# Preview the first 5 rows of each dataset and basic structure info
for name, df in dfs.items():
    print(f"\n--- {name.upper()} ---")
    display(df.head(5))  
    print(df.info())     # column dtypes, non-null counts, memory usage


--- ADDRESSES ---


Unnamed: 0,supplier_id,country
0,1003,Sweden
1,1084,Germany
2,1076,germany
3,1093,Germany
4,1037,germany


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117 entries, 0 to 116
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   supplier_id  117 non-null    int64 
 1   country      117 non-null    object
dtypes: int64(1), object(1)
memory usage: 2.0+ KB
None

--- ARTICLES ---


Unnamed: 0,Article_ID,Article,Industry
0,"""1000001""",Standardteile aus Aluminium,Metals
1,"""1000002""",Blechteile,metals
2,"""1000003""",Rohmetalle,Metals
3,"""1000004""",Elektronische Standardteile,Electronics
4,"""1000005""",Technische Produkte,electronics


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Article_ID  11 non-null     object
 1   Article     11 non-null     object
 2   Industry    11 non-null     object
dtypes: object(3)
memory usage: 392.0+ bytes
None

--- INDICES ---


Unnamed: 0,country_id,human_rights_index,enivronmental_risk
0,AFG,16.0,91.0
1,ALB,94.0,91.0
2,DZA,39.0,91.0
3,ASM,,
4,AND,,91.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249 entries, 0 to 248
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   country_id          249 non-null    object 
 1   human_rights_index  142 non-null    float64
 2   enivronmental_risk  220 non-null    float64
dtypes: float64(2), object(1)
memory usage: 6.0+ KB
None

--- ORDERS ---


Unnamed: 0,order_id,price_per_quantity,quantity,order_value,supplier_id,article_id
0,O00107,470.33,92,43270.36,"""1001""",1000001
1,O00109,241.4,163,39348.2,"""1001""",1000001
2,O00112,357.63,7,2503.41,"""1001""",1000002
3,O00278,865.39,125,108173.75,"""1001""",1000002
4,O00300,889.69,58,51602.02,"""1001""",1000003


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   order_id            500 non-null    object 
 1   price_per_quantity  500 non-null    float64
 2   quantity            500 non-null    int64  
 3   order_value         500 non-null    float64
 4   supplier_id         500 non-null    object 
 5   article_id          500 non-null    int64  
dtypes: float64(2), int64(2), object(2)
memory usage: 23.6+ KB
None

--- SUPPLIERS ---


Unnamed: 0,supplier_id,total_company_revenue,domain,status,certificates_valid
0,1003,1377955000.0,,inactive,no
1,1084,1074170000.0,,active,yes
2,1076,903055200.0,,active,yes
3,1093,745053100.0,,active,yes
4,1037,1700478000.0,,inactive,yes


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106 entries, 0 to 105
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   supplier_id            106 non-null    int64  
 1   total_company_revenue  106 non-null    float64
 2   domain                 0 non-null      float64
 3   status                 106 non-null    object 
 4   certificates_valid     106 non-null    object 
dtypes: float64(2), int64(1), object(2)
memory usage: 4.3+ KB
None


## Step 2: Data Audit

In [15]:
# primary key unique?
# check whether pk have duplicates for later potential aggregation
pk_checks = {
    "addresses": "supplier_id",
    "orders": "order_id",
    "articles": "Article_ID",
    "suppliers": "supplier_id",
    "indices": "country_id",
}
uniqueness = {}
for name, pk in pk_checks.items():
    dupe = dfs[name][pk].duplicated().sum()
    uniqueness[name] = {"pk": pk, "duplicates": int(dupe)}
uniqueness

{'addresses': {'pk': 'supplier_id', 'duplicates': 17},
 'orders': {'pk': 'order_id', 'duplicates': 0},
 'articles': {'pk': 'Article_ID', 'duplicates': 0},
 'suppliers': {'pk': 'supplier_id', 'duplicates': 6},
 'indices': {'pk': 'country_id', 'duplicates': 0}}

In [17]:
ri = {}
orders = dfs["orders"]; articles = dfs["articles"]; suppliers = dfs["suppliers"]; addresses = dfs["addresses"]

ri["orders->articles"]  = (~orders["Article_ID"].isin(articles["Article_ID"])).mean()
ri["orders->suppliers"] = (~orders["supplier_id"].isin(suppliers["supplier_id"])).mean()
ri["addresses->suppliers"] = (~addresses["supplier_id"].isin(suppliers["supplier_id"])).mean()

# fraction of missing references (0.0 = perfect). Multiply by 100 for %
{k: round(v*100,2) for k,v in ri.items()}

KeyError: 'Article_ID'