# Projekt B: Supply Chain Data Cleaning Pipeline üöõüßπ
**Autor:** Kilian Sender
**Status:** Work in Progress


## 1. Setup & Datenimport


### Datenquelle
Der Datensatz **"Supply Chain Shipment Pricing Data"** stammt von **Pushpit Kamboj**.
* **Original-Link:** https://www.kaggle.com/datasets/pushpitkamboj/logistics-data-containing-real-world-data?resource=download
* **Lizenz:** CC0: Public Domain

Hier wird die CSV inklusive der n√∂tigen Bibliotheken geladen

In [4]:
import pandas as pd

# Versuch 1: Direkt im aktuellen Ordner laden
try:
    df_rohdaten = pd.read_csv('incom2024_delay_example_dataset.csv')
    print("‚úÖ Datei erfolgreich geladen!")
except FileNotFoundError:
    print("‚ùå Datei immer noch nicht gefunden. Pr√ºfe den Pfad!")
    
    # Detektiv-Hilfe: Zeig mir, wo ich bin und was hier liegt
    import os
    print(f"\nIch bin hier: {os.getcwd()}")
    print("Hier liegen folgende Dateien:", os.listdir())

‚úÖ Datei erfolgreich geladen!


## 2. Explorative Analyse (Der Detektiv-Blick)
Wir verschaffen uns einen √úberblick √ºber Datentypen, fehlende Werte und offensichtliche Fehler.

In [6]:
df_rohdaten.head()

Unnamed: 0,payment_type,profit_per_order,sales_per_customer,category_id,category_name,customer_city,customer_country,customer_id,customer_segment,customer_state,...,order_region,order_state,order_status,product_card_id,product_category_id,product_name,product_price,shipping_date,shipping_mode,label
0,DEBIT,34.448338,92.49099,9.0,Cardio Equipment,Caguas,Puerto Rico,12097.683,Consumer,PR,...,Western Europe,Vienna,COMPLETE,191.0,9.0,Nike Men's Free 5.0+ Running Shoe,99.99,2015-08-13 00:00:00+01:00,Standard Class,-1
1,TRANSFER,91.19354,181.99008,48.0,Water Sports,Albuquerque,EE. UU.,5108.1045,Consumer,CA,...,South America,Buenos Aires,PENDING,1073.0,48.0,Pelican Sunstream 100 Kayak,199.99,2017-04-09 00:00:00+01:00,Standard Class,-1
2,DEBIT,8.313806,89.96643,46.0,Indoor/Outdoor Games,Amarillo,Puerto Rico,4293.4478,Consumer,PR,...,Western Europe,Nord-Pas-de-Calais-Picardy,COMPLETE,1014.0,46.0,O'Brien Men's Neoprene Life Vest,49.98,2015-03-18 00:00:00+00:00,Second Class,1
3,TRANSFER,-89.463196,99.15065,17.0,Cleats,Caguas,Puerto Rico,546.5306,Consumer,PR,...,Central America,Santa Ana,PROCESSING,365.0,17.0,Perfect Fitness Perfect Rip Deck,59.99,2017-03-18 00:00:00+00:00,Second Class,0
4,DEBIT,44.72259,170.97824,48.0,Water Sports,Peabody,EE. UU.,1546.398,Consumer,CA,...,Central America,Illinois,COMPLETE,1073.0,48.0,Pelican Sunstream 100 Kayak,199.99,2015-03-30 00:00:00+01:00,Standard Class,1


## 3. Datenbereinigung (The Cleaning Machine)
Hier wenden wir unsere Logik an, um "Dirty Data" zu reparieren.

### 3.1 Fix: Customer Zip Codes
Entfernen von Dezimalstellen und Auff√ºllen auf 5 Ziffern.

### 3.2 Fix: ID-Spalten (Integer-Konvertierung)
Wir bereinigen alle ID-Spalten (customer_id, department_id, order_id), indem wir die Dezimalstellen entfernen.

### 3.3 Fix: Datumsformate
Konvertierung von Text zu echtem DateTime-Format.

In [5]:
df = df_rohdaten.copy()


# Wir wandeln erst in 'int' (schneidet Komma ab), dann in 'str' weil sonst keine '0' vorne stehen kann f√ºr z.B. US-Zipcodes
df['Zip_Clean'] = pd.to_numeric(df['customer_zipcode'], errors='coerce').fillna(0).astype(int).astype(str)

# Alles unter 6 Ziffern wird vorne mit '0' aufgef√ºllt
df['Zip_Clean'] = df['Zip_Clean'].str.zfill(5)

df['profit_per_order_clean'] = pd.to_numeric(df['profit_per_order'], errors='coerce').round(2)

df['sales_per_customer_clean'] = pd.to_numeric(df['sales_per_customer'], errors='coerce').round(2)

df['category_id_clean'] = pd.to_numeric(df['category_id'], errors='coerce').fillna(0).astype(int)

df['department_id_clean'] = pd.to_numeric(df['department_id'], errors='coerce').fillna(0).astype(int)

df['customer_id_clean'] = pd.to_numeric(df['customer_id'], errors='coerce').fillna(0).astype(int)

df[['customer_zipcode', 'Zip_Clean', 'category_id', 'category_id_clean','profit_per_order','profit_per_order_clean','sales_per_customer','sales_per_customer_clean','department_id','department_id_clean','customer_id','customer_id_clean']].head()


Unnamed: 0,customer_zipcode,Zip_Clean,category_id,category_id_clean,profit_per_order,profit_per_order_clean,sales_per_customer,sales_per_customer_clean,department_id,department_id_clean,customer_id,customer_id_clean
0,725.0,725,9.0,9,34.448338,34.45,92.49099,92.49,3.0,3,12097.683,12097
1,92745.16,92745,48.0,48,91.19354,91.19,181.99008,181.99,7.0,7,5108.1045,5108
2,2457.7297,2457,46.0,46,8.313806,8.31,89.96643,89.97,7.0,7,4293.4478,4293
3,725.0,725,17.0,17,-89.463196,-89.46,99.15065,99.15,4.0,4,546.5306,546
4,95118.6,95118,48.0,48,44.72259,44.72,170.97824,170.98,7.0,7,1546.398,1546


In [6]:
df.head()

Unnamed: 0,payment_type,profit_per_order,sales_per_customer,category_id,category_name,customer_city,customer_country,customer_id,customer_segment,customer_state,...,product_price,shipping_date,shipping_mode,label,Zip_Clean,profit_per_order_clean,sales_per_customer_clean,category_id_clean,department_id_clean,customer_id_clean
0,DEBIT,34.448338,92.49099,9.0,Cardio Equipment,Caguas,Puerto Rico,12097.683,Consumer,PR,...,99.99,2015-08-13 00:00:00+01:00,Standard Class,-1,725,34.45,92.49,9,3,12097
1,TRANSFER,91.19354,181.99008,48.0,Water Sports,Albuquerque,EE. UU.,5108.1045,Consumer,CA,...,199.99,2017-04-09 00:00:00+01:00,Standard Class,-1,92745,91.19,181.99,48,7,5108
2,DEBIT,8.313806,89.96643,46.0,Indoor/Outdoor Games,Amarillo,Puerto Rico,4293.4478,Consumer,PR,...,49.98,2015-03-18 00:00:00+00:00,Second Class,1,2457,8.31,89.97,46,7,4293
3,TRANSFER,-89.463196,99.15065,17.0,Cleats,Caguas,Puerto Rico,546.5306,Consumer,PR,...,59.99,2017-03-18 00:00:00+00:00,Second Class,0,725,-89.46,99.15,17,4,546
4,DEBIT,44.72259,170.97824,48.0,Water Sports,Peabody,EE. UU.,1546.398,Consumer,CA,...,199.99,2015-03-30 00:00:00+01:00,Standard Class,1,95118,44.72,170.98,48,7,1546


Entscheidung: Die Spalte customer_zipcode enth√§lt synthetisches Rauschen (Dezimalstellen) und Inkonsistenzen. Sie wird f√ºr die Analyse ignoriert. Stattdessen nutzen wir order_country f√ºr die geografische Aggregation.

## 4. Finaler Check & Export
Wir pr√ºfen das Ergebnis und speichern die saubere Datei.