In this notebook we take the star schema dataset and we flatten it into a single table, in order to use it for the model prediction task.

The process will consist of merging the shipment fact table with the other dimensions, until only one table is left.

### Library and dataset imports

In this section we import the required libraries and the star schema dataset.

In [1]:
import pandas as pd
import numpy as np

In [2]:
excel_file = '../../../00-Project/datasets/star_schema_dataset_1.xlsx'

In [3]:
# Here we read fact and dimension tables
fact_shipment = pd.read_excel(excel_file, sheet_name='fact_shipment')
dim_customer = pd.read_excel(excel_file, sheet_name='dim_customer')
dim_delivery_address = pd.read_excel(excel_file, sheet_name='dim_delivery_address')
dim_pickup_address = pd.read_excel(excel_file, sheet_name='dim_pickup_address')
dim_date = pd.read_excel(excel_file, sheet_name='dim_date')
dim_service = pd.read_excel(excel_file, sheet_name='dim_service')
dim_carrier = pd.read_excel(excel_file, sheet_name='dim_carrier')
dim_country = pd.read_excel(excel_file, sheet_name='dim_country')

Here we check if everything was imported correctly.

In [4]:
# Print dimensions of each table for verification
print("Initial table dimensions:")
print(f"Fact Shipment: {fact_shipment.shape}")
print(f"Customer: {dim_customer.shape}")
print(f"Delivery Address: {dim_delivery_address.shape}")
print(f"Pickup Address: {dim_pickup_address.shape}")
print(f"Date: {dim_date.shape}")
print(f"Service: {dim_service.shape}")
print(f"Carrier: {dim_carrier.shape}")
print(f"Country: {dim_country.shape}\n")

Initial table dimensions:
Fact Shipment: (711458, 21)
Customer: (7935, 9)
Delivery Address: (712272, 6)
Pickup Address: (712272, 6)
Date: (627, 5)
Service: (2119, 7)
Carrier: (237, 4)
Country: (200, 5)



### Some preprocessing

Since the name column will have to be renamed each time for clarity, we do it to the original dimension of country.

In [5]:
dim_country = dim_country.rename(columns={'name': 'name_country'})

In [6]:
dim_country.columns

Index(['country_id', 'name_country', 'iso_country_code', 'continent', 'EU'], dtype='object')

### Shipment and customer merging

In this section we merge together the shipment and customer columns.

In [7]:
# We check the columns of both tables as reference
print(f"Shipment columns: {fact_shipment.columns}")
print(f"customer columns: {dim_customer.columns}")

Shipment columns: Index(['shipment_id', 'customer_price', 'expected_carrier_price',
       'final_carrier_price', 'weight', 'shipment_type', 'insurance_type',
       'customer_id', 'pickup_address_id', 'delivery_address_id', 'service_id',
       'domain_name', 'booking_state', 'lms_plus', 'exworks_id', 'margin',
       'created_date_id', 'pickup_date_id', 'real_pickup_date_id',
       'delivery_date_id', 'real_delivery_date_id'],
      dtype='object')
customer columns: Index(['customer_id', 'created_date', 'domain_name', 'main_industry_name',
       'industry_sector_name', 'segmentation', 'sequence_number',
       'structure_number', 'is_master'],
      dtype='object')


Here we rename the customer columns before merging the tables. Renaming the columns on merge can lead to some misunderstandings and errors, so we prefer doing it beforehand.

In [8]:
# Here we create a dictionary to rename all customer columns except customer_id
customer_rename = {
    col: f"{col}_customer" 
    for col in dim_customer.columns 
    if col != 'customer_id'
}

# Here we rename the customer columns
dim_customer_renamed = dim_customer.copy()
dim_customer_renamed = dim_customer_renamed.rename(columns=customer_rename)

# Here we merge it with fact_shipment
df = fact_shipment.merge(
    dim_customer_renamed,
    on='customer_id',
    how='left'
)

print(f"Shape after customer merge: {df.shape}\n")

Shape after customer merge: (711458, 29)



Here we check if everything went accordingly.

In [9]:
df.columns

Index(['shipment_id', 'customer_price', 'expected_carrier_price',
       'final_carrier_price', 'weight', 'shipment_type', 'insurance_type',
       'customer_id', 'pickup_address_id', 'delivery_address_id', 'service_id',
       'domain_name', 'booking_state', 'lms_plus', 'exworks_id', 'margin',
       'created_date_id', 'pickup_date_id', 'real_pickup_date_id',
       'delivery_date_id', 'real_delivery_date_id', 'created_date_customer',
       'domain_name_customer', 'main_industry_name_customer',
       'industry_sector_name_customer', 'segmentation_customer',
       'sequence_number_customer', 'structure_number_customer',
       'is_master_customer'],
      dtype='object')

### Delivery, country and shipment mergin

In this section we merge together the delivery and country dimensions, after which we merge the resulting table with the shipment one. We don't want to do all of them in a single chain of merges because of some naming issues that can occur.

In [10]:
# Here we print the country and delivery columns to have as reference
print(f"Country columns: {dim_country.columns}")
print(f"Delivery columns: {dim_delivery_address.columns}")

Country columns: Index(['country_id', 'name_country', 'iso_country_code', 'continent', 'EU'], dtype='object')
Delivery columns: Index(['delivery_address_id', 'created_date', 'domain_name', 'country_id',
       'postal_code', 'city'],
      dtype='object')


As the first step, we merge delivery address with country.

In [11]:
# Here we merge the delivery address with country
delivery_with_country = dim_delivery_address.merge(
    dim_country,
    on='country_id',
    how='left'
)

# Here we drop 'country_id' since it's not needed
delivery_with_country = delivery_with_country.drop('country_id', axis=1)

In [12]:
# We check if everything went accordingly
delivery_with_country.columns

Index(['delivery_address_id', 'created_date', 'domain_name', 'postal_code',
       'city', 'name_country', 'iso_country_code', 'continent', 'EU'],
      dtype='object')

In this step we rename the relevant columns.

In [13]:
# Here we create a dictionary to rename all columns except delivery_address_id for the same reasons as in the above section
delivery_rename = {
    col: f"{col}_delivery" 
    for col in delivery_with_country.columns 
    if col != 'delivery_address_id'
}

# Here we rename the columns
delivery_with_country = delivery_with_country.rename(columns=delivery_rename)

Here we merge the merged table with shipment.

In [14]:
df = df.merge(
    delivery_with_country,
    on='delivery_address_id',
    how='left'
)

# Here we drop 'delivery_address_id' since it's not needed anymore
df = df.drop('delivery_address_id', axis=1)

print(f"Shape after delivery address merges: {df.shape}\n")

Shape after delivery address merges: (711458, 36)



### Pickup, country and df merge

In this section, as in the previous, we merge the pickup and country dimensions first, then the resulting one with the shipment table.

In [15]:
# Here we check the columns
print(f"Country columns: {dim_country.columns}")
print(f"Pickup columns: {dim_pickup_address.columns}")

Country columns: Index(['country_id', 'name_country', 'iso_country_code', 'continent', 'EU'], dtype='object')
Pickup columns: Index(['pickup_address_id', 'created_date', 'domain_name', 'country_id',
       'postal_code', 'city'],
      dtype='object')


As the first step we merge the country with the pickup dimension

In [16]:
pickup_with_country = dim_pickup_address.merge(
    dim_country,
    on='country_id',
    how='left'
)

# Here we drop the country ID
pickup_with_country = pickup_with_country.drop('country_id', axis=1)

In [17]:
pickup_with_country.columns

Index(['pickup_address_id', 'created_date', 'domain_name', 'postal_code',
       'city', 'name_country', 'iso_country_code', 'continent', 'EU'],
      dtype='object')

Next, we create a dictionary in order to change the column names

In [18]:
pickup_rename = {
    col: f"{col}_pickup" 
    for col in pickup_with_country.columns 
    if col != 'pickup_address_id'
}

# Here we rename the columns
pickup_with_country = pickup_with_country.rename(columns=pickup_rename)


In [19]:
print(pickup_with_country.columns)

Index(['pickup_address_id', 'created_date_pickup', 'domain_name_pickup',
       'postal_code_pickup', 'city_pickup', 'name_country_pickup',
       'iso_country_code_pickup', 'continent_pickup', 'EU_pickup'],
      dtype='object')


At last, we merge the created table with the shipment table.

In [20]:
df = df.merge(
    pickup_with_country,
    on='pickup_address_id',
    how='left'
)

# Here we drop the pickup address as it's not needed
df = df.drop('pickup_address_id', axis=1)
print(f"Shape after delivery address merges: {df.shape}\n")

Shape after delivery address merges: (711458, 43)



In [21]:
df.columns

Index(['shipment_id', 'customer_price', 'expected_carrier_price',
       'final_carrier_price', 'weight', 'shipment_type', 'insurance_type',
       'customer_id', 'service_id', 'domain_name', 'booking_state', 'lms_plus',
       'exworks_id', 'margin', 'created_date_id', 'pickup_date_id',
       'real_pickup_date_id', 'delivery_date_id', 'real_delivery_date_id',
       'created_date_customer', 'domain_name_customer',
       'main_industry_name_customer', 'industry_sector_name_customer',
       'segmentation_customer', 'sequence_number_customer',
       'structure_number_customer', 'is_master_customer',
       'created_date_delivery', 'domain_name_delivery', 'postal_code_delivery',
       'city_delivery', 'name_country_delivery', 'iso_country_code_delivery',
       'continent_delivery', 'EU_delivery', 'created_date_pickup',
       'domain_name_pickup', 'postal_code_pickup', 'city_pickup',
       'name_country_pickup', 'iso_country_code_pickup', 'continent_pickup',
       'EU_pickup'],
  

### Service and carrier dimensions merging

In this section we are going to merge the service and carrier dimensions, and the resulting table in the shipment dimension.

In [22]:
# Here we check the columns
print(f"Service columns: {dim_service.columns}")
print(f"carrier columns: {dim_carrier.columns}")

Service columns: Index(['service_id', 'created_date', 'name', 'service_type', 'transport_type',
       'carrier_id', 'domain_name'],
      dtype='object')
carrier columns: Index(['carrier_id', 'name', 'created_date', 'domain_name'], dtype='object')


First, we rename the columns of the tables.

In [23]:
# Here we rename the columns of carrier for clarity
carrier_rename = {
   col: f"{col}_carrier" 
   for col in dim_carrier.columns 
   if col != 'carrier_id'
}

dim_carrier = dim_carrier.rename(columns=carrier_rename)

In [24]:
# Here we do the same for the service
service_rename = {
   col: f"{col}_service" 
   for col in dim_service.columns 
   if col not in ['service_id', 'service_type', 'transport_type', 'carrier_id']
}

dim_service = dim_service.rename(columns=service_rename)

Now we can merge the carrier table into the service table

In [25]:
service_carrier = dim_service.merge(
   dim_carrier,
   on='carrier_id',
   how='left'
)

# Here we drop 'carrier_id' as it's not needed
service_carrier = service_carrier.drop('carrier_id', axis=1)

At last, we merge the resulting table into the shipment table.

In [26]:
df = df.merge(
   service_carrier,
   on='service_id',
   how='left'
)

# Here we drop 'service_id'
df = df.drop('service_id', axis=1)

print(f"Shape after service and carrier merges: {df.shape}\n")

Shape after service and carrier merges: (711458, 50)



### Date dimensions merging

In this section we merge the date dimension into the different columns of shipment table.

In [27]:
# Here we review the columns
print(dim_date.columns)

Index(['full_date', 'year', 'month', 'quarter', 'date_id'], dtype='object')


In [28]:
# In this for loop, for each date we merge the date dimensions with it
for date_type in ['created_date', 'pickup_date', 'delivery_date', 'real_pickup_date', 'real_delivery_date']:
    date_id_col = f'{date_type}_id'
    if date_id_col in df.columns:
        df = df.merge(
            dim_date,
            left_on=date_id_col,
            right_on='date_id',
            how='left',
            suffixes=('', f'_{date_type}')
        )
        # Here we rename the date columns to avoid confusion
        df = df.rename(columns={
            'year': f'year_{date_type}',
            'month': f'month_{date_type}',
            'quarter': f'quarter_{date_type}',
            'full_date': f'full_date_{date_type}'
        })
        
print(f"Shape after date merges: {df.shape}\n")

Shape after date merges: (711458, 75)



In [29]:
# Here we check the result
print(df.columns)

Index(['shipment_id', 'customer_price', 'expected_carrier_price',
       'final_carrier_price', 'weight', 'shipment_type', 'insurance_type',
       'customer_id', 'domain_name', 'booking_state', 'lms_plus', 'exworks_id',
       'margin', 'created_date_id', 'pickup_date_id', 'real_pickup_date_id',
       'delivery_date_id', 'real_delivery_date_id', 'created_date_customer',
       'domain_name_customer', 'main_industry_name_customer',
       'industry_sector_name_customer', 'segmentation_customer',
       'sequence_number_customer', 'structure_number_customer',
       'is_master_customer', 'created_date_delivery', 'domain_name_delivery',
       'postal_code_delivery', 'city_delivery', 'name_country_delivery',
       'iso_country_code_delivery', 'continent_delivery', 'EU_delivery',
       'created_date_pickup', 'domain_name_pickup', 'postal_code_pickup',
       'city_pickup', 'name_country_pickup', 'iso_country_code_pickup',
       'continent_pickup', 'EU_pickup', 'created_date_service', 

In [30]:
# Here we remove the id columns that aren't needed anymore
columns_to_drop = [
    'date_id',
    'created_date_id', 'pickup_date_id', 'delivery_date_id',
    'real_pickup_date_id', 'real_delivery_date_id',
    'date_id_pickup_date', 'date_id_delivery_date', 
    'date_id_real_pickup_date', 'date_id_real_delivery_date'
]

df = df.drop(columns=columns_to_drop, errors='ignore')

In [31]:
print(df.columns)

Index(['shipment_id', 'customer_price', 'expected_carrier_price',
       'final_carrier_price', 'weight', 'shipment_type', 'insurance_type',
       'customer_id', 'domain_name', 'booking_state', 'lms_plus', 'exworks_id',
       'margin', 'created_date_customer', 'domain_name_customer',
       'main_industry_name_customer', 'industry_sector_name_customer',
       'segmentation_customer', 'sequence_number_customer',
       'structure_number_customer', 'is_master_customer',
       'created_date_delivery', 'domain_name_delivery', 'postal_code_delivery',
       'city_delivery', 'name_country_delivery', 'iso_country_code_delivery',
       'continent_delivery', 'EU_delivery', 'created_date_pickup',
       'domain_name_pickup', 'postal_code_pickup', 'city_pickup',
       'name_country_pickup', 'iso_country_code_pickup', 'continent_pickup',
       'EU_pickup', 'created_date_service', 'name_service', 'service_type',
       'transport_type', 'domain_name_service', 'name_carrier',
       'created_

### Saving the dataset

In this section we save the processed dataframe

In [32]:
output_file = '../../../00-Project/datasets/flattened_dataset.csv'

# Save to CSV
df.to_csv(output_file, index=False)