In this notebook, our goal is to transform the original dataset provided from LMS into the representation of the star schema which we designed.

### Libary and dataset imports

Here we import pandas and openpyxl, with the second one done by using pip since it gave us problems locally.

In [None]:
import pandas as pd
import pip
pip.main(["install", "openpyxl"])

Here we load the excel dataset.

In [None]:
file_path = "../../../00-Project/datasets/2024-08-01_LMS_data_2023.xlsx"
xls = pd.ExcelFile(file_path)

### Load the different sheets into their respective data frames

Here we load the different sheets in their own dataframes. We don't do it in one cell for testing reasons. \
Shipment is our fact, and we identified as dimensions the information about:
<ul>
  <li>Carrier</li>
  <li>Domain</li>
  <li>Country</li>
  <li>Service</li>
  <li>Customer</li>
  <li>Pickup address</li>
  <li>Delivery address</li>
</ul>

In [None]:
shipment_df = pd.read_excel(xls, 'shipment')

In [None]:
carrier_df = pd.read_excel(xls, 'carrier')
domain_df = pd.read_excel(xls, 'domain')
country_df = pd.read_excel(xls, 'country')

In [None]:
service_df = pd.read_excel(xls, 'service')

In [None]:
customer_df = pd.read_excel(xls, 'customer')

In [None]:
pickupaddress_df = pd.read_excel(xls, 'pickupaddress')
deliveryaddress_df = pd.read_excel(xls, 'deliveryaddress')

In [None]:
branchcode_df = pd.read_excel(xls, 'branchcode')
branchcode_customer_translation_df = pd.read_excel(xls, 'branchcode_customer_translation')

### Preprocessing of shipment

In the shipment dataframe, we make some quality checks for the shipment id

In [None]:
# We convert 'shipment_id' to numeric, then turn invalid values (non-numeric) to NaN
shipment_df['shipment_id'] = pd.to_numeric(shipment_df['shipment_id'], errors='coerce')

# We drop rows where 'shipment_id' is NaN
shipment_df_clean = shipment_df.dropna(subset=['shipment_id'])

Next, we want to make quality checks for the dates present.

In [None]:
# With this function we check is the date is valid and within reasonale bounds
def fix_and_check_date(date_str):
    try:
        date = pd.to_datetime(date_str, errors='raise')  # We attempt to convert the date
        if date.year > 2026:  # If the year is out of bounds (beyond 2026)
            return pd.NaT  # We mark it as NaT (invalid)
        return date
    except:
        return pd.NaT  # We mark as NaT if the conversion fails

# We apply the function to our date columns and we discard the time
for col in ['created_date', 'real_delivery_date', 'real_pickup_date']:
    shipment_df_clean[col] = shipment_df_clean[col].apply(fix_and_check_date).dt.date

# We drop the rows where any date column contains NaT
shipment_df_clean.dropna(subset=['created_date', 'real_delivery_date', 'real_pickup_date'], inplace=True)


### Creation of the shipment fact

In this section, we merge shipment with the domain in order to create the shipment fact.

In [None]:
# We merge the domain and the shipment, using 'domain_id' as key
fact_shipment = shipment_df_clean.merge(domain_df[['domain_id', 'name']], on='domain_id', how='left')

# We rename 'name' and 'bookingstate' for easier interpretation
fact_shipment = fact_shipment.rename(columns={'name': 'domain_name',
                                              'bookingstate': 'booking_state'})

We choose the relevant columns to keep, in accordance with the star schema model.

In [None]:
fact_shipment = fact_shipment[['shipment_id', 'customer_price', 'expected_carrier_price', 
                               'final_carrier_price', 'weight', 'shipment_type', 
                               'insurance_type', 'customer_id', 'pickupaddress_id', 
                               'deliveryaddress_id', 'service_id', 'domain_name', 
                               'pickup_date', 'delivery_date', 'real_pickup_date', 
                               'real_delivery_date', 'booking_state', 'lms_plus', 
                               'exworks_id','created_date']]

Here we calculate the margin by subtracting the customer price and the final carrier price.

In [None]:
# We convert both columns to numeric, coercing errors (invalid parsing will be set to NaN)
fact_shipment['customer_price'] = pd.to_numeric(fact_shipment['customer_price'], errors='coerce')
fact_shipment['final_carrier_price'] = pd.to_numeric(fact_shipment['final_carrier_price'], errors='coerce')

# We calculate margin, leaving it as NaN where values are missing
fact_shipment['margin'] = fact_shipment['customer_price'] - fact_shipment['final_carrier_price']

### Process pickup and delivery address data frames

In this section we create the pickup and delivery address dimensions, considering the important columns and tying them to the shipment fact.

In [None]:
# We create the pickup dimension and we merge the domain name into it
dim_pickup_address = pickupaddress_df.merge(domain_df[['domain_id', 'name']], on='domain_id', how='left')
dim_pickup_address = dim_pickup_address.rename(columns={'name': 'domain_name'})

# We keep the columns present in the defined star schema
dim_pickup_address = dim_pickup_address[['pickupaddress_id', 'created_date', 'domain_name', 'country_id', 'postal_code', 'city']]
dim_pickup_address = dim_pickup_address.rename(columns={'pickupaddress_id': 'pickup_address_id'})

# We create the pickup dimension and we merge the domain name into it
dim_delivery_address = deliveryaddress_df.merge(domain_df[['domain_id', 'name']], on='domain_id', how='left')
dim_delivery_address = dim_delivery_address.rename(columns={'name': 'domain_name'})

# We keep the columns present in the defined star schema
dim_delivery_address = dim_delivery_address[['deliveryaddress_id', 'created_date', 'domain_name', 'country_id', 'postal_code', 'city']]
dim_delivery_address = dim_delivery_address.rename(columns={'deliveryaddress_id': 'delivery_address_id'})

# We rename the columns in the shipment fact table for better understanding
fact_shipment = fact_shipment.rename(columns={'pickupaddress_id': 'pickup_address_id', 
                                                'deliveryaddress_id': 'delivery_address_id'})

# We ensure both dimensions contain only the date parts
dim_delivery_address['created_date'] = pd.to_datetime(dim_delivery_address['created_date'], errors='coerce').dt.date
dim_pickup_address['created_date'] = pd.to_datetime(dim_pickup_address['created_date'], errors='coerce').dt.date

### Process customer data frame

In this section we create the customer dimensions; we handle it's relationships with the branchcodes, other than the master relationships.

Customers are organized in a hierarchical structure where:
<ul>
  <li>Master Accounts are identified when a customer's sequence number matches their structure number</li>
  <li>Industry Classifications are assigned through branchcodes:</li>
  <ul>
    <li>Each customer can have multiple branchcodes</li>
    <li>Only the first/primary branchcode is used as the main industry, meaning the main activity that the customer does. We will consider it in order to have a 1 to n relationship</li>
    <li>Each branchcode has both a specific industry name and a broader sector classification</li>
</ul>
</ul>

First we handle the customer and the industries; we merge the branchcode and the root branch (which we refer to using the sector). \
We get the main industry for each customer by using the first branchcode.

In [None]:
# We join the customer with the branchcode table to get both the specific industry name and its root info
customer_industries = (
    customer_df.merge(
        # First we get the translation table to link customers to branchcodes
        branchcode_customer_translation_df,
        on='customer_id',
        how='left'
    )
    # Then we get the branchcode information, including the branch name
    .merge(
        branchcode_df[['branchcode_id', 'branch_name', 'root_branch_id']],
        on='branchcode_id',
        how='left'
    )
    # At last, we get the root branch name by joining branchcode table again
    .merge(
        branchcode_df[['branchcode_id', 'branch_name']],
        left_on='root_branch_id',
        right_on='branchcode_id',
        how='left',
        suffixes=('', '_root')
    )
    # We sort by translation ID to ensure that the main industry comes first
    .sort_values('branchcode_customer_id')
)

Next, we get only the main industry name and its root branch name for each customer

In [None]:
customer_industry_info = customer_industries.groupby('customer_id').agg({
    'branch_name': 'first',          # We get the main industry name
    'branch_name_root': 'first'      # We get the root branch name
}).reset_index()

# Now we can create the customer dimension with all the information 
dim_customer = (
    customer_df
    # Here we merge the industry information
    .merge(
        customer_industry_info[['customer_id', 'branch_name', 'branch_name_root']], 
        on='customer_id',
        how='left'
    )
    # Here we merge the domain information
    .merge(
        domain_df[['domain_id', 'name']],
        on='domain_id',
        how='left'
    )
    # Lastly we rename some of the columns in order to be more descriptive
    .rename(columns={
        'name': 'domain_name',
        'sequencenumber': 'sequence_number',
        'structurenumber': 'structure_number',
        'branch_name': 'main_industry_name',
        'branch_name_root': 'industry_sector_name'
    })
)

As the last step, we add for each customer a column displaying if it is the 'master', meaning the main entity of the sequence and structure number hierarchy.

In [None]:
dim_customer['is_master'] = dim_customer['sequence_number'] == dim_customer['structure_number']
dim_customer = dim_customer[[
    'customer_id', 'created_date', 'domain_name', 
    'main_industry_name', 'industry_sector_name',
    'segmentation', 'sequence_number', 'structure_number', 
    'is_master'
]]
dim_customer['created_date'] = pd.to_datetime(dim_customer['created_date'], errors='coerce').dt.date

### Process dates data frames

In this section we create the date dimensions; first we do some data quality check, aster which we create the dimensions and tie them to the fact table. 

In [None]:
# Here we extract the relevant date columns from shipment
date_columns = ['created_date', 'pickup_date', 'real_pickup_date', 'delivery_date', 'real_delivery_date']

# Here we remove time from dates that include time
for col in ['created_date', 'real_delivery_date', 'real_pickup_date']:
    fact_shipment[col] = pd.to_datetime(fact_shipment[col], errors='coerce').dt.date  

# Here we process each date column separately in order to avoid memory overload and crashing
date_dim = pd.DataFrame()

for col in date_columns:
    # Here we convert each date column to datetime and remove invalid dates
    fact_shipment[col] = pd.to_datetime(fact_shipment[col], errors='coerce')
    
    # Here we combine the current date column into the date dimension, avoiding duplication
    new_dates = fact_shipment[[col]].drop_duplicates().dropna().rename(columns={col: 'full_date'})
    date_dim = pd.concat([date_dim, new_dates]).drop_duplicates().reset_index(drop=True)

# Here we ensure full_date is only date and we create month, quarter, and year columns
date_dim['full_date'] = pd.to_datetime(date_dim['full_date'], errors='coerce').dt.date  
date_dim['year'] = pd.to_datetime(date_dim['full_date'], errors='coerce').dt.year      
date_dim['month'] = pd.to_datetime(date_dim['full_date'], errors='coerce').dt.month    
date_dim['quarter'] = pd.to_datetime(date_dim['full_date'], errors='coerce').dt.quarter
date_dim['date_id'] = date_dim.index + 1  # Create incremental date IDs

# Here we replace date columns in the shipment table with corresponding date IDs, processing them one at a time
for col in date_columns:
    # Here we ensure both the fact_shipment column and the full_date column are in the same datetime format
    fact_shipment[col] = pd.to_datetime(fact_shipment[col], errors='coerce').dt.date
    date_dim['full_date'] = pd.to_datetime(date_dim['full_date'], errors='coerce').dt.date
    
    # Here we merge fact_shipment with the date dimension to assign date IDs
    fact_shipment = fact_shipment.merge(date_dim[['full_date', 'date_id']], left_on=col, right_on='full_date', how='left')
    
    # Here we rename the new column and drop the redundant 'full_date' column
    fact_shipment = fact_shipment.rename(columns={'date_id': f'{col}_id'}).drop(columns=['full_date'])

# Lastly, we drop the original date columns as we now have the date IDs in place
fact_shipment = fact_shipment.drop(columns=date_columns)

### Process service dataframe

In this section we create the service dimension; we merge the domain information into the service and handle the naming alongside the importanct columns.

In [None]:
# Here we merge domain name into the service dimension table, adding suffixes to distinguish between the columns
dim_service = service_df.merge(domain_df[['domain_id', 'name']], on='domain_id', how='left', suffixes=('_service', '_domain'))

# Here we rename the different columns for clarity and formatting
dim_service = dim_service.rename(columns={'name_domain': 'domain_name'})
dim_service = dim_service.rename(columns={'name_service': 'name'})
dim_service = dim_service.rename(columns={'servicetype': 'service_type',
                                            'transporttype': 'transport_type'})

# Here we keep only the relevant columns
dim_service = dim_service[['service_id', 'created_date', 'name', 'service_type', 'transport_type', 'carrier_id', 'domain_name']]

# Here we ensure the date doens't contain the time
dim_service['created_date'] = pd.to_datetime(dim_service['created_date'], errors='coerce').dt.date

### Process carrier and country dimensions

In this section we create the carrier and country dimensions, merging the carrier with the domain information and keeping the relevant columns.

In [None]:
# Here we merge the domain name into the service dimension table, adding suffixes to distinguish between columns
dim_carrier = carrier_df.merge(domain_df[['domain_id', 'name']], on='domain_id', how='left', suffixes=('_carrier', '_domain'))

# Here we rename some of the columns for clarity and formatting reasons
dim_carrier = dim_carrier.rename(columns={'name_carrier': 'name',
                                            'name_domain': 'domain_name'})

# Here we filter the important columns
dim_carrier = dim_carrier[['carrier_id', 'name', 'created_date', 'domain_name']]

# Here we ensure that created_date contains only the date
dim_carrier['created_date'] = pd.to_datetime(dim_carrier['created_date'], errors='coerce').dt.date  

In [None]:
# Here we create the country dimension, filtering the needed columns
dim_country = country_df[['country_id', 'name', 'isocountrycode', 'continent', 'EU']]

# Here we rename the 'isocountrycode' column for clarity
dim_country = dim_country.rename(columns={'isocountrycode': 'iso_country_code'})

### Save new start schema dataset

The last operation left to do is to merge toeghter the fact table and different dimensions into a single excel file.

In [None]:
with pd.ExcelWriter('../../../00-Project/datasets/star_schema_dataset.xlsx', engine='xlsxwriter') as writer:
    # The Fact table
    fact_shipment.to_excel(writer, sheet_name='fact_shipment', index=False)
    
    # The Dimension tables
    dim_customer.to_excel(writer, sheet_name='dim_customer', index=False)
    dim_delivery_address.to_excel(writer, sheet_name='dim_delivery_address', index=False)
    dim_pickup_address.to_excel(writer, sheet_name='dim_pickup_address', index=False)
    date_dim.to_excel(writer, sheet_name='dim_date', index=False)
    dim_service.to_excel(writer, sheet_name='dim_service', index=False)
    dim_carrier.to_excel(writer, sheet_name='dim_carrier', index=False)
    dim_country.to_excel(writer, sheet_name='dim_country', index=False)

print("Star schema transformation with domain names included completed!")