# Analyzing Logistics Data using Snowpark Connect for Apache Spark

## Overview
This notebook demonstrates how to analyze logistics and supply chain data using Snowpark Connect for Apache Spark‚Ñ¢. We'll work with carrier performance metrics and freight bill data to identify delivery risks and performance patterns.

## Key Features
- **Zero Migration Overhead**: Bring existing Spark code to Snowflake with minimal changes
- **Better Performance**: Leverage Snowflake's cloud data platform for improved analytics performance  
- **Native DataFrame APIs**: Use familiar PySpark DataFrame operations on Snowflake data

## Dataset Description
We'll be analyzing two main datasets:
1. **Carrier Performance Metrics**: Historical performance data for different shipping carriers
2. **Freight Bills**: Detailed shipping transaction records including costs, routes, and delivery information

## Objectives
- Load and analyze carrier performance data
- Examine freight bill details and delivery confirmations
- Identify shipments at risk of delays
- Create integrated views for operational insights

Let's start by setting up our Spark session and connecting to Snowflake.

In [None]:
import warnings
warnings.filterwarnings('ignore')

from snowflake import snowpark_connect
from snowflake.snowpark.context import get_active_session

from pyspark.sql.functions import col, avg, sum

session = get_active_session()
print(session)

spark = snowpark_connect.server.init_spark_session()


In [None]:
use schema stratos_dynamics_scm.data;

In [None]:
CREATE OR REPLACE FILE FORMAT csv_format
  TYPE = 'CSV'
  FIELD_DELIMITER = ','
  SKIP_HEADER = 1 -- Assumes the first row is a header
  NULL_IF = ('', 'NULL')
  EMPTY_FIELD_AS_NULL = TRUE
  COMPRESSION = 'AUTO';

In [None]:
CREATE OR REPLACE STAGE stratos_public_s3_stage
  URL = 's3://stratos-logistics-operations/'
  FILE_FORMAT = csv_format;

In [None]:
CREATE OR REPLACE TABLE carrier_performance_metrics (
    metric_id                     VARCHAR,
    carrier_name                  VARCHAR,
    reporting_period              VARCHAR,
    period_start_date             DATE,
    period_end_date               DATE,
    total_shipments               INT,
    on_time_deliveries            INT,
    on_time_percentage            FLOAT,
    total_weight_lbs              FLOAT,
    damage_claims                 INT,
    damage_rate_percentage        FLOAT,
    total_damage_cost             NUMERIC(18, 2),  -- Use NUMERIC for currency
    average_transit_days          FLOAT,
    customer_satisfaction_score   FLOAT,
    total_freight_cost            NUMERIC(18, 2),  -- Use NUMERIC for currency
    cost_per_shipment             NUMERIC(18, 2),  -- Use NUMERIC for currency
    load_timestamp                TIMESTAMP_LTZ DEFAULT CURRENT_TIMESTAMP()
);

In [None]:
CREATE OR REPLACE TABLE freight_bills (
    bill_id                 VARCHAR,
    pro_number              VARCHAR,
    po_number               VARCHAR,
    carrier_name            VARCHAR,
    ship_date               DATE,
    delivery_date           DATE,
    origin_city             VARCHAR,
    origin_state            VARCHAR,
    origin_country          VARCHAR,
    origin_zip              INT,
    destination_city        VARCHAR,
    destination_state       VARCHAR,
    destination_country     VARCHAR,
    destination_zip         INT,
    destination_facility    VARCHAR,
    component_code          VARCHAR,
    component_name          VARCHAR,
    quantity                INT,
    weight_lbs              FLOAT,
    declared_value          INT,
    freight_class           FLOAT,
    base_charge             NUMERIC(18, 2),  -- Use NUMERIC for currency
    weight_charge           NUMERIC(18, 2),  -- Use NUMERIC for currency
    fuel_surcharge          NUMERIC(18, 2),  -- Use NUMERIC for currency
    accessorial_charges     NUMERIC(18, 2),  -- Use NUMERIC for currency
    total_charge            NUMERIC(18, 2),  -- Use NUMERIC for currency
    payment_terms           VARCHAR,
    payment_status          VARCHAR,
    invoice_date            DATE,
    load_timestamp          TIMESTAMP_LTZ DEFAULT CURRENT_TIMESTAMP()
);

In [None]:
COPY INTO carrier_performance_metrics
FROM @stratos_public_s3_stage/carrier_performance_metrics.csv
FILE_FORMAT = (FORMAT_NAME = csv_format)
ON_ERROR = 'ABORT_STATEMENT' -- Stops the load if any error is encountered
;

In [None]:
COPY INTO freight_bills
FROM @stratos_public_s3_stage/freight_bills.csv
FILE_FORMAT = (FORMAT_NAME = csv_format)
ON_ERROR = 'ABORT_STATEMENT' -- Stops the load if any error is encountered
;

## üöÄ Setup Complete

**Quick Setup Overview:** The cells above accomplish our initial setup in 4 key steps:
1. **Environment**: Initialize Snowpark & Spark sessions for Snowflake execution
2. **Data Infrastructure**: Create CSV file format and S3 external stage for data loading  
3. **Table Schema**: Define `carrier_performance_metrics` and `freight_bills` tables
4. **Data Loading**: Import CSV data from S3 into Snowflake tables

Now let's analyze the data using Spark DataFrames!


In [None]:
db_name = "stratos_dynamics_scm"
schema_name = "data"

In [None]:
carrier_performance = f"{db_name}.{schema_name}.carrier_performance_metrics"
carrier_performance_df = spark.sql(f"select * from {carrier_performance}")
carrier_performance_df.show(5)

### üìä Historic Shipment Analysis by Carrier

Let's analyze key performance indicators across carriers to identify patterns in shipment volume, damage rates, costs, and customer satisfaction. This aggregation will help us understand which carriers perform best across different metrics:

In [None]:
carrier_shipment_delays_df = carrier_performance_df.groupBy("carrier_name").agg(
    avg("total_shipments").alias("total_shipments"),
    avg("damage_claims").alias("avg_damage_claims"),
    sum("total_damage_cost").alias("total_damage_cost"),
    avg("customer_satisfaction_score").alias("avg_customer_satisfaction_score")
)
carrier_shipment_delays_df.show()

### üìã Delivery Confirmations Analysis

Now let's examine delivery confirmation data to track actual vs. scheduled delivery dates. This data will help us identify potential delivery delays and at-risk shipments:

In [None]:
deliveries = "build25_de_keynote.data.delivery_confirmations"
deliveries_df = spark.sql(f"select * from {deliveries}")
deliveries_df.show(5)

### üí∞ Freight Bill Details Analysis

Let's examine our freight bill data to understand shipping costs, routes, and transaction details. This information will provide insights into cost patterns and shipping logistics:

In [None]:
freight_bills = "build25_de_keynote.data.freight_bills"
freight_bills_df = spark.sql(f"select * from {freight_bills}")
freight_bills_df.show(5)

### üîÑ Joining Freight Bills and Delivery Confirmations

This is a crucial step where we combine freight bill data with delivery confirmations to create a comprehensive view of our shipments. The join will help us:

1. **Identify At-Risk Deliveries**: Compare scheduled vs. actual delivery dates
2. **Cost Analysis**: Associate costs with delivery performance
3. **Route Analytics**: Understand shipping patterns and potential bottlenecks
4. **Operational Insights**: Create actionable data for logistics optimization

The join uses `bill_id` as the common key between both datasets:

In [None]:
dc = deliveries_df.alias("dc")
fb = freight_bills_df.alias("fb")

# Join with aliases
deliveries_at_risk = dc.join(fb, on="bill_id", how="inner")

# Now you can reference specific columns using aliases
deliveries_at_risk = deliveries_at_risk.select(
    "bill_id",
    col("dc.pro_number").alias("pro_number"),
    col("dc.po_number").alias("po_number"),
    col("dc.carrier_name").alias("carrier_name"),
    col("dc.scheduled_delivery_date").alias("scheduled_delivery_date"),
    col("dc.actual_delivery_date").alias("actual_delivery_date"),
    col("fb.destination_city"),
    col("fb.destination_state"),
    col("fb.destination_country"),
    col("fb.destination_zip"),
    col("fb.destination_facility"),
    col("fb.origin_city"),
    col("fb.origin_state"),
    col("fb.origin_country"),
    col("fb.origin_zip"),
    col("fb.component_code"),
    col("fb.component_name"),
    col("fb.quantity"),
    col("fb.weight_lbs"),
    col("fb.declared_value"),
    col("fb.total_charge"),
    col("fb.payment_terms"),
    col("fb.payment_status"),
    col("fb.invoice_date"),
    col("fb.quantity").alias("product_quantity"),
    col("fb.freight_class")
)

deliveries_at_risk.show()

### ‚ö†Ô∏è Creating Deliveries at Risk Table

Now we'll persist our joined and analyzed data as a new Snowflake table called `deliveries_at_risk`. This table will serve as a operational dashboard for logistics teams to monitor and take action on potential delivery issues:

In [None]:
deliveries_at_risk.write.mode("append").saveAsTable(f"{db_name}.{schema_name}.deliveries_at_risk")

## üéâ Analysis Complete!

**Success!** We've successfully created a comprehensive logistics analytics solution using Snowpark Connect for Apache Spark. 

### What We Accomplished:

1. ‚úÖ **Data Infrastructure**: Set up file formats, external stages, and table schemas in Snowflake
2. ‚úÖ **Data Loading**: Imported carrier performance and freight bill data from S3
3. ‚úÖ **Spark Analytics**: Used familiar PySpark DataFrames on Snowflake data
4. ‚úÖ **Data Integration**: Joined multiple datasets to create operational insights
5. ‚úÖ **Actionable Results**: Created a `deliveries_at_risk` table for ongoing monitoring

The `deliveries_at_risk` table is now available for business intelligence tools, reporting dashboards, and operational workflows!