<a href="https://colab.research.google.com/github/Requenamar3/Azure-Data-Studio-Project/blob/main/ShippingCostAnalysis_Synthetic_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Azure Data Studio Project: Synthetic Data Generation for Shipping and Distribution**

**Objective**
This project generates a synthetic dataset for a hypothetical e-commerce business specializing in shipping perishable goods. The generated data simulates a shipping and distribution network, allowing for analysis of key metrics like shipping costs, delivery efficiency, and carrier performance.

**Background**
The dataset supports the analysis and visualization of critical business metrics, providing a realistic framework for decision-making. It is designed to answer questions like:

* What are the cost drivers in the shipping network?
* Which carriers or routes are most efficient?
* How do different events (e.g., holidays, weddings) impact demand and costs?


# 1.Project Setup


## 1.1 Install Required Libraries
Install the Faker library for generating realistic synthetic data:

In [1]:
# Install the Faker library
!pip install faker

Collecting faker
  Downloading Faker-33.0.0-py3-none-any.whl.metadata (15 kB)
Downloading Faker-33.0.0-py3-none-any.whl (1.9 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.9/1.9 MB[0m [31m112.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m52.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faker
Successfully installed faker-33.0.0


## 1.2 Import Libraries

Library Descriptions:
* datetime.timedelta: Manipulates dates to calculate shipping timelines.
* pandas: Creates and manages tabular data structures (DataFrames).
* faker: Generates realistic synthetic data like ZIP codes, states, and dates.
* random: Adds variability to synthetic data generation  List item

In [2]:
# libraries needed
from datetime import timedelta
import pandas as pd
from faker import Faker
import random



# Initialize Faker
fake = Faker()

##1.3 Initialize Variables
Set up the number of data points and initialize Faker:


In [3]:
# Initialize Faker
fake = Faker()

# Define the number of data points for each table. Adjust as needed
num_orders = 10000  # Number of orders.
num_routes = 300  # Number of routes.
num_zip_codes = 1000  # Number of unique ZIP codes
num_transactions = 10000  # Number of shipping transactions

#2.Data Generation

##2.1 Orders Table
The **Orders Table** captures details about customer orders. This includes the cost of shipping, the distribution center fulfilling the order, the type of delivery, and the purpose of the order (e.g., holidays or events).

**Purpose**

This table is essential for analyzing demand patterns, cost drivers, and event-based shipping trends. It helps answer:
*   How much does it cost to ship orders for different delivery types?
*   Which events or seasons generate higher demand?

**Key Columns**
*   **Order_ID**: A unique identifier for each order.
*   **Distribution_Center_ID**: Links the order to a specific distribution center
*   **Delivery_Type & Service_Type**: Indicates the mode of delivery (e.g., "Ground" or "Air").
*   **Cost**: The shipping cost for the order.
*   **Event_Type & Holiday_Flag**: Provides context for seasonal or event-driven demand.

In [4]:
# Generate Orders Table Data
# We're creating a dictionary that will simulate customer orders with fields like Order ID, Distribution Center, ZIP codes, etc.
orders_data = {
    # Generate unique order IDs that follow a consistent format (e.g., ORD0001, ORD0002).
    'Order_ID': [f'ORD{str(i).zfill(4)}' for i in range(1, num_orders + 1)],

    # Randomly assign one of the six distribution centers to each order.
    'Distribution_Center_ID': [random.choice(['DC1', 'DC2', 'DC3', 'DC4', 'DC5', 'DC6']) for _ in range(num_orders)],

    # Use the Faker library to create random ZIP codes for customer locations.
    'Customer_ZIP': [fake.zipcode() for _ in range(num_orders)],

    # Randomly assign a delivery type for each order (Ground or Air).
    'Delivery_Type': [random.choice(['Ground', 'Air']) for _ in range(num_orders)],

    # Choose a shipping service type (e.g., Ground, Standard Overnight, or Priority Overnight) for each order.
    'Service_Type': [random.choice(['Ground', 'Standard Overnight', 'Priority Overnight']) for _ in range(num_orders)],

    # Generate a random shipping cost for each order, ranging from $10 to $300.
    # Costs are rounded to two decimal places to make them realistic.
    'Cost': [round(random.uniform(10, 300), 2) for _ in range(num_orders)],

    # Assign a random event type for the order (e.g., Birthday, Holiday) or "None" if there's no event.
    'Event_Type': [random.choice(['Birthday', 'Holiday', 'Wedding', 'Graduation', 'Bridal Shower',
                                  'Retirement', 'Gender Reveal', 'First Communion', 'None']) for _ in range(num_orders)],

    # Add a flag to indicate whether the order falls on a holiday (1) or not (0).
    'Holiday_Flag': [random.choice([0, 1]) for _ in range(num_orders)],

    # Leave a placeholder for a carrier reliability score, which we might add in the future.
    'Carrier_Reliability_Score': None
}

# Turn the dictionary into a pandas DataFrame so we can work with the data more easily.
orders_df = pd.DataFrame(orders_data)

# Report: Orders Table Overview
report_title = "Orders Table Overview: Summary of Customer Orders"
print("\n" + "=" * len(report_title))
print(report_title)
print("=" * len(report_title))

# Display the first few rows of the Orders DataFrame to get a quick look at the data.
print("\nPreview of the first 5 rows:")
display(orders_df.head())

# Show some details about the DataFrame, like the column names, data types, and whether there are any missing values.
print("\nStructure and Info:")
orders_df.info()

# Print out statistics for the numerical columns, like the average cost and the range of costs.
print("\nDescriptive Statistics for Numerical Data:")
print(orders_df.describe())



Orders Table Overview: Summary of Customer Orders

Preview of the first 5 rows:


Unnamed: 0,Order_ID,Distribution_Center_ID,Customer_ZIP,Delivery_Type,Service_Type,Cost,Event_Type,Holiday_Flag,Carrier_Reliability_Score
0,ORD0001,DC6,34853,Ground,Standard Overnight,96.39,Graduation,0,
1,ORD0002,DC3,59407,Air,Priority Overnight,99.92,Retirement,1,
2,ORD0003,DC2,87122,Air,Priority Overnight,222.91,Retirement,1,
3,ORD0004,DC3,2396,Air,Ground,79.03,First Communion,1,
4,ORD0005,DC4,59526,Ground,Ground,146.97,,0,



Structure and Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Order_ID                   10000 non-null  object 
 1   Distribution_Center_ID     10000 non-null  object 
 2   Customer_ZIP               10000 non-null  object 
 3   Delivery_Type              10000 non-null  object 
 4   Service_Type               10000 non-null  object 
 5   Cost                       10000 non-null  float64
 6   Event_Type                 10000 non-null  object 
 7   Holiday_Flag               10000 non-null  int64  
 8   Carrier_Reliability_Score  0 non-null      object 
dtypes: float64(1), int64(1), object(7)
memory usage: 703.2+ KB

Descriptive Statistics for Numerical Data:
               Cost  Holiday_Flag
count  10000.000000  10000.000000
mean     155.846756      0.500300
std       83.732721      0.500025
min       10

##2.2 Distribution Centers Table
The Distribution Centers Table outlines the locations and volume allocations of the distribution hubs.

**Purpose**

This table provides insights into the logistics network and volume management across distribution centers. It helps identify:
*   Which centers are handling the most volume?
*   Are the centers' allocations optimal based on geographic demand?

**Key Columns**
*   Center_ID: Unique identifier for each distribution center.
*   Location: The city and state of the center.
*   ZIP_Code: The primary ZIP code of the center.
*   Volume_Allocation: The percentage of total shipments managed by the center.

**Insights**
*   Evaluate the allocation of shipping volumes to different distribution centers to determine if they align with regional demand.
*   Identify underutilized or overburdened centers by comparing volume *   allocation percentages to actual shipping data.
*   Analyze the geographical distribution of centers to ensure optimal coverage and identify potential gaps in the network.
*   Assess if certain centers are handling orders for states far from their location, increasing costs unnecessarily.


In [5]:
# Generate Distribution Centers Table

# Create a dictionary to hold the distribution center data.
distribution_centers_data = {
    # Unique identifiers for each distribution center.
    'Center_ID': ['DC1', 'DC2', 'DC3', 'DC4', 'DC5', 'DC6'],

    # The city and state locations of each distribution center (separated into two columns).
    'City': ['Los Angeles', 'Miami', 'Dallas', 'Bronx', 'Hatfield', 'Chicago'],
    'State': ['CA', 'FL', 'TX', 'NY', 'PA', 'IL'],

    # Generate realistic ZIP codes for the states where the centers are located using the Faker library.
    'ZIP_Code': [
        fake.zipcode_in_state('CA'), fake.zipcode_in_state('FL'), fake.zipcode_in_state('TX'),
        fake.zipcode_in_state('NY'), fake.zipcode_in_state('PA'), fake.zipcode_in_state('IL'),
    ]
}

# Convert the dictionary into a pandas DataFrame for easier data manipulation and analysis.
distribution_centers_df = pd.DataFrame(distribution_centers_data)

# Report: Distribution Centers Table Overview
report_title = "Distribution Centers Table Overview: Summary of Distribution Network"
print("\n" + "=" * len(report_title))
print(report_title)
print("=" * len(report_title))

# Display the first few rows of the DataFrame to give an overview of the distribution center data.
print("\nPreview of the first 5 rows:")
display(distribution_centers_df.head())

# Provide a summary of the DataFrame structure, including column names, data types, and non-null counts.
print("\nStructure and Info:")
distribution_centers_df.info()

# Print descriptive statistics for numerical columns, like the volume allocation percentages.
print("\nDescriptive Statistics for Numerical Data:")
print(distribution_centers_df.describe())



Distribution Centers Table Overview: Summary of Distribution Network

Preview of the first 5 rows:


Unnamed: 0,Center_ID,City,State,ZIP_Code
0,DC1,Los Angeles,CA,92387
1,DC2,Miami,FL,32279
2,DC3,Dallas,TX,77757
3,DC4,Bronx,NY,14120
4,DC5,Hatfield,PA,17970



Structure and Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Center_ID  6 non-null      object
 1   City       6 non-null      object
 2   State      6 non-null      object
 3   ZIP_Code   6 non-null      object
dtypes: object(4)
memory usage: 320.0+ bytes

Descriptive Statistics for Numerical Data:
       Center_ID         City State ZIP_Code
count          6            6     6        6
unique         6            6     6        6
top          DC1  Los Angeles    CA    92387
freq           1            1     1        1


##2.3 Carriers Table
The Carriers Table defines the shipping providers and their associated costs and service types.

**Purpose**

This table is crucial for understanding the cost-effectiveness of different carriers. It supports:

*   Identifying the most cost-efficient carriers for specific service types.
*   Analyzing carrier reliability and delivery times.

**Key Columns**

*   **Carrier_ID**: Unique identifier for each carrier.
*   **Name**: The name of the carrier (e.g., "UPS Ground").
*   **Flat_Rate**: The fixed cost per service provided by the carrier.
*   **Service_Type**: Specifies the mode of transport (e.g., "Ground" or "Air").

**Insights**
*   Compare flat rate costs across carriers to determine the most cost-effective options for ground and air services.
*   Evaluate the balance of carrier usage (e.g., are certain carriers handling more shipments than others?).
*   Analyze carrier specialization to match the best carrier with specific service types (e.g., ground vs. air).
*   Track how flat rate costs impact total shipping costs when combined with route distances and delivery types.


In [6]:
#Generate Carriers Table Data
# This table contains information about the carriers, including their service types, flat rates, and unique identifiers.

# Create a dictionary to store carrier information.
carriers_data = {
    # Unique identifiers for each carrier.
    'Carrier_ID': ['C1', 'C2', 'C3', 'C4', 'C5', 'C6'],

    # Names of the carriers and their services.
    'Name': ['UPS Ground', 'FedEx Ground', 'OnTrac', 'FedEx Standard Overnight', 'FedEx Priority Overnight', 'UPS Air'],

    # Flat rate shipping costs associated with each carrier (these rates are static and can be adjusted as needed).
    'Flat_Rate': [2.2, 2.8, 2.5, 5.0, 7.0, 6.5],

    # Service type offered by each carrier (e.g., Ground shipping or Air shipping).
    'Service_Type': ['Ground', 'Ground', 'Ground', 'Air', 'Air', 'Air']
}

# Convert the dictionary into a pandas DataFrame for easier manipulation and analysis.
carriers_df = pd.DataFrame(carriers_data)

# Report: Carriers Table Overview
report_title = "Carriers Table Overview: Summary of Shipping Providers"
print("\n" + "=" * len(report_title))
print(report_title)
print("=" * len(report_title))

# Display the first few rows of the Carriers DataFrame to get a quick look at the data.
print("\nPreview of the first 5 rows:")
display(carriers_df.head())

# Show the structure and details of the DataFrame, including data types and non-null counts for each column.
print("\nStructure and Info:")
carriers_df.info()

# Print out statistics for numerical columns (e.g., flat rates for shipping services).
print("\nDescriptive Statistics for Numerical Data:")
print(carriers_df.describe())



Carriers Table Overview: Summary of Shipping Providers

Preview of the first 5 rows:


Unnamed: 0,Carrier_ID,Name,Flat_Rate,Service_Type
0,C1,UPS Ground,2.2,Ground
1,C2,FedEx Ground,2.8,Ground
2,C3,OnTrac,2.5,Ground
3,C4,FedEx Standard Overnight,5.0,Air
4,C5,FedEx Priority Overnight,7.0,Air



Structure and Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Carrier_ID    6 non-null      object 
 1   Name          6 non-null      object 
 2   Flat_Rate     6 non-null      float64
 3   Service_Type  6 non-null      object 
dtypes: float64(1), object(3)
memory usage: 320.0+ bytes

Descriptive Statistics for Numerical Data:
       Flat_Rate
count   6.000000
mean    4.333333
std     2.121949
min     2.200000
25%     2.575000
50%     3.900000
75%     6.125000
max     7.000000


##2.4 Routes Details Table
The Routes Details Table captures information about the shipping routes connecting distribution centers to customer locations. It includes geographical details, transport modes, and carrier assignments for each route.

**Purpose**:
This table is essential for analyzing the logistics network and optimizing routes. It helps answer:

*    Which routes are the most cost-effective based on distance and preferred transport mode?
*    How are carriers allocated to specific routes, and how does that impact efficiency?
*    What trends exist in preferred modes of transport for long versus short routes?

**Key Columns**

*    **Route_ID**: A unique identifier for each route.
*    **Center_ID**: Links the route to a specific distribution center.
*    **From_ZIP & From_State**: Represents the starting location of the route (distribution center ZIP code and state).
*    **To_ZIP & To_State**: Represents the destination location (customer ZIP code and state).
*    **Distance**: The distance (in miles) between the distribution center and the customer location.
*    **Preferred_Mode**: Indicates the preferred transport mode ("Ground" or "Air") for the route.
*    **Carrier_ID**: Links the route to a specific carrier for execution.

**Insights**
*    Analyze the cost implications of long-distance versus short-distance routes and the preference for air or ground transport.
*    Identify high-cost routes and opportunities to optimize transport modes (e.g., shifting air to ground for cost savings).
*    Evaluate carrier usage patterns to ensure balanced allocation and avoid over-reliance on specific carriers.
*    Use geographical trends (e.g., frequently serviced states or ZIP codes) to plan for expanding distribution centers or addressing inefficiencies.


In [7]:
# 4. Generate Routes Details Table
# This table defines the routes between distribution centers and customer ZIP codes, including distances and preferred shipping modes.

# Create a dictionary to hold route details data.
routes_details_data = {
    # Unique identifier for each route.
    'Route_ID': [f'R{i}' for i in range(1, num_routes + 1)],

    # Randomly assign a distribution center to the route.
    'Center_ID': [random.choice(['DC1', 'DC2', 'DC3', 'DC4', 'DC5', 'DC6']) for _ in range(num_routes)],

    # Generate random ZIP codes for the starting location of the route (distribution center).
    'From_ZIP': [fake.zipcode() for _ in range(num_routes)],

    # Generate random state abbreviations for the starting location.
    'From_State': [fake.state_abbr() for _ in range(num_routes)],

    # Generate random ZIP codes for the destination (customer ZIP code).
    'To_ZIP': [fake.zipcode() for _ in range(num_routes)],

    # Generate random state abbreviations for the destination.
    'To_State': [fake.state_abbr() for _ in range(num_routes)],

    # Randomly generate distances in miles between 50 and 3000.
    'Distance': [round(random.uniform(50, 3000), 2) for _ in range(num_routes)],

    # Assign a preferred mode of transport (Ground is preferred 70% of the time).
    'Preferred_Mode': ['Ground' if random.random() < 0.7 else 'Air' for _ in range(num_routes)],

    # Randomly assign a carrier ID for the route.
    'Carrier_ID': [random.choice(['C1', 'C2', 'C3', 'C4', 'C5', 'C6']) for _ in range(num_routes)]
}

# Convert the dictionary into a pandas DataFrame.
routes_details_df = pd.DataFrame(routes_details_data)

# Report: Routes Details Table Overview
report_title = "Routes Details Table Overview: Summary of Distribution Routes"
print("\n" + "=" * len(report_title))
print(report_title)
print("=" * len(report_title))

# Display the first few rows of the Routes Details DataFrame.
print("\nPreview of the first 5 rows:")
display(routes_details_df.head())

# Show the structure and details of the DataFrame.
print("\nStructure and Info:")
routes_details_df.info()

# Print out statistics for numerical columns like distances.
print("\nDescriptive Statistics for Numerical Data:")
print(routes_details_df.describe())



Routes Details Table Overview: Summary of Distribution Routes

Preview of the first 5 rows:


Unnamed: 0,Route_ID,Center_ID,From_ZIP,From_State,To_ZIP,To_State,Distance,Preferred_Mode,Carrier_ID
0,R1,DC1,97392,ND,54305,SD,1963.02,Air,C2
1,R2,DC5,68974,PA,87743,MD,1065.99,Air,C6
2,R3,DC5,94455,SD,81643,HI,1529.25,Ground,C1
3,R4,DC4,86707,NV,95927,MH,2422.81,Ground,C2
4,R5,DC2,13687,MA,25056,SC,2480.75,Air,C4



Structure and Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Route_ID        300 non-null    object 
 1   Center_ID       300 non-null    object 
 2   From_ZIP        300 non-null    object 
 3   From_State      300 non-null    object 
 4   To_ZIP          300 non-null    object 
 5   To_State        300 non-null    object 
 6   Distance        300 non-null    float64
 7   Preferred_Mode  300 non-null    object 
 8   Carrier_ID      300 non-null    object 
dtypes: float64(1), object(8)
memory usage: 21.2+ KB

Descriptive Statistics for Numerical Data:
          Distance
count   300.000000
mean   1543.897833
std     843.078385
min      59.760000
25%     821.567500
50%    1560.920000
75%    2258.767500
max    2996.570000


##2.5 Shipping Transactions Table

The Shipping Transactions Table provides a detailed log of all shipping events, including costs, timelines, and carrier assignments. It ties together data from the orders, routes, and carriers.

**Purpose**

This table is critical for tracking and analyzing the performance of the shipping process. It helps answer:

*    How do shipping costs vary based on carrier, route, or delivery type?
*    What percentage of shipments are delivered on time?
*    Are there any seasonal trends in shipping activity and associated costs?

**Key Columns**
*    **Transaction_ID**: A unique identifier for each shipping transaction.
*    **Order_ID**: Links the transaction to a specific order from the Orders Table.
*    **Carrier_ID**: Indicates the carrier responsible for the shipment.
*    **Route_ID**: Links the transaction to a specific route from the Routes Details Table.
*    **Shipping_Cost**: The total cost of the shipping transaction.
*    **Transaction_Date**: The date the transaction was created.
*    **Scheduled_Delivery_Date**: The expected delivery date for the shipment.
*    **Actual_Delivery_Date**: The date the shipment was delivered (assumed to match the scheduled date in this dataset).
*    **Shipping_Date**: The date the shipment was sent, typically 1 to 3 days before the scheduled delivery date.

**Insights**

*    Evaluate on-time delivery performance by comparing scheduled and actual delivery dates.
*    Identify high-cost transactions and trends in shipping costs by carrier, route, or season to enable cost optimization
*    Track spikes in shipping activity and costs during peak periods (holidays or seasonal events) to prepare for increased demand.
*    Assess the impact of shipping date proximity to the transaction date on carrier performance and customer satisfaction.

In [8]:
# Generate Shipping Transactions Table
# This table records shipping transactions, including costs, carrier selection, and delivery timelines.

# Create a dictionary to hold shipping transaction data.
shipping_transactions_data = {
    # Unique identifier for each transaction.
    'Transaction_ID': [f'TXN{str(i).zfill(4)}' for i in range(1, num_transactions + 1)],

    # Randomly assign an order ID to the transaction.
    'Order_ID': [random.choice(orders_df['Order_ID']) for _ in range(num_transactions)],

    # Randomly assign a carrier ID to the transaction.
    'Carrier_ID': [random.choice(carriers_df['Carrier_ID']) for _ in range(num_transactions)],

    # Randomly assign a route ID to the transaction.
    'Route_ID': [random.choice(routes_details_df['Route_ID']) for _ in range(num_transactions)],

    # Generate random shipping costs between $10 and $300.
    'Shipping_Cost': [round(random.uniform(10, 300), 2) for _ in range(num_transactions)],

    # Generate random transaction dates within the current year using the Faker library.
    'Transaction_Date': [fake.date_this_year() for _ in range(num_transactions)]
}

# Convert the dictionary into a pandas DataFrame.
shipping_transactions_df = pd.DataFrame(shipping_transactions_data)

# Add columns for delivery schedules.
# Generate scheduled delivery dates (1 to 5 days after the transaction date).
shipping_transactions_df['Scheduled_Delivery_Date'] = pd.to_datetime(shipping_transactions_df['Transaction_Date']) + pd.to_timedelta(
    [random.randint(1, 5) for _ in range(num_transactions)], unit='D'
)

# Assume the actual delivery date matches the scheduled date (on-time delivery).
shipping_transactions_df['Actual_Delivery_Date'] = shipping_transactions_df['Scheduled_Delivery_Date']

# Generate shipping dates (1 to 3 days before the scheduled delivery date).
shipping_transactions_df['Shipping_Date'] = shipping_transactions_df['Scheduled_Delivery_Date'] - pd.to_timedelta(
    [random.randint(1, 3) for _ in range(num_transactions)], unit='D'
)

# Report: Shipping Transactions Table Overview
report_title = "Shipping Transactions Table Overview: Summary of Shipping Events"
print("\n" + "=" * len(report_title))
print(report_title)
print("=" * len(report_title))

# Display the first few rows of the Shipping Transactions DataFrame.
print("\nPreview of the first 5 rows:")
display(shipping_transactions_df.head())

# Show the structure and details of the DataFrame.
print("\nStructure and Info:")
shipping_transactions_df.info()

# Print out statistics for numerical columns like shipping costs.
print("\nDescriptive Statistics for Numerical Data:")
print(shipping_transactions_df.describe())



Shipping Transactions Table Overview: Summary of Shipping Events

Preview of the first 5 rows:


Unnamed: 0,Transaction_ID,Order_ID,Carrier_ID,Route_ID,Shipping_Cost,Transaction_Date,Scheduled_Delivery_Date,Actual_Delivery_Date,Shipping_Date
0,TXN0001,ORD0440,C3,R82,266.87,2024-02-18,2024-02-21,2024-02-21,2024-02-19
1,TXN0002,ORD8298,C5,R68,55.83,2024-09-03,2024-09-08,2024-09-08,2024-09-06
2,TXN0003,ORD5085,C2,R225,196.04,2024-07-04,2024-07-08,2024-07-08,2024-07-06
3,TXN0004,ORD5868,C2,R63,36.66,2024-01-12,2024-01-16,2024-01-16,2024-01-13
4,TXN0005,ORD2867,C1,R159,88.94,2024-10-10,2024-10-13,2024-10-13,2024-10-11



Structure and Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Transaction_ID           10000 non-null  object        
 1   Order_ID                 10000 non-null  object        
 2   Carrier_ID               10000 non-null  object        
 3   Route_ID                 10000 non-null  object        
 4   Shipping_Cost            10000 non-null  float64       
 5   Transaction_Date         10000 non-null  object        
 6   Scheduled_Delivery_Date  10000 non-null  datetime64[ns]
 7   Actual_Delivery_Date     10000 non-null  datetime64[ns]
 8   Shipping_Date            10000 non-null  datetime64[ns]
dtypes: datetime64[ns](3), float64(1), object(5)
memory usage: 703.2+ KB

Descriptive Statistics for Numerical Data:
       Shipping_Cost Scheduled_Delivery_Date Actual_Delivery_Date  \
count   10000.0

##2.6 Generate ZIP Codes Table

The ZIP Codes Table provides a mapping of ZIP codes to specific routes in the logistics network. This table complements the Routes Details Table by adding granularity, enabling detailed analysis of service areas and their connection to routes.

**Purpose**

This table is essential for understanding the geographical distribution of routes and optimizing coverage. It helps answer:

*    Which ZIP codes are associated with specific routes?
*    Are there clusters of ZIP codes serviced by the same route, and do they indicate inefficiencies?
*    How can the data be used to plan for expanding or reallocating distribution resources?

**Key Columns**

*    **ZIP_Code:** A unique ZIP code representing a service area.
*    **Route_ID:** Links the ZIP code to a specific route from the Routes Details Table.

**Insights**
*    Visualize the geographical reach of each route and identify underserved areas.
*    Analyze clusters of ZIP codes assigned to the same route to improve efficiency.
*    Combine this table with shipping and order data to track demand patterns by region.
*    Use this data to identify opportunities for reassigning ZIP codes to more cost-effective or efficient routes.



In [9]:
# Generate ZIP Codes Table Data
# This table maps unique ZIP codes to the routes they are serviced by.

# Create a dictionary to store ZIP code data.
zip_codes_data = {
    # Generate a list of unique ZIP codes using the Faker library.
    # The number of ZIP codes is defined by the variable 'num_zip_codes'.
    'ZIP_Code': [fake.zipcode() for _ in range(num_zip_codes)],

    # Assign each ZIP code to a random Route_ID from the Routes Details Table.
    # This establishes a relationship between ZIP codes and specific routes.
    'Route_ID': [random.choice(routes_details_df['Route_ID'].tolist()) for _ in range(num_zip_codes)]
}

# Convert the dictionary into a pandas DataFrame for analysis.
zip_codes_df = pd.DataFrame(zip_codes_data)

# Display the sample ZIP Codes Table
# This section provides a preview and basic information about the data.
print("\nSample ZipCodes Table")  # Title for the report section
display(zip_codes_df.head())  # Display the first 5 rows of the DataFrame to preview the data.

# Show information about the DataFrame's structure and data types.
zip_codes_df.info()

# Print descriptive statistics for numerical columns, if any.
# Since this table is primarily categorical, statistics may not apply to all columns.
print("\nDescriptive Statistics for ZIP Codes Data:")
print(zip_codes_df.describe())



Sample ZipCodes Table


Unnamed: 0,ZIP_Code,Route_ID
0,22159,R23
1,34145,R31
2,28129,R17
3,48087,R179
4,16859,R243


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   ZIP_Code  1000 non-null   object
 1   Route_ID  1000 non-null   object
dtypes: object(2)
memory usage: 15.8+ KB

Descriptive Statistics for ZIP Codes Data:
       ZIP_Code Route_ID
count      1000     1000
unique      995      281
top       48987      R22
freq          2        9


#3.Export and Download Data

##3.1 Export Dataframes

In [10]:
# Export each DataFrame to a CSV file for external use and analysis.
# The CSV files can be imported into tools like Azure Data Studio or Excel.

# Save the Orders Table
orders_df.to_csv('Orders.csv', index=False)  # The index is excluded to avoid unnecessary columns.
print("Orders Table has been exported to 'Orders.csv'.")

# Save the Distribution Centers Table
distribution_centers_df.to_csv('Distribution_Centers.csv', index=False)
print("Distribution Centers Table has been exported to 'Distribution_Centers.csv'.")

# Save the Carriers Table
carriers_df.to_csv('Carriers.csv', index=False)
print("Carriers Table has been exported to 'Carriers.csv'.")

# Save the Routes Details Table
routes_details_df.to_csv('Routes_Details.csv', index=False)
print("Routes Details Table has been exported to 'Routes_Details.csv'.")

# Save the Shipping Transactions Table
shipping_transactions_df.to_csv('Shipping_Transactions.csv', index=False)
print("Shipping Transactions Table has been exported to 'Shipping_Transactions.csv'.")

# Save the ZIP Codes Table
zip_codes_df.to_csv('Zip_Codes.csv', index=False)
print("ZIP Codes Table has been exported to 'Zip_Codes.csv'.")


Orders Table has been exported to 'Orders.csv'.
Distribution Centers Table has been exported to 'Distribution_Centers.csv'.
Carriers Table has been exported to 'Carriers.csv'.
Routes Details Table has been exported to 'Routes_Details.csv'.
Shipping Transactions Table has been exported to 'Shipping_Transactions.csv'.
ZIP Codes Table has been exported to 'Zip_Codes.csv'.


##3.2 Download CSV files

In [11]:
# Enable downloading of the CSV files for local use.
# This is useful for quickly transferring data generated in the notebook to a local machine.

from google.colab import files  # Import the library to download files from Colab.

# Trigger download for each exported CSV file.
files.download('Orders.csv')  # Download Orders Table.
files.download('Distribution_Centers.csv')  # Download Distribution Centers Table.
files.download('Carriers.csv')  # Download Carriers Table.
files.download('Routes_Details.csv')  # Download Routes Details Table.
files.download('Shipping_Transactions.csv')  # Download Shipping Transactions Table.
files.download('Zip_Codes.csv') # Download Zip Codes Table.


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>