# Workshop: Workspace Setup, Data Import & Exploration

**Training Objective:** Practical mastery of workspace configuration, data import from various formats, and basic exploration operations.

**Topics covered:**
- Workspace and cluster configuration
- Loading various data formats (CSV, JSON, Parquet)
- Basic data exploration
- Manual schema construction
- Analysis of missing data and unique values

**Duration:** 30 minutes

## Context and Requirements

- **Training Day**: Day 1 - Fundamentals & Exploration
- **Notebook Type**: Workshop
- **Technical Requirements**:
 - Databricks Runtime 13.0+ (recommended: 14.3 LTS)
 - Unity Catalog enabled
 - Permissions: CREATE TABLE, CREATE SCHEMA, SELECT, MODIFY
 - Cluster: Standard with minimum 2 workers

## Workshop Introduction

In this workshop you will work with real KION data:
- **Customers** (customers.csv) - customer data
- **Orders** (orders_batch.json) - orders
- **Products** (products.parquet) - products

### Tasks to complete:
1. Environment and variable configuration
2. Data import from CSV, JSON, Parquet
3. Manual schema construction
4. Data exploration: statistics, missing values, unique values
5. Data quality analysis

### Success criteria:
- All 3 datasets correctly loaded
- Schemas defined manually and applied
- Complete exploratory analysis conducted
- Data quality issues identified

## Theoretical Introduction

**Section Objective:** Understanding the basics of working with data in Databricks Lakehouse

**Basic Concepts:**
- **Workspace**: Databricks environment containing notebooks, clusters, and data
- **Cluster**: Set of virtual machines processing data
- **DataFrame**: Distributed collection of data organized in columns
- **Schema**: Data structure defining column names and data types
- **Data Format**: CSV (text), JSON (semi-structured), Parquet (columnar binary)

**Why is this important?**
Correct data loading and exploration is the foundation of every ETL pipeline. Understanding schemas, formats, and exploration methods allows early detection of data quality issues.

## Environment Initialization

Run the initialization script for per-user catalog and schema isolation:

In [0]:
%run ../00_setup

## Configuration

Define workshop-specific variables:

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import *

# ÅšcieÅ¼ki do plikÃ³w danych (juÅ¼ zdefiniowane w 00_setup)
CUSTOMERS_CSV = f"{DATASET_BASE_PATH}/customers/customers.csv"
ORDERS_JSON = f"{DATASET_BASE_PATH}/orders/orders_batch.json"
PRODUCTS_PARQUET = f"{DATASET_BASE_PATH}/products/products.parquet"

**Configuration Context**

Paths to actual data files have been configured:
- **Customers CSV**: customers.csv (~10,000 customer records)
- **Orders JSON**: orders_batch.json (~100,000 orders) 
- **Products Parquet**: products.parquet (product catalog)

**Actual data structure:**
- **Customers**: `customer_id`, `first_name`, `last_name`, `email`, `phone`, `city`, `state`, `country`, `registration_date`, `customer_segment`
- **Orders**: `order_id`, `customer_id`, `product_id`, `store_id`, `order_datetime`, `quantity`, `unit_price`, `discount_percent`, `total_amount`, `payment_method`

---

## Task 1: CSV Data Import

### Objective:
Load customer data from CSV file, first with automatic schema detection, then with manually defined schema.

### Instructions:
1. Load `customers.csv` with `inferSchema=True` option
2. Display schema and first 5 records
3. Count the number of records
4. Define schema manually (StructType)
5. Reload using manual schema
6. Compare schemas

### Expected result:
- DataFrame with customer data
- Schema containing columns: customer_id (int), first_name (string), last_name (string), email (string), city (string), country (string), registration_date (timestamp)

### Hints:
- Use `spark.read.format("csv").option("header", "true")`
- For manual schema use: `StructType`, `StructField`, `IntegerType`, `StringType`, `TimestampType`

In [0]:
# Load CSV file with customers (automatic schema detection)
customers_df = (
 spark.read
 .format("csv")
 .option("header", "true")
 .option("inferSchema", "true")
 .load(CUSTOMERS_CSV)
)

# Display schema and sample data
customers_df.printSchema()
display(customers_df.limit(5))

**Customer data analysis after loading**

Spark automatically detected the schema. Note the actual columns in the CSV file:
- **Identification**: `customer_id` (CUST000001...)
- **Personal data**: `first_name`, `last_name`, `email`, `phone`
- **Location**: `city`, `state`, `country`
- **Metadata**: `registration_date`, `customer_segment`

**Next step**: Define schema manually for better data type control and performance.

In [0]:
# Define schema manually based on actual data
# Complete data types for each field
customers_schema = StructType([
 StructField("customer_id", ____, False),
 StructField("first_name", ____, True),
 StructField("last_name", StringType(), True),
 StructField("email", ____, True),
 StructField("phone", StringType(), True),
 StructField("city", ____, True),
 StructField("state", StringType(), True),
 StructField("country", ____, True),
 StructField("registration_date", ____, True),
 StructField("customer_segment", ____, True)
])

# Reload with manual schema
customers_df_manual = (
 spark.read
 .format("____")
 .option("header", "____")
 .schema(____)
 .load(CUSTOMERS_CSV)
)

# Display schema
print("[INFO] Schema with manual definition:")
customers_df_manual.printSchema()

# Display sample data
display(customers_df_manual.limit(5))

**Manual customers schema defined**

 **Best practice**: Always define schema manually instead of using `inferSchema=True`

**Benefits of manual schema:**
- **Performance**: No need to scan entire file to determine types
- **Predictability**: Control over data types (String vs Integer)
- **Safety**: Validation of data conformance to schema
- **Documentation**: Schema serves as data structure documentation

---

## Task 2: JSON Data Import

### Objective:
Load order data from JSON file and define schema manually.

### Instructions:
1. Load `orders_batch.json` with `inferSchema=True`
2. Examine data structure (schema, types)
3. Define schema manually
4. Reload with manual schema

### Expected result:
- DataFrame with orders
- Schema: order_id (int), customer_id (int), order_date (timestamp), total_amount (double), status (string)

### Hints:
- JSON doesn't require `header` option
- Use `DoubleType` for monetary amounts

In [0]:
# Load JSON file with orders (automatic schema detection)
orders_df = (
 spark.read
 .format("json")
 .option("multiLine", "true")
 .load(ORDERS_JSON)
)

display(orders_df.limit(5))

**Order data analysis after loading from JSON**

Note the actual order data structure:
- **Identifiers**: `order_id`, `customer_id`, `product_id`, `store_id`
- **Transaction details**: `order_datetime`, `quantity`, `unit_price`, `discount_percent`
- **Summary**: `total_amount`, `payment_method`

**Data quality issues:**
- Some records have `NULL` in `order_id` or `order_datetime`
- Future dates in `order_datetime` (2026)
- Data consistency check required

In [0]:
# Define orders schema based on actual data
orders_schema = StructType([
 StructField("order_id", ____, True),
 StructField("customer_id", ____, True),
 StructField("product_id", StringType(), True),
 StructField("store_id", ____, True),
 StructField("order_datetime", ____, True),
 StructField("quantity", ____, True),
 StructField("unit_price", DoubleType(), True),
 StructField("discount_percent", ____, True),
 StructField("total_amount", ____, True),
 StructField("payment_method", ____, True)
])

# Load with manual schema
orders_df_manual = (
 spark.read
 .format("____")
 .option("multiLine", "____")
 .schema(____)
 .load(ORDERS_JSON)
)

orders_df_manual.printSchema()
display(orders_df_manual.limit(5))

---

## Task 3: Parquet Data Import

### Objective:
Load product data from Parquet file (schema is embedded).

### Instructions:
1. Load `products.parquet`
2. Check schema (Parquet contains embedded schema)
3. Display data
4. Count records

### Expected result:
- DataFrame with products
- Schema automatically loaded from Parquet file

### Hints:
- Parquet doesn't require `inferSchema` or manual schema

In [0]:
# Load products.parquet
products_df = (
 spark.read
 .format("____")
 .load(PRODUCTS_PARQUET)
)

# Display schema (Parquet contains embedded schema)
products_df.printSchema()

# Display data and count records
display(products_df.limit(5))
product_count = products_df.count()

**Parquet - format with embedded schema**

 **Benefits of Parquet format:**
- **Embedded schema**: No need to define schema manually
- **Columnar compression**: Space savings and faster analytical queries
- **Performance**: Best format for Big Data and analytics in lakehouse
- **Compatibility**: Universal standard for analytical systems

---

## Save Data to Delta Lake

### Objective:
Save loaded DataFrames to Delta Lake tables for further use.

### Instructions:
1. Save customers to table `bronze.customers_workshop`
2. Save orders to table `bronze.orders_workshop`
3. Save products to table `bronze.products_workshop`

In [0]:
# Example: Save customers to Delta Lake
# This cell is ready - analyze the code and run it

customers_table = f"{BRONZE_SCHEMA}.customers_workshop"

(
 customers_df_manual
 .write
 .format("delta")
 .mode("overwrite")
 .option("overwriteSchema", "true")
 .saveAsTable(customers_table)
)

# Check saved table structure
spark.sql(f"DESCRIBE TABLE {customers_table}").show(truncate=False)

**Customers saved to Delta Lake**

 **Delta Lake - ACID format for lakehouse:**
- **ACID transactions**: Atomic data operations
- **Schema evolution**: Ability to change schema over time
- **Time travel**: Access to historical data versions
- **Optimize & Vacuum**: Performance optimization

Table saved: `{customers_table}`

In [0]:
# Save orders to Delta Lake
orders_table = f"{BRONZE_SCHEMA}.orders_workshop"

(
 orders_df_manual
 .write
 .format("____")
 .mode("____")
 .option("overwriteSchema", "true")
 .saveAsTable(____)
)

**Orders saved to Delta Lake**

After completing the blanks above, the orders table will be saved in Delta Lake format. 
Note the use of parameters:
- **format("delta")**: Specifies Delta Lake format
- **mode("overwrite")**: Replaces existing data
- **overwriteSchema**: Allows table schema changes

In [0]:
# Save products to Delta Lake
products_table = f"{BRONZE_SCHEMA}.products_workshop"

(
 ____
 .write
 .format("____")
 .mode("____")
 .option("overwriteSchema", "true")
 .saveAsTable(____)
)

**Products saved to Delta Lake**

Complete the blanks above to save the products table. All three tables (customers, orders, products) will be available in the bronze schema for further processing.

---

## Task 4: Data Exploration - Customers

### Objective:
Conduct detailed customer data exploration.

### Instructions:
1. Display list of columns and their types
2. Count unique customers
3. Find customer count by country
4. Check for NULL values in columns
5. Generate descriptive statistics (`describe()`)

### Expected result:
- Complete exploratory analysis
- Identified data gaps
- Geographic distribution of customers

### Hints:
- Use: `columns`, `dtypes`, `count()`, `distinct()`, `groupBy()`, `describe()`
- To check NULL: `.filter(col("column_name").isNull())`

In [0]:
# Basic information about data structure
customers_df_manual.columns
customers_df_manual.dtypes

# Count unique customer_id values
unique_customers = customers_df_manual.select("____").distinct().count()
total_customers = customers_df_manual.count()

# Display statistics
display(spark.createDataFrame([
 (total_customers, unique_customers, total_customers - unique_customers)
], ["total_records", "unique_customers", "duplicates"]))

**Basic customers statistics**

We checked:
- **Columns and types**: Data structure and types of each column
- **Duplicates**: Whether `customer_id` is unique (primary key)
- **Completeness**: Total record count vs unique identifiers

**Next step**: Check geographic distribution of customers

In [0]:
# Count customers by country
customers_by_country = (
 customers_df_manual
 .groupBy("____")
 .count()
 .orderBy("count", ascending=____)
)

display(customers_by_country)

**Geographic distribution of customers**

The analysis shows customer distribution by country, sorted descending.
This allows identifying:
- **Main markets**: Countries with the largest customer base
- **Expansion potential**: Countries with few customers
- **Geographic concentration**: Does the company have global reach?

In [0]:
# Check NULL values in each column
from pyspark.sql.functions import col, sum as spark_sum

null_counts = customers_df_manual.select([
 spark_sum(col(c).____().____("int")).alias(c)
 for c in customers_df_manual.columns
])

display(null_counts)

**NULL value analysis**

Checking for missing data in each column:
- **0 = Complete data** in column
- **>0 = Missing data** requiring handling

**Important for data quality**: Missing values in key fields (`customer_id`, `email`) are critical for business.

In [0]:
# Generate descriptive statistics
display(customers_df_manual.____())

**Descriptive statistics for customers**

The `describe()` method shows:
- **count**: Number of non-null values
- **mean/stddev**: Mean and standard deviation (for numeric columns)
- **min/max**: Extreme values
- For string columns: most frequent values

**Usage**: Identifying outliers and general data distribution

---

## Task 5: Data Exploration - Orders

### Objective:
Conduct order analysis.

### Instructions:
1. Count orders by status
2. Calculate total order value
3. Find average, min, max order value
4. Check for missing data
5. Find top 10 most expensive orders

### Expected result:
- Order business statistics
- Data issues identified
- Top 10 orders

### Hints:
- Use `.agg()` with functions: `sum`, `avg`, `min`, `max`
- For sorting: `.orderBy(col("column").desc())`

In [0]:
# Count orders by payment method
orders_by_payment = (
 orders_df_manual
 .groupBy("____")
 .____()
 .orderBy("count", ascending=False)
)

display(orders_by_payment)

**Payment method distribution**

The analysis shows customer payment preferences:
- **Cash, Credit Card, Debit Card, PayPal**: Main methods
- **Business insights**: Which methods are most popular?
- **Planning**: Are additional payment methods needed?

In [0]:
# Calculate statistics for total_amount
from pyspark.sql.functions import sum, avg, min, max, count, round as spark_round

orders_stats = orders_df_manual.select(
 count("*").alias("total_orders"),
 spark_round(____("total_amount"), 2).alias("total_revenue"),
 spark_round(____("total_amount"), 2).alias("avg_order_value"),
 ____("total_amount").alias("min_order"),
 ____("total_amount").alias("max_order")
)

display(orders_stats)

**Key orders business metrics**

Calculated statistics show:
- **Total Orders**: Total number of orders
- **Total Revenue**: Sum of all orders (revenue)
- **Average Order Value (AOV)**: Average order value
- **Min/Max Order**: Order value range

**Business application**: Benchmarking, budget planning, trend analysis

In [0]:
# Find top 10 most expensive orders
top_orders = (
 orders_df_manual
 .orderBy(col("total_amount").____())
 .limit(____)
)

display(top_orders)

**Top 10 most expensive orders**

Analysis of largest transactions shows:
- **VIP customers**: Who generates the highest revenue?
- **Premium products**: Which products in the most expensive orders?
- **Patterns**: Do high-value orders share common characteristics?

**Business application**: Customer segmentation, retention strategies, cross-selling

# Check missing data in orders
null_counts_orders = orders_df_manual.select([
 spark_sum(col(c).isNull().cast("int")).alias(c) 
 for c in orders_df_manual.columns
])

display(null_counts_orders)

**Missing data analysis in orders**

Key gaps to check:
- **order_id NULL**: Orders without identifier (system issue?)
- **order_datetime NULL**: Missing transaction time (affects reporting)
- **customer_id NULL**: Cannot associate with customer

**Remediation actions**: Delete or fill missing values in subsequent ETL steps

---

## Task 6: Data Exploration - Products

### Objective:
Conduct product analysis.

### Instructions:
1. Check schema and columns
2. Count products by category (if column exists)
3. Find price statistics (if price column exists)
4. Display top 10 most expensive products

### Expected result:
- Complete product analysis
- Category distribution
- Price statistics

### Hints:
- Check available columns before analysis: `products_df.columns`

In [0]:
# Check schema and columns
products_df.columns
products_df.printSchema()

# Display sample data
display(products_df.limit(5))

**Products structure analysis**

Checking the schema helps understand:
- **Available columns**: What product information is available?
- **Data types**: Parquet automatically preserves proper types
- **Sample data**: Content and value format

**Next steps**: Check if `category` and `price` columns exist for further analysis

In [0]:
display(products_df)

In [0]:
# Check if 'category' column exists, count products by category
if "subcategory_code" in products_df.columns:
 products_by_category = (
 products_df
 .groupBy("____")
 .count()
 .orderBy("____", ascending=False)
 )
 
 display(products_by_category)
else:
 display(spark.createDataFrame([("Column 'category' does not exist in data",)], ["info"]))

**Product category distribution**

If `subcategory_code` column exists:
- **Product portfolio**: What categories are available?
- **Concentration**: Which categories dominate the offering?
- **Cross-selling opportunities**: Related categories

If it doesn't exist - data enrichment with product categorization may be needed.

In [0]:
# Check if 'price' column exists, calculate statistics
if "price" in products_df.columns:
 products_stats = products_df.select(
 ____("*").alias("total_products"),
 spark_round(avg("____"), 2).alias("avg_price"),
 min("price").alias("min_price"),
 ____("price").alias("max_price")
 )
 
 display(products_stats)
else:
 display(spark.createDataFrame([("Column 'price' does not exist in data",)], ["info"]))

**Product price statistics**

If `price` column exists, we analyze:
- **Average price**: Overall price level
- **Price range**: min/max for portfolio understanding
- **Segmentation**: Budget vs premium products

**Application**: Pricing strategy, competitive analysis, product segmentation

---

## Summary

### What was achieved:
- You configured Databricks environment and per-user variables
- You loaded data from three different formats (CSV, JSON, Parquet)
- You defined schemas manually (best practice)
- You conducted detailed exploration of all datasets
- You identified data quality issues
- You generated complete data quality report
- You saved data to Delta Lake tables

### Key takeaways:
1. **Manual schemas > inferSchema**: Faster, safer, and predictable data types
2. **Exploration before transformation**: Always analyze data before processing
3. **Quality requires monitoring**: NULL, duplicates, outliers must be detected systematically
4. **Parquet is the lakehouse standard**: Embedded schema, best compression and performance

### Quick Reference - Most important commands:

| Operation | PySpark | Notes |
|----------|---------|---------|
| Load CSV | `spark.read.format("csv").option("header","true").schema(schema).load(path)` | Always use manual schema |
| Load JSON | `spark.read.format("json").schema(schema).load(path)` | `multiLine=true` option for JSON arrays |
| Load Parquet | `spark.read.format("parquet").load(path)` | Embedded schema |
| Save to Delta | `df.write.format("delta").mode("overwrite").saveAsTable(table)` | mode: overwrite/append |
| Check NULL | `df.select([sum(col(c).isNull()).alias(c) for c in df.columns])` | For each column |
| Find duplicates | `df.count() - df.distinct().count()` | Difference = duplicate count |
| Statistics | `df.describe()` | For numeric columns |
| Grouping | `df.groupBy("col").count()` | Aggregations |

---

## ðŸ“‹ Solutions

Below are the complete solutions for all workshop tasks. Use them to verify your work or if you get stuck.

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, IntegerType, DoubleType

In [0]:
# =============================================================================
# SOLUTIONS - Task 1: CSV Data Import
# =============================================================================

# Define schema manually based on actual data
customers_schema = StructType([
    StructField("customer_id", StringType(), False),
    StructField("first_name", StringType(), True),
    StructField("last_name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("phone", StringType(), True),
    StructField("city", StringType(), True),
    StructField("state", StringType(), True),
    StructField("country", StringType(), True),
    StructField("registration_date", TimestampType(), True),
    StructField("customer_segment", StringType(), True)
])

# Reload with manual schema
customers_df_manual = (
    spark.read
    .format("csv")
    .option("header", "true")
    .schema(customers_schema)
    .load(CUSTOMERS_CSV)
)

# =============================================================================
# SOLUTIONS - Task 2: JSON Data Import
# =============================================================================

# Define orders schema based on actual data
orders_schema = StructType([
    StructField("order_id", StringType(), True),
    StructField("customer_id", StringType(), True),
    StructField("product_id", StringType(), True),
    StructField("store_id", StringType(), True),
    StructField("order_datetime", TimestampType(), True),
    StructField("quantity", IntegerType(), True),
    StructField("unit_price", DoubleType(), True),
    StructField("discount_percent", IntegerType(), True),
    StructField("total_amount", DoubleType(), True),
    StructField("payment_method", StringType(), True)
])

# Load with manual schema
orders_df_manual = (
    spark.read
    .format("json")
    .option("multiLine", "true")
    .schema(orders_schema)
    .load(ORDERS_JSON)
)

# =============================================================================
# SOLUTIONS - Task 3: Parquet Data Import
# =============================================================================

# Load products.parquet
products_df = (
    spark.read
    .format("parquet")
    .load(PRODUCTS_PARQUET)
)

# =============================================================================
# SOLUTIONS - Save Data to Delta Lake
# =============================================================================

# Save orders to Delta Lake
orders_table = f"{BRONZE_SCHEMA}.orders_workshop"
(
    orders_df_manual
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(orders_table)
)

# Save products to Delta Lake
products_table = f"{BRONZE_SCHEMA}.products_workshop"
(
    products_df
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(products_table)
)

# =============================================================================
# SOLUTIONS - Task 4: Data Exploration - Customers
# =============================================================================

# Count unique customer_id values
unique_customers = customers_df_manual.select("customer_id").distinct().count()
total_customers = customers_df_manual.count()

# Count customers by country
customers_by_country = (
    customers_df_manual
    .groupBy("country")
    .count()
    .orderBy("count", ascending=False)
)

# Check NULL values in each column
from pyspark.sql.functions import col, sum as spark_sum
null_counts = customers_df_manual.select([
    spark_sum(col(c).isNull().cast("int")).alias(c)
    for c in customers_df_manual.columns
])

# Generate descriptive statistics
customers_df_manual.describe()

# =============================================================================
# SOLUTIONS - Task 5: Data Exploration - Orders
# =============================================================================

# Count orders by payment method
orders_by_payment = (
    orders_df_manual
    .groupBy("payment_method")
    .count()
    .orderBy("count", ascending=False)
)

# Calculate statistics for total_amount
from pyspark.sql.functions import sum, avg, min, max, count, round as spark_round
orders_stats = orders_df_manual.select(
    count("*").alias("total_orders"),
    spark_round(sum("total_amount"), 2).alias("total_revenue"),
    spark_round(avg("total_amount"), 2).alias("avg_order_value"),
    min("total_amount").alias("min_order"),
    max("total_amount").alias("max_order")
)

# Find top 10 most expensive orders
top_orders = (
    orders_df_manual
    .orderBy(col("total_amount").desc())
    .limit(10)
)

# =============================================================================
# SOLUTIONS - Task 6: Data Exploration - Products
# =============================================================================

# Count products by category
if "subcategory_code" in products_df.columns:
    products_by_category = (
        products_df
        .groupBy("subcategory_code")
        .count()
        .orderBy("count", ascending=False)
    )

# Calculate price statistics
if "price" in products_df.columns:
    products_stats = products_df.select(
        count("*").alias("total_products"),
        spark_round(avg("price"), 2).alias("avg_price"),
        min("price").alias("min_price"),
        max("price").alias("max_price")
    )

## Resource Cleanup

Optional - delete tables created during the workshop:

In [0]:
# WARNING: Run only if you want to delete all created tables
# These tables may be needed in subsequent workshops!

# Uncomment the lines below to delete tables:
# spark.sql(f"DROP TABLE IF EXISTS {BRONZE_SCHEMA}.customers_workshop")
# spark.sql(f"DROP TABLE IF EXISTS {BRONZE_SCHEMA}.orders_workshop") 
# spark.sql(f"DROP TABLE IF EXISTS {BRONZE_SCHEMA}.products_workshop")
# spark.catalog.clearCache()

display(spark.createDataFrame([
 ("Resource cleanup is commented out",),
 ("Uncomment the code above to delete tables",),
 ("WARNING: Tables may be needed in subsequent workshops!",)
], ["Info"]))