# Databricks ML Workshop 1: Fundamental Data Processing

## Introduction to Data Processing for Machine Learning

**What we'll learn**: Core data processing techniques essential for ML in Databricks

**Tools we'll use**: 
- PySpark DataFrames for data manipulation
- Databricks SQL for data exploration
- Built-in visualization tools
- Data quality checks and cleaning

**Time**: 60 minutes | **Difficulty**: Beginner

---

## Learning Objectives
1. Load and explore datasets in Databricks
2. Perform basic data cleaning and transformation
3. Handle missing values and data types
4. Create data quality reports
5. Prepare data for machine learning

**Let's start processing data!** 

# Step 1: Getting Started 

## Your Task: Import Essential Libraries
First, let's import the libraries we need for data processing.

**Try this:**
- Import `pyspark.sql.functions` as `F`
- Import `pyspark.sql.types` 
- Import `pandas` as `pd` for some operations

*Write your code in the cell below, then check the solution!*

In [0]:
# 🔨 YOUR CODE HERE

# Import PySpark functions
# from pyspark.sql import functions as F

# Import PySpark types
# from pyspark.sql import types

# Import pandas
# import pandas as pd

In [0]:
# ✅ SOLUTION: Import Essential Libraries

from pyspark.sql import functions as F
from pyspark.sql import types
import pandas as pd

# Success! Libraries imported 
print("Libraries imported successfully!")

# Step 2: Loading Sample Data

##Your Task: Create a Sample Customer Dataset
Let's create a sample dataset to practice data processing techniques.

**Try this:**
- Create a DataFrame with customer data including: id, name, age, city, purchase_amount, last_login
- Include some missing values and data quality issues to practice cleaning
- Use `spark.createDataFrame()` with a list of tuples

*Hint: Include at least 10 customers with some NULL values and inconsistent data!*

In [0]:
# ✅ SOLUTION: Create Sample Customer Dataset

# Sample data with quality issues
data = [
    (1, "John Doe", 25, "New York", 1500.0, "2024-01-15"),
    (2, "Jane Smith", None, "Los Angeles", 2300.0, "2024-01-14"),
    (3, "Bob Johnson", 35, None, 850.0, "2024-01-13"),
    (4, "Alice Brown", 28, "Chicago", None, "2024-01-12"),
    (5, "Charlie Davis", 42, "Houston", 3200.0, None),
    (6, "Eva Wilson", None, "Phoenix", 1750.0, "2024-01-10"),
    (7, "Frank Miller", 31, "Philadelphia", 2100.0, "2024-01-09"),
    (8, "Grace Lee", 29, None, 1900.0, "2024-01-08"),
    (9, "Henry Taylor", 38, "San Antonio", None, "2024-01-07"),
    (10, "Ivy Chen", 26, "San Diego", 2800.0, "2024-01-06")
]

# Create DataFrame
df = spark.createDataFrame(data, ["id", "name", "age", "city", "purchase_amount", "last_login"])

# Show the data
df.display()
print(f"Dataset created with {df.count()} customers!")

# Step 3: Basic Data Exploration

##Your Task: Explore the Dataset
Let's understand our data better by exploring its structure and content.

**Try this:**
- Check the schema using `.printSchema()`
- Get basic statistics using `.describe()`
- Count total rows and columns

*Understanding your data is the first step in any ML project!*

In [0]:
# 🔨 YOUR CODE HERE

# Check the schema
# df.printSchema()

# Get basic statistics
# df.describe().show()

# Count rows and columns
# print(f"Rows: {df.count()}, Columns: {len(df.columns)}")

In [0]:
# ✅ SOLUTION: Basic Data Exploration

print("Dataset Schema:")
df.printSchema()

print("Basic Statistics:")
df.describe().display()

print(f"Dataset Size: {df.count()} rows, {len(df.columns)} columns")

print("Column Names:")
for col in df.columns:
    print(f"  - {col}")

print("Exploration complete!")

# Step 4: Data Quality Assessment

## Your Task: Check for Missing Values
Data quality is crucial for ML. Let's identify missing values in our dataset.

**Try this:**
- Count NULL values in each column
- Calculate the percentage of missing data per column
- Use `F.col().isNull()` and `F.sum()` functions

*Missing data can significantly impact your model's performance!*

In [0]:
# 🔨 YOUR CODE HERE

# Count NULL values for each column
# null_counts = df.select([
#     F.sum(F.col(c).isNull().cast("int")).alias(c) 
#     for c in df.columns
# ])

# Show the results
# null_counts.show()

# Calculate percentage of missing values
# total_rows = df.count()
# for col in df.columns:
#     null_count = df.filter(F.col(col).isNull()).count()
#     percentage = (null_count / total_rows) * 100
#     print(f"{col}: {null_count} missing ({percentage:.1f}%)")

In [0]:
# ✅ SOLUTION: Check for Missing Values

print("🔍 Missing Value Analysis:")

# Count NULL values for each column
null_counts = df.select([
    F.sum(F.col(c).isNull().cast("int")).alias(c) 
    for c in df.columns
])

print("NULL Count by Column:")
null_counts.display()

# Calculate percentage of missing values
total_rows = df.count()
print(f"Missing Value Percentages (Total rows: {total_rows}):")

for col in df.columns:
    null_count = df.filter(F.col(col).isNull()).count()
    percentage = (null_count / total_rows) * 100
    print(f"  {col}: {null_count} missing ({percentage:.1f}%)")

print("Missing value analysis complete!")

# Step 5: Handle Missing Values

##Your Task: Clean Missing Data
Now let's handle the missing values using different strategies.

**Try this:**
- Fill missing ages with the mean age
- Fill missing cities with "Unknown"
- Fill missing purchase amounts with median
- Use `.fillna()` and `.na.fill()` methods

*Different strategies work better for different types of data!*

In [0]:
# 🔨 YOUR CODE HERE

# Calculate mean age for missing values
# mean_age = df.select(F.avg("age")).collect()[0][0]

# Calculate median purchase amount
# median_purchase = df.approxQuantile("purchase_amount", [0.5], 0.0)[0]

# Fill missing values
# df_clean = df.fillna({
#     "age": mean_age,
#     "city": "Unknown",
#     "purchase_amount": median_purchase,
#     "last_login": "2024-01-01"  # Default date
# })

# Show the cleaned data
# df_clean.show()

In [0]:
# ✅ SOLUTION: Handle Missing Values

print("Cleaning Missing Values:")

# Calculate statistics for imputation
mean_age = df.select(F.avg("age")).collect()[0][0]
median_purchase = df.approxQuantile("purchase_amount", [0.5], 0.0)[0]

print(f"Mean age for imputation: {mean_age:.1f}")
print(f"Median purchase amount for imputation: {median_purchase:.2f}")

# Fill missing values with appropriate strategies
df_clean = df.fillna({
    "age": mean_age,
    "city": "Unknown",
    "purchase_amount": median_purchase,
    "last_login": "2024-01-01"  # Default date
})

print("Cleaned Dataset:")
df_clean.display()

# Verify no missing values remain
print("🔍 Verification - Missing values after cleaning:")
df_clean.select([F.sum(F.col(c).isNull().cast("int")).alias(c) for c in df_clean.columns]).display()

print("Data cleaning complete!")

# Step 6: Data Type Conversion

## Your Task: Convert Data Types
Proper data types are essential for ML algorithms. Let's convert our columns to appropriate types.

**Try this:**
- Convert `last_login` from string to date
- Ensure `age` is integer type
- Ensure `purchase_amount` is double type
- Use `.cast()` or `.withColumn()`

*Correct data types prevent errors and improve performance!*

In [0]:
# 🔨 YOUR CODE HERE

# Convert data types
# df_typed = df_clean.withColumn("last_login", F.to_date("last_login")) \
#                    .withColumn("age", F.col("age").cast("integer")) \
#                    .withColumn("purchase_amount", F.col("purchase_amount").cast("double"))

# Check the new schema
# df_typed.printSchema()

# Show sample data
# df_typed.show(5)

In [0]:
# ✅ SOLUTION: Convert Data Types

print("Converting Data Types:")

# Convert data types appropriately
df_typed = df_clean.withColumn("last_login", F.to_date("last_login")) \
                   .withColumn("age", F.col("age").cast("integer")) \
                   .withColumn("purchase_amount", F.col("purchase_amount").cast("double"))

print("Updated Schema:")
df_typed.printSchema()

print("Sample Data with Correct Types:")
df_typed.display(5)

print("Data type conversion complete!")

# Step 7: Feature Engineering

## Your Task: Create New Features
Feature engineering can improve model performance. Let's create some useful features.

**Try this:**
- Create an `age_group` column (Young: <30, Middle: 30-40, Senior: >40)
- Create a `days_since_login` column
- Create a `purchase_category` column (Low: <1500, Medium: 1500-2500, High: >2500)
- Use `F.when()` for conditional logic

*Good features can make the difference between a mediocre and excellent model!*

In [0]:
# 🔨 YOUR CODE HERE

# Create age groups
# df_features = df_typed.withColumn(
#     "age_group",
#     F.when(F.col("age") < 30, "Young")
#      .when(F.col("age") <= 40, "Middle")
#      .otherwise("Senior")
# )

# Create days since last login
# current_date = F.current_date()
# df_features = df_features.withColumn(
#     "days_since_login",
#     F.datediff(current_date, F.col("last_login"))
# )

# Create purchase categories
# df_features = df_features.withColumn(
#     "purchase_category",
#     F.when(F.col("purchase_amount") < 1500, "Low")
#      .when(F.col("purchase_amount") <= 2500, "Medium")
#      .otherwise("High")
# )

# Show the enhanced dataset
# df_features.show()

In [0]:
# ✅ SOLUTION: Create New Features

print("🛠️ Creating New Features:")

# Create age groups
df_features = df_typed.withColumn(
    "age_group",
    F.when(F.col("age") < 30, "Young")
     .when(F.col("age") <= 40, "Middle")
     .otherwise("Senior")
)

# Create days since last login
current_date = F.current_date()
df_features = df_features.withColumn(
    "days_since_login",
    F.datediff(current_date, F.col("last_login"))
)

# Create purchase categories
df_features = df_features.withColumn(
    "purchase_category",
    F.when(F.col("purchase_amount") < 1500, "Low")
     .when(F.col("purchase_amount") <= 2500, "Medium")
     .otherwise("High")
)

print("Enhanced Dataset with New Features:")
df_features.display()

print("Feature engineering complete!")

# Step 8: Data Analysis and Insights

## Your Task: Analyze Data Patterns
Let's analyze our processed data to understand patterns and relationships.

**Try this:**
- Calculate average purchase amount by city
- Count customers in each age group
- Analyze purchase patterns by category
- Use `.groupBy()`, `.agg()`, and aggregation functions

*Data analysis helps validate your processing and discover insights!*

In [0]:
# 🔨 YOUR CODE HERE

# Average purchase by city
# avg_by_city = df_features.groupBy("city") \
#                         .agg(F.avg("purchase_amount").alias("avg_purchase"),
#                              F.count("*").alias("customer_count")) \
#                         .orderBy(F.desc("avg_purchase"))

# Customer count by age group
# age_distribution = df_features.groupBy("age_group") \
#                               .count() \
#                               .orderBy("age_group")

# Show results
# avg_by_city.show()
# age_distribution.show()

In [0]:
# ✅ SOLUTION: Data Analysis and Insights

print("Analyzing Data Patterns:")

# Average purchase by city
print("Average Purchase Amount by City:")
avg_by_city = df_features.groupBy("city") \
                        .agg(F.avg("purchase_amount").alias("avg_purchase"),
                             F.count("*").alias("customer_count")) \
                        .orderBy(F.desc("avg_purchase"))
avg_by_city.display()

# Customer distribution by age group
print("Customer Distribution by Age Group:")
age_distribution = df_features.groupBy("age_group") \
                              .count() \
                              .orderBy("age_group")
age_distribution.display()

# Purchase category distribution
print("Purchase Category Distribution:")
purchase_distribution = df_features.groupBy("purchase_category") \
                                  .agg(F.count("*").alias("customer_count"),
                                       F.avg("purchase_amount").alias("avg_amount")) \
                                  .orderBy("purchase_category")
purchase_distribution.display()

print("Data analysis complete!")

# Step 9: Save Processed Data

## Your Task: Save Your Processed Dataset
Let's save our cleaned and enhanced dataset for future use.

**Try this:**
- Create a temporary view of the processed data
- Save as Delta table for ACID transactions
- Create a summary report of the processing steps

*Saving processed data allows you to reuse it in multiple ML experiments!*

In [0]:
# 🔨 YOUR CODE HERE

# Create a temporary view
# df_features.createOrReplaceTempView("processed_customers")

# Save as Delta table
# df_features.write.format("delta").mode("overwrite").saveAsTable("demo.processed_customers")

# Create summary report
# print(" Processing Summary:")
# print(f"Original rows: {df.count()}")
# print(f"Final rows: {df_features.count()}")
# print(f"Columns added: {len(df_features.columns) - len(df.columns)}")

In [0]:
# ✅ SOLUTION: Save Processed Data

print("Saving Processed Data:")

# Create a temporary view for SQL access
df_features.createOrReplaceTempView("processed_customers")

# Ensure demo database exists
spark.sql("CREATE DATABASE IF NOT EXISTS demo")

# Save as Delta table for ACID transactions
try:
    df_features.write.format("delta").mode("overwrite").saveAsTable("demo.processed_customers")
    print(" Data saved to Delta table: demo.processed_customers")
except Exception as e:
    print(f"Note: {e}")
    print(" Data processing completed (table creation may require permissions)")

# Create comprehensive summary report
print("\n Data Processing Summary Report:")
print("="*50)
print(f" Dataset Size:")
print(f"  - Original rows: {df.count()}")
print(f"  - Final rows: {df_features.count()}")
print(f"  - Original columns: {len(df.columns)}")
print(f"  - Final columns: {len(df_features.columns)}")
print(f"  - New features added: {len(df_features.columns) - len(df.columns)}")

print(f" Data Quality:")
print(f"  - Missing values handled: ")
print(f"  - Data types corrected: ")
print(f"  - Features engineered: ")

print(f" Ready for ML: ")
print("Data processing pipeline complete!")

#  Workshop 1 Complete!

## What We Accomplished
 **Data Loading** - Created sample customer dataset with realistic issues  
 **Data Exploration** - Analyzed schema, statistics, and structure  
 **Quality Assessment** - Identified and quantified missing values  
 **Data Cleaning** - Applied appropriate imputation strategies  
 **Type Conversion** - Ensured proper data types for ML  
 **Feature Engineering** - Created valuable new features  
 **Data Analysis** - Discovered patterns and relationships  
 **Data Persistence** - Saved processed data for reuse  

## Key Skills Learned
 **Data Quality Assessment** - How to identify and measure data issues  
 **Data Cleaning Techniques** - Multiple strategies for handling missing values  
 **Feature Engineering** - Creating useful features from existing data  
 **Data Analysis** - Using aggregations to understand patterns  
 **Data Management** - Saving and organizing processed datasets  

## Next Steps
- **Workshop 2**: Advanced data transformations and vector search
- **Workshop 3**: End-to-end ML pipeline with Feature Store and AutoML
- Apply these techniques to your own datasets
- Experiment with different imputation strategies
- Try creating more complex features

**Excellent work! You've mastered fundamental data processing in Databricks! 🚀**

---

*Ready for more advanced data processing? Continue to Workshop 2!*