# Part 4: Real-Time Prediction

**Objective**: Use our saved batch model to score a live simulated stream. This provides the "wow" moment with zero setup.


In [None]:
# Setup: Import required libraries
from pyspark.sql.functions import *
from pyspark.sql.types import *
import mlflow

## Module 4.1: The "Live Order" Simulator

**Goal**: Use the rate source (a built-in simulator) to manufacture a new stream of orders.


In [None]:
# Load the trained model from Part 3
runs = mlflow.search_runs()
latest_run = runs.iloc[0]
run_id = latest_run['run_id']

model_uri = f"runs:/{run_id}/my_tpch_order_value_model"
loaded_model = mlflow.spark.load_model(model_uri)

print(f"âœ“ Model loaded!")


### Create a Simulated Stream

The `rate` source generates a stream of timestamps at a specified rate - perfect for simulating live data!


In [None]:
# Create rate stream (generates timestamps)
rate_stream = spark.readStream \
    .format("rate") \
    .option("rowsPerSecond", 1) \
    .load()

rate_stream.printSchema()


In [None]:
# Transform rate stream into order data (simple simulation)
orders_stream = rate_stream.select(
    col("timestamp").alias("order_time"),
    # Simulate the 3 features our model needs
    (rand() * 12 + 1).cast("int").alias("month"),
    (rand() * 100000 - 1000).cast("double").alias("c_acctbal"),
    # Market segment
    when(rand() > 0.8, "AUTOMOBILE")
    .when(rand() > 0.6, "BUILDING")
    .when(rand() > 0.4, "MACHINERY")
    .when(rand() > 0.2, "HOUSEHOLD")
    .otherwise("FURNITURE").alias("c_mktsegment")
)

orders_stream.printSchema()


## Module 4.2: Apply Model & Display Live

**Goal**: See live predictions in the notebook.


### Prepare Features for the Model

We need to apply the same feature engineering pipeline that was used during training.


In [None]:
# The loaded model is a full pipeline that includes feature engineering!
# It will automatically apply StringIndexer and VectorAssembler
# We just need to provide the raw features matching the pipeline input


### Apply Model to Stream

The model can be applied directly to streaming DataFrames!


In [None]:
# Apply model to stream (pipeline handles feature engineering automatically!)
predictions_stream = loaded_model.transform(orders_stream)


### Display Live Predictions

Use `display()` (Databricks) or write to console/sink for live predictions in Jupyter notebooks!


In [None]:
# Display live predictions
# In Databricks: use display() for live updates
# In local: use console sink
query = predictions_stream.select(
    "order_time",
    "c_mktsegment",
    "c_acctbal",
    "prediction"
).writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

# In Databricks, use: display(predictions_stream.select("order_time", "c_mktsegment", "c_acctbal", "prediction"))

print("Streaming started! Check console for predictions.")
print("To stop: query.stop()")


### Live Predictions!

**What you're seeing**:
- A stream generating predictions every second
- Each row is a new order being scored in real-time
- The `prediction` column shows predicted order value

**In production**: Real data from Kafka/Kinesis, running 24/7


### ðŸŽ¯ Key Takeaways

1. **Streaming**: Use `rate` for simulation, Kafka/Kinesis for production
2. **Models**: ML models work seamlessly with streaming DataFrames
3. **Real-Time**: Same batch model scores live data
4. **Display**: Use `display()` in Databricks for live updates


### ðŸ’¡ Production Considerations

**For real production streaming**:
- Use Kafka, Kinesis, or Event Hub as source
- Write predictions to Delta Lake or database
- Enable checkpointing for fault tolerance
- Monitor with Spark UI and alerting

**Example production code**:
```python
query = predictions_stream.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/checkpoint/path") \
    .outputMode("append") \
    .start("dbfs:/predictions/")
```


In [None]:
# Stop the streaming query when done
# Uncomment the line below to stop the stream
query.stop()
