# Pipeline Basics

An **AI pipeline** is a sequence of processing steps that transform raw input into a final output. Each step performs a discrete task — data loading, preprocessing, inference, post-processing — and passes its result to the next step.

Pipelines make complex workflows:

- **Reproducible** — the same input always produces the same output.
- **Composable** — steps can be swapped, reordered, or extended independently.
- **Observable** — each step's input and output can be inspected for debugging.

## Defining a Pipeline

A pipeline can be represented as a simple list of steps, each with a name, description, and expected I/O types.

In [1]:
pipeline = {
    "name": "Document Q&A",
    "steps": [
        {"name": "load_documents",   "description": "Ingest raw documents from source directory."},
        {"name": "chunk_text",       "description": "Split documents into overlapping text chunks."},
        {"name": "embed_chunks",     "description": "Generate vector embeddings for each chunk."},
        {"name": "retrieve_context", "description": "Find the most relevant chunks for a query."},
        {"name": "generate_answer",  "description": "Produce an answer using retrieved context."},
    ],
}

print(f"Pipeline: {pipeline['name']}")
print("Steps:")
for i, step in enumerate(pipeline["steps"], 1):
    print(f"  {i}. {step['name']:<20s} — {step['description']}")

Pipeline: Document Q&A
Steps:
  1. load_documents    — Ingest raw documents from source directory.
  2. chunk_text        — Split documents into overlapping text chunks.
  3. embed_chunks      — Generate vector embeddings for each chunk.
  4. retrieve_context  — Find the most relevant chunks for a query.
  5. generate_answer   — Produce an answer using retrieved context.


## Pipeline Stage Summary

The table below summarizes each stage's role, expected input, and output.

In [2]:
import pandas as pd

stages = pd.DataFrame([
    {"Stage": "load_documents",   "Input": "Directory path",        "Output": "List[Document]", "Description": "Read files from disk into Document objects"},
    {"Stage": "chunk_text",       "Input": "List[Document]",        "Output": "List[Chunk]",    "Description": "Split documents into fixed-size overlapping chunks"},
    {"Stage": "embed_chunks",     "Input": "List[Chunk]",           "Output": "List[Vector]",   "Description": "Convert text chunks to vector embeddings"},
    {"Stage": "retrieve_context", "Input": "Query + List[Vector]",  "Output": "List[Chunk]",    "Description": "Rank and select top-k relevant chunks"},
    {"Stage": "generate_answer",  "Input": "Query + List[Chunk]",   "Output": "String",         "Description": "Prompt LLM with context to produce final answer"},
])

stages

Unnamed: 0,Stage,Input,Output,Description
0,load_documents,Directory path,List[Document],Read files from disk into Document objects
1,chunk_text,List[Document],List[Chunk],Split documents into fixed-size overlapping chunks
2,embed_chunks,List[Chunk],List[Vector],Convert text chunks to vector embeddings
3,retrieve_context,Query + List[Vector],List[Chunk],Rank and select top-k relevant chunks
4,generate_answer,Query + List[Chunk],String,Prompt LLM with context to produce final answer


## Next Steps

See [Pipeline Results](02-pipeline-results.ipynb) to learn how to inspect the output of a pipeline run.