KEP ingests PDFs / plain-text, splits them into chunks, calls an LLM for either
- Classification – tag each paragraph (e.g. “relevant” / “irrelevant”)
- Extraction – return JSON that follows a user–defined schema, or
- Convert Every PDF is rendered to Markdown and split into paragraphs.
- Classify Each paragraph is labelled relevant / irrelevant with an LLM.
- Extract A second LLM call turns every relevant paragraph into JSON that follows your schema.
Everything is fully streamed: no intermediate database, no GPUs required.
New to KEP? Follow our interactive Quick Start journey for the best learning experience:
# Navigate to Quick Start directory
cd "Quick Start"
# 1️⃣ Learn about KEP (5 minutes)
python "01_hello_kep.py"
# 2️⃣ Check your environment (2 minutes)
python "02_environment_check.py"
# 3️⃣ Test LLM connections (3 minutes)
python "03_test_connections.py"
# 4️⃣ Understand schemas (10 minutes)
python "04_understanding_schemas.py"
# 5️⃣ Run your first pipeline (15 minutes)
python "05_pipeline_demo.py"
# 6️⃣ Explore results (5 minutes)
python "06_results_explorer.py"
# 7️⃣ Create custom schemas (20 minutes)
python "07_custom_schemas.py"
# 8️⃣ Troubleshooting & support (as needed)
python "08_troubleshooting.py"Total time: ~60 minutes to become proficient with KEP!
For experienced users or production environments:
python run_pipeline.py \
--pdf-dir ./pdfs \ # folder with PDFs
--cls-schema ./schemas/pfas_classification.json \ # relevant / irrelevant
--ext-schema ./schemas/pfas_extraction.json \ # your JSON schema
--work-dir ./runs/demo \
--provider watsonx \ # or rits
--model-id mistralai/mistral-large \
--prompt-mode few \ # zero | few
--debug-io # dump every prompt + raw replyBefore running KEP, ensure you have:
- Python 3.8+ installed
- Required dependencies:
pip install -r requirements.txt - LLM provider access (WatsonX or RITS)
- API credentials configured
- PDF files and schemas ready
📋 For complete multi-repository setup including Data_llm_extractor, Docling, and Jupyter Lab, see UNIFIED_SETUP.md
🎯 For guided setup and learning, use the Quick Start journey above!
Resulting tree:
runs/demo/
├─ ingest/all_paragraphs.json
├─ classified_full.json
├─ classified_relevant.json
├─ structured.json
├─ classification/ … # per-paragraph logs
├─ extraction/ … # per-paragraph logs
├─ general_metadata.json
├─ llm_metadata.json
└─ run.logUse one of the following commands:
# Using Git:
git clone -b master git@github.ibm.com:brl-kbe/Knowledge-Extraction-Pipeline.git
cd knowledge-extraction-pipelineAlternatively, using the GitHub CLI:
gh repo clone brl-kbe/Knowledge-Extraction-Pipeline -- -b master
cd knowledge-extraction-pipelineSet up your environment and install the required packages:
# Create virtual environment (recommended)
python -m venv kep-env
source kep-env/bin/activate # On Windows: kep-env\Scripts\activate
# Install dependencies
pip install -r requirements.txtAll provider settings live inside the repository under llm/<provider>/config.yaml.
# llm/watsonx/config.yaml
url: "https://us-south.ml.cloud.ibm.com"
apikey: "YOUR_WATSONX_API_KEY"
project_id: "YOUR_PROJECT_ID"
parameters:
decoding_method: greedy
max_new_tokens: 2000
temperature: 0.7
repetition_penalty: 1.1
stop_sequences: ["\n\n"]# llm/rits/config.yaml (create this file if using RITS)
api_url: "YOUR_RITS_ENDPOINT"
rits_api_key: "YOUR_RITS_API_KEY"
request_defaults:
temperature: 0.7
max_tokens: 2000
top_p: 0.9Every config key can be overridden by environment variables:
WatsonX:
WATSONX_URLWATSONX_APIKEYWATSONX_PROJECT_ID
RITS:
RITS_API_URLRITS_API_KEY
# Example: Override API key via environment
export WATSONX_APIKEY="your-actual-key-here"
python run_pipeline.py --provider watsonx ...- Create
llm/<provider_name>/client.pyimplementingLLMClient - Add matching
config.yaml - Register with
@register_provider("provider_name")decorator
# llm/custom/client.py
from llm.base import LLMClient
from llm.factory import register_provider
@register_provider("custom")
class CustomClient(LLMClient):
def inference(self, messages):
# Your implementation
passThe KEP is currently in its Beta phase, and access to the code‑base is strictly controlled.
If you would like to test or contribute, please e‑mail one of the contacts below with a short description of your use‑case and the GitHub (or IBM Git) username you would like to whitelist.
| Contact | Role | E‑mail |
|---|---|---|
| Viviane Torres | Senior Research Scientist, Manager | vivianet@br.ibm.com |
| Marcelo Archanjo | Senior Research Scientist | marcelo.archanjo@ibm.com |
| Anaximandro Souza | PhD Researcher | anaximandrosouza@ibm.com |
Tip: please use your corporate e‑mail (or the e‑mail associated with the Git account you plan to use) so that the team can locate your profile quickly.
The Quick Start/ directory contains 8 interactive Python scripts designed to teach you KEP progressively:
| Script | Purpose | Time | Description |
|---|---|---|---|
| 01_hello_kep.py | Introduction | 5 min | Overview, architecture, use cases, success stories |
| 02_environment_check.py | Environment | 2 min | Python, dependencies, KEP installation validation |
| 03_test_connections.py | Connectivity | 3 min | LLM providers, authentication, model discovery |
| 04_understanding_schemas.py | Schemas | 10 min | Classification vs extraction, few-shot vs zero-shot |
| 05_pipeline_demo.py | Execution | 15 min | Complete pipeline run with guided configuration |
| 06_results_explorer.py | Analysis | 5 min | Results exploration, quality assessment, export |
| 07_custom_schemas.py | Customization | 20 min | Domain-specific schema creation workshop |
| 08_troubleshooting.py | Support | As needed | Diagnostics, automated fixes, help resources |
After completing the Quick Start journey, you'll be able to:
- ✅ Understand KEP architecture and the three-stage pipeline
- ✅ Configure authentication for WatsonX and RITS providers
- ✅ Design effective schemas for your research domain
- ✅ Run complete pipelines from PDFs to structured JSON
- ✅ Analyze and export results for further research
- ✅ Troubleshoot issues and optimize performance
- ✅ Create custom schemas for specialized domains
# Navigate to Quick Start
cd "Quick Start"
# Interactive mode (recommended for learning)
python "01_hello_kep.py"
# Command-line options for automation
python "02_environment_check.py" --fix # Auto-fix issues
python "03_test_connections.py" --verbose # Detailed output
python "04_understanding_schemas.py" --validate # Check schemas
python "05_pipeline_demo.py" --auto # Auto-run with defaults
python "06_results_explorer.py" --stats # Statistics only
python "07_custom_schemas.py" --template bio # Use biomedical template
python "08_troubleshooting.py" --quick # Quick health check- Start with the numbered sequence - each script builds on previous knowledge
- Use interactive mode first - then explore command-line options
- Run diagnostics early -
02_environment_check.pyand03_test_connections.py - Practice with examples - use provided schemas before creating custom ones
- Keep scripts handy - they're useful for ongoing troubleshooting and analysis
Schemas define both the extraction structure and provide examples for few-shot prompting.
{
"PERSONA": "You are a scientific assistant...",
"TASK": "Classify the paragraph as relevant or irrelevant",
"INSTRUCTIONS": [
"Return one valid JSON object",
"Field classification must be exactly 'relevant' or 'irrelevant'"
],
"SCHEMAS": {
"classification": "relevant or irrelevant"
},
"EXAMPLE": [
{
"text": "Sample paragraph text...",
"classification": "relevant"
}
]
}{
"PERSONA": "You are a scientific information-extraction assistant",
"TASK": "Extract structured data from paragraphs",
"INSTRUCTIONS": [
"Strict schema compliance",
"Valid JSON only",
"Empty lists for missing data"
],
"SCHEMAS": {
"materials": ["Material"],
"properties": ["Property"],
"applications": ["Application"]
},
"EXAMPLE": [
{
"text": "Sample paragraph...",
"output": {
"materials": ["lithium phosphate"],
"properties": ["conductive"],
"applications": ["battery electrolyte"]
}
}
]
}WatsonX Models:
mistralai/mistral-largemeta-llama/llama-3-70b-instructibm/granite-13b-chat-v2google/flan-ul2
Model Parameters:
parameters:
decoding_method: greedy # greedy | sample
max_new_tokens: 2000 # 1-4000
temperature: 0.7 # 0.0-2.0 (only for sample)
top_p: 0.9 # 0.0-1.0 (only for sample)
top_k: 50 # 1-100 (only for sample)
repetition_penalty: 1.1 # 1.0-2.0
stop_sequences: ["\n\n"] # Array of stop strings# Paragraph-based (default)
--chunk-strategy paragraph
# Sentence-based with custom size
--chunk-strategy sentence --chunk-size 5 --chunk-overlap 1
# Fixed character chunks
--chunk-strategy fixed --chunk-size 500 --chunk-overlap 50# Enable detailed I/O logging
--debug-io
# Set deterministic seed
export PIPELINE_SEED=123
# Custom work directory
--work-dir ./custom-output- Uses only instructions and schema
- Faster execution
- Good for simple extraction tasks
python run_pipeline.py --prompt-mode zero ...- Requires examples in schema JSON
- Better accuracy for complex tasks
- Examples must be in
EXAMPLEarray
python run_pipeline.py --prompt-mode few ...Important: Few-shot mode will fail if schemas lack examples.
Each pipeline run creates the following structure:
runs/demo/
├── ingest/
│ └── all_paragraphs.json # Converted PDF paragraphs
├── classification/ # Per-paragraph classification logs
├── extraction/ # Per-paragraph extraction logs
├── classified_full.json # All paragraphs with classifications
├── classified_relevant.json # Only relevant paragraphs
├── structured.json # Final extracted JSON
├── general_metadata.json # Processing metadata
├── llm_metadata.json # Model and prompt metadata
└── run.log # Complete execution log
- Quick Start learning path added – 8 interactive Python scripts for guided learning
- Progressive tutorial system – from introduction to advanced schema creation
- Comprehensive diagnostics – automated environment checking and troubleshooting
- Interactive schema builders – domain-specific templates and guided creation
- Results exploration tools – quality assessment and export capabilities
- Robust error handling – improved authentication testing and network diagnostics
- Evaluation framework removed – the repository is now inference-only
- Strict few-shot contract: examples must reside inside the schema JSON
- New
ExtractorPipelinewraps convert → classify → extract in ~100 LOC utils/runner.pysimplified to call the new pipeline- Logging makeover – single file per run (
run.log) plus rich progress bars
Happy extracting 🎉