## Chapter 2

[Claude Chat Link](https://claude.ai/share/e0031351-df93-4a08-b853-c7372dd05a14)

```bash
**`Step 1: Complete pyenv setup in Codespaces`**
# Check if the configuration was added
cat ~/.bashrc | grep pyenv
# Restart your shell session
exec bash
# OR reload the configuration
source ~/.bashrc
# Now check pyenv
pyenv --version

**`Step 2: Install Python 3.11.8 and Poetry`**
# Install Python 3.11.8
pyenv install 3.11.8
# List available versions
pyenv versions

# In the new terminal, check if Poetry is installed
poetry –version
# Install Poetry (this doesn't require sudo)
curl -sSL https://install.python-poetry.org | python3 –
# Add Poetry to PATH
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
# Check if Poetry works
poetry --version
# Set Python 3.11.8 as the global default
pyenv global 3.11.8
# Verify it's now active
python --version
python3 --version

**`Step3: Navigate to your project and set local version`** 
# Go to your project directory
cd /workspaces/llm-twin-replicate
# To create the .python-version file, you must run
pyenv local 3.11.8
# Verify
python –version

**`Step 4: Installing the poe the poet`**
# Install Poe the Poet
poetry self add 'poethepoet[poetry_plugin]'
```

## What is pyproject.toml?
`pyproject.toml` is a configuration file that defines:

- **Project metadata** (name, version, description, author)
- **Dependencies** (what Python packages your project needs)
- **Development dependencies** (tools for testing, formatting, etc.)
- **Build system** (how to package your project)
- **Tool configurations** (settings for various development tools)

It's the modern Python standard for project configuration (replacing older files like `setup.py` and `requirements.txt`).

We need to create the `pyproject.toml` file yourself - it's a crucial part of any Python project that uses Poetry for dependency management.

### Understanding the `pyproject.toml` file for the llm-twin project

```bash
[tool.poetry]
name = "llm-engineering"
version = "0.1.0"
description = ""
authors = ["iusztinpaul <p.b.iusztin@gmail.com>"]
license = "MIT"
readme = "README.md"

[tool.poetry.dependencies]
python = "~3.11"
zenml = { version = "0.74.0", extras = ["server"] }
pymongo = "^4.6.2"
click = "^8.0.1"
loguru = "^0.7.2"
rich = "^13.7.1"
numpy = "^1.26.4"
poethepoet = "0.29.0"
datasets = "^3.0.1"
torch = "2.2.2"

# Digital data ETL
selenium = "^4.21.0"
webdriver-manager = "^4.0.1"
beautifulsoup4 = "^4.12.3"
html2text = "^2024.2.26"
jmespath = "^1.0.1"
chromedriver-autoinstaller = "^0.6.4"

# Feature engineering
qdrant-client = "^1.8.0"
langchain = "^0.2.11"
sentence-transformers = "^3.0.0"

# RAG
langchain-openai = "^0.1.3"
jinja2 = "^3.1.4"
tiktoken = "^0.7.0"
fake-useragent = "^1.5.1"
langchain-community = "^0.2.11"

# Inference
fastapi = ">=0.100,<=0.110"
uvicorn = "^0.30.6"
opik = "^0.2.2"


[tool.poetry.group.dev.dependencies]
ruff = "^0.4.9"
pre-commit = "^3.7.1"
pytest = "^8.2.2"


[tool.poetry.group.aws.dependencies]
sagemaker = ">=2.232.2"
s3fs = ">2022.3.0"
aws-profile-manager = "^0.7.3"
kubernetes = "^30.1.0"
sagemaker-huggingface-inference-toolkit = "^2.4.0"


[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

# ----------------------------------
# --- Poe the Poet Configuration ---
# ----------------------------------

[tool.poe.tasks]
# Data pipelines
run-digital-data-etl-alex = "echo 'It is not supported anymore.'"
run-digital-data-etl-maxime = "poetry run python -m tools.run --run-etl --no-cache --etl-config-filename digital_data_etl_maxime_labonne.yaml"
run-digital-data-etl-paul = "poetry run python -m tools.run --run-etl --no-cache --etl-config-filename digital_data_etl_paul_iusztin.yaml"
run-digital-data-etl = [
    "run-digital-data-etl-maxime",
    "run-digital-data-etl-paul",
]
run-feature-engineering-pipeline = "poetry run python -m tools.run --no-cache --run-feature-engineering"
run-generate-instruct-datasets-pipeline = "poetry run python -m tools.run --no-cache --run-generate-instruct-datasets"
run-generate-preference-datasets-pipeline = "poetry run python -m tools.run --no-cache --run-generate-preference-datasets"
run-end-to-end-data-pipeline = "poetry run python -m tools.run --no-cache --run-end-to-end-data"

# Utility pipelines
run-export-artifact-to-json-pipeline = "poetry run python -m tools.run --no-cache --run-export-artifact-to-json"
run-export-data-warehouse-to-json = "poetry run python -m tools.data_warehouse --export-raw-data"
run-import-data-warehouse-from-json = "poetry run python -m tools.data_warehouse --import-raw-data"

# Training pipelines
run-training-pipeline = "poetry run python -m tools.run --no-cache --run-training"
run-evaluation-pipeline = "poetry run python -m tools.run --no-cache --run-evaluation"

# Inference
call-rag-retrieval-module = "poetry run python -m tools.rag"

run-inference-ml-service = "poetry run uvicorn tools.ml_service:app --host 0.0.0.0 --port 8000 --reload"
call-inference-ml-service = "curl -X POST 'http://127.0.0.1:8000/rag' -H 'Content-Type: application/json' -d '{\"query\": \"My name is Paul Iusztin. Could you draft a LinkedIn post discussing RAG systems? I am particularly interested in how RAG works and how it is integrated with vector DBs and LLMs.\"}'"

# Infrastructure
## Local infrastructure
local-docker-infrastructure-up = "docker compose up -d"
local-docker-infrastructure-down = "docker compose stop"
local-zenml-server-down = "poetry run zenml logout --local"
local-infrastructure-up = [
    "local-docker-infrastructure-up",
    "local-zenml-server-down",
    "local-zenml-server-up",
]
local-infrastructure-down = [
    "local-docker-infrastructure-down",
    "local-zenml-server-down",
]
set-local-stack = "poetry run zenml stack set default"
set-aws-stack = "poetry run zenml stack set aws-stack"
set-asynchronous-runs = "poetry run zenml orchestrator update aws-stack --synchronous=False"
zenml-server-disconnect = "poetry run zenml disconnect"

## Settings
export-settings-to-zenml = "poetry run python -m tools.run --export-settings"
delete-settings-zenml = "poetry run zenml secret delete settings"

## SageMaker
create-sagemaker-role = "poetry run python -m llm_engineering.infrastructure.aws.roles.create_sagemaker_role"
create-sagemaker-execution-role = "poetry run python -m llm_engineering.infrastructure.aws.roles.create_execution_role"
deploy-inference-endpoint = "poetry run python -m llm_engineering.infrastructure.aws.deploy.huggingface.run"
test-sagemaker-endpoint = "poetry run python -m llm_engineering.model.inference.test"
delete-inference-endpoint = "poetry run python -m llm_engineering.infrastructure.aws.deploy.delete_sagemaker_endpoint"

## Docker
build-docker-image = "docker buildx build --platform linux/amd64 -t llmtwin -f Dockerfile ."
run-docker-end-to-end-data-pipeline = "docker run --rm --network host --shm-size=2g --env-file .env llmtwin poetry poe --no-cache --run-end-to-end-data"
bash-docker-container = "docker run --rm -it --network host --env-file .env llmtwin bash"

# QA
lint-check = "poetry run ruff check ."
format-check = "poetry run ruff format --check ."
lint-check-docker = "sh -c 'docker run --rm -i hadolint/hadolint < Dockerfile'"
gitleaks-check = "docker run -v .:/src zricethezav/gitleaks:latest dir /src/llm_engineering"
lint-fix = "poetry run ruff check --fix ."
format-fix = "poetry run ruff format ."

[tool.poe.tasks.local-zenml-server-up]
control.expr = "sys.platform"

[[tool.poe.tasks.local-zenml-server-up.switch]]
case = "darwin"
env = { OBJC_DISABLE_INITIALIZE_FORK_SAFETY = "YES" }
cmd = "poetry run zenml login --local"

[[tool.poe.tasks.local-zenml-server-up.switch]]
case = "win32"
cmd = "poetry run zenml login --local --blocking"

[[tool.poe.tasks.local-zenml-server-up.switch]]
cmd = "poetry run zenml login --local"

# Tests
[tool.poe.tasks.test]
cmd = "poetry run pytest tests/"
env = { ENV_FILE = ".env.testing" }
```

### Build System Configuration
```bash
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
```

**What this does**:
- Tells Python how to **build/package** your project when you want to distribute it
- `poetry-core` is the tool that handles the building process
- `poetry.core.masonry.api` is the specific backend that creates wheel files (.whl) and source distributions
- This is required by Python's PEP 517/518 standards for modern packaging

**When you need it**:
- When you want to publish your package to PyPI
- When someone wants to install your project with pip install
- For proper project structure and standards compliance

### Tool Configurations
```bash
[tool.black]
line-length = 88
target-version = ['py311']

[tool.isort]
profile = "black"
line_length = 88

[tool.mypy]
python_version = "3.11"
warn_return_any = true
warn_unused_configs = true
```

**What these do**:
- **[tool.black]**: Configures the Black code formatter
    - `line-length = 88`: Maximum characters per line
    - `target-version = ['py311']`: Format code for Python 3.11
- **[tool.isort]**: Configures import statement sorting
    - `profile = "black"`: Make it compatible with Black formatting
    - `line_length = 88`: Match Black's line length
- **[tool.mypy]**: Configures type checking
    - `python_version = "3.11"`: Target Python version for type checking
    - `warn_return_any = true`: Warn about functions returning Any type
    - `warn_unused_configs = true`: Warn about unused mypy settings

**When you need them**:
- Only when you actually use these tools
- They provide consistent settings across your team
- You can delete these sections if you don't use the tools

### Poetry Add vs Pre-defining Dependencies
We have two approaches:

1. **Approach 1: Start minimal and add as needed**
```bash
# Start with just basic info
[tool.poetry]
name = "llm-twin-replicate"
version = "0.1.0"
description = "..."
authors = ["..."]

[tool.poetry.dependencies]
python = "^3.11"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
```

Then add packages one by one:
```bash
poetry add fastapi
poetry add uvicorn
poetry add openai
poetry add --group dev pytest
poetry add --group dev black
```

2. **Approach 2: Pre-define common dependencies**
- List packages you know you'll need
- Faster initial setup
- Good for following tutorials/courses where dependencies are known

### Key components explained:
- **[tool.poetry]**: Basic project info
- **[tool.poetry.dependencies]**: Packages needed to run your app
- **[tool.poetry.group.dev.dependencies]**: Development tools (testing, formatting)
- **[tool.poetry.group.aws]**: Optional AWS-specific dependencies
- **[tool.poe.tasks]**: Custom commands you can run with poe <task-name>

Let's break down each section and explain what it does:

### 📋 Basic Project Information
```bash
[tool.poetry]
name = "llm-engineering"
version = "0.1.0"
description = ""
authors = ["iusztinpaul <p.b.iusztin@gmail.com>"]
license = "MIT"
readme = "README.md"
```

- **Project metadata**: Name, version, author info
- **MIT license**: Open source license allowing commercial use
- This is the foundation of any Python project

### 🐍 Core Dependencies
```bash
[tool.poetry.dependencies]
python = "~3.11"  # Exactly Python 3.11.x (not 3.12+)
```

### MLOps & Orchestration
```bash
zenml = { version = "0.74.0", extras = ["server"] }  # ML pipeline orchestration
```

- **ZenML**: Manages ML pipelines, tracks experiments, handles model deployment

### Data Processing & ML
```bash
datasets = "^3.0.1"      # Hugging Face datasets
torch = "2.2.2"          # PyTorch for deep learning
numpy = "^1.26.4"        # Numerical computing
```

### Web Scraping & Data Collection
```bash
selenium = "^4.21.0"                    # Browser automation
webdriver-manager = "^4.0.1"           # Manages browser drivers
beautifulsoup4 = "^4.12.3"             # HTML parsing
html2text = "^2024.2.26"               # HTML to text conversion
chromedriver-autoinstaller = "^0.6.4"  # Auto Chrome driver setup
```

- These tools scrape digital content (LinkedIn posts, articles, etc.) to build your personal data

### Vector Database & Embeddings
```bash
qdrant-client = "^1.8.0"           # Vector database client
sentence-transformers = "^3.0.0"   # Text embeddings
```

- **Qdrant**: Stores vector embeddings of your content
- **Sentence Transformers**: Converts text to numerical vectors

### LLM & RAG (Retrieval-Augmented Generation)
```bash
langchain = "^0.2.11"           # LLM application framework
langchain-openai = "^0.1.3"     # OpenAI integration
langchain-community = "^0.2.11" # Community extensions
tiktoken = "^0.7.0"             # OpenAI tokenizer
jinja2 = "^3.1.4"               # Template engine
```

- **LangChain**: Framework for building LLM applications
- **RAG**: Combines your personal data with LLM responses

### API & Web Service
```bash
fastapi = ">=0.100,<=0.110"  # Modern web API framework
uvicorn = "^0.30.6"          # ASGI server
```

### Utilities
```bash
click = "^8.0.1"        # Command-line interface
loguru = "^0.7.2"       # Advanced logging
rich = "^13.7.1"        # Beautiful terminal output
poethepoet = "0.29.0"   # Task runner
opik = "^0.2.2"         # ML observability
```

### 🛠️ Development Dependencies
```bash
[tool.poetry.group.dev.dependencies]
ruff = "^0.4.9"        # Fast Python linter & formatter
pre-commit = "^3.7.1"  # Git hooks for code quality
pytest = "^8.2.2"      # Testing framework
```

- **Ruff**: Super fast replacement for Black, isort, flake8
- **Pre-commit**: Runs checks before Git commits
- **Pytest**: Industry-standard testing

### ☁️ AWS Cloud Dependencies
```bash
[tool.poetry.group.aws.dependencies]
sagemaker = ">=2.232.2"  # AWS ML platform
s3fs = ">2022.3.0"       # S3 filesystem interface
kubernetes = "^30.1.0"   # Container orchestration
```

- **Optional group**: Only installed when deploying to AWS
- **SageMaker**: AWS's managed ML platform for training and inference

### 🎯 Poe the Poet Tasks (The Magic!)
This is where the real power lies - **automated workflows:**

### Data Collection Pipelines
```bash
run-digital-data-etl-maxime = "poetry run python -m tools.run --run-etl --no-cache --etl-config-filename digital_data_etl_maxime_labonne.yaml"
```

- Scrapes data from specific people (Maxime Labonne, Paul Iusztin)
- Creates personalized datasets

### Feature Engineering
```bash
run-feature-engineering-pipeline = "poetry run python -m tools.run --no-cache --run-feature-engineering"
```

- Processes raw data into features for ML training
- Creates embeddings and vector representations

### Training Pipelines
```bash
run-training-pipeline = "poetry run python -m tools.run --no-cache --run-training"
run-evaluation-pipeline = "poetry run python -m tools.run --no-cache --run-evaluation"
```

- Trains your personalized LLM
- Evaluates model performance

### Inference & API
```bash
run-inference-ml-service = "poetry run uvicorn tools.ml_service:app --host 0.0.0.0 --port 8000 --reload"
call-inference-ml-service = "curl -X POST 'http://127.0.0.1:8000/rag' ..."
```

- Starts the API server
- Tests the RAG system with sample queries

### Infrastructure Management
```bash
local-infrastructure-up = [
    "local-docker-infrastructure-up",
    "local-zenml-server-down", 
    "local-zenml-server-up",
]
```

- **Composite tasks**: Runs multiple commands in sequence
- Manages Docker containers, ZenML server, databases

### Cloud Deployment
```bash
deploy-inference-endpoint = "poetry run python -m llm_engineering.infrastructure.aws.deploy.huggingface.run"
delete-inference-endpoint = "poetry run python -m llm_engineering.infrastructure.aws.deploy.delete_sagemaker_endpoint"
```

- Deploys your model to AWS SageMaker
- Manages cloud infrastructure

### 🔍 Key Insights
This project structure shows:
1. **`Complete MLOps pipeline`***: Data → Processing → Training → Deployment
2. **`Production-ready`**: Proper testing, linting, CI/CD setup
3. **`Multi-environment`**: Local development + AWS cloud deployment
4. **`Automated workflows`**: Everything scriptable with `poe <task-name>`
5. **`Modern Python practices`**: Poetry, Ruff, proper dependency management

### 🚀 How You'd Use This
```bash
# Set up the project
poetry install

# Collect your digital data
poe run-digital-data-etl-paul

# Process the data
poe run-feature-engineering-pipeline

# Train your LLM twin
poe run-training-pipeline

# Start the API
poe run-inference-ml-service

# Deploy to cloud
poe deploy-inference-endpoint
```

## `ruff.toml` replaces the tool configurations in `pyproject.toml`
**What's happening here**:
1. This project uses Ruff instead of `Black/isort/flake8`
2. `Ruff` is a modern, all-in-one tool that replaces multiple separate tools
3. The configuration is in `ruff.toml` instead of `pyproject.toml`

**Old approach (what I showed earlier):**
```bash
# pyproject.toml
[tool.black]         # Code formatter
[tool.isort]         # Import sorter  
[tool.flake8]        # Linter
[tool.mypy]          # Type checker
```

- 4 separate tools
- Slower (each tool runs separately)
- More configuration needed

**Modern approach (this project):**
```bash
# ruff.toml  
line-length = 120    # Replaces Black
[lint]               # Replaces flake8
[lint.isort]         # Replaces isort
```

- **1 tool (Ruff)** does formatting + linting + import sorting
- **Much faster** (written in `Rust`)
- **Simpler configuration**

### Current `ruff.toml` file
```bash
line-length = 120
target-version = "py311"
extend-exclude = [
    ".github",
    "graphql_client",
    "graphql_schemas"
]

[lint]
extend-select = [
  "I",
  "B",
  "G",
  "T20",
  "PTH",
  "RUF"
]

[lint.isort]
case-sensitive = true

[lint.pydocstyle]
convention = "google"
```

**Basic Configuration:**
```bash
line-length = 120        # Max characters per line (like Black)
target-version = "py311" # Target Python version
extend-exclude = [...]   # Folders to ignore
```

**Linting Rules:**
```bash
[lint]
extend-select = [
  "I",     # isort - import sorting
  "B",     # flake8-bugbear - bug detection
  "G",     # flake8-logging-format - logging best practices
  "T20",   # flake8-print - detect print statements
  "PTH",   # flake8-use-pathlib - use pathlib instead of os.path
  "RUF"    # Ruff-specific rules
]
```

**Import Sorting (replaces isort):**
```bash
[lint.isort]
case-sensitive = true  # Sort imports case-sensitively
```

**Documentation Style:**
```bash
[lint.pydocstyle]
convention = "google"  # Use Google-style docstrings
```

**Why this project doesn't need the pyproject.toml tool configs:**
Looking back at the dependencies:
```bash
[tool.poetry.group.dev.dependencies]
ruff = "^0.4.9"        # ✅ Only Ruff (no Black, isort, flake8)
pre-commit = "^3.7.1"  
pytest = "^8.2.2"
```

- Notice: No Black, isort, or flake8 - just Ruff!

**Running the tools:**
```bash
# With the old approach:
black .
isort .
flake8 .
mypy .

# With Ruff (this project):
ruff format .    # Replaces Black
ruff check .     # Replaces flake8 + isort
# Still need mypy separately for type checking
```

**Poe tasks that use these:**
```bash
lint-check = "poetry run ruff check ."      # Check for issues
format-check = "poetry run ruff format --check ."  # Check formatting
lint-fix = "poetry run ruff check --fix ."   # Auto-fix issues  
format-fix = "poetry run ruff format ."     # Auto-format code
```


**Check `Poe the poet` installed version with Poetry**
```bash
####### Method 1: Check installed version with Poetry #######
# If installed as a Poetry plugin
poetry self show poethepoet

# Or check all Poetry plugins
poetry self show

####### Method 2: Direct command line check #######
# Check version directly
poe --version

# Or
poetry poe --version

####### Method 3: Check in your pyproject.toml #######
# List Poetry plugins
poetry self show plugins

# Check specific plugin info
poetry self show poethepoet
```

Let me walk you through creating the ruff.toml file step by step:

## Step 1: Create the ruff.toml file
```bash
# In your project root directory (same level as pyproject.toml)
touch ruff.toml
```

## Step 2: Add the configuration content
Open the ruff.toml file in VS Code and paste the content from the artifact I created earlier:
```bash
# Ruff configuration for LLM Twin project
line-length = 120
target-version = "py311"

# Exclude common directories that don't need linting
extend-exclude = [
    ".github",
    ".venv",
    "__pycache__",
    "*.egg-info",
    ".mypy_cache",
    ".pytest_cache"
]

[lint]
# Enable specific rule sets
extend-select = [
  "I",    # isort - import sorting
  "B",    # flake8-bugbear - bug detection  
  "G",    # flake8-logging-format - logging best practices
  "T20",  # flake8-print - detect print statements (avoid print in production)
  "PTH",  # flake8-use-pathlib - use pathlib instead of os.path
  "RUF"   # Ruff-specific rules
]

[lint.isort]
case-sensitive = true

[lint.pydocstyle]
convention = "google"  # Use Google-style docstrings
```

## Step 3: Ensure Ruff is in your pyproject.toml
Make sure your pyproject.toml has Ruff in the dev dependencies:
```bash
[tool.poetry.group.dev.dependencies]
ruff = "^0.4.9"
pytest = "^8.2.2"
pre-commit = "^3.7.1"
```

## Step 4: Install Ruff
```bash
# Install all dependencies including dev group
poetry install

# Or specifically add Ruff if it's not in your pyproject.toml yet
poetry add --group dev ruff
```

## Step 5: Test Ruff installation
```bash
# Check if Ruff is installed
poetry run ruff --version

# Check your configuration
poetry run ruff check --show-settings
```

## Step 6: Add Poe the Poet tasks to pyproject.toml
Add these task configurations to your `pyproject.toml`:
```bash
[tool.poe.tasks]
# Code quality tasks
lint-check = "poetry run ruff check ."
format-check = "poetry run ruff format --check ."
lint-fix = "poetry run ruff check --fix ."
format-fix = "poetry run ruff format ."

# Combined tasks
lint = ["lint-check", "format-check"]
fix = ["lint-fix", "format-fix"]

# Development setup
install = "poetry install"
install-dev = "poetry install --with dev"
```

## Step 8: Create a sample Python file to test
```bash
# Create a test file
touch test_ruff.py

# Add some intentionally "bad" code to test:
import os
import sys
import json


def hello_world():
    print("Hello World")
    x=1+2
    return x
```

## Step 9: Run Ruff on the test file
```bash
# Check for issues
poe lint-check

# Fix formatting
poe format-fix

# Check the file after formatting
cat test_ruff.py
```

## File structure should look like:
```bash
your-project/
├── pyproject.toml
├── ruff.toml          # ← New file you just created
├── README.md
├── test_ruff.py       # ← Test file
└── .gitignore
```

## Pro tip: Add to your workflow
```bash
# Before committing code, always run:
poe fix        # Auto-fix and format
poe lint       # Final check

# Or create a pre-commit hook (advanced)
poetry add --group dev pre-commit
```

## Step 1: Check if poetthepoet is installed
```bash
# Check if it's in your dependencies
poetry show | grep poet
```

## Step 2: Install poetthepoet
```bash
# Add poetthepoet to your project
poetry add --group dev poethepoet

# Or install it as a Poetry plugin (recommended)
poetry self add 'poethepoet[poetry_plugin]'
```

## Step 3: Verify installation
```bash
# Check version
poetry run poe --version

# Or if installed as plugin
poe --version
```

## Step 4: Add poe tasks to your pyproject.toml
You need to add the `[tool.poe.tasks]` section to your `pyproject.toml`:
```bash
# Add this to your pyproject.toml file
[tool.poe.tasks]
# Code quality tasks
lint-check = "poetry run ruff check ."
format-check = "poetry run ruff format --check ."
lint-fix = "poetry run ruff check --fix ."
format-fix = "poetry run ruff format ."

# Combined tasks
lint = ["lint-check", "format-check"]
fix = ["lint-fix", "format-fix"]
```

## Step 5: Test with direct poetry commands first
Before using `poe`, let's test ruff directly:
```bash
# Test ruff directly
poetry run ruff --version

# Check your test file
poetry run ruff check test_ruff.py

# Format your test file
poetry run ruff format test_ruff.py
```

## Step 6: Test poe commands
```bash
# List available tasks
poe

# Run lint check
poe lint-check
```

## What `poetry shell` does:
poetry shell is a command that activates the virtual environment created by Poetry. Let me explain what it does and when to use it:
```bash
poetry shell

# Before:
$ python --version
Python 3.9.7  # System Python

$ which python
/usr/bin/python  # System Python path

# After `poetry shell`:
$ python --version
Python 3.11.0  # Project's Python version

$ which python
/home/user/.cache/pypoetry/virtualenvs/llm-twin-abc123/bin/python  # Virtual env Python
```

## Two ways to run commands:
```bash
# Method 1: Using poetry run (without shell)
# Every command needs 'poetry run'
poetry run python script.py
poetry run ruff check .
poetry run pytest
poetry run poe lint-check

# Method 2: Using poetry shell (activate environment)
# Activate the environment once
poetry shell

# Now run commands directly (no 'poetry run' needed)
python script.py
ruff check .
pytest  
poe lint-check

# Exit the shell when done
exit
```

> Note: `poetry shell` is not available in Poetry 2.0.0 by default

Poetry 2.0+ removed the shell command by default. You have two options to get it back:

**Option 1: Use poetry env activate (recommended)**
```bash
# This is the new recommended way
poetry env activate
```
This should activate your virtual environment similar to `poetry shell`.

**Option 2: Install the shell plugin**
```bash
# Install the shell plugin
poetry self add poetry-plugin-shell

# Check if shell command is now available
poetry shell

# Your prompt should change to show the virtual environment
# Something like: (llm-twin-replicate-py3.11) $

# Step 4: Test your commands
# Once in the shell, test these commands
poe --version
ruff --version
python --version

# Step 5: Exit when done
exit
```

## What is pre-commit?
**Pre-commit** is a tool that runs automated checks (hooks) on your code before you commit it to Git. It helps catch issues early and ensures code quality.

### How it works:
1. You type: git commit -m "my changes"
2. Pre-commit runs automatically BEFORE the commit
3. It checks your code (linting, formatting, security, etc.)
4. If checks pass ✅ → commit proceeds
5. If checks fail ❌ → commit is blocked, issues shown

## Setting up pre-commit:
**Step 1: Install pre-commit**
```bash
# Add to your project
poetry add --group dev pre-commit
```
**Step 2: Create .pre-commit-config.yaml:**  Yes, you need to create this file manually! Here's a better version for your project:
```bash
# Pre-commit configuration for LLM Twin project
repos:
  # Ruff linting and formatting
  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.4.9  # Use latest version
    hooks:
      - id: ruff
        name: ruff (linter)
        args: [--fix]  # Auto-fix issues when possible
      - id: ruff-format
        name: ruff (formatter)

  # Security checks with Gitleaks
  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.18.4  # Use latest version
    hooks:
      - id: gitleaks

  # Basic file checks
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: trailing-whitespace
        name: Remove trailing whitespace
      - id: end-of-file-fixer
        name: Fix end of files
      - id: check-yaml
        name: Check YAML syntax
      - id: check-toml
        name: Check TOML syntax
      - id: check-json
        name: Check JSON syntax
      - id: check-added-large-files
        name: Check for large files
        args: ['--maxkb=1000']
      - id: check-merge-conflict
        name: Check for merge conflicts

  # Python-specific checks
  - repo: https://github.com/pycqa/isort
    rev: 5.13.2
    hooks:
      - id: isort
        name: Sort imports
        args: ["--profile", "black", "--line-length", "120"]

# Optional: Configure which files to include/exclude
exclude: |
  (?x)^(
    \.git/.*|
    \.venv/.*|
    __pycache__/.*|
    .*\.egg-info/.*|
    build/.*|
    dist/.*
  )$
```

**Step 3: Install the pre-commit hooks**
```bash
# This installs the hooks into your .git directory
poetry run pre-commit install
```

**Step 4: Test the setup**
```bash
# Test on all files (first time)
poetry run pre-commit run --all-files

# Or test on specific files
poetry run pre-commit run --files test_ruff.py
```

**Step 5: Make a test commit**
```bash
# Make some changes to test
echo "print('test')" >> test_ruff.py

# Add and commit (pre-commit will run automatically)
git add test_ruff.py
git commit -m "test pre-commit"
```

#### What each hook does**:
*Ruff hooks*:
- `ruff`: Lints your Python code, finds bugs and style issues
- `ruff-format`: Formats your code consistently

*Gitleaks*:
- `gitleaks`: Scans for secrets, API keys, passwords in your code

*Pre-commit-hooks*:
- `trailing-whitespace`: Removes extra spaces at end of lines
- `end-of-file-fixer`: Ensures files end with newline
- `check-yaml/toml/json`: Validates file syntax
- `check-added-large-files`: Prevents committing huge files
- `check-merge-conflict`: Finds unresolved merge conflicts

*isort*:
- `isort`: Sorts and organizes your Python importsWhat each hook does:

**Example workflow:**
```bash
# 1. Make changes to your code
echo "import sys\nimport os\nprint('hello')" > bad_code.py

# 2. Try to commit
git add bad_code.py
git commit -m "add bad code"

# 3. Pre-commit runs and might show:
# - Ruff found style issues ❌
# - Fixed imports with isort ✅
# - Formatted code with ruff-format ✅

# 4. If any hooks fail, commit is blocked
# 5. Fix issues and try again
git add bad_code.py  # Add the fixed files
git commit -m "add bad code"  # Now it should pass ✅
```

- Skip pre-commit (emergency only):
```bash
git commit -m "urgent fix" --no-verify
```

- Update hooks to latest versions:
```bash
poetry run pre-commit autoupdate
```

- Add pre-commit to your `poe` tasks: Add this to your `pyproject.toml`
```bash
[tool.poe.tasks]
pre-commit-install = "poetry run pre-commit install"
pre-commit-run = "poetry run pre-commit run --all-files"
pre-commit-update = "poetry run pre-commit autoupdate"
```

The `.pre-commit-config.yaml` file is like a recipe that tells pre-commit exactly what checks to run and how to run them. You definitely need to create it yourself, but now you have a solid template to start with!

### what is the use of ruff and pre-commit? When should we use one, and what are the differences among them?
What each tool does:
**Ruff (The Tool)**
- What: A Python linter and formatter
- Purpose: Finds bugs, style issues, and formats code
- When it runs: When YOU decide to run it

**Pre-commit (The Automation)**
- What: A framework that runs tools automatically
- Purpose: Ensures code quality checks happen consistently
- When it runs: Automatically before commits (or when triggered)

**Think of it this way**:
```markdown
Ruff = The Worker 👷‍♂️
Pre-commit = The Manager 👔

Pre-commit tells Ruff: "Hey, check this code before it gets committed!"
```

**Visual comparison: Manual Ruff workflow**
```bash
# You have to remember to run these manually
poetry run ruff check .        # Check for issues
poetry run ruff format .       # Format code
git add .
git commit -m "fix code"       # Nothing stops bad code!
```

**Pre-commit + Ruff workflow:**
````bash
# You just try to commit
git add .
git commit -m "fix code"

# Pre-commit automatically runs:
# 1. Ruff check
# 2. Ruff format  
# 3. Other checks
# If anything fails → commit blocked ❌
# If everything passes → commit succeeds ✅
```

**When to use each**:
1. Use Ruff directly when:
```bash
# Development/testing - quick feedback
poetry run ruff check file.py
poe lint-check

# CI/CD pipelines
poetry run ruff check . --diff

# IDE integration (real-time checking)
# VS Code Ruff extension
```
2. Use Pre-commit when:
```bash
# Enforce consistency across team
git commit -m "new feature"  # Auto-runs checks

# Prevent bad code from entering repo
# No more "oops, forgot to run linter"

# Standardize workflow for all developers
```

### The power of combining them:
```bash
# Ruff vs Pre-commit: Complete Guide

## Quick Summary
- **Ruff** = The tool that does the work
- **Pre-commit** = The automation that runs the tools

## When to Use Each

### Use Ruff Directly 🔧

**During Development:**
```bash
# Quick feedback while coding
poetry run ruff check file.py
poe lint-check

# Fix specific issues
poetry run ruff check --fix .

# Format specific files
poetry run ruff format src/
```

**In CI/CD:**
```bash
# Check code quality in pipelines
poetry run ruff check . --output-format=github
```

**IDE Integration:**
- VS Code Ruff extension
- PyCharm Ruff plugin
- Real-time linting as you type

### Use Pre-commit 🤖

**Enforce Team Standards:**
- Automatically runs before every commit
- Prevents bad code from entering repository
- Consistent workflow for all developers

**Multiple Tool Coordination:**
- Runs Ruff + other tools together
- Handles dependencies between tools
- Provides unified pass/fail status

## Workflow Examples

### Solo Developer
```bash
# Development
poetry run ruff check .     # Manual checking
poetry run ruff format .    # Manual formatting

# Commit
git commit -m "changes"     # Pre-commit runs automatically
```

### Team Project
```bash
# Everyone gets same checks automatically
git commit -m "feature"     

# Pre-commit runs:
# - Ruff linting
# - Ruff formatting
# - Import sorting
# - Security scanning
# - File validation
```

## Configuration Comparison

### Ruff Configuration (pyproject.toml)
```toml
[tool.ruff]
line-length = 120
target-version = "py311"

[tool.ruff.lint]
select = ["E", "F", "I", "N"]
ignore = ["E501"]
```

### Pre-commit Configuration (.pre-commit-config.yaml)
```yaml
repos:
  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.4.9
    hooks:
      - id: ruff
        args: [--fix]
      - id: ruff-format
  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.18.4
    hooks:
      - id: gitleaks
```

## Benefits of Each Approach

### Ruff Only
✅ Fast feedback during development  
✅ Granular control  
✅ IDE integration  
❌ Easy to forget  
❌ Inconsistent usage across team  

### Pre-commit + Ruff
✅ Automatic enforcement  
✅ Team consistency  
✅ Multiple tools coordination  
✅ Prevents bad commits  
❌ Slower commit process  
❌ Can be bypassed with --no-verify  

## Best Practice: Use Both! 🎯

### Development Phase
```bash
# Quick checks while coding
poetry run ruff check file.py
```

### Commit Phase  
```bash
# Automatic comprehensive checks
git commit -m "changes"  # Pre-commit runs
```

### CI/CD Phase
```bash
# Double-check in pipeline
poetry run ruff check . --diff
```

## Common Patterns

### Poe Tasks (pyproject.toml)
```toml
[tool.poe.tasks]
# Manual development checks
lint = "poetry run ruff check ."
format = "poetry run ruff format ."
fix = "poetry run ruff check --fix ."

# Pre-commit management
pre-commit-install = "poetry run pre-commit install"
pre-commit-run = "poetry run pre-commit run --all-files"
```

### Git Hooks Integration
```bash
# Install once per repository
poetry run pre-commit install

# Runs automatically on every commit
git commit -m "any message"
```

## Troubleshooting

### Pre-commit Too Slow?
```bash
# Skip for urgent commits (emergency only)
git commit --no-verify -m "urgent fix"
```

### Want Both Speed and Safety?
```bash
# Quick check during development
poetry run ruff check file.py

# Comprehensive check before commit
git commit -m "changes"  # Pre-commit runs all checks
```
```

### Key takeaways:
1. Ruff is the tool that does the actual linting and formatting
2. Pre-commit is the automation that runs Ruff (and other tools) automatically
3. Use both together for the best developer experience:
    - Ruff for quick feedback during development
    - Pre-commit for automatic enforcement before commits

**Your current setup: You have both configured! Here's how to use them**
```bash
# Development: Quick manual checks
poe lint-check    # Uses Ruff via poe
poe format-fix    # Uses Ruff via poe

# Commit: Automatic comprehensive checks  
git commit -m "my changes"  # Pre-commit runs Ruff + other tools
```

Think of pre-commit as your quality gate - it ensures no bad code gets into your repository, while Ruff is your development tool for quick feedback while coding.
The magic happens when you use them together: fast feedback during development, automatic enforcement during commits! 

## Local Development Setup and Deployment Setup
After you have installed all the dependencies, you must create and fill a .env file with your credentials to appropriately interact with other services and run the project. Setting your sensitive credentials in a .env file is a good security practice, as this file won't be committed to GitHub or shared with anyone else.

Start collecting the API keys from [OpenAI](https://platform.openai.com/docs/quickstart), [HuggingFace](https://huggingface.co/docs/hub/en/security-tokens), and [Comet ML & Opik](https://www.comet.com/docs/opik/?utm_source=llm_handbook&utm_medium=github&utm_campaign=opik). To access Opik's dashboard use [this link](https://www.comet.com/opik?utm_source=llm_handbook&utm_medium=github&utm_content=opik).

When deploying the project to the cloud, we must set additional settings for Mongo, Qdrant, and AWS. If you are just working locally, the default values of these env vars will work out of the box.

- We must change the DATABASE_HOST env var with the URL pointing to your cloud MongoDB cluster, [Tutorial](https://www.mongodb.com/resources/products/fundamentals/mongodb-cluster-setup).
- 

```bash
# --- Required settings even when working locally. ---

# OpenAI API Config
OPENAI_MODEL_ID=gpt-4o-mini
OPENAI_API_KEY=str

# Huggingface API Config
HUGGINGFACE_ACCESS_TOKEN=str

# Comet ML (during training and inference) 
COMET_API_KEY=str

# --- Required settings when deploying the code. ---
# --- Otherwise, default values work fine. ---

# MongoDB database
DATABASE_HOST="mongodb://llm_engineering:llm_engineering@127.0.0.1:27017"

# Qdrant vector database
USE_QDRANT_CLOUD=false
QDRANT_CLOUD_URL=str
QDRANT_APIKEY=str

# AWS Authentication
AWS_ARN_ROLE=str
AWS_REGION=eu-central-1
AWS_ACCESS_KEY=str
AWS_SECRET_KEY=str
```