# Chapter 42: Full-Text Search

PostgreSQL's Full-Text Search (FTS) capability extends far beyond the simplistic `LIKE '%term%'` pattern matching that crushes performance on large datasets. By implementing linguistic analysis, stemming, ranking algorithms, and relevance scoring directly in the database engine, PostgreSQL eliminates network round-trips to external search services for many use cases. However, effective deployment requires understanding text search configurations, vectorization strategies, and the architectural boundaries where dedicated search engines become necessary.

## 42.1 FTS Architecture and Core Concepts

Full-text search in PostgreSQL operates on a fundamentally different paradigm than substring matching. It transforms natural language into normalized lexemes and builds inverted indexes for rapid retrieval.

### 42.1.1 Documents and Lexemes

A **document** is the unit of search—typically a concatenation of fields that constitute a searchable corpus. A **lexeme** is a normalized word form after linguistic processing (stemming, stopword removal).

```sql
-- Basic vectorization
SELECT to_tsvector('english', 'The quick brown foxes jumped over the lazy dogs');
-- Returns: 'brown':3 'dog':9 'fox':4 'jump':6 'lazi':8 'quick':2

-- Note the transformations:
-- 1. Stop words removed ('the', 'over')
-- 2. Stemming applied ('foxes' -> 'fox', 'jumped' -> 'jump', 'lazy' -> 'lazi')
-- 3. Position information preserved (numbers indicate word positions)
```

**tsvector Structure**:
```sql
-- Anatomy of a tsvector
SELECT to_tsvector('english', 'PostgreSQL is a powerful database');
-- 'database':5 'power':4 'postgresql':1

-- Format: 'lexeme':position[,position...][weight]
-- Weights: A (title), B (subtitle), C (body), D (meta) - default is D
```

### 42.1.2 The tsquery Type

While `tsvector` represents the document, `tsquery` represents the search predicate with Boolean operators.

```sql
-- Basic query construction
SELECT to_tsquery('english', 'postgresql & database');
-- Returns: 'postgresql' & 'database'

-- Phrase search (order matters)
SELECT to_tsquery('english', 'postgresql <-> database');
-- <-> is "followed by" (adjacent positions)

-- Complex Boolean logic
SELECT to_tsquery('english', '(quick | fast) & (fox | wolf) & !dog');
-- Returns: ('quick' | 'fast') & ('fox' | 'wolf') & !'dog'

-- Prefix matching
SELECT to_tsquery('english', 'postgres:*');
-- Matches 'postgresql', 'postgres', 'postgraduate' (stemmed variations)
```

**Query Operators**:
- `&` : AND (both terms must exist)
- `|` : OR (either term)
- `!` : NOT (exclude term)
- `<->` : Followed by (distance of 1)
- `<N>` : Distance operator (within N words)
- `:*` : Prefix matching

## 42.2 Text Search Configurations

Configurations determine how raw text transforms into tsvectors through parsing and dictionary processing.

### 42.2.1 Built-in Configurations

```sql
-- Available configurations
SELECT cfgname FROM pg_ts_config;
-- Common: english, spanish, french, german, simple

-- Examine configuration components
SELECT 
    cfgname,
    prs_name as parser,
    dict_name as dictionaries
FROM pg_ts_config c
JOIN pg_ts_config_map m ON m.map_cfg = c.oid
JOIN pg_ts_parser p ON c.cfgparser = p.oid
JOIN pg_ts_dict d ON m.map_dict = d.oid
WHERE cfgname = 'english'
LIMIT 5;
```

**Parser Functions**:
The parser breaks text into tokens (words, numbers, email addresses, URLs, etc.):

```sql
-- See how parser tokenizes text
SELECT * FROM ts_parse('default', 'foo-bar@example.com https://postgresql.org 12345');
-- tokid | token              | description
-- 1     | foo-bar@example.com| Protocol/email
-- 12    | https              | Protocol head
-- 12    | postgresql.org     | Host
-- 12    | 12345              | Integer
```

### 42.2.2 Dictionary Processing and Stemming

Dictionaries process tokens to produce lexemes. The default `english` configuration uses:

1. **ispell** or **simple** dictionary for spell checking
2. **english_stem** (Snowball stemming algorithm)
3. **stop words** removal

```sql
-- View English stop words
SELECT * FROM pg_get_keywords() 
WHERE word IN (SELECT word FROM pg_stopwords WHERE cfgname = 'english');

-- Custom stop words for specific domains
CREATE TEXT SEARCH DICTIONARY english_custom_stops (
    TEMPLATE = simple,
    STOPWORDS = english,
    ACCEPT = false
);

-- Test stemming variations
SELECT lexeme 
FROM ts_lexize('english_stem', 'running runs ran runner');
-- Returns: 'run' for all (aggressive stemming)
```

**Snowball vs. Porter Stemming**:
PostgreSQL defaults to Snowball (multilingual). For English-specific Porter stemming (less aggressive):

```sql
CREATE TEXT SEARCH CONFIGURATION public.english_porter (
    COPY = pg_catalog.english
);

ALTER TEXT SEARCH CONFIGURATION public.english_porter
    ALTER MAPPING FOR asciiword WITH english_porter_stem;
```

### 42.2.3 Custom Configurations for Domain-Specific Search

For specialized vocabularies (medical, legal, technical), create custom dictionaries:

```sql
-- Create synonym dictionary
CREATE TEXT SEARCH DICTIONARY tech_synonyms (
    TEMPLATE = synonym,
    SYNONYMS = tech_terms
);

-- Contents of tech_terms (in $SHAREDIR/tsearch_data/tech_terms):
-- postgresql pg sql pgsql
-- javascript js
-- python py

-- Create configuration incorporating synonyms
CREATE TEXT SEARCH CONFIGURATION tech_english (
    COPY = pg_catalog.english
);

ALTER TEXT SEARCH CONFIGURATION tech_english
    ALTER MAPPING FOR asciiword 
    WITH tech_synonyms, english_stem;

-- Now 'postgres' matches 'postgresql', 'js' matches 'javascript'
SELECT to_tsvector('tech_english', 'I love js and postgres');
-- Returns: 'javascript':3 'love':2 'postgresql':5
```

## 42.3 Indexing Strategies (GIN vs. GiST)

Efficient FTS requires specialized indexes. PostgreSQL offers two access methods with distinct trade-offs.

### 42.3.1 GIN Indexes (Preferred for Static Data)

Generalized Inverted Index (GIN) is optimized for containment queries and scales well for read-heavy workloads.

```sql
-- Standard GIN index for FTS
CREATE INDEX idx_articles_search ON articles 
USING GIN (to_tsvector('english', title || ' ' || content));

-- For multilingual content (language stored in column)
CREATE INDEX idx_articles_multilingual ON articles 
USING GIN (to_tsvector(language::regconfig, content));

-- Covering index with weights (stored generated column)
ALTER TABLE articles ADD COLUMN search_vector tsvector 
GENERATED ALWAYS AS (
    setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
    setweight(to_tsvector('english', coalesce(summary, '')), 'B') ||
    setweight(to_tsvector('english', coalesce(content, '')), 'C')
) STORED;

CREATE INDEX idx_articles_weighted ON articles USING GIN (search_vector);
```

**GIN Characteristics**:
- **Fast lookup**: O(log n) lexeme retrieval
- **Slow updates**: Index maintenance is expensive (inverted lists are long)
- **Bloat sensitive**: Frequent updates cause significant index bloat
- **FastScan optimization**: PostgreSQL 14+ features faster scans for common terms

### 42.3.2 GiST Indexes (For Dynamic Data)

Generalized Search Tree (GiST) supports lossy indexing with signature compression, better for frequently updated documents.

```sql
-- GiST index for high-churn tables
CREATE INDEX idx_logs_search ON logs 
USING GiST (to_tsvector('english', message));

-- Signature length optimization (default 126 bytes)
CREATE INDEX idx_logs_search_sig ON logs 
USING GiST (to_tsvector('english', message) gist_tsvector_ops(siglen=100));
```

**GiST Characteristics**:
- **Update friendly**: Faster insert/update than GIN
- **Lossy results**: May return false positives requiring heap recheck (slower exact queries)
- **Distance support**: Supports ordering by distance (`<->` operator) for ranking during index scan

**Decision Matrix**:
- **Read-heavy, batch-loaded**: GIN (preferred default)
- **Write-heavy, real-time**: GiST or GIN with `fastupdate=off` and aggressive autovacuum
- **Huge documents (>10KB)**: GiST (smaller index size)

### 42.3.3 Index Maintenance and Fast Update

GIN's `fastupdate` storage parameter buffers changes for deferred index merging:

```sql
-- Create GIN with fastupdate disabled (for real-time consistency)
CREATE INDEX idx_docs_search ON documents 
USING GIN (search_vector) 
WITH (fastupdate = off);

-- With fastupdate=on (default), pending list grows until vacuum flushes it
-- Query slowdown occurs when pending list is large
SELECT 
    relname,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
    pg_size_pretty(pg_relation_size(relid)) as table_size
FROM pg_stat_user_indexes
WHERE relname = 'idx_docs_search';
```

## 42.4 Ranking and Relevance

Raw matching is insufficient; ranking quantifies relevance using statistical and positional algorithms.

### 42.4.1 ts_rank and ts_rank_cd

PostgreSQL provides two ranking functions with different algorithms:

```sql
-- ts_rank: Standard ranking based on frequency and proximity
SELECT 
    id,
    title,
    ts_rank(search_vector, query) as rank
FROM articles, 
    plainto_tsquery('english', 'postgresql database') query
WHERE search_vector @@ query
ORDER BY rank DESC
LIMIT 10;

-- ts_rank_cd: Cover Density (higher weight to close proximity)
SELECT 
    id,
    title,
    ts_rank_cd(search_vector, query, 32 /* normalization mask */) as rank
FROM articles,
    to_tsquery('english', 'postgresql <-> database') query
WHERE search_vector @@ query
ORDER BY rank DESC;
```

**Normalization Masks** (combine with | operator):
- 0: No normalization (raw score)
- 1: Divide by document length
- 2: Divide by query length  
- 4: Simple length normalization
- 8: Rank divided by number of unique words in document
- 16: Rank divided by 1 + logarithm of unique words

**Weighting Strategy**:
```sql
-- Custom weight array (D=0.1, C=0.2, B=0.4, A=1.0 default)
SELECT ts_rank(
    '{0.1, 0.2, 0.4, 1.0}',  -- D, C, B, A weights
    search_vector,
    query,
    16 | 2  -- Normalization flags
) FROM ...;
```

### 42.4.2 Highlighting Search Results

`ts_headline` generates search result snippets with hit highlighting:

```sql
SELECT 
    id,
    title,
    ts_headline(
        'english',
        content,
        query,
        'StartSel=<mark>, StopSel=</mark>, MaxWords=35, MinWords=15, MaxFragments=3, FragmentDelimiter=" ... "'
    ) as snippet
FROM articles,
    plainto_tsquery('english', 'postgresql performance') query
WHERE search_vector @@ query
ORDER BY ts_rank(search_vector, query) DESC
LIMIT 5;
```

**Performance Warning**:
`ts_headline` is CPU-intensive (reparses original text). For high-traffic search, cache highlighted results or use application-layer highlighting.

## 42.5 Advanced Search Patterns

### 42.5.1 Multilingual Search Architecture

Supporting multiple languages requires careful schema design:

```sql
-- Language-specific vector columns
ALTER TABLE articles ADD COLUMN search_vector_english tsvector;
ALTER TABLE articles ADD COLUMN search_vector_spanish tsvector;
ALTER TABLE articles ADD COLUMN search_vector_german tsvector;

-- Populate via trigger based on language column
CREATE OR REPLACE FUNCTION articles_search_update() RETURNS trigger AS $$
BEGIN
    IF NEW.language = 'en' THEN
        NEW.search_vector_english := to_tsvector('english', NEW.content);
        NEW.search_vector_spanish := NULL;
        NEW.search_vector_german := NULL;
    ELSIF NEW.language = 'es' THEN
        NEW.search_vector_spanish := to_tsvector('spanish', NEW.content);
        -- etc...
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Query all languages (OR across columns)
SELECT * FROM articles 
WHERE search_vector_english @@ query 
   OR search_vector_spanish @@ query;
```

**Alternative: Language-agnostic simple search**:
```sql
-- Use 'simple' configuration for code, identifiers, or mixed-language content
CREATE INDEX idx_code_search ON snippets 
USING GIN (to_tsvector('simple', code));

-- No stemming, no stop words (searches exact tokens)
SELECT * FROM snippets 
WHERE to_tsvector('simple', code) @@ to_tsquery('simple', 'getUserById');
```

### 42.5.2 Partial Match and Fuzzy Search

FTS does not support fuzzy matching natively. Combine with pg_trgm for typo tolerance:

```sql
-- Hybrid approach: FTS for relevance, trigram for typos
SELECT 
    id,
    title,
    ts_rank(search_vector, to_tsquery('english', 'database')) as rank
FROM articles
WHERE 
    -- Exact FTS match (fast, uses GIN)
    search_vector @@ to_tsquery('english', 'database')
    -- OR fuzzy match (slower, uses GiST/GIN trigram index)
    OR title % 'database'  -- trigram similarity
ORDER BY rank DESC, similarity(title, 'database') DESC
LIMIT 10;

-- Index for trigram (separate from FTS)
CREATE INDEX idx_title_trgm ON articles USING GIN (title gin_trgm_ops);
```

### 42.5.3 Faceted Search (Category Aggregation)

Combine FTS with GROUP BY for faceted navigation:

```sql
-- Count matches per category
SELECT 
    category,
    count(*) as match_count,
    array_agg(id ORDER BY ts_rank(search_vector, query) DESC LIMIT 5) as top_ids
FROM articles,
    plainto_tsquery('english', 'postgresql') query
WHERE search_vector @@ query
GROUP BY category;

-- With window functions for "top N per category"
WITH ranked AS (
    SELECT 
        id, 
        category, 
        ts_rank(search_vector, query) as rank,
        row_number() OVER (PARTITION BY category ORDER BY ts_rank(search_vector, query) DESC) as rn
    FROM articles,
        plainto_tsquery('english', 'postgresql') query
    WHERE search_vector @@ query
)
SELECT * FROM ranked WHERE rn <= 3;  -- Top 3 per category
```

## 42.6 Operational Considerations

### 42.6.1 Vacuum and Bloat Management

FTS indexes on frequently updated text columns suffer bloat:

```sql
-- Monitor bloat specifically in FTS indexes
SELECT 
    schemaname,
    relname as table,
    indexrelname as index,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
    idx_scan as usage,
    idx_tup_read
FROM pg_stat_user_indexes
WHERE indexrelname LIKE '%tsvector%'
ORDER BY pg_relation_size(indexrelid) DESC;

-- Reindex strategy (CONCURRENTLY to avoid locks)
REINDEX INDEX CONCURRENTLY idx_articles_search;
```

### 42.6.2 Query Performance Optimization

**Limiting Scan Scope**:
```sql
-- Add restrictive filters before FTS to reduce scan set
SELECT * FROM articles 
WHERE created_at > now() - interval '1 year'  -- Filter first (uses index)
  AND search_vector @@ to_tsquery('english', 'postgresql')  -- Then FTS
ORDER BY created_at DESC
LIMIT 20;
```

**Pagination Pitfalls**:
Deep pagination with `ORDER BY ts_rank()` requires scanning and ranking all preceding results. Use keyset pagination with score thresholds:

```sql
-- Page 2: Find rows with rank lower than last seen
SELECT * FROM articles 
WHERE search_vector @@ query
  AND ts_rank(search_vector, query) < 0.85  -- Last score from page 1
ORDER BY ts_rank(search_vector, query) DESC
LIMIT 20;
```

## 42.7 Architectural Decision: PostgreSQL vs. Dedicated Search

**Use PostgreSQL FTS when**:
- Data volume < 10-50GB of searchable text
- Query rate < 1000 QPS (sustainable on single instance)
- Search is secondary to transactional consistency (ACID requirements)
- Result ranking is simple (text relevance only, no click-through rates or ML)
- Real-time indexing is critical (no replication lag tolerance)

**Migrate to Elasticsearch/OpenSearch/Sphinx when**:
- Distributed scaling required (horizontal partitioning)
- Complex faceting with millions of unique terms
- Need for ML-tuned relevance (learning to rank)
- Geographic/distributed search with sharding
- Dedicated search team managing cluster
- Document sizes routinely > 1MB (PostgreSQL TOAST limitations)

**Hybrid Architecture** (Eventual Consistency):
```sql
-- Write to PostgreSQL (source of truth)
INSERT INTO articles (title, content) VALUES ('New Post', 'Content...') RETURNING id;

-- Async replication to Elasticsearch via logical decoding
-- pglogical or pgstream captures changes, pushes to search index

-- Read from PostgreSQL for exact matches, Elasticsearch for fuzzy/relevance
```

## Chapter Summary

In this chapter, you learned:

1. **Core Concepts**: FTS operates on `tsvector` (normalized document representation) and `tsquery` (search predicates). The process involves parsing, stopword removal, and stemming to produce lexemes with position information.

2. **Configurations**: Text search configurations define parsers and dictionaries. The default `english` configuration uses Snowball stemming and standard stopwords. Custom configurations accommodate domain synonyms and multilingual requirements.

3. **Indexing**: GIN indexes provide fast lookups for read-heavy workloads but suffer update overhead. GiST indexes handle frequent updates better but return lossy results requiring recheck. Use GIN with `fastupdate=off` for real-time consistency or aggressive autovacuum for batch updates.

4. **Ranking**: `ts_rank()` calculates relevance based on term frequency and proximity. `ts_rank_cd()` (Cover Density) weights adjacent terms higher. Use normalization flags to prevent long documents from dominating results solely due to term frequency.

5. **Highlighting**: `ts_headline()` generates contextual snippets with marked search terms but consumes significant CPU—cache results for high-traffic scenarios.

6. **Advanced Patterns**: Implement multilingual search via language-specific columns or the `simple` configuration for code. Combine FTS with `pg_trgm` for fuzzy typo tolerance. Use faceted search with GROUP BY for category navigation.

7. **Operational Management**: Monitor FTS index bloat carefully—GIN indexes on high-churn text columns require frequent REINDEX. Filter by restrictive criteria (date ranges) before applying FTS predicates to minimize scan scope.

8. **Architecture Decisions**: PostgreSQL FTS excels for moderate scale (<50GB text, <1000 QPS) requiring transactional consistency. Migrate to dedicated search engines when horizontal scaling, ML relevance tuning, or massive faceting is required.

---

**Next:** In Chapter 43, we will explore Geospatial data handling via PostGIS—covering spatial types (geometry, geography), spatial indexing with R-trees (GiST), coordinate reference systems, and practical patterns for location-based queries within the PostgreSQL ecosystem.