In [10]:
!pip install pyvespa learntorank -qqqq

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
psychopy 2023.1.3 requires arabic-reshaper, which is not installed.
psychopy 2023.1.3 requires egi-pynetstation, which is not installed.
psychopy 2023.1.3 requires esprima, which is not installed.
psychopy 2023.1.3 requires ffpyplayer, which is not installed.
psychopy 2023.1.3 requires freetype-py, which is not installed.
psychopy 2023.1.3 requires gevent, which is not installed.
psychopy 2023.1.3 requires imageio-ffmpeg, which is not installed.
psychopy 2023.1.3 requires javascripthon, which is not installed.
psychopy 2023.1.3 requires json-tricks, which is not installed.
psychopy 2023.1.3 requires moviepy, which is not installed.
psychopy 2023.1.3 requires msgpack, which is not installed.
psychopy 2023.1.3 requires msgpack-numpy, which is not installed.
psychopy 2023.1.3 requires psychtoolbox, which is not insta

In [5]:
!git clone --depth 1 https://github.com/vespa-engine/pyvespa.git

Cloning into 'pyvespa'...


In [8]:
# jupyter notebook --notebook-dir pyvespa/docs/sphinx/source

^C


### Understanding Documents and Document IDs in Vespa

**Documents** are the backbone of how Vespa organizes, stores, and retrieves data. Grasping how documents and their unique identifiers work is essential for effectively leveraging Vespa's powerful search capabilities. Let's dive into the essentials.

#### What is a Document in Vespa?

A **document** in Vespa is a structured piece of data that the system indexes, stores, and searches. Think of it as a record or an object containing various **fields** such as attributes, text, or metadata relevant to your application. Each document represents a single entity like a product, article, or user profile.

**Components of a Document:**
- **Fields:** These are key-value pairs where the key is the field name (e.g., `title`, `price`) and the value is the corresponding data.
- **Document ID:** A unique string identifier that distinguishes each document.
- **Schema:** Defines the structure of the document, specifying the types and properties of its fields (e.g., string, integer).

**Example:**
Imagine you have an online store. A product document might look like this:

```json
{
  "fields": {
    "title": "Smartphone X",
    "description": "Latest model with AI-powered features",
    "price": 699.99,
    "availability": true
  },
  "id": "product:12345"
}
```

#### Document Configuration and Routing

When setting up Vespa clusters, you specify which **document types** each cluster will store. This configuration plays a crucial role in:

1. **Garbage Collection:** Helps manage the deletion and cleanup of documents if garbage collection is enabled.
2. **Default Routing:** Determines the default routes for incoming documents. By default, a document is sent to all clusters that store its type.

**Example Configuration:**
```xml
<documents>
    <document type="product" selection="product.timestamp > now() - 86400" />
</documents>
```

#### Document Distribution and Clustering

Vespa uses the **Document ID** to distribute documents across different nodes in a cluster. This ensures efficient storage and quick retrieval.

- **Numeric Location:** Derived from the Document ID, it determines where the document is stored.
- **Buckets:** Logical containers grouping documents with similar numeric locations. Documents in the same bucket share certain bits in their numeric location.
- **Co-localized Storage:** Ensures related documents are stored together in the same bucket, enhancing retrieval speed.

**Example:**
If two products have IDs `product:electronics:1001` and `product:electronics:1002`, they might be stored in the same bucket if their numeric locations share specific bits.

#### Document IDs

**Document IDs** are unique identifiers that Vespa uses to manage documents. They follow a specific format and play a vital role in how documents are stored and retrieved.

**Structure of a Document ID:**
```
id:<namespace>:<document-type>:<key/value-pair>:<user-specified>
```

**Breakdown:**

1. **Namespace (Required):** A string for logical separation of documents. It doesn’t affect how Vespa processes documents but helps avoid ID conflicts.
   - **Example:** `id:store:products:12345`

2. **Document-Type (Required):** Defines the category of the document as specified in your schema.
   - **Example:** In `id:store:products:12345`, `products` is the document type.

3. **Key/Value Pair (Optional):** Helps control document distribution across buckets. Only relevant for **streaming** or **store-only** document types.
   - **`n=<number>`:** Used for testing bucket distributions.
   - **`g=<groupname>`:** Groups documents by a hashed group name for co-localized storage.
   - **Example:** `id:store:products:g=electronics:12345`

4. **User-Specified (Required):** The unique part of the ID, often the main identifier like a product ID.
   - **Example:** `12345` in `id:store:products:g=electronics:12345`

**Full Example:**
```
id:store:products:g=electronics:12345
```
- **Namespace:** `store`
- **Document Type:** `products`
- **Group Modifier:** `g=electronics`
- **User-Specified ID:** `12345`

**Usage in APIs:**
To add a document using this ID:
```bash
curl -X POST http://localhost:8080/document/v1/store/products/docid/12345 \
     -H "Content-Type: application/json" \
     -d '{
           "fields": {
             "title": "Smartphone X",
             "description": "AI-powered smartphone",
             "price": 699.99,
             "availability": true
           }
         }'
```

#### How Vespa Uses Document IDs

Vespa leverages Document IDs to efficiently distribute and manage documents across the cluster:

- **Numeric Location Calculation:** Based on the Document ID, Vespa calculates a numeric location that determines the document's storage node.
- **Bucket Assignment:** Documents with similar numeric locations are grouped into the same bucket.
- **Efficient Retrieval:** Storing related documents together in buckets enhances retrieval speed and performance.

**Important Tips:**
- **Modifiers Usage:** Only use key/value pair modifiers (`n` or `g`) for streaming or store-only document types. Avoid them for regular indexed documents to prevent performance issues.
- **Unique Identifiers:** Ensure the user-specified part of the Document ID is unique to avoid conflicts.

#### Document Expiry

Vespa can automatically expire documents based on certain criteria using **garbage collection**. This helps manage storage by removing outdated or irrelevant documents.

**Setting Up Expiry:**
1. **Define a Timestamp Field:**
   ```yaml
   field timestamp type long {
       indexing: attribute
       attribute {
           fast-access
       }
   }
   ```
2. **Configure Garbage Collection:**
   ```xml
   <documents garbage-collection="true">
       <document type="product" selection="product.timestamp > now() - 86400" />
   </documents>
   ```
   - **Selection Expression:** Determines which documents to keep. Documents not matching `product.timestamp > now() - 86400` (i.e., older than one day) will be expired.

**Caution:**
- **Testing Selections:** Use tools like `vespa visit` to ensure your selection expressions correctly identify documents for expiration.
- **Performance Impact:** Ensure fields used in selections are indexed as attributes with fast access to avoid slowing down garbage collection.

**Example Command:**
To see which documents would be preserved:
```bash
vespa visit --selection 'product.timestamp > now() - 86400' --field-set "product.timestamp"
```
To see which documents would be removed:
```bash
vespa visit --selection 'not (product.timestamp > now() - 86400)' --field-set "product.timestamp"
```

#### Fieldsets

**Fieldsets** allow you to specify which fields should be returned during read operations, optimizing performance by limiting the data fetched.

**Types of Fieldsets:**
1. **Built-in Fieldsets:**
   - `[all]`: Returns all fields and the Document ID.
   - `[document]`: Returns original fields and the Document ID.
   - `[id]`: Returns only the Document ID.
   - `[none]`: Returns no fields (internal use).

2. **Custom Fieldsets:**
   - Specify a list of fields to retrieve.
   - **Example:** `product:title,price`

**Using Fieldsets:**
- When performing a read operation like `get` or `visit`, specify the desired fieldset to control which fields are returned.

**Example:**
To fetch only the `title` and `price` of products:
```bash
vespa visit --selection 'product.title contains "Smartphone"' --field-set "product:title,price"
```

#### Summary

- **Documents** in Vespa are structured data units comprising fields, a unique Document ID, and a schema.
- **Document IDs** follow a specific scheme that helps Vespa distribute and manage documents efficiently across clusters.
- **Configuration** of document types in clusters ensures proper routing and garbage collection.
- **Buckets** and **co-localized storage** enhance performance by grouping related documents.
- **Fieldsets** optimize data retrieval by allowing selective fetching of fields.
- **Document expiry** helps maintain storage efficiency by automatically removing outdated documents.

Understanding these fundamentals equips you to effectively structure, manage, and query your data in Vespa, ensuring high performance and relevance in your search applications.

### Understanding Schemas in Vespa

Schemas are essential in Vespa as they define the structure and behavior of your data. They determine how documents are stored, indexed, and searched. Let's explore what schemas are, how to create and manage them, and the key components involved.

#### What is a Schema in Vespa?

A **schema** in Vespa defines a **document type** and specifies what computations (like ranking profiles) you want to perform on it. Think of a schema as a blueprint for your documents, outlining the fields they contain and how those fields are processed during indexing and searching.

**Key Points:**
- **Document Type:** Represents a category of documents, such as `music`, `books`, or `products`.
- **Fields:** Attributes or properties of the document, like `title`, `artist`, or `price`.
- **Rank Profiles:** Define how documents are scored and ranked in search results.

#### Defining a Schema

Schemas are stored in files with a `.sd` extension within the `schemas/` directory of your application package. Each schema file typically contains definitions for documents, fields, fieldsets, and rank profiles.

**Example Schema (`music.sd`):**
```yaml
schema music {
    document music {
        field artist type string {
            indexing: summary | index
        }

        field artistId type string {
            indexing: summary | attribute
            match: word
            rank: filter
        }

        field title type string {
            indexing: summary | index
        }

        field album type string {
            indexing: index
        }

        field duration type int {
            indexing: summary
        }

        field year type int {
            indexing: summary | attribute
        }

        field popularity type int {
            indexing: summary | attribute
        }
    }

    fieldset default {
        fields: artist, title, album
    }

    rank-profile song inherits default {
        first-phase {
            expression {
                nativeRank(artist,title) +
                if(isNan(attribute(popularity)) == 1, 0, attribute(popularity))
            }
        }
    }
}
```

**Components Explained:**
- **Document Block:** Defines the `music` document type with its fields.
- **Fields:** Each field has a name, type, and indexing instructions.
- **Fieldset:** Groups specific fields (`artist`, `title`, `album`) for easier querying.
- **Rank Profile:** Specifies how documents of this type are ranked during searches.

#### Key Components of a Schema

1. **Document Types and Fields:**
   - **Document:** The main entity, such as a song in the `music` schema.
   - **Field:** An attribute of the document. Each field has a type (e.g., `string`, `int`) and indexing options.

2. **Fieldsets:**
   - **Purpose:** Group multiple fields together to streamline search queries and limit the data returned during read operations.
   - **Built-in Fieldsets:**
     - `[all]`: Returns all fields and the Document ID.
     - `[document]`: Returns original fields and the Document ID.
     - `[id]`: Returns only the Document ID.
   - **Custom Fieldsets:**
     - Defined by listing specific fields.
     - **Example:** `fieldset myset { fields: artist, title, album }`

3. **Rank Profiles:**
   - **Definition:** Determine how documents are scored and ranked in search results.
   - **Inheritance:** Rank profiles can inherit from other profiles, allowing for reusable ranking logic.
   - **Example:**
     ```yaml
     rank-profile song inherits default {
         first-phase {
             expression {
                 nativeRank(artist,title) +
                 if(isNan(attribute(popularity)) == 1, 0, attribute(popularity))
             }
         }
     }
     ```

4. **Structs and Struct-Fields:**
   - **Struct:** A composite type grouping multiple fields into a single unit.
   - **Struct-Field:** Defines how specific fields within a struct are indexed and searched.
   - **Example:**
     ```yaml
     struct email {
         field sender type string {}
         field recipient type string {}
         field subject type string {}
         field content type string {}
     }

     field emails type array<email> {
         indexing: summary
         struct-field content {
             indexing: attribute
             attribute: fast-search
         }
     }
     ```

#### Inheritance in Schemas and Document Types

Inheritance allows you to create a hierarchy of schemas and document types, promoting reusability and reducing duplication.

**Schema Inheritance:**
- A schema can inherit another, gaining all its definitions.
- **Example:**
  ```yaml
  schema books inherits items {
      document books inherits items {
          field author type string {
              indexing: summary | index
          }
      }
  }
  ```

**Document Type Inheritance:**
- A document type can inherit fields and rank profiles from another document type.
- **Example:**
  ```yaml
  schema music inherits items {
      document music inherits items {
          field artist type string {
              indexing: summary | index
          }
      }
  }
  ```

**Benefits:**
- **Reusability:** Common fields and rank profiles can be defined once and inherited by multiple document types.
- **Consistency:** Ensures that similar document types share the same structure and ranking logic.

#### Managing Multiple Schemas and Content Clusters

Vespa allows you to define multiple schemas within an application, each potentially mapped to different content clusters. This flexibility helps in scaling and optimizing performance based on different document types.

**Single Content Cluster with Multiple Schemas:**
- All schemas are stored within a single content cluster.
- **Example:**
  ```xml
  <content id="maincluster" version="1.0">
      <documents>
          <document type="albums" mode="index" />
          <document type="lyrics" mode="index" />
          <document type="tracks" mode="index" />
      </documents>
  </content>
  ```

**Multiple Content Clusters:**
- Each schema is mapped to its own content cluster.
- **Example:**
  ```xml
  <content id="musiccluster" version="1.0">
      <documents>
          <document type="albums" mode="index" />
          <document type="tracks" mode="index" />
      </documents>
  </content>

  <content id="lyricscluster" version="1.0">
      <documents>
          <document type="lyrics" mode="index" />
      </documents>
  </content>
  ```

**Querying Multiple Schemas:**
- Use the `restrict` parameter to limit queries to specific schemas.
- **Example:**
  ```bash
  vespa query 'select * from sources * where title contains "bob"' restrict=music,books
  ```

#### Indexing and Match Modes

**Indexing:**
- Determines how field data is processed and stored for efficient searching.
- **Options:**
  - **index:** Creates a text index for full-text search.
  - **attribute:** Keeps the field in memory for sorting, grouping, and filtering.
  - **summary:** Includes the field in search result summaries.
- **Example:**
  ```yaml
  field title type string {
      indexing: summary | index
  }
  ```

**Match Modes:**
- Define how query terms match field data.
- **Examples:**
  - **exact:** Matches the entire term.
  - **prefix:** Matches terms that start with the query string.
- **Special Operators:** When dealing with arrays or maps, operators like `sameElement()` ensure matches occur within the same struct element.

#### Handling Field Sizes

While Vespa doesn't set a maximum size for fields, large fields (like long strings or large arrays) can impact memory and performance. To manage this:
- **Use Summary Classes:** Limit which fields are returned in query responses.
- **Set Limits:** Use `limit` or `hits` to control the size of result sets.
- **Example:**
  ```yaml
  field description type string {
      indexing: summary | index
      match: prefix
      match: exact
  }
  ```

#### Schema Modifications and Best Practices

Vespa supports safe schema modifications, allowing you to evolve your data structure without disrupting existing data.

**Adding Fields:**
- Non-destructive and straightforward.
- **Example:**
  ```yaml
  field newField type string {
      indexing: summary
  }
  ```

**Changing Indexing Modes:**
- Destructive changes require validation overrides.
- **Example:**
  To change a field from `index` to `attribute`, update your schema and add a validation override:
  ```xml
  <validation-overrides>
      <allow until="2024-12-31">indexing-change</allow>
  </validation-overrides>
  ```

**Renaming Fields:**
- Not directly supported. Use one of the following methods:
  1. **Drop and Refeed:**
     - Remove the old field.
     - Add the new field.
     - Refeed data with updates.
  2. **Partial Updates:**
     - Add the new field.
     - Update documents with the new field.
     - Transition queries to use the new field.
  3. **Use Aliases:** Create aliases for fields if applicable.

**Example Workflow:**
```bash
# Drop the old field and add the new one
# Update your application to use the new field
# Refeed the data with the new field
```

#### Practical Example: Creating a Music Schema

Let's walk through creating a simple `music` schema with documents representing songs.

1. **Define the Schema File (`music.sd`):**
   ```yaml
   schema music {
       document music {
           field artist type string {
               indexing: summary | index
           }

           field artistId type string {
               indexing: summary | attribute
               match: word
               rank: filter
           }

           field title type string {
               indexing: summary | index
           }

           field album type string {
               indexing: index
           }

           field duration type int {
               indexing: summary
           }

           field year type int {
               indexing: summary | attribute
           }

           field popularity type int {
               indexing: summary | attribute
           }
       }

       fieldset default {
           fields: artist, title, album
       }

       rank-profile song inherits default {
           first-phase {
               expression {
                   nativeRank(artist,title) +
                   if(isNan(attribute(popularity)) == 1, 0, attribute(popularity))
               }
           }
       }
   }
   ```

2. **Add the Schema to Your Application:**
   Place `music.sd` in the `schemas/` directory of your Vespa application package.

3. **Deploy the Application:**
   Use the Vespa CLI or API to deploy your application with the new schema.

4. **Indexing Documents:**
   Add documents to Vespa using the defined schema.
   ```bash
   curl -X POST http://localhost:8080/document/v1/music/music/docid/1 \
        -H "Content-Type: application/json" \
        -d '{
              "fields": {
                "artist": "The Beatles",
                "artistId": "beatles123",
                "title": "Hey Jude",
                "album": "Hey Jude",
                "duration": 431,
                "year": 1968,
                "popularity": 95
              }
            }'
   ```

5. **Querying Documents:**
   Perform searches based on the indexed fields.
   ```bash
   curl "http://localhost:8080/search/?query=Hey+Jude&rankprofile=song"
   ```

#### Summary

- **Schemas** define the structure and behavior of your documents in Vespa.
- **Document Types** and **Fields** outline the attributes of your data.
- **Fieldsets** group fields for efficient querying and data retrieval.
- **Rank Profiles** control how documents are scored and ranked in search results.
- **Inheritance** allows for reusable and hierarchical schema definitions.
- **Indexing** and **Match Modes** optimize how data is stored and searched.
- **Managing Schemas** involves safely modifying and evolving your data structure as needed.



### Understanding Schemas and Parent/Child Relationships in Vespa

Schemas play a crucial role in Vespa by defining how your data is structured, indexed, and searched. Additionally, understanding parent/child relationships allows you to model complex data structures efficiently. Let’s explore these concepts in detail.

#### What is a Schema in Vespa?

A **schema** in Vespa defines the structure of your documents and how they should be processed. It specifies the **document types**, **fields**, **fieldsets**, and **rank profiles**. Essentially, a schema acts as a blueprint for your data, ensuring consistency and enabling efficient search and ranking operations.

**Key Components of a Schema:**

- **Document Types:** Categories of documents, such as `music`, `books`, or `products`.
- **Fields:** Attributes within a document, like `title`, `artist`, or `price`.
- **Fieldsets:** Groups of fields that can be queried together.
- **Rank Profiles:** Rules that determine how documents are scored and ranked in search results.

#### Defining a Schema

Schemas are stored in files with a `.sd` extension within the `schemas/` directory of your Vespa application package. Each schema file defines one or more document types and their respective fields.

**Example Schema (`music.sd`):**

```yaml
schema music {
    document music {
        field artist type string {
            indexing: summary | index
        }

        field artistId type string {
            indexing: summary | attribute
            match: word
            rank: filter
        }

        field title type string {
            indexing: summary | index
        }

        field album type string {
            indexing: index
        }

        field duration type int {
            indexing: summary
        }

        field year type int {
            indexing: summary | attribute
        }

        field popularity type int {
            indexing: summary | attribute
        }
    }

    fieldset default {
        fields: artist, title, album
    }

    rank-profile song inherits default {
        first-phase {
            expression {
                nativeRank(artist, title) +
                if(isNan(attribute(popularity)) == 1, 0, attribute(popularity))
            }
        }
    }
}
```

**Components Explained:**

- **Document Block:** Defines the `music` document type with its fields.
- **Fields:** Each field has a name, type, and indexing instructions.
- **Fieldset:** Groups specific fields (`artist`, `title`, `album`) for easier querying.
- **Rank Profile:** Specifies how documents of this type are ranked during searches.

#### Key Components of a Schema

1. **Document Types and Fields:**
   - **Document:** The main entity, such as a song in the `music` schema.
   - **Field:** An attribute of the document. Each field has a type (e.g., `string`, `int`) and indexing options.

2. **Fieldsets:**
   - **Purpose:** Group multiple fields together to streamline search queries and limit the data returned during read operations.
   - **Built-in Fieldsets:**
     - `[all]`: Returns all fields and the Document ID.
     - `[document]`: Returns original fields and the Document ID.
     - `[id]`: Returns only the Document ID.
   - **Custom Fieldsets:**
     - Defined by listing specific fields.
     - **Example:** 
       ```yaml
       fieldset myset {
           fields: artist, title, album
       }
       ```

3. **Rank Profiles:**
   - **Definition:** Determine how documents are scored and ranked in search results.
   - **Inheritance:** Rank profiles can inherit from other profiles, allowing for reusable ranking logic.
   - **Example:**
     ```yaml
     rank-profile song inherits default {
         first-phase {
             expression {
                 nativeRank(artist, title) +
                 if(isNan(attribute(popularity)) == 1, 0, attribute(popularity))
             }
         }
     }
     ```

4. **Structs and Struct-Fields:**
   - **Struct:** A composite type grouping multiple fields into a single unit.
   - **Struct-Field:** Defines how specific fields within a struct are indexed and searched.
   - **Example:**
     ```yaml
     struct email {
         field sender type string {}
         field recipient type string {}
         field subject type string {}
         field content type string {}
     }

     field emails type array<email> {
         indexing: summary
         struct-field content {
             indexing: attribute
             attribute: fast-search
         }
     }
     ```

#### Inheritance in Schemas and Document Types

Inheritance allows you to create a hierarchy of schemas and document types, promoting reusability and reducing duplication.

**Schema Inheritance:**
- A schema can inherit another, gaining all its definitions.
- **Example:**
  ```yaml
  schema books inherits items {
      document books inherits items {
          field author type string {
              indexing: summary | index
          }
      }
  }
  ```

**Document Type Inheritance:**
- A document type can inherit fields and rank profiles from another document type.
- **Example:**
  ```yaml
  schema music inherits items {
      document music inherits items {
          field artist type string {
              indexing: summary | index
          }
      }
  }
  ```

**Benefits:**
- **Reusability:** Common fields and rank profiles can be defined once and inherited by multiple document types.
- **Consistency:** Ensures that similar document types share the same structure and ranking logic.

#### Managing Multiple Schemas and Content Clusters

Vespa allows you to define multiple schemas within an application, each potentially mapped to different content clusters. This flexibility helps in scaling and optimizing performance based on different document types.

**Single Content Cluster with Multiple Schemas:**
- All schemas are stored within a single content cluster.
- **Example:**
  ```xml
  <content id="maincluster" version="1.0">
      <documents>
          <document type="albums" mode="index" />
          <document type="lyrics" mode="index" />
          <document type="tracks" mode="index" />
      </documents>
  </content>
  ```

**Multiple Content Clusters:**
- Each schema is mapped to its own content cluster.
- **Example:**
  ```xml
  <content id="musiccluster" version="1.0">
      <documents>
          <document type="albums" mode="index" />
          <document type="tracks" mode="index" />
      </documents>
  </content>

  <content id="lyricscluster" version="1.0">
      <documents>
          <document type="lyrics" mode="index" />
      </documents>
  </content>
  ```

**Querying Multiple Schemas:**
- Use the `restrict` parameter to limit queries to specific schemas.
- **Example:**
  ```bash
  vespa query 'select * from sources * where title contains "bob"' restrict=music,books
  ```

#### Indexing and Match Modes

**Indexing:**
- Determines how field data is processed and stored for efficient searching.
- **Options:**
  - **index:** Creates a text index for full-text search.
  - **attribute:** Keeps the field in memory for sorting, grouping, and filtering.
  - **summary:** Includes the field in search result summaries.
- **Example:**
  ```yaml
  field title type string {
      indexing: summary | index
  }
  ```

**Match Modes:**
- Define how query terms match field data.
- **Examples:**
  - **exact:** Matches the entire term.
  - **prefix:** Matches terms that start with the query string.
- **Special Operators:** When dealing with arrays or maps, operators like `sameElement()` ensure matches occur within the same struct element.

#### Handling Field Sizes

While Vespa doesn't set a maximum size for fields, large fields (like long strings or large arrays) can impact memory and performance. To manage this:

- **Use Summary Classes:** Limit which fields are returned in query responses.
- **Set Limits:** Use `limit` or `hits` to control the size of result sets.
- **Example:**
  ```yaml
  field description type string {
      indexing: summary | index
      match: prefix
      match: exact
  }
  ```

#### Schema Modifications and Best Practices

Vespa supports safe schema modifications, allowing you to evolve your data structure without disrupting existing data.

**Adding Fields:**
- Non-destructive and straightforward.
- **Example:**
  ```yaml
  field newField type string {
      indexing: summary
  }
  ```

**Changing Indexing Modes:**
- Destructive changes require validation overrides.
- **Example:**
  To change a field from `index` to `attribute`, update your schema and add a validation override:
  ```xml
  <validation-overrides>
      <allow until="2024-12-31">indexing-change</allow>
  </validation-overrides>
  ```

**Renaming Fields:**
- Not directly supported. Use one of the following methods:
  1. **Drop and Refeed:**
     - Remove the old field.
     - Add the new field.
     - Refeed data with updates.
  2. **Partial Updates:**
     - Add the new field.
     - Update documents with the new field.
     - Transition queries to use the new field.
  3. **Use Aliases:** Create aliases for fields if applicable.

**Example Workflow:**
```bash
# Drop the old field and add the new one
# Update your application to use the new field
# Refeed the data with the new field
```

#### Parent/Child Relationships

Parent/child relationships allow you to model complex data structures by creating hierarchical links between documents. This is useful for applications with structured data, such as e-commerce platforms or advertising systems.

**Benefits:**
- **Simplify Operations:** Update related data with a single write operation.
- **Avoid De-normalization:** No need to duplicate data across multiple documents.
- **Efficient Searches:** Search child documents based on parent properties and vice versa.

**Key Features:**
- **Parent References:** Use document references to establish parent/child links.
- **Imported Fields:** Import fields from parent documents into child schemas for enhanced querying and ranking.
- **No Cascade Delete:** Deleting a parent doesn’t automatically delete children.
- **No Self or Cyclic References:** A document cannot reference itself or create cycles.

**Use Cases:**
- **E-commerce:** Products with multiple sellers.
- **Advertising:** Advertisers with campaigns and ads needing real-time updates.

**Implementing Parent/Child Relationships:**

1. **Define Parent Document:**
   ```yaml
   schema advertiser {
       document advertiser {
           field name type string {
               indexing: attribute
           }
       }
   }
   ```

2. **Define Child Document with Reference:**
   ```yaml
   schema campaign {
       document campaign {
           field advertiser_ref type reference<advertiser> {
               indexing: attribute
           }
           field budget type int {
               indexing: attribute
           }
       }
       import field advertiser_ref.name as advertiser_name {}
   }
   ```

3. **Define Grandchild Document:**
   ```yaml
   schema ad {
       document ad {
           field campaign_ref type reference<campaign> {
               indexing: attribute
           }
           field other_campaign_ref type reference<campaign> {
               indexing: attribute
           }
           field salesperson_ref type reference<salesperson> {
               indexing: attribute
           }
       }

       import field campaign_ref.budget as budget {}
       import field salesperson_ref.name as salesperson_name {}
       import field campaign_ref.advertiser_name as advertiser_name {}

       document-summary my_summary {
           summary budget {}
           summary salesperson_name {}
           summary advertiser_name {}
       }
   }
   ```

**Example Documents:**

1. **Advertiser Document:**
   ```json
   {
       "put": "id:test:advertiser::cool",
       "fields": {
           "name": "Cool Advertiser"
       }
   }
   ```

2. **Campaign Documents:**
   ```json
   [
       {
           "put": "id:test:campaign::thebest",
           "fields": {
               "advertiser_ref": "id:test:advertiser::cool",
               "budget": 20
           }
       },
       {
           "put": "id:test:campaign::nextbest",
           "fields": {
               "advertiser_ref": "id:test:advertiser::cool",
               "budget": 10
           }
       }
   ]
   ```

3. **Salesperson Document:**
   ```json
   {
       "put": "id:test:salesperson::johndoe",
       "fields": {
           "name": "John Doe"
       }
   }
   ```

4. **Ad Document:**
   ```json
   {
       "put": "id:test:ad::1",
       "fields": {
           "campaign_ref": "id:test:campaign::thebest",
           "other_campaign_ref": "id:test:campaign::nextbest",
           "salesperson_ref": "id:test:salesperson::johndoe"
       }
   }
   ```

**Explanation:**
- The **Ad** document references two **Campaign** documents and one **Salesperson** document.
- Imported fields like `budget`, `salesperson_name`, and `advertiser_name` are used in the ad’s summary, enabling enriched search results without duplicating data.

#### Multivalue Fields

Instead of parent/child relationships, you can use **multivalue fields** (arrays or maps of structs) to represent one-to-many relationships within a single document.

**Example:**
```yaml
struct address {
    field street type string {}
    field city type string {}
    field zip type string {}
}

field addresses type array<address> {
    indexing: summary
}
```

**Use Cases:**
- **Products with Multiple Properties:** Like different sizes or colors.
- **Users with Multiple Contacts:** Like multiple email addresses or phone numbers.

#### Choosing Between Parent/Child and Multivalue Fields

**Parent/Child Relationships:**
- Best for complex, hierarchical data where child documents need to reference and inherit properties from parent documents.
- Ideal when child documents are numerous and share common parent attributes.

**Multivalue Fields:**
- Suitable for simpler one-to-many relationships within a single document.
- Easier to manage when the relationships are straightforward and don't require extensive querying based on parent attributes.

**Rule of Thumb:**
- **Use Parent/Child** when you need to reference shared attributes across many documents.
- **Use Multivalue Fields** for flexible structures with an unlimited set of properties per document.

#### Summary

- **Schemas** define the structure and behavior of your documents in Vespa, including fields, fieldsets, and rank profiles.
- **Inheritance** in schemas and document types promotes reusability and consistency.
- **Parent/Child Relationships** enable efficient modeling of hierarchical data without de-normalization.
- **Multivalue Fields** offer an alternative for simpler one-to-many relationships within documents.
- **Indexing and Match Modes** optimize how data is stored and searched.
- **Managing Schemas** involves safe modifications and best practices to maintain data integrity and performance.



### Parent/Child Relationships in Vespa

Modeling complex data structures often requires establishing relationships between different pieces of data. In Vespa, **parent/child relationships** allow you to create hierarchical links between documents, enabling more efficient data management and querying. This approach is particularly useful for applications with structured data, such as e-commerce platforms or advertising systems.

#### Benefits of Parent/Child Relationships

- **Simplified Operations:** Update related data with a single write operation.
- **No De-normalization Needed:** Avoid duplicating data across multiple documents, simplifying updates and ensuring consistency.
- **Enhanced Search Capabilities:** Search child documents based on properties from parent documents and vice versa.
- **Efficient Garbage Collection:** Use imported fields as part of visiting and garbage collection with document selection expressions.

**Note:** Parent/child relationships are not supported in streaming search.

#### Use Cases

- **E-commerce:** Products with multiple sellers.
- **Advertising:** Advertisers with campaigns and ads that have budgets requiring real-time updates.

#### Implementing Parent/Child Relationships

1. **Define Parent and Child Schemas:**

   **Parent Document (Advertiser):**
   ```yaml
   schema advertiser {
       document advertiser {
           field name type string {
               indexing: attribute
           }
       }
   }
   ```

   **Child Document (Campaign):**
   ```yaml
   schema campaign {
       document campaign {
           field advertiser_ref type reference<advertiser> {
               indexing: attribute
           }
           field budget type int {
               indexing: attribute
           }
       }
       import field advertiser_ref.name as advertiser_name {}
   }
   ```

   **Grandchild Document (Ad):**
   ```yaml
   schema ad {
       document ad {
           field campaign_ref type reference<campaign> {
               indexing: attribute
           }
           field other_campaign_ref type reference<campaign> {
               indexing: attribute
           }
           field salesperson_ref type reference<salesperson> {
               indexing: attribute
           }
       }

       import field campaign_ref.budget as budget {}
       import field salesperson_ref.name as salesperson_name {}
       import field campaign_ref.advertiser_name as advertiser_name {}

       document-summary my_summary {
           summary budget {}
           summary salesperson_name {}
           summary advertiser_name {}
       }
   }
   ```

2. **Add Documents with References:**

   **Advertiser Document:**
   ```json
   {
       "put": "id:test:advertiser::cool",
       "fields": {
           "name": "Cool Advertiser"
       }
   }
   ```

   **Campaign Documents:**
   ```json
   [
       {
           "put": "id:test:campaign::thebest",
           "fields": {
               "advertiser_ref": "id:test:advertiser::cool",
               "budget": 20
           }
       },
       {
           "put": "id:test:campaign::nextbest",
           "fields": {
               "advertiser_ref": "id:test:advertiser::cool",
               "budget": 10
           }
       }
   ]
   ```

   **Salesperson Document:**
   ```json
   {
       "put": "id:test:salesperson::johndoe",
       "fields": {
           "name": "John Doe"
       }
   }
   ```

   **Ad Document:**
   ```json
   {
       "put": "id:test:ad::1",
       "fields": {
           "campaign_ref": "id:test:campaign::thebest",
           "other_campaign_ref": "id:test:campaign::nextbest",
           "salesperson_ref": "id:test:salesperson::johndoe"
       }
   }
   ```

   **Explanation:**
   - The **Ad** document references two **Campaign** documents and one **Salesperson** document.
   - Imported fields like `budget`, `salesperson_name`, and `advertiser_name` are used in the ad’s summary, enabling enriched search results without duplicating data.

#### Performance Considerations

- **Global Documents:** Parent documents are global, meaning a write operation affects all content nodes. Ensure there are significantly fewer parent documents than child documents to maintain performance.
- **Memory Usage:** Each content node must hold all global documents plus its share of regular documents, impacting memory requirements.
- **Query Performance:** Reference fields add minimal memory indirection and do not significantly impact query performance.

**Important:** Avoid cyclic or self-references to prevent issues within the document hierarchy.

#### Alternatives to Parent/Child Relationships

Instead of using parent/child relationships, you can use **multivalue fields** (arrays or maps of structs) to represent one-to-many relationships within a single document.

**Example:**

```yaml
struct address {
    field street type string {}
    field city type string {}
    field zip type string {}
}

field addresses type array<address> {
    indexing: summary
}
```

**Use Cases:**
- **Products with Multiple Properties:** Different sizes or colors.
- **Users with Multiple Contacts:** Multiple email addresses or phone numbers.

**Choosing Between Parent/Child and Multivalue Fields:**

- **Parent/Child:** Best for complex, hierarchical data where child documents need to reference shared parent attributes.
- **Multivalue Fields:** Suitable for simpler one-to-many relationships within a single document.

**Rule of Thumb:**
- **Use Parent/Child** when you need to reference shared attributes across many documents.
- **Use Multivalue Fields** for flexible structures with an unlimited set of properties per document.

---

### Annotations API in Vespa

The **Annotations API** in Vespa allows you to add metadata to specific parts of your text data. This feature is useful for scenarios where you need to label or structure parts of your text, such as marking up HTML content or adding semantic information for natural language processing.

#### Annotating Text with Spans

**Basic Concepts:**
- **Span:** A segment of text identified by a start index and length.
- **Span Tree:** A hierarchical structure of spans that represents the annotations over the text.

**Use Case: Adding Simple Labels to Text**

Imagine you want to add metadata to an HTML document to identify different parts like headers, titles, and body content.

**Example: Annotating HTML Content**

1. **Define Annotation Types in Schema:**

   ```yaml
   schema example {
       annotation text {}
       annotation markup {}
   }
   ```

2. **Annotate Text Using the API:**

   ```java
   StringFieldValue text = new StringFieldValue("<html><head><title>Diary</title></head><body>I live in San Francisco</body></html>");

   SpanList root = new SpanList();
   root.add(new Span(0, 19))
       .add(new Span(19, 5))
       .add(new Span(24, 21))
       .add(new Span(45, 23))
       .add(new Span(68, 14));

   SpanTree tree = new SpanTree("html", root);
   text.setSpanTree(tree);

   // Adding annotations
   AnnotationTypeRegistry atr = processing.getService().getDocumentTypeManager().getAnnotationTypeRegistry();

   AnnotationType textType = atr.getType("text");
   AnnotationType markup = atr.getType("markup");

   Span span1 = new Span(0, 19);
   tree.annotate(span1, markup);

   Span span2 = new Span(19, 5);
   tree.annotate(span2, textType);

   Span span3 = new Span(24, 21);
   tree.annotate(span3, markup);

   Span span4 = new Span(45, 23);
   tree.annotate(span4, textType);

   Span span5 = new Span(68, 14);
   tree.annotate(span5, markup);
   ```

   **Explanation:**
   - **Span Definitions:** Each span marks a specific part of the HTML text.
   - **Annotations:** Each span is labeled either as `markup` or `text`, providing semantic information about that section.

#### Building Annotation Trees

For more complex annotations, such as creating a structured tree that represents nested elements or multiple interpretations, you can build **annotation trees** using `SpanList` and `SpanTree`.

**Example: Creating a Structured Annotation Tree**

1. **Define Extended Annotation Types:**

   ```yaml
   schema example {
       annotation text {}
       annotation begintag {}
       annotation endtag {}
       annotation body {}
       annotation header {}
   }
   ```

2. **Annotate with Nested Structures:**

   ```java
   StringFieldValue text = new StringFieldValue("<html><head><title>Diary</title></head><body>I live in San Francisco</body></html>");

   SpanList root = new SpanList();
   SpanTree tree = new SpanTree("html", root);

   AnnotationType textType = atr.getType("text");
   AnnotationType beginTag = atr.getType("begintag");
   AnnotationType endTag = atr.getType("endtag");
   AnnotationType bodyType = atr.getType("body");
   AnnotationType headerType = atr.getType("header");

   // Header span
   SpanList header = new SpanList();
   header.add(new Span(6, 6))
         .add(new Span(12, 7))
         .add(new Span(19, 5))
         .add(new Span(24, 8))
         .add(new Span(32, 7));
   tree.annotate(header, headerType);
   tree.annotate(new Span(6, 6), beginTag);
   tree.annotate(new Span(12, 7), beginTag);
   tree.annotate(new Span(19, 5), textType);
   tree.annotate(new Span(24, 8), endTag);
   tree.annotate(new Span(32, 7), endTag);

   // Body span
   SpanList body = new SpanList();
   body.add(new Span(39, 6))
       .add(new Span(45, 23))
       .add(new Span(68, 7));
   tree.annotate(body, bodyType);
   tree.annotate(new Span(39, 6), beginTag);
   tree.annotate(new Span(45, 23), textType);
   tree.annotate(new Span(68, 7), endTag);

   // Root annotations
   root.add(new Span(0, 6))
       .add(header)
       .add(body)
       .add(new Span(75, 7));
   tree.annotate(new Span(0, 6), beginTag);
   tree.annotate(new Span(75, 7), endTag);
   ```

   **Explanation:**
   - **Nested Spans:** The header and body sections are further broken down into smaller spans, each annotated appropriately.
   - **Structured Tree:** The `SpanTree` now represents a hierarchical structure of the HTML document, allowing for more nuanced queries and operations.

#### Adding Annotations with Values

Sometimes, you need to attach more information to a span beyond simple labels. For example, annotating a city name with its geographical coordinates.

1. **Extend Schema with Value Fields:**

   ```yaml
   schema example {
       annotation text {}
       annotation begintag {}
       annotation endtag {}
       annotation body {}
       annotation header {}
       annotation city {
           field latitude type double {}
           field longitude type double {}
       }
       struct position {
           field latitude type double {}
           field longitude type double {}
       }
   }
   ```

2. **Annotate with Values:**

   ```java
   StringFieldValue text = new StringFieldValue("<html><head><title>Diary</title></head><body>I live in San Francisco</body></html>");

   SpanList root = new SpanList();
   SpanTree tree = new SpanTree("html", root);

   AnnotationType textType = atr.getType("text");
   AnnotationType beginTag = atr.getType("begintag");
   AnnotationType endTag = atr.getType("endtag");
   AnnotationType bodyType = atr.getType("body");
   AnnotationType headerType = atr.getType("header");
   AnnotationType cityType = atr.getType("city");

   // Create position struct
   Struct position = (Struct) cityType.getDataType().createFieldValue();
   position.setValue("latitude", 37.774929);
   position.setValue("longitude", -122.419415);
   Annotation city = new Annotation(cityType, position);

   // Header annotations
   SpanList header = new SpanList();
   header.add(new Span(6, 6))
         .add(new Span(12, 7))
         .add(new Span(19, 5))
         .add(new Span(24, 8))
         .add(new Span(32, 7));
   tree.annotate(header, headerType);
   tree.annotate(new Span(6, 6), beginTag);
   tree.annotate(new Span(12, 7), beginTag);
   tree.annotate(new Span(19, 5), textType);
   tree.annotate(new Span(24, 8), endTag);
   tree.annotate(new Span(32, 7), endTag);

   // Text node with city annotation
   SpanList textNode = new SpanList();
   textNode.add(new Span(45, 10))
           .add(new Span(55, 13));
   tree.annotate(new Span(55, 13), city);
   tree.annotate(textNode, textType);

   // Body annotations
   SpanList body = new SpanList();
   body.add(new Span(39, 6))
       .add(textNode)
       .add(new Span(68, 7));
   tree.annotate(body, bodyType);
   tree.annotate(new Span(39, 6), beginTag);
   tree.annotate(new Span(68, 7), endTag);

   // Root annotations
   root.add(new Span(0, 6))
       .add(header)
       .add(body)
       .add(new Span(75, 7));
   tree.annotate(new Span(0, 6), beginTag);
   tree.annotate(new Span(75, 7), endTag);
   ```

   **Explanation:**
   - **Structs in Annotations:** The `city` annotation includes a `position` struct with latitude and longitude.
   - **Annotating with Values:** The span covering "San Francisco" is annotated with both the `text` label and the `city` annotation containing its coordinates.

#### Working with Alternate Span Trees

Sometimes, a single interpretation of the text isn't sufficient, especially in natural language processing where multiple interpretations exist.

**Example: Multiple Interpretations of a Sentence**

Consider the sentence: "I saw the girl with the telescope."

1. **Define Annotation Types for Multiple Interpretations:**

   ```yaml
   schema example {
       annotation text {}
       annotation begintag {}
       annotation endtag {}
       annotation body {}
       annotation header {}
       annotation city {
           field latitude type double {}
           field longitude type double {}
       }
   }
   ```

2. **Build an Alternate Span Tree:**

   ```java
   StringFieldValue text = new StringFieldValue("<body><p>I saw the girl with the telescope</p></body>");

   SpanList root = new SpanList();
   SpanTree tree = new SpanTree("html", root);

   AnnotationType textType = atr.getType("text");
   AnnotationType beginTag = atr.getType("begintag");
   AnnotationType endTag = atr.getType("endtag");
   AnnotationType bodyType = atr.getType("body");
   AnnotationType headerType = atr.getType("header");
   AnnotationType cityType = atr.getType("city");

   // Create city annotation with position
   Struct position = (Struct) cityType.getDataType().createFieldValue();
   position.setValue("latitude", 37.774929);
   position.setValue("longitude", -122.419415);
   Annotation city = new Annotation(cityType, position);

   // Paragraph span
   SpanList paragraph = new SpanList();
   paragraph.add(new Span(6, 3))   // "I "
            .add(new Span(9, 10))  // "saw the gi"
            .add(new Span(19, 4))  // "rl w"
            .add(new Span(23, 4)); // "ith"
   tree.annotate(paragraph, headerType);
   tree.annotate(new Span(6, 3), beginTag);
   tree.annotate(new Span(9, 10), textType);
   tree.annotate(new Span(19, 4), beginTag);
   tree.annotate(new Span(23, 4), endTag);

   // Text node with city annotation
   SpanList textNode = new SpanList();
   textNode.add(new Span(27, 9)) // "telescope"
           .add(new Span(36, 8)); // "</body>"
   tree.annotate(textNode, bodyType);
   tree.annotate(new Span(27, 9), city);
   tree.annotate(new Span(36, 8), endTag);

   // Root annotations
   root.add(new Span(0, 6))         // "<body>"
       .add(paragraph)
       .add(textNode);
   tree.annotate(new Span(0, 6), beginTag);
   tree.annotate(new Span(36, 8), endTag);
   tree.annotate(city);
   ```

   **Explanation:**
   - **Alternate Interpretations:** The annotation tree can represent multiple interpretations by allowing overlapping or multiple annotations.
   - **Annotation References:** Annotations can reference other annotations to build a more complex graph, enabling richer metadata and relationships.

#### Manipulating Span Trees

Often, you'll need to modify existing span trees, such as updating annotations or restructuring the tree based on new information.

**Example: Removing Specific Annotations**

Suppose you want to remove all `markup` annotations from a span tree.

1. **Iterate and Remove Annotations:**

   ```java
   StringFieldValue text = new StringFieldValue("<html><head><title>Diary</title></head><body>I live in San Francisco</body></html>");

   SpanTree tree = text.getSpanTree("html");
   SpanList root = (SpanList) tree.getRoot();

   ListIterator<SpanNode> nodeIt = root.childIterator();

   AnnotationType markup = atr.getType("markup");

   while (nodeIt.hasNext()) {
       SpanNode node = nodeIt.next();
       Iterator<Annotation> annotationIt = tree.iterator(node);
       
       boolean hasMarkup = false;
       while (annotationIt.hasNext()) {
           Annotation annotation = annotationIt.next();
           if (annotation.getType().equals(markup)) {
               annotationIt.remove();
               hasMarkup = true;
           }
       }
       
       if (hasMarkup) {
           nodeIt.remove();
           // Optionally, add replacement spans or annotations
       }
   }
   ```

   **Explanation:**
   - **Iterating Over SpanNodes:** Traverse each node in the span tree.
   - **Removing Annotations:** Identify and remove annotations of the `markup` type.
   - **Updating the Tree:** Optionally, modify the tree structure after removing annotations.

#### Inheritance of Annotations

Annotations can inherit from one another, allowing you to extend existing annotations with additional information.

**Example: Extending an Annotation Type**

1. **Define Base Annotation (`person`):**

   ```yaml
   schema example {
       annotation person {
           field birthdate type int {}
           field firstname type string {}
           field lastname type string {}
       }
   }
   ```

2. **Extend Annotation (`employee`):**

   ```yaml
   schema example2 {
       annotation employee inherits person {
           field employeeid type int {}
       }
   }
   ```

   **Explanation:**
   - The `employee` annotation inherits all fields from `person` and adds an `employeeid` field.
   - This allows `employee` to be used wherever `person` is applicable, promoting reusability and consistency.

---

### Summary

- **Schemas** define the structure and behavior of your documents, including fields, fieldsets, and rank profiles.
- **Parent/Child Relationships** enable hierarchical data modeling, simplifying operations and enhancing search capabilities without data duplication.
- **Annotations API** allows you to add metadata to specific parts of your text, supporting complex data structures and multiple interpretations.
- **Inheritance** in schemas and annotations promotes reusability and consistency, reducing duplication and simplifying schema management.
- **Performance Considerations** are crucial when designing schemas and relationships to ensure efficient storage and query performance.
- **Best Practices:**
  - Use clear and descriptive field names.
  - Leverage inheritance to minimize duplication.
  - Choose between parent/child relationships and multivalue fields based on your data complexity.
  - Optimize field indexing and match modes for your search requirements.
  - Regularly test and validate your schemas and annotations to maintain data integrity and performance.


# Create application based on CORD19 sample data

## First Step
Create Vespa ApplicationPackage

In [11]:
from vespa.package import ApplicationPackage

app_package = ApplicationPackage(name="cord19") 

## Fields and Schema
In Vespa, a field is a fundamental unit of data within a document. Think of a document as a record or an item in your database, and each field as a specific attribute or piece of information about that document.

