# How to connect with running Vespa instances

> Connect and interact with CORD-19 search app.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vespa-engine/pyvespa/blob/master/docs/sphinx/source/connect-to-vespa-instance.ipynb)

This self-contained tutorial will show you how to connect to a pre-existing Vespa instance. We will use the https://cord19.vespa.ai/ app as an example. You can run this tutorial yourself in Google Colab by clicking on the badge located at the top of the tutorial.

## Install

The library is available at PyPI and therefore can be installed with `pip`.

In [None]:
!pip install pyvespa

## Connect to a running Vespa application

We can connect to a running Vespa application by creating an instance of [Vespa](reference-api.rst#vespa.application.Vespa) with the appropriate url. The resulting `app` will then be used to communicate with the application.

In [2]:
from vespa.application import Vespa

app = Vespa(url = "https://api.cord19.vespa.ai")

## Define a Query model

> Easily define matching and ranking criteria

When building a search application, we usually want to expirement with different query models. A [Query](reference-api.rst#vespa.query.Query) model consists of a match phase and a ranking phase. The matching phase will define how to match documents based on the query sent and the ranking phase will define how to rank the matched documents. Both phases can get quite complex and being able to easily express and experiment with them is very valuable.

In the example below we define the match phase to be the [Union](reference-api.rst#vespa.query.Union) of the [WeakAnd](reference-api.rst#vespa.query.WeakAnd) and the [ANN](reference-api.rst#vespa.query.ANN) operators. The `WeakAnd` will match documents based on query terms while the Approximate Nearest Neighbor (`ANN`) operator will match documents based on the distance between the query and document embeddings. This is an illustration of how easy it is to combine term and semantic matching in Vespa.  

In [3]:
from vespa.query import Union, WeakAnd, ANN
from random import random

match_phase = Union(
    WeakAnd(hits = 10), 
    ANN(
        doc_vector="title_embedding", 
        query_vector="title_vector", 
        embedding_model=lambda x: [random() for x in range(768)],
        hits = 10,
        label="title"
    )
)

We then define the ranking to be done by the `bm25` rank-profile that is already defined in the application package. We set `list_features=True` to be able to collect ranking-features later in this tutorial. After defining the `match_phase` and the `rank_profile` we can instantiate the `Query` model.

In [4]:
from vespa.query import Query, RankProfile

rank_profile = RankProfile(name="bm25", list_features=True)

query_model = Query(match_phase=match_phase, rank_profile=rank_profile)

## Query the vespa app

> Send queries via the query API. See the [query page](query.ipynb) for more examples.

We can use the `query_model` that we just defined to issue queries to the application via the `query` method.

In [5]:
query_result = app.query(
    query="Is remdesivir an effective treatment for COVID-19?", 
    query_model=query_model
)

We can see the number of documents that were retrieved by Vespa:

In [6]:
query_result.number_documents_retrieved

1046

And the number of documents that were returned to us:

In [7]:
len(query_result.hits)

10

## Labelled data

> How to structure labelled data

We often need to either evaluate query models or to collect data to improve query models through ML. In both cases we usually need labelled data. Lets create some labelled data to illustrate their expected format and their usage in the library.

Each data point contains a `query_id`, a `query` and `relevant_docs` associated with the query.

In [8]:
labelled_data = [
    {
        "query_id": 0, 
        "query": "Intrauterine virus infections and congenital heart disease",
        "relevant_docs": [{"id": 0, "score": 1}, {"id": 3, "score": 1}]
    },
    {
        "query_id": 1, 
        "query": "Clinical and immunologic studies in identical twins discordant for systemic lupus erythematosus",
        "relevant_docs": [{"id": 1, "score": 1}, {"id": 5, "score": 1}]
    }
]

Non-relevant documents are assigned `"score": 0` by default. Relevant documents will be assigned `"score": 1` by default if the field is missing from the labelled data. The defaults for both relevant and non-relevant documents can be modified on the appropriate methods.

## Collect training data

> Collect training data to analyse and/or improve ranking functions. See the [collect training data page](collect-training-data.ipynb) for more examples.

We can colect training data with the [collect_training_data](reference-api.rst#vespa.application.Vespa.collect_training_data) method according to a specific [Query](reference-api.rst#vespa.query.Query) model. Below we will collect two documents for each query in addition to the relevant ones.

In [9]:
training_data_batch = app.collect_training_data(
    labelled_data = labelled_data,
    id_field = "id",
    query_model = query_model,
    number_additional_docs = 2
)
training_data_batch

Unnamed: 0,attributeMatch(authors.first),attributeMatch(authors.first).averageWeight,attributeMatch(authors.first).completeness,attributeMatch(authors.first).fieldCompleteness,attributeMatch(authors.first).importance,attributeMatch(authors.first).matches,attributeMatch(authors.first).maxWeight,attributeMatch(authors.first).normalizedWeight,attributeMatch(authors.first).normalizedWeightedWeight,attributeMatch(authors.first).queryCompleteness,...,textSimilarity(results).queryCoverage,textSimilarity(results).score,textSimilarity(title).fieldCoverage,textSimilarity(title).order,textSimilarity(title).proximity,textSimilarity(title).queryCoverage,textSimilarity(title).score,document_id,query_id,relevant
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0625,0.0,0.0,0.142857,0.055357,0,0,1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,1.0,1.0,1.0,213690,0,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.285714,0.666667,0.739583,0.571429,0.587426,225739,0,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.142857,0.0,0.4375,0.142857,0.224554,3,0,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,1.0,1.0,1.0,213690,0,0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.285714,0.666667,0.739583,0.571429,0.587426,225739,0,0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.111111,0.0,0.0,0.083333,0.047222,1,1,1
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,1.0,1.0,1.0,176163,1,0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.1875,1.0,1.0,0.25,0.6125,13597,1,0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.083333,0.0,0.0,0.083333,0.041667,5,1,1


## Evaluating a query model

> Define metrics and evaluate query models. See the [evaluation page](evaluation.ipynb) for more examples.

We will define the following evaluation metrics:
* % of documents retrieved per query
* recall @ 10 per query
* MRR @ 10 per query

In [10]:
from vespa.evaluation import MatchRatio, Recall, ReciprocalRank

eval_metrics = [MatchRatio(), Recall(at=10), ReciprocalRank(at=10)]

Evaluate:

In [11]:
evaluation = app.evaluate(
    labelled_data = labelled_data,
    eval_metrics = eval_metrics, 
    query_model = query_model, 
    id_field = "id",
)
evaluation

Unnamed: 0,query_id,match_ratio_retrieved_docs,match_ratio_docs_available,match_ratio_value,recall_10_value,reciprocal_rank_10_value
0,0,1254,233281,0.005375,0.0,0
1,1,1003,233281,0.0043,0.0,0
