# Access a Term Vector


A term vector is information and statistics in the fields of a particular document. Term vectors in Elasticsearch are generated on the fly.


 
## Getting Started

In this example, we will use the Elaticsearch Python API. First, we will import and set-up all of the required Python modules and variables we will use later on. Additionally, if you wish to use `curl` instead of the Python API, the corresponding command line function has been commented above each API request.

In [1]:
from elasticsearch import Elasticsearch
import pandas as pd
es = Elasticsearch(urls=['localhost'], port=9200)

Let's examine what a document in this index looks like. (this operation may take few seconds)

In [3]:
# This query will retrieve every document in the index.
query = {
    'query': {
        'match_all': {}
    }
}

# Send a search request to Elasticsearch.
# curl -X GET localhost:9200/goma/_search -H 'Content-Type: application/json' -d @query.json
res = es.search(index='goma', body=query)

# The response is a json object, the listing is nested inside it.
# Here we are accessing the first hit in the listing.
res['hits']['hits'][0]

{'_index': 'goma',
 '_type': 'doc',
 '_id': 'QSlkSGgBOPedV1qMWTBT',
 '_score': 1.0,
 '_source': {'qagoma_events': [{'id': '118628',
    'title': 'Australian Art Collection',
    'description': 'An exciting reimagining of the Australian Art Collection has recently opened. Our curators, along with Director Chris Saines, have taken this rare opportunity to re-present the Gallery&rsquo;s Australian art holdings, collected for more than 120 years, in new and innovative ways.',
    'start_time': '2017-09-30 10:00:00',
    'end_time': '2028-12-31 17:00:00',
    'stop_date': '2028-12-31',
    'thumbnail': 'https://www.qagoma.qld.gov.au/__data/assets/image/0005/118733/APP_sleeping-bride.jpg',
    'location': 'QAG: Gallery 10; Gallery 11; Gallery 12; Gallery 13; Josephine Ulrick &amp; Win Schubert Galleries',
    'entry': 'Free',
    'available': 'Yes',
    'category': '',
    'link': 'https://www.qagoma.qld.gov.au/whats-on/exhibitions/australian-collection',
    'sessions': [{'session_count': '

Using the term vector API, let's investigate the term vector for the description field of the document above (id _AV19Sgi4jk6MoKTLfifp_). Note in the call to the `termvectors` method, we explicitly request the term statistics `term_statistics=True`.

In [7]:
# curl -X GET localhost:9200/goma/event/AV19Sgi4jk6MoKTLfifp/_termvectors?term_statistics&fields=description
res = es.termvectors(index='goma', doc_type='event', id='AV19Sgi4jk6MoKTLfifp', 
                     fields=['description'], term_statistics=True)

# We don't really care that much about the additional info, let's get straight to the point.
tv = res['term_vectors']['description']
tv

KeyError: 'term_vectors'

That's a big json object, so let's break it down into some digestable tables. Firstly, let's take a look at the field statistics.

 - `doc_count`: document count (how many documents contain this field)
 - `sum_doc_freq`: sum of document frequencies (the sum of document frequencies for all terms in this field)
 - `sum_ttf`: sum of total term frequencies (the sum of total term frequencies of each term in this field)

In [8]:
pd.DataFrame(tv['field_statistics'], index=['count'])

NameError: name 'tv' is not defined

More importantly, we can also see the breakdown of the term statistics in the document for each term in the document. These tables omit the `tokens` field, however this is can be used to extract the location of the term in the document.

 - `term_freq`: term frequency in the field
 - `doc_freq`: document frequency (the number of documents containing the current term)
 - `ttf`: sum of total term frequencies (the sum of total term frequencies of each term in this field)

In [5]:
terms = []
for term in tv['terms']:
    term_info = tv['terms'][term].copy()
    del(term_info['tokens'])
    term_info.update({'term': term})
    terms.append(term_info)
df = pd.DataFrame(terms).set_index('term')
df[0:10]

Unnamed: 0_level_0,doc_freq,term_freq,ttf
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
22,1,1,1
a,28,1,37
aboriginal,2,1,3
absence,1,1,1
adornments,1,1,1
and,23,5,73
are,3,1,3
as,9,1,12
associated,1,1,1
banumbirr,1,1,1


In [6]:
# Sorted by doc_freq
df.sort_values(by='doc_freq', ascending=False)[0:10]

Unnamed: 0_level_0,doc_freq,term_freq,ttf
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,28,1,37
the,25,3,85
of,24,2,54
and,23,5,73
for,14,1,14
to,14,1,24
from,11,3,20
as,9,1,12
with,7,2,12
works,6,2,8


In [7]:
# Sorted by term_freq
df.sort_values(by='term_freq', ascending=False)[0:10]

Unnamed: 0_level_0,doc_freq,term_freq,ttf
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
and,23,5,73
the,25,3,85
from,11,3,20
works,6,2,8
of,24,2,54
cultures,1,2,2
with,7,2,12
lucent,1,1,1
majesty,1,1,1
metre,1,1,1


In [8]:
# Sorted by ttf
df.sort_values(by='ttf', ascending=False)[0:10]

Unnamed: 0_level_0,doc_freq,term_freq,ttf
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
the,25,3,85
and,23,5,73
of,24,2,54
a,28,1,37
to,14,1,24
from,11,3,20
for,14,1,14
with,7,2,12
as,9,1,12
works,6,2,8


#### Exercise 1

Repeat the expoloration of the term vector for a document using the ClueWeb12 sample index you have built in previous activities.

#### Exercise 2 -- advanced

Using the Clueweb12 sample index, identify two documents that contain a query term of your choice (suggestion: after having chosen a term, query the index to retrieve the top 2 documents that satisfy the query). Then, compute the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between the two term vectors.