# 0 List all index

In [129]:
!curl -X GET "localhost:9200/_cat/indices?v"

health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   news   pUcnw5GGQAaeEr00fLQuSg   1   1      20417            0     35.2mb         35.2mb
yellow open   novels cUrnZJIzRKaFFh3AIiOfEw   1   1         33            0     18.3mb         18.3mb
yellow open   test   m_GdMZOqTxesTyykLXxeOA   1   1          3            0      4.2kb          4.2kb


# 1 Delete all indeces

In [2]:
!curl -X DELETE "localhost:9200/*"

{"acknowledged":true}


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    21  100    21    0     0     21      0  0:00:01 --:--:--  0:00:01    42


# 2 Index reloaded

In [5]:
%run IndexFilesPreprocess.py --index news  --path 20_newsgroups/**/[0-9]* --token letter --filter lowercase asciifolding

Indexing 20417 files
Reading files ...


DELETE http://localhost:9200/news [status:404 request:0.055s]


Index settings= {'news': {'settings': {'index': {'number_of_shards': '1', 'provided_name': 'news', 'creation_date': '1539639196314', 'analysis': {'analyzer': {'default': {'filter': ['lowercase', 'asciifolding'], 'type': 'custom', 'tokenizer': 'letter'}}}, 'number_of_replicas': '1', 'uuid': 'Y7kUjQfeTxiPH-5w_TdfRw', 'version': {'created': '6040299'}}}}}
Indexing ...


what word is the most frequent one in the English language?

the

In [1]:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan
from elasticsearch.exceptions import NotFoundError

def count_words(index, alpha):
    lpal = []
    voc = {}
    
    try:
        client = Elasticsearch()

        sc = scan(client, index=index, doc_type='document', query={"query" : {"match_all": {}}})
            
        for s in sc:
            tv = client.termvectors(index=index, doc_type='document', id=s['_id'], fields=['text'])
            if 'text' in tv['term_vectors']:
                for t in tv['term_vectors']['text']['terms']:
                    if t in voc:
                        voc[t] += tv['term_vectors']['text']['terms'][t]['term_freq']
                    else:
                        voc[t] = tv['term_vectors']['text']['terms'][t]['term_freq']
        for v in voc:
            lpal.append((v.encode("utf8", "ignore"), voc[v]))

        print('%s Words' % len(lpal))
    except NotFoundError:
        print('Index %s does not exists' % index)
    
    lpal.sort(key=lambda x: x[0 if alpha else 1])
    
    return lpal

In [4]:
lpal = count_words("novels", False)

54365 Words


In [5]:
lpal[-1]

(b'the', 206706)

# 3 tf-idf and cos similarity

The implementation is in code

## 3.1 Experimenting
Test between one document to another. And test one document to itself.

In [70]:
%run TFIDFViewer.py --index novels --files novels/DickensAChristmasCarol.txt novels/DickensGreatExpectations.txt
%run TFIDFViewer.py --index novels --files novels/DickensAChristmasCarol.txt novels/DickensAChristmasCarol.txt

Similarity = 0.01715
Similarity = 1.00000


A simple test we created with two document.

* d1: a b c d e
* d2: a c e f g
* d3: b g f e c

By hand, the similarity between d2 and d3 should be 66%.

In [45]:
%run IndexFilesPreprocess.py --index test  --path mytest/* --token letter

Indexing 3 files
Reading files ...
Index settings= {'test': {'settings': {'index': {'number_of_shards': '1', 'provided_name': 'test', 'creation_date': '1539738959094', 'analysis': {'analyzer': {'default': {'filter': ['lowercase'], 'type': 'custom', 'tokenizer': 'letter'}}}, 'number_of_replicas': '1', 'uuid': 'm_GdMZOqTxesTyykLXxeOA', 'version': {'created': '6020499'}}}}}
Indexing ...


In [72]:
%run TFIDFViewer.py --index test --print --files mytest/file2.txt mytest/file3.txt

TFIDF FILE mytest/file2.txt
a 0.5773502691896258
f 0.5773502691896258
g 0.5773502691896258
---------------------
TFIDF FILE mytest/file3.txt
b 0.5773502691896258
f 0.5773502691896258
g 0.5773502691896258
---------------------
Similarity = 0.66667


## A final question, have you noticed that we are searching the documents using the path name? 

Yes. We are using path as index to search the document.

## Was the path tokenized by the index? 

No. We configure the path field to prevent the tokenization.

## What did we do differently when indexing the documents so we can look for an exact match in the path field?

We add a property to the mapping of configuration of index.
```python
    client.indices.put_mapping(doc_type='document', index=index, body= {
        "document" : {
            "properties": {
                "path": {
                    "type": "keyword",
                }
            }
        }
    })
```

# 4 Document relevance

In [29]:
%run SearchIndexWeight.py --index news --nhits 5 --query toronto nyc

['toronto', 'nyc']
ID= zJrwd2YBIcpsWJdpM8ce SCORE=9.122151
PATH= 20_newsgroups/alt.atheism/0000574
TEXT: In article <1r1mr8$eov@aurora.engr.LaTech.edu>, ra
-----------------------------------------------------------------
ID= G5rwd2YBIcpsWJdpPvk7 SCORE=4.8560805
PATH= 20_newsgroups/sci.med/0013128
TEXT: Here is a press release from the Natural Resources
-----------------------------------------------------------------
ID= Ypvwd2YBIcpsWJdpRg95 SCORE=4.0302267
PATH= 20_newsgroups/talk.politics.misc/0018667
TEXT: v140pxgt@ubvmsb.cc.buffalo.edu (Daniel B Case) wri
-----------------------------------------------------------------
ID= Vpvwd2YBIcpsWJdpQgTD SCORE=3.5608654
PATH= 20_newsgroups/talk.politics.guns/0015998
TEXT: Jim De Arras (jmd@cube.handheld.com) wrote:
: > La
-----------------------------------------------------------------
4 Documents


In [30]:
%run SearchIndexWeight.py --index news --nhits 5 --query toronto^2 nyc

['toronto^2', 'nyc']
ID= zJrwd2YBIcpsWJdpM8ce SCORE=12.817256
PATH= 20_newsgroups/alt.atheism/0000574
TEXT: In article <1r1mr8$eov@aurora.engr.LaTech.edu>, ra
-----------------------------------------------------------------
ID= G5rwd2YBIcpsWJdpPvk7 SCORE=6.8231306
PATH= 20_newsgroups/sci.med/0013128
TEXT: Here is a press release from the Natural Resources
-----------------------------------------------------------------
ID= Ypvwd2YBIcpsWJdpRg95 SCORE=6.744345
PATH= 20_newsgroups/talk.politics.misc/0018667
TEXT: v140pxgt@ubvmsb.cc.buffalo.edu (Daniel B Case) wri
-----------------------------------------------------------------
ID= Vpvwd2YBIcpsWJdpQgTD SCORE=5.0032635
PATH= 20_newsgroups/talk.politics.guns/0015998
TEXT: Jim De Arras (jmd@cube.handheld.com) wrote:
: > La
-----------------------------------------------------------------
4 Documents


In [99]:
%run SearchIndexWeight.py --index news --nhits 5 --query toronto nyc^2

['toronto^1', 'nyc^2']
ID= zJrwd2YBIcpsWJdpM8ce SCORE=14.549197
PATH= 20_newsgroups/alt.atheism/0000574
TEXT: In article <1r1mr8$eov@aurora.engr.LaTech.edu>, ra
-----------------------------------------------------------------
ID= G5rwd2YBIcpsWJdpPvk7 SCORE=7.745111
PATH= 20_newsgroups/sci.med/0013128
TEXT: Here is a press release from the Natural Resources
-----------------------------------------------------------------
ID= Vpvwd2YBIcpsWJdpQgTD SCORE=5.679333
PATH= 20_newsgroups/talk.politics.guns/0015998
TEXT: Jim De Arras (jmd@cube.handheld.com) wrote:
: > La
-----------------------------------------------------------------
ID= Ypvwd2YBIcpsWJdpRg95 SCORE=5.3463345
PATH= 20_newsgroups/talk.politics.misc/0018667
TEXT: v140pxgt@ubvmsb.cc.buffalo.edu (Daniel B Case) wri
-----------------------------------------------------------------
4 Documents


# 5 Rocchio

In [162]:
%run Rocchio.py --k 5 --nrounds 5 --index news --query nice city

Receive query ['nice', 'city']
******************************************************
iteration 1
Term vector : ['nice', 'city']
Weight vector : [1.0, 1.0]
Composed query : ['nice^1.0', 'city^1.0']
******************************************************
iteration 2
Term vector : ['nice', 'city']
Weight vector : [1.0497523249337901, 1.121322570620743]
Composed query : ['nice^1.0497523249337901', 'city^1.121322570620743']
******************************************************
iteration 3
Term vector : ['nice', 'city']
Weight vector : [1.0995046498675802, 1.242645141241486]
Composed query : ['nice^1.0995046498675802', 'city^1.242645141241486']
******************************************************
iteration 4
Term vector : ['nice', 'city']
Weight vector : [1.1492569748013703, 1.363967711862229]
Composed query : ['nice^1.1492569748013703', 'city^1.363967711862229']
******************************************************
iteration 5
Term vector : ['nice', 'city']
Weight vector : [1.1990092997