# 0 List all index

In [3]:
!curl -X GET "localhost:9200/_cat/indices?v"

health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   news   pUcnw5GGQAaeEr00fLQuSg   1   1      20417            0     35.2mb         35.2mb
yellow open   novels cUrnZJIzRKaFFh3AIiOfEw   1   1         33            0     18.3mb         18.3mb


# 1 Delete all indeces

In [2]:
!curl -X DELETE "localhost:9200/*"

{"acknowledged":true}


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    21  100    21    0     0     21      0  0:00:01 --:--:--  0:00:01    42


# 2 Index reloaded

In [5]:
%run IndexFilesPreprocess.py --index news  --path 20_newsgroups/**/[0-9]* --token letter --filter lowercase asciifolding

Indexing 20417 files
Reading files ...


DELETE http://localhost:9200/news [status:404 request:0.055s]


Index settings= {'news': {'settings': {'index': {'number_of_shards': '1', 'provided_name': 'news', 'creation_date': '1539639196314', 'analysis': {'analyzer': {'default': {'filter': ['lowercase', 'asciifolding'], 'type': 'custom', 'tokenizer': 'letter'}}}, 'number_of_replicas': '1', 'uuid': 'Y7kUjQfeTxiPH-5w_TdfRw', 'version': {'created': '6040299'}}}}}
Indexing ...


what word is the most frequent one in the English language?

the

In [1]:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan
from elasticsearch.exceptions import NotFoundError

def count_words(index, alpha):
    lpal = []
    voc = {}
    
    try:
        client = Elasticsearch()

        sc = scan(client, index=index, doc_type='document', query={"query" : {"match_all": {}}})
            
        for s in sc:
            tv = client.termvectors(index=index, doc_type='document', id=s['_id'], fields=['text'])
            if 'text' in tv['term_vectors']:
                for t in tv['term_vectors']['text']['terms']:
                    if t in voc:
                        voc[t] += tv['term_vectors']['text']['terms'][t]['term_freq']
                    else:
                        voc[t] = tv['term_vectors']['text']['terms'][t]['term_freq']
        for v in voc:
            lpal.append((v.encode("utf8", "ignore"), voc[v]))

        print('%s Words' % len(lpal))
    except NotFoundError:
        print('Index %s does not exists' % index)
    
    lpal.sort(key=lambda x: x[0 if alpha else 1])
    
    return lpal

In [4]:
lpal = count_words("novels", False)

54365 Words


In [5]:
lpal[-1]

(b'the', 206706)

# 3 tf-idf and cos similarity

The implementation is in code

## 3.1 Experimenting
Test between one document to another. And test one document to itself.

In [70]:
%run TFIDFViewer.py --index novels --files novels/DickensAChristmasCarol.txt novels/DickensGreatExpectations.txt
%run TFIDFViewer.py --index novels --files novels/DickensAChristmasCarol.txt novels/DickensAChristmasCarol.txt

Similarity = 0.01715
Similarity = 1.00000


A simple test we created with two document.

* d1: a b c d e
* d2: a c e f g
* d3: b g f e c

By hand, the similarity between d2 and d3 should be 66%.

In [45]:
%run IndexFilesPreprocess.py --index test  --path mytest/* --token letter

Indexing 3 files
Reading files ...
Index settings= {'test': {'settings': {'index': {'number_of_shards': '1', 'provided_name': 'test', 'creation_date': '1539738959094', 'analysis': {'analyzer': {'default': {'filter': ['lowercase'], 'type': 'custom', 'tokenizer': 'letter'}}}, 'number_of_replicas': '1', 'uuid': 'm_GdMZOqTxesTyykLXxeOA', 'version': {'created': '6020499'}}}}}
Indexing ...


In [72]:
%run TFIDFViewer.py --index test --print --files mytest/file2.txt mytest/file3.txt

TFIDF FILE mytest/file2.txt
a 0.5773502691896258
f 0.5773502691896258
g 0.5773502691896258
---------------------
TFIDF FILE mytest/file3.txt
b 0.5773502691896258
f 0.5773502691896258
g 0.5773502691896258
---------------------
Similarity = 0.66667


## A final question, have you noticed that we are searching the documents using the path name? 

Yes. We are using path as index to search the document.

## Was the path tokenized by the index? 

No. We configure the path field to prevent the tokenization.

## What did we do differently when indexing the documents so we can look for an exact match in the path field?

We add a property to the mapping of configuration of index.
```python
    client.indices.put_mapping(doc_type='document', index=index, body= {
        "document" : {
            "properties": {
                "path": {
                    "type": "keyword",
                }
            }
        }
    })
```

# 4 Document relevance

In [29]:
%run SearchIndexWeight.py --index news --nhits 5 --query toronto nyc

['toronto', 'nyc']
ID= zJrwd2YBIcpsWJdpM8ce SCORE=9.122151
PATH= 20_newsgroups/alt.atheism/0000574
TEXT: In article <1r1mr8$eov@aurora.engr.LaTech.edu>, ra
-----------------------------------------------------------------
ID= G5rwd2YBIcpsWJdpPvk7 SCORE=4.8560805
PATH= 20_newsgroups/sci.med/0013128
TEXT: Here is a press release from the Natural Resources
-----------------------------------------------------------------
ID= Ypvwd2YBIcpsWJdpRg95 SCORE=4.0302267
PATH= 20_newsgroups/talk.politics.misc/0018667
TEXT: v140pxgt@ubvmsb.cc.buffalo.edu (Daniel B Case) wri
-----------------------------------------------------------------
ID= Vpvwd2YBIcpsWJdpQgTD SCORE=3.5608654
PATH= 20_newsgroups/talk.politics.guns/0015998
TEXT: Jim De Arras (jmd@cube.handheld.com) wrote:
: > La
-----------------------------------------------------------------
4 Documents


In [30]:
%run SearchIndexWeight.py --index news --nhits 5 --query toronto^2 nyc

['toronto^2', 'nyc']
ID= zJrwd2YBIcpsWJdpM8ce SCORE=12.817256
PATH= 20_newsgroups/alt.atheism/0000574
TEXT: In article <1r1mr8$eov@aurora.engr.LaTech.edu>, ra
-----------------------------------------------------------------
ID= G5rwd2YBIcpsWJdpPvk7 SCORE=6.8231306
PATH= 20_newsgroups/sci.med/0013128
TEXT: Here is a press release from the Natural Resources
-----------------------------------------------------------------
ID= Ypvwd2YBIcpsWJdpRg95 SCORE=6.744345
PATH= 20_newsgroups/talk.politics.misc/0018667
TEXT: v140pxgt@ubvmsb.cc.buffalo.edu (Daniel B Case) wri
-----------------------------------------------------------------
ID= Vpvwd2YBIcpsWJdpQgTD SCORE=5.0032635
PATH= 20_newsgroups/talk.politics.guns/0015998
TEXT: Jim De Arras (jmd@cube.handheld.com) wrote:
: > La
-----------------------------------------------------------------
4 Documents


In [99]:
%run SearchIndexWeight.py --index news --nhits 5 --query toronto nyc^2

['toronto^1', 'nyc^2']
ID= zJrwd2YBIcpsWJdpM8ce SCORE=14.549197
PATH= 20_newsgroups/alt.atheism/0000574
TEXT: In article <1r1mr8$eov@aurora.engr.LaTech.edu>, ra
-----------------------------------------------------------------
ID= G5rwd2YBIcpsWJdpPvk7 SCORE=7.745111
PATH= 20_newsgroups/sci.med/0013128
TEXT: Here is a press release from the Natural Resources
-----------------------------------------------------------------
ID= Vpvwd2YBIcpsWJdpQgTD SCORE=5.679333
PATH= 20_newsgroups/talk.politics.guns/0015998
TEXT: Jim De Arras (jmd@cube.handheld.com) wrote:
: > La
-----------------------------------------------------------------
ID= Ypvwd2YBIcpsWJdpRg95 SCORE=5.3463345
PATH= 20_newsgroups/talk.politics.misc/0018667
TEXT: v140pxgt@ubvmsb.cc.buffalo.edu (Daniel B Case) wri
-----------------------------------------------------------------
4 Documents


# 5 Rocchio

In [125]:
%run Rocchio.py --k 5 --nrounds 1 --index novels --query toronto

Receive query ['toronto']
******************************************************
iteration 1
Term vector : ['toronto']
Weight vector : [1.0]
Composed query : ['toronto^1.0']
378 1088 1 1
a 0.0
1 1088 1 1
abated 0.0
1 1088 1 1
abide 0.0
1 1088 1 1
abjuring 0.0
1 1088 1 1
abolished 0.0
8 1088 1 1
about 0.0
8 1088 1 1
above 0.0
1 1088 1 1
abraham 0.0
1 1088 1 1
abroad 0.0
2 1088 1 1
absence 0.0
1 1088 1 1
absent 0.0
1 1088 1 1
absently 0.0
1 1088 1 1
absolute 0.0
1 1088 1 1
absorb 0.0
1 1088 1 1
acadie 0.0
2 1088 1 1
accent 0.0
5 1088 1 1
accents 0.0
1 1088 1 1
accept 0.0
2 1088 1 1
accepted 0.0
1 1088 1 1
accepting 0.0
10 1088 1 1
access 0.0
1 1088 1 1
accessed 0.0
1 1088 1 1
accessible 0.0
1 1088 1 1
accompanies 0.0
1 1088 1 1
accomplished 0.0
3 1088 1 1
accordance 0.0
1 1088 1 1
accorded 0.0
1 1088 1 1
according 0.0
1 1088 1 1
account 0.0
2 1088 1 1
accounted 0.0
1 1088 1 1
accrue 0.0
1 1088 1 1
accursed 0.0
1 1088 1 1
ache 0.0
1 1088 1 1
achieve 0.0
2 1088 1 1
aching 0.0
2 1088 1 1
ac

breathe 0.0
4 1088 1 1
breathed 0.0
1 1088 1 1
breathing 0.0
2 1088 1 1
bred 0.0
1 1088 1 1
breeze 0.0
1 1088 1 1
brevity 0.0
1 1088 1 1
bride 0.0
1 1088 1 1
brides 0.0
17 1088 1 1
bright 0.0
4 1088 1 1
brighter 0.0
1 1088 1 1
brightest 0.0
4 1088 1 1
brightly 0.0
1 1088 1 1
brightness 0.0
2 1088 1 1
brilliancy 0.0
7 1088 1 1
bring 0.0
4 1088 1 1
brings 0.0
1 1088 1 1
bris 0.0
8 1088 1 1
britain 0.0
7 1088 1 1
british 0.0
2 1088 1 1
broad 0.0
3 1088 1 1
brock 0.0
3 1088 1 1
broke 0.0
3 1088 1 1
broken 0.0
2 1088 1 1
brothers 0.0
16 1088 1 1
brought 0.0
2 1088 1 1
brow 0.0
1 1088 1 1
bubble 0.0
1 1088 1 1
bubbling 0.0
1 1088 1 1
buffalo 0.0
2 1088 1 1
bugle 0.0
1 1088 1 1
build 0.0
2 1088 1 1
builded 0.0
3 1088 1 1
buildings 0.0
4 1088 1 1
built 0.0
1 1088 1 1
bulk 0.0
1 1088 1 1
bullet 0.0
3 1088 1 1
bullets 0.0
1 1088 1 1
burden 0.0
1 1088 1 1
burdened 0.0
3 1088 1 1
burn 0.0
2 1088 1 1
burning 0.0
3 1088 1 1
burst 0.0
1 1088 1 1
bursting 0.0
2 1088 1 1
bursts 0.0
2 1088 1 1
business 

4 1088 1 1
disease 0.0
1 1088 1 1
diseased 0.0
1 1088 1 1
dishonoured 0.0
1 1088 1 1
disk 0.0
1 1088 1 1
disloyal 0.0
2 1088 1 1
disloyalty 0.0
2 1088 1 1
dismay 0.0
1 1088 1 1
disobey 0.0
1 1088 1 1
display 0.0
8 1088 1 1
displayed 0.0
4 1088 1 1
displaying 0.0
1 1088 1 1
displays 0.0
1 1088 1 1
disregard 0.0
1 1088 1 1
dissensions 0.0
1 1088 1 1
dissolve 0.0
3 1088 1 1
distance 0.0
7 1088 1 1
distant 0.0
1 1088 1 1
distinctions 0.0
1 1088 1 1
distress 0.0
6 1088 1 1
distribute 0.0
6 1088 1 1
distributed 0.0
7 1088 1 1
distributing 0.0
6 1088 1 1
distribution 0.0
1 1088 1 1
distributor 0.0
1 1088 1 1
disunited 0.0
1 1088 1 1
diversion 0.0
1 1088 1 1
divide 0.0
3 1088 1 1
divided 0.0
41 1088 1 1
do 0.0
2 1088 1 1
docility 0.0
2 1088 1 1
does 0.0
1 1088 1 1
dogged 0.0
10 1088 1 1
domain 0.0
1 1088 1 1
domestic 0.0
2 1088 1 1
dominion 0.0
5 1088 1 1
don 0.0
4 1088 1 1
donate 0.0
1 1088 1 1
donation 0.0
15 1088 1 1
donations 0.0
7 1088 1 1
done 0.0
1 1088 1 1
donors 0.0
1 1088 1 1
doors 0

1 1088 1 1
gifts 0.0
1 1088 1 1
ginger 0.0
2 1088 1 1
girded 0.0
1 1088 1 1
girt 0.0
22 1088 1 1
give 0.0
12 1088 1 1
given 0.0
1 1088 1 1
gives 0.0
2 1088 1 1
giveth 0.0
4 1088 1 1
glad 0.0
1 1088 1 1
glade 0.0
5 1088 1 1
gladly 0.0
2 1088 1 1
gladness 0.0
1 1088 1 1
gladsome 0.0
1 1088 1 1
glance 0.0
1 1088 1 1
glances 0.0
1 1088 1 1
glass 0.0
1 1088 1 1
glassy 0.0
2 1088 1 1
gleam 0.0
2 1088 1 1
gleams 0.0
1 1088 1 1
glee 0.0
1 1088 1 1
glen 0.0
1 1088 1 1
glibly 0.0
1 1088 1 1
gliding 0.0
1 1088 1 1
glist 0.0
1 1088 1 1
glistened 0.0
1 1088 1 1
glitter 0.0
1 1088 1 1
glittered 0.0
1 1088 1 1
gloom 0.0
2 1088 1 1
glorious 0.0
16 1088 1 1
glory 0.0
2 1088 1 1
glow 0.0
1 1088 1 1
glowed 0.0
7 1088 1 1
go 0.0
2 1088 1 1
goal 0.0
1 1088 1 1
goals 0.0
67 1088 1 1
god 0.0
1 1088 1 1
godly 0.0
2 1088 1 1
goes 0.0
1 1088 1 1
going 0.0
3 1088 1 1
gold 0.0
1 1088 1 1
golden 0.0
5 1088 1 1
gone 0.0
15 1088 1 1
good 0.0
1 1088 1 1
gordon 0.0
1 1088 1 1
got 0.0
1 1088 1 1
govern 0.0
1 1088 1 1
g

10 1088 1 1
led 0.0
4 1088 1 1
left 0.0
2 1088 1 1
legal 0.0
1 1088 1 1
legally 0.0
1 1088 1 1
legatee 0.0
1 1088 1 1
legions 0.0
1 1088 1 1
legislators 0.0
1 1088 1 1
lends 0.0
3 1088 1 1
length 0.0
1 1088 1 1
lengthen 0.0
1 1088 1 1
leo 0.0
17 1088 1 1
less 0.0
2 1088 1 1
lesson 0.0
2 1088 1 1
lessons 0.0
4 1088 1 1
lest 0.0
31 1088 1 1
let 0.0
4 1088 1 1
letter 0.0
5 1088 1 1
li 0.0
3 1088 1 1
liability 0.0
1 1088 1 1
liable 0.0
3 1088 1 1
liberal 0.0
2 1088 1 1
liberty 0.0
1 1088 1 1
library 0.0
18 1088 1 1
license 0.0
1 1088 1 1
licensed 0.0
4 1088 1 1
lie 0.0
16 1088 1 1
lies 0.0
2 1088 1 1
lieu 0.0
45 1088 1 1
life 0.0
1 1088 1 1
lift 0.0
27 1088 1 1
light 0.0
1 1088 1 1
lightened 0.0
2 1088 1 1
lightly 0.0
1 1088 1 1
lightsome 0.0
5 1088 1 1
lii 0.0
4 1088 1 1
liii 0.0
17 1088 1 1
like 0.0
3 1088 1 1
limitation 0.0
5 1088 1 1
limited 0.0
2 1088 1 1
limitless 0.0
1 1088 1 1
lindsay 0.0
1 1088 1 1
line 0.0
1 1088 1 1
lineage 0.0
2 1088 1 1
lines 0.0
1 1088 1 1
linger 0.0
1 1088 1

2 1088 1 1
pay 0.0
2 1088 1 1
paying 0.0
3 1088 1 1
payments 0.0
16 1088 1 1
peace 0.0
3 1088 1 1
peaceful 0.0
1 1088 1 1
peacefully 0.0
1 1088 1 1
peak 0.0
1 1088 1 1
peaks 0.0
1 1088 1 1
pearls 0.0
1 1088 1 1
peculiar 0.0
1 1088 1 1
pedantic 0.0
1 1088 1 1
peer 0.0
4 1088 1 1
pen 0.0
1 1088 1 1
pensive 0.0
9 1088 1 1
people 0.0
7 1088 1 1
perchance 0.0
1 1088 1 1
perfect 0.0
1 1088 1 1
perfectly 0.0
1 1088 1 1
perfects 0.0
1 1088 1 1
perfidy 0.0
1 1088 1 1
perform 0.0
1 1088 1 1
performances 0.0
1 1088 1 1
performed 0.0
3 1088 1 1
performing 0.0
4 1088 1 1
perhaps 0.0
1 1088 1 1
periodic 0.0
1 1088 1 1
perish 0.0
1 1088 1 1
perished 0.0
1 1088 1 1
permanent 0.0
7 1088 1 1
permission 0.0
1 1088 1 1
permit 0.0
2 1088 1 1
permitted 0.0
1 1088 1 1
perplexed 0.0
4 1088 1 1
person 0.0
1 1088 1 1
persuade 0.0
1 1088 1 1
peruvian 0.0
1 1088 1 1
pg 0.0
2 1088 1 1
pgdp 0.0
8 1088 1 1
pglaf 0.0
1 1088 1 1
philosophers 0.0
1 1088 1 1
philosophic 0.0
1 1088 1 1
phoenix 0.0
4 1088 1 1
phrase 0.0
3

1 1088 1 1
sallies 0.0
1 1088 1 1
salt 0.0
1 1088 1 1
salute 0.0
2 1088 1 1
same 0.0
1 1088 1 1
sameness 0.0
1 1088 1 1
samuel 0.0
1 1088 1 1
sanctifies 0.0
1 1088 1 1
sanctioned 0.0
1 1088 1 1
sand 0.0
1 1088 1 1
sandy 0.0
3 1088 1 1
sar 0.0
1 1088 1 1
sars 0.0
4 1088 1 1
sat 0.0
1 1088 1 1
savage 0.0
2 1088 1 1
save 0.0
1 1088 1 1
saved 0.0
1 1088 1 1
saviour 0.0
19 1088 1 1
saw 0.0
2 1088 1 1
saxon 0.0
12 1088 1 1
say 0.0
1 1088 1 1
scale 0.0
1 1088 1 1
scaled 0.0
1 1088 1 1
scales 0.0
1 1088 1 1
scalp 0.0
4 1088 1 1
scarce 0.0
3 1088 1 1
scattered 0.0
8 1088 1 1
scene 0.0
1 1088 1 1
scenery 0.0
3 1088 1 1
scenes 0.0
1 1088 1 1
sceptre 0.0
1 1088 1 1
sceptred 0.0
1 1088 1 1
science 0.0
2 1088 1 1
scientific 0.0
1 1088 1 1
scoffers 0.0
1 1088 1 1
scorching 0.0
1 1088 1 1
scorned 0.0
1 1088 1 1
scotch 0.0
1 1088 1 1
scourge 0.0
1 1088 1 1
scurvy 0.0
7 1088 1 1
sea 0.0
1 1088 1 1
seal 0.0
3 1088 1 1
search 0.0
1 1088 1 1
seared 0.0
1 1088 1 1
seas 0.0
2 1088 1 1
seat 0.0
1 1088 1 1
sea

3 1088 1 1
thirty 0.0
124 1088 1 1
this 0.0
26 1088 1 1
those 0.0
67 1088 1 1
thou 0.0
39 1088 1 1
though 0.0
31 1088 1 1
thought 0.0
7 1088 1 1
thoughts 0.0
2 1088 1 1
thousand 0.0
2 1088 1 1
thousands 0.0
1 1088 1 1
thrall 0.0
1 1088 1 1
threatened 0.0
2 1088 1 1
three 0.0
1 1088 1 1
threshold 0.0
1 1088 1 1
thrice 0.0
2 1088 1 1
thrill 0.0
1 1088 1 1
thrilled 0.0
1 1088 1 1
throbbing 0.0
1 1088 1 1
throbs 0.0
1 1088 1 1
throes 0.0
1 1088 1 1
throne 0.0
1 1088 1 1
throng 0.0
21 1088 1 1
through 0.0
6 1088 1 1
throughout 0.0
1 1088 1 1
throw 0.0
4 1088 1 1
thrown 0.0
1 1088 1 1
thrust 0.0
55 1088 1 1
thus 0.0
72 1088 1 1
thy 0.0
2 1088 1 1
thyself 0.0
4 1088 1 1
tide 0.0
1 1088 1 1
tidings 0.0
1 1088 1 1
ties 0.0
16 1088 1 1
till 0.0
7 1088 1 1
tilt 0.0
17 1088 1 1
time 0.0
3 1088 1 1
times 0.0
1 1088 1 1
tinted 0.0
1 1088 1 1
tiny 0.0
19 1088 1 1
tis 0.0
4 1088 1 1
title 0.0
1 1088 1 1
titled 0.0
57 1088 1 1
tm 0.0
1 1088 1 1
tmay 0.0
545 1088 1 1
to 0.0
1 1088 1 1
tobacco 0.0
2 1088

5 1088 1 1
xlii 0.0
5 1088 1 1
xliii 0.0
5 1088 1 1
xliv 0.0
5 1088 1 1
xlix 0.0
5 1088 1 1
xlv 0.0
5 1088 1 1
xlvi 0.0
5 1088 1 1
xlvii 0.0
5 1088 1 1
xlviii 0.0
5 1088 1 1
xv 0.0
5 1088 1 1
xvi 0.0
5 1088 1 1
xvii 0.0
5 1088 1 1
xviii 0.0
5 1088 1 1
xx 0.0
5 1088 1 1
xxi 0.0
5 1088 1 1
xxii 0.0
5 1088 1 1
xxiii 0.0
5 1088 1 1
xxiv 0.0
5 1088 1 1
xxix 0.0
5 1088 1 1
xxv 0.0
5 1088 1 1
xxvi 0.0
5 1088 1 1
xxvii 0.0
5 1088 1 1
xxviii 0.0
5 1088 1 1
xxx 0.0
5 1088 1 1
xxxi 0.0
5 1088 1 1
xxxii 0.0
5 1088 1 1
xxxiii 0.0
5 1088 1 1
xxxiv 0.0
5 1088 1 1
xxxix 0.0
5 1088 1 1
xxxv 0.0
5 1088 1 1
xxxvi 0.0
5 1088 1 1
xxxvii 0.0
5 1088 1 1
xxxviii 0.0
1 1088 1 1
yards 0.0
1 1088 1 1
yarn 0.0
48 1088 1 1
ye 0.0
4 1088 1 1
year 0.0
10 1088 1 1
years 0.0
56 1088 1 1
yet 0.0
3 1088 1 1
yield 0.0
3 1088 1 1
yielded 0.0
1 1088 1 1
yielding 0.0
1 1088 1 1
yields 0.0
1 1088 1 1
yoke 0.0
1 1088 1 1
yonder 0.0
1 1088 1 1
yore 0.0
115 1088 1 1
you 0.0
5 1088 1 1
young 0.0
94 1088 1 1
your 0.0
2 1088 1 1
y