# Query expansion

Reformulate or expand the query in order to find more and better results

1. Morphological
   * Find also different inflections of the search word (especially important for languages like Finnish)
   * If our search word is 'koiralle', maybe we should also return documents containing koira, koiran, koiraa, koiralla, koiralta, koirilla, koirilta, koirille...
   * Stemming or lemmatization
2. Synonyms or related terms
   * Expand the search with synonyms or other related terms
3. Spelling errors
   * If a word is mistyped in the search, it's unlikely we will find good results --> spelling correction or 'Did you mean'

## 1. Morphology

### Stemming

- Reduce inflected word to a word stem by dropping all inflection affixes: open, opens, opening, opener --> open
- Word stem: the part of the word that is common to all its inflected variants (not necessary same as the base form)
- koira, koiran, koiria --> stem is koir
- By default, Solr uses stemming in text fileds, text_en uses English stemming model and text_fi uses Finnish stemming model
- Snowball stemmer: removes known affixes from words and only the stem should stay
- bit too brute force for Finnish, e.g. removes 'na' from 'peruna' because -na is a known suffix for essives
- Solr also has other, less aggressive stemmers...

### Lemmatization

- Determine the base form of the word
- Maybe not a big deal in English? ([UDPipe](http://lindat.mff.cuni.cz/services/udpipe/run.php) lemma accuracy 97%)
- ...but more difficult problem in Finnish, UDPipe lemma accuracy 86.8%, but can be improved if rule-based morphological analyzer (Omorfi) is included
- Solr does not have ready-made lemmatizers, but one can include own lemmatizer

### ...in Solr

* **Where** does this happen? How do we tell Solr to do all this?
* When adding a field to Solr you are asked for `field type`, which pretty much tells Solr what to do with that field
* These come from `solr/server/solr/CORENAME/conf/managed-schema` where much of the config of a single core is
* Reasonable defaults which you can change of course
* Here's `text_fi`
  1. Tokenize
  2. Lowercase
  3. Apply stop word list
  4. Stem

```
  <fieldType name="text_fi" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_fi.txt" ignoreCase="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Finnish"/>
    </analyzer>
  </fieldType>
```

* These (and many others) are documented [here](https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-SynonymFilter)
* Change schema: edit `managed-schema` and restart Solr or at least reload core


## 2. Synonym expansion

### WordNet

- Lexical database where words are grouped into synonym sets (synsets), and other types of hierarchies (antonyms, hyponyms, hyperonyms)
- English: http://wordnetweb.princeton.edu/perl/webwn
- Finnish: http://www.ling.helsinki.fi/cgi-bin/fiwn/search

- Also available as a python NLTK package

In [4]:
#import nltk ; nltk.download() # download wordnet and other nltk material (get at least omw and wordnet in corpora)

from nltk.corpus import wordnet as wn


print("fire:",wn.synsets("fire")) # all synsets for a word
print("fire as verb", wn.synsets("fire", pos=wn.VERB)) # define part-of-speech

# synset definitions
print("Definition of first fire:", wn.synsets("fire")[0],wn.synsets("fire")[0].definition())
print("Definition of first fire as verb:", wn.synsets("fire", pos=wn.VERB)[0],wn.synsets("fire", pos=wn.VERB)[0].definition())

# all lemmas for a given synset
lemmas=[lemma.name() for lemma in wn.synsets("fire", pos=wn.VERB)[0].lemmas()]
print("Lemmas", lemmas, "\n")

# List all languages available
print("Languages available:",wn.langs(), "\n")

# and Finnish lemmas for the same synset, note that we are still using the same 'English' synset
lemmas=[lemma for lemma in wn.synsets("fire", pos=wn.VERB)[0].lemma_names("fin")]
print("Finnish lemmas:", lemmas, "\n")
# ...but we can also use Finnish words
print("Looked up in Finnish", wn.synsets("tuli", lang="fin")[0].lemma_names("fin"), "\n")

# how many words wordnet has?
all_lemmas=[l for l in wn.all_lemma_names(lang="eng")]
print("First few lemmas", all_lemmas[:5])
print("Total words English:",len(all_lemmas), "\n")


# Finnish
all_lemmas=[l for l in wn.all_lemma_names(lang="fin")]
print("First few lemmas in Finnish", all_lemmas[:5])
print("Total words Finnish:",len(all_lemmas))


fire: [Synset('fire.n.01'), Synset('fire.n.02'), Synset('fire.n.03'), Synset('fire.n.04'), Synset('fire.n.05'), Synset('ardor.n.03'), Synset('fire.n.07'), Synset('fire.n.08'), Synset('fire.n.09'), Synset('open_fire.v.01'), Synset('fire.v.02'), Synset('fire.v.03'), Synset('displace.v.03'), Synset('fire.v.05'), Synset('fire.v.06'), Synset('arouse.v.01'), Synset('burn.v.01'), Synset('fuel.v.02')]
fire as verb [Synset('open_fire.v.01'), Synset('fire.v.02'), Synset('fire.v.03'), Synset('displace.v.03'), Synset('fire.v.05'), Synset('fire.v.06'), Synset('arouse.v.01'), Synset('burn.v.01'), Synset('fuel.v.02')]
Definition of first fire: Synset('fire.n.01') the event of something burning (often destructive)
Definition of first fire as verb: Synset('open_fire.v.01') start firing a weapon
Lemmas ['open_fire', 'fire'] 

Languages available: ['eng', 'als', 'arb', 'bul', 'cat', 'cmn', 'dan', 'ell', 'eus', 'fas', 'fin', 'fra', 'glg', 'heb', 'hrv', 'ind', 'ita', 'jpn', 'nno', 'nob', 'pol', 'por', 'qcn

### synonym lists

- now we can use these WordNet synsets to collect a list of synonyms for each word
- ...and these synonyms can be used to expand queries
- but must keep in mind that these are lemmas, not wordforms

In [5]:
from nltk.corpus import wordnet as wn

def expand_words(words, lang):
    # function to expand a list of words using wordnet synsets
    synonyms=[]
    synonyms+=words
    for w in words:
        for s in wn.synsets(w, lang=lang):
            synonyms+=s.lemma_names(lang)
    return set(synonyms)

print("Finnish:", "\n")
search_words=["kissa", "maukua"]
print("Original query:", search_words, "\n")
expanded=expand_words(search_words,"fin")
print("Expanded query:", sorted(expanded), "\n")
print("")

print("English:", "\n")
search_words=["house", "flames"]
print("Original query:", search_words, "\n")
expanded=expand_words(search_words,"eng")
print("Expanded query:", sorted(expanded), "\n")
print("")


Finnish: 

Original query: ['kissa', 'maukua'] 

Expanded query: ['Felis_catus', 'Felis_domesticus', 'iso_kissaeläin', 'kilpikonnakuvioinen_kissa', 'kissa', 'kissaeläin', 'kotikissa', 'maukua', 'naukua'] 


English: 

Original query: ['house', 'flames'] 

Expanded query: ['business_firm', 'domiciliate', 'family', 'fire', 'firm', 'flame', 'flames', 'flaming', 'flare', 'home', 'house', 'household', 'mansion', 'menage', 'planetary_house', 'put_up', 'sign', 'sign_of_the_zodiac', 'star_sign', 'theater', 'theatre'] 




### Synonym expansion in Solr

* https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-SynonymFilter
* Looks like we will need a file with synsets like such: *car,vehicle,automobile*  - one per line

In [28]:
with open("synonyms_fin.txt","w") as f:
    for s in wn.all_synsets():
        finnish_lemmas=s.lemma_names("fin")
        if len(finnish_lemmas)>1: # at least two, would make no sense otherwise
            #print them with commas in between, and underscores replaced with spaces
            print(",".join(l.replace("_"," ") for l in finnish_lemmas),file=f)

### In Solr

* Here's the default definition of `text_en`

```
  <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPossessiveFilterFactory"/>
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPossessiveFilterFactory"/>
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
  </fieldType>
```

* `text_en` has a differnt pipeline for *query* and *index*
* **Index time**
  1. Tokenize
  2. Filter stop words
  3. Lowercase
  4. Remove possessive markers
  5. Protect some words from stemming
  6. Stem
* **Query time**
  1. Tokenize
  2. **Expand synonyms**
  3. Filter stop words
  4. Lowercase
  5. Remove possessive markers
  6. Proteict some words from stemming
  7. Stem

### Query time vs index time is an important distinction

* Index time: processing carried out when indexing the data to search
* Query time: processing carried out on every query
* Certain steps are only needed at query time, certain steps are done on index time
* Synonym expansion: maybe not too smart on index time --- why not?

### Word2vec

- "Similar words appear in similar contexts"
- https://github.com/tmikolov/word2vec

In [7]:
import lwvlib # https://github.com/fginter/wvlib_light

model=lwvlib.load("/home/jmnybl/pb34_wf_200_v2_skgram.bin",100000,500000)
print("kissa:", model.nearest("kissa"), "\n")
print("maukuu:", model.nearest("maukuu"), "\n") # note that these does not have to be base forms


kissa: [(0.88850498, 'kani'), (0.8579123, 'kissanpentu'), (0.85663795, 'pentu'), (0.8336221, 'marsu'), (0.80795115, 'katti'), (0.80793911, 'hamsteri'), (0.8023504, 'koira'), (0.798271, 'kisu'), (0.78334367, 'kirppu'), (0.77792805, 'susi')] 

maukuu: [(0.8782531, 'naukui'), (0.8746469, 'maukui'), (0.86840028, 'naukaisi'), (0.80436617, 'sähisi'), (0.80063224, 'kolli'), (0.79289144, 'sähähti'), (0.78519303, 'murisi'), (0.76773942, 'murahti'), (0.75686067, 'naaras'), (0.75560999, 'sihahti')] 

ready


# 3. Spelling error expansion

* We all know and love Google's "Did you mean?" corrections
* User query logs a goldmine here - spelling errors can be gathered from logs by looking for queries submitted one after another, with a tiny difference
* To my understanding, this is what actually happens
* We don't have query logs - let us try to achieve something like this with our means

* Head to http://bionlp-www.utu.fi/wv_demo/ - try few Finnish typos
* The correct form is often nearby
* Word2vec models trained on plenty of data for dozens of languages can nowadays be downloaded: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989
* We could search systematically for words among whose top-N near list is a word which is 1-2 characters away such that the query word is not recognized as a word of the language, while the similar word is

* Code: https://github.com/fginter/wv_spellcheck/blob/master/sc.py
* The file pairs_initial.filtered has the right stuff:

```
1.0     vielä   viellä  5339641 14143
1.0     vielä   veilä   5339641 656
1.0     ollut   olllut  5674748 1305
1.0     ollut   olltu   5674748 333
1.0     ollut   ollt    5674748 189
1.0     mitä    mtä     4623994 410
1.0     kaikki  kaiki   4287031 2037
1.0     kaikki  kaikkki 4287031 365
1.0     olisi   olsi    5395948 1482
1.0     olisi   oisi    5395948 6384
1.0     olisi   oilisi  5395948 209
1.0     mukaan  mukan   4753605 1581
1.0     mukaan  mu­kaan 4753605 436
1.0     siitä   siittä  4756536 9938
1.0     siitä   siintä  4756536 1896
1.0     siitä   siiitä  4756536 604
1.0     jotka   joitka  4593675 336
1.0     jotka   jotaka  4593675 410
1.0     jotka   jotak   4593675 369
1.0     jotka   jokta   4593675 302
1.0     kuitenkin       kuitekin        4130244 3134
1.0     tulee   tuleee  4261769 502
1.0     jälkeen jäkeen  4281104 1759
1.0     jälkeen jäljeen 4281104 414
1.0     jälkeen jäleen  4281104 410
```

* Now we can filter it and turn it into a spelling error dictionary like such:

```
paluttaa => paluttaa,palauttaa
Cantin => Canthin,Cantin,Canth,Canthia
tun- => taan,tun-
ollukka => ollukaan,ollukka
nuosi => nuosi,nousi
Aikasempi => Aikasempi,aikaisempi
saahaa => saahaa,saatas
otakkaan => otakaan,otakkaan
kertoopi => kertoo,kertookin,kertoopi
yhdssä => yhdessä,yhdssä
niitteen => niitteen,niitten
Baselissa => Baselissa,Badenissa
Australissa => Australiassa,Australissa
selkäesti => selkeästi,selvästi,selkäesti
tietyyn => tiettyyn,tietyyn
etko => etkö,etko,enkö
täytee => täyteen,täytee
Oylle => oy:lle,Oy:lle,Oylle
miän => miän,meikän
pienetää => pienetää,pienentää
muussina => muussina,muusina
rävellystä => rävellystä,räpellystä
roopan => roopan,Euroopan
niis => jois,niis
yhta => yhtä,yhta
ajatelen => ajattelen,ajatelen
wappuna => wappuna,vappuna,Vappuna
TSOP:n => TSOP:n,SOK:n
kute => kuten,kute
tehosekottimeen => tehosekottimeen,tehosekoittimeen
käyti => käyty,käyti,käytti
```

...and then we can point solr to it in its config like so:

```
  <fieldType name="text_fi" class="solr.TextField" positionIncrementGap="1">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Finnish"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="True" synonyms="spelling_fi.txt" />
      <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="True" synonyms="synonyms_fi_wordnet.txt" />
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Finnish"/>
    </analyzer>
  </fieldType>
```