# Query expansion

- Reformulate or expand the query in order to find more and better results

1) Morphological
- Find also different inflections of the search word (especially important for languages like Finnish)
- If our search word is 'koiralle', maybe we should also return documents containing koira, koiran, koiraa, koiralla, koiralta, koirilla, koirilta, koirille...
- Stemming or lemmatization

2) Spelling errors
- If a word is mistyped in the search, it's unlikely we will find good results --> spelling correction or 'Did you mean'

3) Synonyms or related terms
- Expand the search with synonyms or other related terms


## 1. Morphology

### Stemming

- Reduce inflected word to a word stem by dropping all inflection affixes: open, opens, opening, opener --> open
- Word stem: the part of the word that is common to all its inflected variants (not necessary same as the base form)
- koira, koiran, koiria --> stem is koir
- By default, Solr uses stemming in text fileds, text_en uses English stemming model and text_fi uses Finnish stemming model
- Snowball stemmer: removes known affixes from words and only the stem should stay
- bit too brute force for Finnish, e.g. removes 'na' from 'peruna' because -na is a known suffix for essives
- Solr also has other, less aggressive stemmers...

### Lemmatization

- Determine the base form of the word
- Maybe not a big deal in English? (UDPipe lemma accuracy 97%)
- ...but more difficult problem in Finnish, UDPipe lemma accuracy 86.8%, but can be improved if rule-based morphological analyzer (Omorfi) is included
- 
- Solr does not have ready-made lemmatizers, but one can include own lemmatizer

## 3. Synonym expansion

### WordNet

- Lexical database where words are grouped into synonym sets (synsets), and other types of hiararchies (antonyms, hyponyms, hyperonyms)
- English: http://wordnetweb.princeton.edu/perl/webwn
- Finnish: http://www.ling.helsinki.fi/cgi-bin/fiwn/search

- Also available at python NLTK package

In [8]:
# import nltk ; nltk.download() # download wordnet and other nltk material

from nltk.corpus import wordnet as wn


print(wn.synsets("fire")) # all synsets for a word

print(wn.synsets("fire", pos=wn.VERB)) # define part-of-speech

# synset definitions
print(wn.synsets("fire")[0],wn.synsets("fire")[0].definition())
print(wn.synsets("fire", pos=wn.VERB)[0],wn.synsets("fire", pos=wn.VERB)[0].definition())

# all lemmas for a given synset
lemmas=[lemma.name() for lemma in wn.synsets("fire", pos=wn.VERB)[0].lemmas()]
print(lemmas, "\n")

# List all languages available
print("Languages available:",wn.langs(), "\n")

# and Finnish lemmas for the same synset, note that we are still using the same 'English' synset
lemmas=[lemma for lemma in wn.synsets("fire", pos=wn.VERB)[0].lemma_names("fin")]
print(lemmas, "\n")
# ...but we can also use Finnish words
print(wn.synsets("tuli", lang="fin")[0].lemma_names("fin"), "\n")

# how many words wordnet has?
all_lemmas=[l for l in wn.all_lemma_names(lang="eng")]
print(all_lemmas[:5])
print("Total words English:",len(all_lemmas), "\n")


# Finnish
all_lemmas=[l for l in wn.all_lemma_names(lang="fin")]
print(all_lemmas[:5])
print("Total words Finnish:",len(all_lemmas))


[Synset('fire.n.01'), Synset('fire.n.02'), Synset('fire.n.03'), Synset('fire.n.04'), Synset('fire.n.05'), Synset('ardor.n.03'), Synset('fire.n.07'), Synset('fire.n.08'), Synset('fire.n.09'), Synset('open_fire.v.01'), Synset('fire.v.02'), Synset('fire.v.03'), Synset('displace.v.03'), Synset('fire.v.05'), Synset('fire.v.06'), Synset('arouse.v.01'), Synset('burn.v.01'), Synset('fuel.v.02')]
[Synset('open_fire.v.01'), Synset('fire.v.02'), Synset('fire.v.03'), Synset('displace.v.03'), Synset('fire.v.05'), Synset('fire.v.06'), Synset('arouse.v.01'), Synset('burn.v.01'), Synset('fuel.v.02')]
Synset('fire.n.01') the event of something burning (often destructive)
Synset('open_fire.v.01') start firing a weapon
['open_fire', 'fire'] 

Languages available: ['eng', 'als', 'arb', 'bul', 'cat', 'cmn', 'dan', 'ell', 'eus', 'fas', 'fin', 'fra', 'glg', 'heb', 'hrv', 'ind', 'ita', 'jpn', 'nno', 'nob', 'pol', 'por', 'qcn', 'slv', 'spa', 'swe', 'tha', 'zsm'] 

['ampua', 'avata_tuli'] 

['tuli', 'tulitus'] 

### synonym lists

- now we can use these WordNet synsets to collect a list of synonyms for each word
- ...and these synonyms can be used to expand queries
- but must keep in mind that these are lemmas, not wordforms

In [9]:
from nltk.corpus import wordnet as wn

def expand_words(words, lang):
    # function to expand a list of words using wordnet synsets
    synonyms=[]
    synonyms+=words
    for w in words:
        for s in wn.synsets(w, lang=lang):
            synonyms+=s.lemma_names(lang)
    return set(synonyms)

print("Finnish:", "\n")
search_words=["kissa", "maukua"]
print("Original query:", search_words, "\n")
expanded=expand_words(search_words,"fin")
print("Expanded query:", sorted(expanded), "\n")
print("")

print("English:", "\n")
search_words=["house", "flames"]
print("Original query:", search_words, "\n")
expanded=expand_words(search_words,"eng")
print("Expanded query:", sorted(expanded), "\n")
print("")


Finnish: 

Original query: ['kissa', 'maukua'] 

Expanded query: ['Felis_catus', 'Felis_domesticus', 'iso_kissaeläin', 'kilpikonnakuvioinen_kissa', 'kissa', 'kissaeläin', 'kotikissa', 'maukua', 'naukua'] 


English: 

Original query: ['house', 'flames'] 

Expanded query: ['business_firm', 'domiciliate', 'family', 'fire', 'firm', 'flame', 'flames', 'flaming', 'flare', 'home', 'house', 'household', 'mansion', 'menage', 'planetary_house', 'put_up', 'sign', 'sign_of_the_zodiac', 'star_sign', 'theater', 'theatre'] 




### Word2vec

- "Similar words appear in similar contexts"
- https://github.com/tmikolov/word2vec

In [7]:
import lwvlib # https://github.com/fginter/wvlib_light

model=lwvlib.load("/home/jmnybl/pb34_wf_200_v2_skgram.bin",100000,500000)
print("kissa:", model.nearest("kissa"), "\n")
print("maukuu:", model.nearest("maukuu"), "\n") # note that these does not have to be base forms


kissa: [(0.88850498, 'kani'), (0.8579123, 'kissanpentu'), (0.85663795, 'pentu'), (0.8336221, 'marsu'), (0.80795115, 'katti'), (0.80793911, 'hamsteri'), (0.8023504, 'koira'), (0.798271, 'kisu'), (0.78334367, 'kirppu'), (0.77792805, 'susi')] 

maukuu: [(0.8782531, 'naukui'), (0.8746469, 'maukui'), (0.86840028, 'naukaisi'), (0.80436617, 'sähisi'), (0.80063224, 'kolli'), (0.79289144, 'sähähti'), (0.78519303, 'murisi'), (0.76773942, 'murahti'), (0.75686067, 'naaras'), (0.75560999, 'sihahti')] 

ready
