# Examples Chaining search
This notebook contains a number of examples of chaining linguistic resources: corpora, lexica and treebanks. Try the examples, or copy the code and customize the examples in the [Sandbox](Sandbox.ipynb).
 * For a tutorial, refer to our [Quickstart](Quickstart.pdf).
 * Reference of our library *chaininglib*, described in the documentation ([local](doc/_build/html/index.html) or [online](https://chaining-search.readthedocs.io/en/latest/)).
 * If you encounter any bugs or errors, please let us know via our [GitHub issue tracker](https://github.com/INL/chaining-search/issues) or send an e-mail to servicedesk@ivdnt.org.


## List of examples
### Corpora
 * [Corpus search](#corpus-search)
 * [Frequency of *zeker*+verb and *vast*+verb compared in a historical corpus](#freq-puur-zuiver)
 * [Frequency of *leuk*+verb and *fijn*+verb compared in a modern corpus](#freq-leuk-fijn)
 * [Search in corpus and filter on metadata](#corpus-filter-metadata)
 * [Visualizing h-dropping](#visualizing-h-dropping)
 * [Generate a lexicon from a corpus](#lexicon-from-corpus)
 * [Train a POS tagger on an annotated historical corpus](#pos-tagger)

### Lexica
 * [Lexicon search](#lexicon-search)

### Corpus + lexicon
 * [Build a frequency list of the lemma of some corpus output](#freq-lemma-corpus)
 * [Find occurences of attributive adjectives not ending with -e, even though they are preceeded by a definite article](#adjective-e)
 * [Look up inflected forms and spelling variants for a given lemma in a corpus](#inflected-spelling-corpus)
 * [Corpus frequency list of lemmata from lexicon with given lemma](#corpus-frequency-lemma-pos)
 * [Build a frequency table of some corpus, based on lemmata of a given lexicon](#freqtable-lemmalist)
 * [Search corpus for wordforms of lemma not included in lexicon](#corpus-wordforms-not-lexicon)
 
### Treebanks
 * [Treebank search](#treebank-search)
 * [Which objects of verb *geven* occur?](#treebank-objects-geven)
 ---

## Corpus

### Corpus search <a class="anchor" id="corpus-search"></a>

* Run the cell below to show the UI, and fill in your search query
* Choose one of the corpora:
  * **zeebrieven**: The Brieven als Buit (Letters as Loot) corpus, consisting of 17th and 18th century letters from Dutch sailors
  * **gysseling**: Corpus Gysseling, 13th century Dutch
  * **openchn**: Externally accessible part of the Corpus Hedendaags Nederlands. Corpus of contemporary Dutch from the Dutch Antilles and Suriname, retrieved from newspapers and websites.
  * **opus**: OPUS corpus of Dutch subtitles

In [1]:
from chaininglib.ui.search import create_corpus_ui
from chaininglib.ui.dfui import display_df

# Create corpus UI, creates references to field contents
corpusQueryField, corpusField = create_corpus_ui()


VBox(children=(Text(value='[lemma="boek"]', description='<b>CQL query:</b>'), Dropdown(description='<b>Corpus:…

In [None]:


# BEWARE: we limit the results to 500 records here; not limiting may cause a search to take quite some time
df_corpus = create_corpus(corpus_name).pattern("[lemma='boek']").max_results(500).search().kwic()
display_df(df_corpus)

 * Click the cell below and press Run to perform the given query

In [18]:
from chaininglib.search.CorpusQuery import *

query= corpusQueryField.value
corpus_name = corpusField.value

# BEWARE: we limit the results to 500 records here; not limiting may cause a search to take quite some time
df_corpus = create_corpus(corpus_name).pattern(query).max_results(500).search().kwic()
display_df(df_corpus, labels="Results")



[F                                                                    



### Frequency of *zeker*+verb and *vast*+verb compared <a class="anchor" id="freq-puur-zuiver"></a>
* Below cell searches for *zeker*+verb and for *vast*+verb in the Letters as Loot (zeebrieven) corpus
* Compare frequencies

In [5]:
#from chaininglib import search
from IPython.core.display import display, HTML
from chaininglib.search.CorpusQuery import *
from chaininglib.ui.dfui import display_df
from chaininglib.utils.dfops import column_difference

corpus_name = "zeebrieven"

# Word 1: puur
word1= "zeker"
cq1 = create_corpus(corpus_name).pattern(r'[lemma="' + word1 + r'"][pos="VRB.*"]')
df_corpus1 = cq1.search().kwic()
display_df(df_corpus1, word1)

# Word 2: zuiver
word2 = "vast"
cq2 = create_corpus(corpus_name).pattern(r'[lemma="' + word2 + r'"][pos="VRB.*"]')
df_corpus2 = cq2.search().kwic()
display_df(df_corpus2, word2)

# Compute difference
diff_left, diff_right, intersec = column_difference(df_corpus1["lemma 1"], df_corpus2["lemma 1"])
# Elements of 1 that are not in 2
display(HTML('Werkwoorden na <b>' + word1 + '</b>, maar niet na <b>' + word2 + '</b>: ' + ", ".join(diff_left)))
# Elements of 2 that are not in 1
display(HTML('Werkwoorden na <b>' + word2 + '</b>, maar niet na <b>' + word1 + '</b>: ' + ", ".join(diff_right)))
# Elements both in 1 and 2
display(HTML('Werkwoorden zowel na <b>' + word1 + '</b> als na <b>' + word2 + '</b>: ' + ", ".join(intersec)))

[F                                                                    

Unnamed: 0,left context,lemma 0,pos 0,word 0,lemma 1,pos 1,word 1,right context
0,onder weeg is UE sal,zeker,ADV,seeker,kunnen,VRB,kunne,sien hoi haer ge stael
1,maken soo salt maer alte,zeker,ADV,seckr,voortgaan,VRB,voort,gaen daerom meene ick het
2,"dit zoo zijnde, dat niet",zeker,ADJ,zeker,weten,VRB,"weet,",kom in t geval van
3,oope dat Ick daer van,zeker,ADJ,seeker,zullen,VRB,sal,weese maer het sal noch
4,doch ick en kan nietmendal,zeker,ADJ,seekers,schrijven,VRB,"schrijuen,",ende verhoope UE in corte
5,niet en kan men niet,zeker,ADV,seecker,schrijven,VRB,Schrijven,God gun en geeft ons
6,"Nes UWEGb genoegen, hie sal",zeker,ADV,seeker,melden,VRB,gemeld,hebben de doot van suster
7,kan ik uw ook niets,zeker,ADV,seeker,melden,VRB,"melden,",want die producten dat wij
8,mijn vertrek kan ik niets,zeker,ADJ,zeeker,melden,VRB,melde,denke in ’t Laast van




[F                                                                    

Unnamed: 0,left context,lemma 0,pos 0,word 0,lemma 1,pos 1,word 1,right context
0,"M.Dalij, die heeft mij voor",vast,ADV,vast,beloven,VRB,"beloofd,",de helfte van sijne Reekening
1,tog kan ik het niet,vast,ADV,vast,schrijven,VRB,schryven,want daris somtys gen stat
2,ghe sicht dat wij voor,vast,ADV,vast,vertrouwen,VRB,vertrouden,dat altemael turcken waeren wij
3,bessten koomen dan gij kend,vast,ADV,vast,geloven,VRB,gelooven,dat gij mij niet half
4,ik kan het niet voor,vast,ADV,vast,schrijven,VRB,schryve,Dog hoop ik dat Got
5,Vertrek Kan ik nog niet,vast,ADV,fast,melden,VRB,Melden,want Het is Hier Zeer
6,godt vartrout dij heft soo,vast,ADV,vast,bouwen,VRB,gebout,hijer mede wens jck mij
7,en dan sal wy wel,vast,ADV,vas,afdanken,VRB,afgedak,worden moeder doet de groetenis
8,daar zyn E myn soo,vast,ADV,vast,beloven,VRB,belooft,heeft geen schip te laten
9,en dat agter ’t schip,vast,ADV,vast,zijn,VRB,was,"hebben wy verlooren, een groote"




### Frequency of *leuk*+verb and *fijn*+verb compared <a class="anchor" id="freq-leuk-fijn"></a>
* Same as above, but now in modern corpus, with modern words
* This case study gathers much more data, so it may take more time to compute results

In [16]:
#from chaininglib import search
from IPython.core.display import display, HTML
from chaininglib.search.CorpusQuery import *
from chaininglib.ui.dfui import display_df
from chaininglib.utils.dfops import column_difference

corpus_name = "opus"
# Word 1: puur
word1= "leuk"
cq1 = create_corpus(corpus_name).max_results(10000).pattern(r'[lemma="' + word1 + r'"][pos="VRB.*"] within <s/>')
df_corpus1 = cq1.search().kwic()
display_df(df_corpus1, word1)
# Word 2: zuiver
word2 = "fijn"
cq2 = create_corpus(corpus_name).max_results(10000).pattern(r'[lemma="' + word2 + r'"][pos="VRB.*"] within <s/>')
df_corpus2 = cq2.search().kwic()
display_df(df_corpus2, word2)
# Compute difference
diff_left, diff_right, intersec = column_difference(df_corpus1["lemma 1"], df_corpus2["lemma 1"])
# Elements of 1 that are not in 2
display(HTML('Werkwoorden na <b>' + word1 + '</b> niet in <b>' + word2 + '</b>: ' + ", ".join(diff_left)))
# Elements of 2 that are not in 1
display(HTML('Werkwoorden na <b>' + word2 + '</b> niet in <b>' + word1 + '</b>: ' + ", ".join(diff_right)))
# Elements both in 1 and 2
display(HTML('Werkwoorden zowel voor <b>' + word1 + '</b> als voor <b>' + word2 + '</b>: ' + ", ".join(intersec)))

[F                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

Unnamed: 0,left context,lemma 0,pos 0,word 0,lemma 1,pos 1,word 1,right context
0,en twee keer voor ja,leuk,"AA(degree=pos,position=adv|pred)",Leuk,horen,"VRB(finiteness=fin,mood=imp|ind,tense=pres,number=sg)",hoor,Ik ken die man Welke
1,precies boven ons Dit kan,leuk,"AA(degree=pos,position=adv|pred)",leuk,worden,VRB(finiteness=ger|inf),worden,Ik denk eerder juist heel
2,Geef terug Willen jullie iets,leuk,"AA(degree=pos,position=postnom,case=gen,formal=infl-s)",leuks,zien,VRB(finiteness=ger|inf),zien,Hij vindt t klote Gaaf
3,je je zal ze niet,leuk,"AA(degree=pos,position=adv|pred)",leuk,vinden,VRB(finiteness=ger|inf),vinden,Wanneer komt Sherry hier Vijf
4,je je zal ze niet,leuk,"AA(degree=pos,position=adv|pred)",leuk,vinden,VRB(finiteness=ger|inf),vinden,Wanneer komt Sherry hier Vijf
...,...,...,...,...,...,...,...,...
9995,het rondneuzen Ze vindt je,leuk,"AA(degree=pos,position=adv|pred)",leuk,weten,"VRB(finiteness=fin,mood=ind,tense=past,number=sg)",wist,je dat Pardon Iona Kom
9996,mevrouw U zult het hier,leuk,"AA(degree=pos,position=adv|pred)",leuk,gaan,VRB(finiteness=ger|inf),gaan,vinden Ik had gedacht dat
9997,ik weet dat je dit,leuk,"AA(degree=pos,position=adv|pred)",leuk,vinden,"VRB(finiteness=fin,mood=ind,tense=pres,number=sg,person=2|3,formal=infl-t)",vindt,Uw briefje was zorgelijk Voelt
9998,weet dat het niet erg,leuk,"AA(degree=pos,position=adv|pred)",leuk,zijn,"VRB(finiteness=fin,mood=imp|ind,tense=pres,number=sg)",is,Dat weet ik wel Maar




[F                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

Unnamed: 0,left context,lemma 0,pos 0,word 0,lemma 1,pos 1,word 1,right context
0,oudste zoon Ik zou het,fijn,"AA(degree=pos,position=adv|pred)",fijn,vinden,VRB(finiteness=ger|inf),vinden,als jij wat met me
1,Ik zal even achter kijken,fijn,"AA(degree=pos,position=adv|pred)",Fijn,bedanken,"VRB(finiteness=part,tense=past)",bedankt,Een moment alstublieft Ik kan
2,Ik zal even achter kijken,fijn,"AA(degree=pos,position=adv|pred)",Fijn,bedanken,"VRB(finiteness=part,tense=past)",bedankt,Een moment alstublieft Ik kan
3,Ik zal even achter kijken,fijn,"AA(degree=pos,position=adv|pred)",Fijn,bedanken,"VRB(finiteness=part,tense=past)",bedankt,Een moment alstublieft Ik kan
4,komen vele broeders Het zou,fijn,"AA(degree=pos,position=adv|pred)",fijn,zijn,VRB(finiteness=ger|inf),zijn,als je zou huilen Het
...,...,...,...,...,...,...,...,...
9995,rij zelf mee Het zou,fijn,"AA(degree=pos,position=adv|pred)",fijn,zijn,VRB(finiteness=ger|inf),zijn,als we je naam wisten
9996,rij zelf mee Het zou,fijn,"AA(degree=pos,position=adv|pred)",fijn,zijn,VRB(finiteness=ger|inf),zijn,als we je naam wisten
9997,mormel Weet je wat ik,fijn,"AA(degree=pos,position=adv|pred)",fijn,vinden,"VRB(finiteness=fin,mood=imp|ind,tense=pres,number=sg)",vind,aan vrouwen Perfectie Lange benen
9998,zou ik hem niet zo,fijn,"AA(degree=pos,position=adv|pred)",fijn,hebben,"VRB(finiteness=fin,mood=ind,tense=pres,number=pl)",hebben,kunnen beledigen O ja en




### Train a POS tagger on an annotated historical corpus and tag a historical sentence<a class="anchor" id="pos-tagger"></a>

In [None]:
from chaininglib.ui.dfui import display_df
from chaininglib.process.corpus import get_tagger
from chaininglib.search.CorpusQuery import *
from chaininglib.search.LexiconQuery import *

import pandas as pd

# gather some pattern including our word, out of annotated corpora

dfs_all_corpora = pd.DataFrame()
some_lemma = "lopen"

for one_corpus in ['zeebrieven', 'gysseling']:
    print('querying '+one_corpus+'...')
    c = create_corpus(one_corpus).lemma(some_lemma).detailed_context(True).search()
    df_corpus = c.kwic() 
    
    # store the results
    dfs_all_corpora = pd.concat( [dfs_all_corpora, df_corpus] )


# get a tagger trained with our corpus data
tagger = get_tagger(dfs_all_corpora, pos_key = 'pos') 

# Use the trained tagger to tag unknown sentences
# The input must be like: tagger.tag(['today','is','a','beautiful','day'])

sentence = 'De menschen lopen wel haesteliken op enen bome ghelike een ape'
tagged_sentence = tagger.tag( sentence.split() )

print(tagged_sentence)


# Know we can lemmatize each occurence of our lemma in the new sentences

### Search in corpus and filter on metadata <a class="anchor" id="corpus-filter-metadata"></a>
First, we request all available metadata fields of the corpus. Then, we issue a search query, and request all metadata fields for the result. Finally, we filter on metadata values.

In [None]:
from chaininglib.search.metadata import get_available_metadata
from chaininglib.utils.dfops import df_filter, property_freq
from chaininglib.ui.dfui import display_df
from chaininglib.search.CorpusQuery import *


corpus_name="zeebrieven"
query=r'[lemma="boek"]'
# Request all metadata fields from corpus
fields = get_available_metadata(corpus_name)
# Perform query and ask all metadata
c = create_corpus(corpus_name).pattern(query).extra_fields_doc(fields["document"]).search()
df_corpus = c.kwic()

# Filter on year: > 1700
df_filter_year = df_corpus[df_corpus["witnessYear_from"].astype('int32') > 1700] 
display_df(df_filter_year, labels="After 1700")

# Filter on sender birth place Amsterdam
condition = df_filter(df_corpus["afz_geb_plaats"], pattern="Amsterdam")
df_filter_place = df_corpus[ condition ]
display_df(df_filter_place, labels="Sender born in Amsterdam")


# Group by birth place
df = property_freq(df_corpus,"afz_loc_plaats")
display_df(df, labels="Most frequent sender locations")

### Visualizing h-dropping  <a class="anchor" id="visualizing-h-dropping"></a>

In [6]:

from chaininglib.search.CorpusQuery import *
from chaininglib.search.metadata import get_available_metadata
from chaininglib.ui.dfui import display_df
from chaininglib.utils.dfops import join_df
 
corpus_to_search="zeebrieven"
group_by_column = 'afz_geb_plaats'

fields = get_available_metadata(corpus_to_search)

print('Get both normal and h-dropping language data')

df_corpus1 = create_corpus(corpus_to_search).pattern(r'[lemma="h[aeo].*" & word="h[aeo].*"]').extra_fields_doc(fields["document"]).search().kwic()
df1_filtered = df_corpus1[df_corpus1.apply(lambda x: len(x[group_by_column]) > 0, axis=1)]
df_corpus2 = create_corpus(corpus_to_search).pattern(r'[lemma="h[aeo].*" & word="[aeo].*"]').extra_fields_doc(fields["document"]).search().kwic()
df2_filtered = df_corpus2[df_corpus2.apply(lambda x: len(x[group_by_column]) > 0, axis=1)]

print('Get neutral language data')

df_corpus1and2 = create_corpus(corpus_to_search).pattern(r'[lemma="h[aeo].*"]').extra_fields_doc(fields["document"]).search().kwic()
df_corpus1and2.rename(columns={'lemma 0':'total'}, inplace=True)
df1and2_filtered = df_corpus1and2[df_corpus1and2.apply(lambda x: len(x[group_by_column]) > 0, axis=1)]

print('Counting...')

group1 = df1_filtered[['lemma 0', group_by_column]].groupby(group_by_column).count()
group2 = df2_filtered[['lemma 0', group_by_column]].groupby(group_by_column).count()
group1and2 = df1and2_filtered[['total', group_by_column]].groupby(group_by_column).count() 

# compute relative counts
df1 = join_df( [ group1, group1and2['total'] ] )
df2 = join_df( [ group2, group1and2['total'] ] )
df1['lemma 0'] = df1['lemma 0'] / df1['total']
df2['lemma 0'] = df2['lemma 0'] / df2['total']

# clean up data
df1 = df1.dropna()
df2 = df2.dropna()
df1 = df1.drop(['total'], axis=1)
df2 = df2.drop(['total'], axis=1)

print('Draw charts showing geographic differences between normal language and h-dropping')

display_df( df1.sort_values(ascending=False,by=['lemma 0']).head(25), labels="normal", mode='chart') 
display_df( df2.sort_values(ascending=False,by=['lemma 0']).head(25), labels="h-dropping", mode='chart')


Get both normal and h-dropping language data
[F                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

ValueError: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: An error occured when searching corpus zeebrieven: invalid literal for int() with base 10: '\n        '

### Generate a lexicon from a corpus <a class="anchor" id="lexicon-from-corpus"></a>

In [None]:
from chaininglib.ui.dfui import display_df
from chaininglib.process.corpus import extract_lexicon
from chaininglib.search.CorpusQuery import *

corpus_name = "zeebrieven"

print('Querying '+corpus_name+'...')
c = create_corpus(corpus_name).pos("NOU").detailed_context(True).search()
df_corpus = c.kwic() 
    
# extract lexicon and show the result
extracted_lexicon = extract_lexicon(df_corpus, posColumnName="pos")
display(extracted_lexicon)

## Lexicon

### Lexicon search <a class="anchor" id="lexicon-search"></a>

* Run the cell below to show the UI, and fill in your search query in the UI
* Choose one of the lexica:
  * **anw**: Algemeen Nederlands Woordenboek, dictionary of contemporary Dutch
  * **celex**: Lexical database of the Dutch language
  * **duelme**
  * **molex**: A lexicon of modern Dutch, containing spelling variants per lemma
  * **mnwlex**: Lexicon service access to Middelnederlands Woordenboek, a lexicon of Middle Dutch (ca. 1250 - 1550)
  * **lexicon_service_db**: Lexicon service access to a lexicon of early modern and modern Dutch (1500 - 1976)

In [7]:
from chaininglib.ui.search import create_lexicon_ui

#from chaininglib import ui
searchWordField, lexiconField = create_lexicon_ui()

VBox(children=(Text(value='boek', description='<b>Word:</b>'), Dropdown(description='<b>Lexicon:</b>', options…

 * Click the cell below and press Run to perform the given query

In [16]:
from chaininglib.search.LexiconQuery import *
from chaininglib.ui.dfui import display_df

search_word = searchWordField.value
lexicon_name = lexiconField.value
# USER: can replace this by own custom query
lex = create_lexicon(lexicon_name).lemma(search_word).search()
df_lexicon = lex.kwic()
display_df(df_lexicon)
#df_columns_list = list(df_lexicon.columns.values)
#df_lexicon_in_columns = df_lexicon[df_columns_list]
#display(df_lexicon_in_columns)

...Searching molex...[F...Querying molex at offset 0...[F                                                                    [F                                                                    [F

Unnamed: 0,lemEntryId,lemma,lemPos,wordformId,wordform,hyphenation,wordformPos,Gender,Number
0,http://rdf.ivdnt.org/lexica/diamant/entry/molex/10919,boek,http://universaldependencies.org/u/pos/NOUN,http://rdf.ivdnt.org/lexica/diamant/wordform/molex/179523,boek,boek,http://universaldependencies.org/u/pos/NOUN,http://universaldependencies.org/u/feat/Gender.html#Neut,http://universaldependencies.org/u/feat/Number.html#Sing
1,http://rdf.ivdnt.org/lexica/diamant/entry/molex/10919,boek,http://universaldependencies.org/u/pos/NOUN,http://rdf.ivdnt.org/lexica/diamant/wordform/molex/39435,boeken,boe|ken,http://universaldependencies.org/u/pos/NOUN,http://universaldependencies.org/u/feat/Gender.html#Neut,http://universaldependencies.org/u/feat/Number.html#Plur
2,http://rdf.ivdnt.org/lexica/diamant/entry/molex/104054,boeken,http://universaldependencies.org/u/pos/VERB,http://rdf.ivdnt.org/lexica/diamant/wordform/molex/365579,boek,boek,http://universaldependencies.org/u/pos/VERB,,http://universaldependencies.org/u/feat/Number.html#Sing
3,http://rdf.ivdnt.org/lexica/diamant/entry/molex/104054,boeken,http://universaldependencies.org/u/pos/VERB,http://rdf.ivdnt.org/lexica/diamant/wordform/molex/803734,boek,boek,http://universaldependencies.org/u/pos/VERB,,http://universaldependencies.org/u/feat/Number.html#Sing


HBox(children=(Label(value='Sla uw resultaten op:'), Text(value='mijn_resultaten.csv'), Button(button_style='w…

## Corpus + lexicon

###  Build a frequency list of the lemma of some corpus output <a class="anchor" id="freq-lemma-corpus"></a>

In [None]:
from chaininglib.search.CorpusQuery import *
from chaininglib.process.corpus import *
from chaininglib.ui.dfui import *

# max number of records to retrieve (speed!)
max_nr_of_records = 50000

# do some corpus search
print('This can take a while... please wait!')

corpus_to_search="openchn"
df_corpus = create_corpus(corpus_to_search).detailed_context(True).pos("NOU.*").max_results(max_nr_of_records).search().kwic()

# compute and display a table of the frequencies of the lemmata
freq_df = get_frequency_list(df_corpus)
display_df(freq_df)

### Find occurences of attributive adjectives not ending with -e, even though they are preceeded by a definite article <a class="anchor" id="adjective-e"></a>

In [10]:
from chaininglib.search.CorpusQuery import *
from chaininglib.search.LexiconQuery import *
from chaininglib.utils.dfops import df_filter
from chaininglib.ui.dfui import display_df

corpus_to_search="openchn"
lexicon_to_search="molex"

# CORPUS: get [article + attributive adjective + nouns] combinations in which the adjective does not end with -e
print('Get occurences of attributive adjectives not ending with -e')
cq = create_corpus(corpus_to_search).pattern(r'[lemma="de|het"][word="^g(.+)[^e]$" & pos="AA.*degree=pos.*"][pos="NOU.*"]')
df_corpus = cq.search().kwic()
display(df_corpus)

# LEXICON: get adjectives the lemma of which does not end with -e 
#lq = create_lexicon(lexicon_to_search).lemma('^g(.+)[^e]$').pos('ADJ').search()
lq = create_lexicon(lexicon_to_search).lemma('^g(.+)[^e]$').pos('ADJ(degree=pos)').search()
df_lexicon = lq.search().kwic()

display(df_lexicon)

# LEXICON: get adjectives having a final -e in definite attributive use
print('Filtering lexicon results')
final_e_condition = df_filter(df_lexicon["wordform"], 'e$')
df_lexicon_form_e = df_lexicon[ final_e_condition ]

# RESULT: get the records out of our first list in which the -e-less-adjectives match the lemma form of our last list
print('List of attributive adjectives not ending with -e even though they should have a final -e:')
e_forms = list(df_lexicon_form_e.lemma)
no_final_e_condition = df_filter(df_corpus["word 1"], pattern=set(e_forms), method="isin")
result_df = df_corpus[ no_final_e_condition ]
display_df( result_df )

Get occurences of attributive adjectives not ending with -e
[F                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         

Unnamed: 0,left context,lemma 0,pos 0,word 0,lemma 1,pos 1,word 1,lemma 2,pos 2,word 2,right context
0,van Duijneveldt met The Unborn,het,"PD(type=d-p,subtype=art-def)",het,gouden,"AA(degree=pos,position=prenom)",gouden,ei,"NOU-C(gender=n,number=sg)",ei,symbool van hoop en belofte
1,s get wild edition heet,het,"PD(type=pers,person=3,gender=n,number=sg,position=pron)",het,gratis,"AA(degree=pos,position=adv|pred)",gratis,flip,NOU-P(part-of-multiword=true),Flip,Flop-muziekevenement dat op zondag 9
2,lekker snoepen van alles wat,de,"PD(type=d-p,subtype=art-def)",de,goedheilig,"AA(degree=pos,position=prenom)",goedheilig,man,"NOU-C(gender=f|m,number=sg)",man,uitdeelt Oh ja en dan
3,Freeport een geldbedrag uitlooft voor,de,"PD(type=d-p,subtype=art-def)",de,gouden,"AA(degree=pos,position=prenom)",gouden,tip,"NOU-C(gender=f|m,number=sg)",tip,zo vertelt eigenaar Mahesh Mukti
4,plaatste vervolgens een foto van,het,"PD(type=d-p,subtype=art-def)",het,vinden,"AA(position=prenom,degree=pos)",gevonden,toestel,"NOU-C(gender=n,number=sg)",toestel,op Twitter in de hoop
...,...,...,...,...,...,...,...,...,...,...,...
9553,onderzoekers een kijkje nemen bij,de,"PD(type=d-p,subtype=art-def)",de,gezonken,"AA(position=prenom,degree=pos)",gezonken,titanic,NOU-P(number=sg),Titanic,Tijdens de twintig dagen durende
9554,het uitzoeken en registreren van,de,"PD(type=d-p,subtype=art-def)",de,gegeven,"AA(position=prenom,degree=pos)",gegeven,collectie,"NOU-C(gender=f|m,number=sg)",collectie,
9555,van een kampleider en tegen,de,"PD(type=d-p,subtype=art-def)",de,gedwongen,"AA(position=prenom,degree=pos)",gedwongen,deportatie,"NOU-C(gender=f|m,number=sg)",deportatie,naar LaosErkend vluchtelingDe Amerikaanse regering
9556,en verrassende voorbeelden zoals van,de,"PD(type=d-p,subtype=art-def)",de,gouden,"AA(degree=pos,position=prenom)",gouden,koets,"NOU-C(gender=f|m,number=sg)",koets,en een haast honderdjarige boom


[F...Querying molex at offset 0...                                                                                                                                                                                                                                                

Unnamed: 0,lemEntryId,lemma,lemPos,wordformId,wordform,hyphenation,wordformPos
0,http://rdf.ivdnt.org/lexica/diamant/entry/molex/27460,gewafeld,http://universaldependencies.org/u/pos/ADJ,http://rdf.ivdnt.org/lexica/diamant/wordform/molex/102144,gewafelde,ge|wa|fel|de,http://universaldependencies.org/u/pos/ADJ
1,http://rdf.ivdnt.org/lexica/diamant/entry/molex/28905,granaten,http://universaldependencies.org/u/pos/ADJ,http://rdf.ivdnt.org/lexica/diamant/wordform/molex/102404,granaten,gra|na|ten,http://universaldependencies.org/u/pos/ADJ
2,http://rdf.ivdnt.org/lexica/diamant/entry/molex/26787,gereglementeerd,http://universaldependencies.org/u/pos/ADJ,http://rdf.ivdnt.org/lexica/diamant/wordform/molex/102662,gereglementeerde,ge|re|gle|men|teer|de,http://universaldependencies.org/u/pos/ADJ
3,http://rdf.ivdnt.org/lexica/diamant/entry/molex/28966,grasrijk,http://universaldependencies.org/u/pos/ADJ,http://rdf.ivdnt.org/lexica/diamant/wordform/molex/104989,grasrijke,gras|rij|ke,http://universaldependencies.org/u/pos/ADJ
4,http://rdf.ivdnt.org/lexica/diamant/entry/molex/25753,geil,http://universaldependencies.org/u/pos/ADJ,http://rdf.ivdnt.org/lexica/diamant/wordform/molex/105604,geilst,geilst,http://universaldependencies.org/u/pos/ADJ
...,...,...,...,...,...,...,...
7942,http://rdf.ivdnt.org/lexica/diamant/entry/molex/846376,grondrechtelijk,http://universaldependencies.org/u/pos/ADJ,http://rdf.ivdnt.org/lexica/diamant/wordform/molex/870801,grondrechtelijks,grond|rech|te|lijks,http://universaldependencies.org/u/pos/ADJ
7943,http://rdf.ivdnt.org/lexica/diamant/entry/molex/846376,grondrechtelijk,http://universaldependencies.org/u/pos/ADJ,http://rdf.ivdnt.org/lexica/diamant/wordform/molex/870802,grondrechtelijke,grond|rech|te|lij|ke,http://universaldependencies.org/u/pos/ADJ
7944,http://rdf.ivdnt.org/lexica/diamant/entry/molex/846376,grondrechtelijk,http://universaldependencies.org/u/pos/ADJ,http://rdf.ivdnt.org/lexica/diamant/wordform/molex/870803,grondrechtelijk,grond|rech|te|lijk,http://universaldependencies.org/u/pos/ADJ
7945,http://rdf.ivdnt.org/lexica/diamant/entry/molex/846376,grondrechtelijk,http://universaldependencies.org/u/pos/ADJ,http://rdf.ivdnt.org/lexica/diamant/wordform/molex/870804,grondrechtelijkste,grond|rech|te|lijk|ste,http://universaldependencies.org/u/pos/ADJ


Filtering lexicon results
List of attributive adjectives not ending with -e even though they should have a final -e:


Unnamed: 0,left context,lemma 0,pos 0,word 0,lemma 1,pos 1,word 1,lemma 2,pos 2,word 2,right context
2,lekker snoepen van alles wat,de,"PD(type=d-p,subtype=art-def)",de,goedheilig,"AA(degree=pos,position=prenom)",goedheilig,man,"NOU-C(gender=f|m,number=sg)",man,uitdeelt Oh ja en dan
10,onvoorwaardelijk gesteund door hulpverleners in,het,"PD(type=d-p,subtype=art-def)",het,gesloten,"AA(position=prenom,degree=pos)",gesloten,instituut,"NOU-C(gender=n,number=sg)",instituut,voor jongeren met gedragsproblemen waarin
17,vier spirituele wetten zijn vergiffenis,het,"PD(type=d-p,subtype=art-def)",het,goddelijk,"AA(degree=pos,position=prenom)",goddelijk,doel,"NOU-C(gender=n,number=sg)",doel,doelen stellen in het leven
18,het museum ter ere van,de,"PD(type=d-p,subtype=art-def)",de,geestelijk,"AA(degree=pos,position=prenom)",geestelijk,vader,"NOU-C(gender=f|m,number=sg)",vader,van de fictieve indiaan Winnetou
26,van de jouvers die daarmee,de,"PD(type=d-p,subtype=art-def)",de,gebroken,"AA(position=prenom,degree=pos)",gebroken,nacht,"NOU-C(gender=f|m,number=sg)",nacht,hadden overleefd Tegen acht uur
...,...,...,...,...,...,...,...,...,...,...,...
9542,hadden gezworen Zij hadden met,het,"PD(type=d-p,subtype=art-def)",het,gewelddadig,"AA(degree=pos,position=prenom)",gewelddadig,omverwerpen,"NOU-C(gender=n,number=sg)",omverwerpen,van de rechtsorde en krijgstucht
9548,Helemaal geen probleem vinden zij,het,"PD(type=d-p,subtype=art-def)",Het,gezond,"AA(degree=pos,position=adv|pred)",gezond,houden,"NOU-C(gender=n,number=sg)",houden,van mensen in het binnenland
9549,tot stand wordt gebracht zal,het,"PD(type=pers,person=3,gender=n,number=sg,position=pron)",het,gewoon,"AA(degree=pos,position=prenom)",gewoon,onderdeel,"NOU-C(gender=n,number=sg)",onderdeel,worden van het nationaal zekerheidsstelsel
9550,op 29 oktober in NIS,het,"PD(type=d-p,subtype=art-def)",Het,gevarieerd,"AA(position=prenom,degree=pos)",gevarieerd,programma,"NOU-C(gender=n,number=sg)",programma,zal uit zang bestaan van


HBox(children=(Label(value='Sla uw resultaten op:'), Text(value='mijn_resultaten.csv'), Button(button_style='w…

### Look up inflected forms and spelling variants for a given lemma in a corpus <a class="anchor" id="inflected-spelling-corpus"></a>

In [None]:
from chaininglib.ui.dfui import display_df
from chaininglib.search.CorpusQuery import *
from chaininglib.search.LexiconQuery import *

# Corpus Gysseling and lexicon mnwlex are from same period: 1250-1550
lexicon_to_search="mnwlex"
corpus_to_search="gysseling"

##############################################
# TODO  zelfde met meerdere lemmata en gegroepeerd 
##############################################

lemma_to_look_for="denken"

# LEXICON: Search for the inflected forms of a lemma in a morphosyntactic lexicon
lq = create_lexicon(lexicon_to_search).lemma(lemma_to_look_for).search()
df_lexicon = lq.kwic()
display_df(df_lexicon)

# Put all inflected forms into a list
inflected_wordforms = list(df_lexicon.wordform)

# CORPUS: Look up the inflected forms in a (possibly unannotated) corpus
# beware: If the corpus is not annotated, all we can do is searching for the inflected words
#         But if the corpus is lemmatized, we have to make sure we're retrieving correct data by specifying the lemma as well
annotated_corpus = True
query = r'[lemma="'+lemma_to_look_for+r'" & word="'+r"|".join(inflected_wordforms)+r'"]' if annotated_corpus else r'[word="'+r"|".join(inflected_wordforms)+r'"]'
cq = create_corpus(corpus_to_search).pattern(query).search()
df_corpus = cq.kwic() 
display_df(df_corpus)

### Corpus frequency list of lemmata from lexicon with given lemma <a class="anchor" id="corpus-frequency-lemma-pos"></a>
Build a function with which we can gather all lemmata of a lexicon, and build a frequency list of those lemmata in a corpus.

In [None]:
from chaininglib.search.LexiconQuery import *
from chaininglib.search.CorpusQuery import *
from chaininglib.process.corpus import get_frequency_list
from chaininglib.ui.dfui import display_df
import numpy as np


# build a function as required. We will run it afterwards

def get_frequency_list_given_a_corpus(lexicon, pos, corpus):
    
    # LEXICON: get a lemmata list to work with

    # query the lexicon for lemma with a given part-of-speech
    lq = create_lexicon(lexicon).pos(pos).search()
    df_lexicon = lq.kwic()

    # Put the results into an array, so we can loop through the found lemmata
    lexicon_lemmata_arr = [w.lower() for w in df_lexicon["writtenForm"]][-200:]
    # Instantiate a DataFrame, in which we will gather all single lemmata occurences
    df_full_list = pd.DataFrame()


    # CORPUS: loop through the lemmata list, query the corpus with each lemma, and count the results

    # It's a good idea to query more than one lemma at at the time,
    # but not too many, otherwise the server will get overloaded!
    nr_of_lemmata_to_query_atonce = 100

    # loop over lemma list 
    for i in range(0, len(lexicon_lemmata_arr), nr_of_lemmata_to_query_atonce):
        
        print('Lemmata processed: '+str(i)+'/'+str(len(lexicon_lemmata_arr)))
        
        # slice to small array of lemmata to query at once
        small_lemmata_arr = lexicon_lemmata_arr[i : i+nr_of_lemmata_to_query_atonce] 

        # join set of lemmas to send them in a query all at once
        # beware: single quotes need escaping
        lemmata_list = "|".join(small_lemmata_arr).replace("'", "\\\\'")
        cq = create_corpus(corpus).pattern(r'[lemma="' + lemmata_list + r'"]').search()
        df_corpus = cq.kwic()
        
        # add the results to the full list
        if "lemma 0" in df_corpus.columns:
            df_full_list = pd.concat( [df_full_list, df_corpus["lemma 0"]] )     
        

    # make sure the columnswith that contains the lemmata is same as given to get_frequency_list function
    column_name="lemma"
    df_full_list.columns = [column_name]

    # we're done with querying, build the frequency list now
    print('Done.')
    freq_df = get_frequency_list(df_full_list, column_name=column_name)
    

    return freq_df

    
# run it!
lexicon="molex"
# TODO: Maybe too much too ask all nouns? Maybe take random sample?
corpus_to_search="openchn"
pos="NOU.*"

freq_df = get_frequency_list_given_a_corpus(lexicon, pos, corpus_to_search)

display_df(freq_df)

### Build a frequency table of some corpus, based on lemma list of a given lexicon <a class="anchor" id="freqtable-lemmalist"></a>
In this case study, we compare lemma frequencies for corpora from different time periods: CHN extern (contemporary Dutch Antilles & Suriname) and Letters as Loot (sailors' letters, 17th and 18th century).

*For this case study, you need to run the previous case study first, because it generates a function we need here.*

In [None]:
from chaininglib.utils.dfops import get_rank_diff
from chaininglib.ui.dfui import display_df

# For this case study, you need to run the previous case study first, because it generates a function we need here

# Use lexica and corpora from same period
base_lexicon1="molex"
corpus_to_search1="openchn"

base_lexicon2="molex"
corpus_to_search2="zeebrieven"

# ADJ gives interesting comparison

pos="ADJ.*"

# build frequency tables of two corpora

df_frequency_list1 = get_frequency_list_given_a_corpus(base_lexicon1, pos, corpus_to_search1)
# sort and display
df_top25_descending = df_frequency_list1.sort_values(ascending=False,by=['token count']).head(25)
df_top25_ascending =  df_frequency_list1.sort_values(ascending=True, by=['rank']).head(25)
display_df( df_top25_descending[['lemmas', 'token count']].set_index('lemmas'), labels='df1 chart '+corpus_to_search1, mode='chart' )

df_frequency_list2 = get_frequency_list_given_a_corpus(base_lexicon2, pos, corpus_to_search2)
# sort and display
df_top25_descending = df_frequency_list2.sort_values(ascending=False,by=['token count']).head(25)
df_top25_ascending =  df_frequency_list2.sort_values(ascending=True, by=['rank']).head(25)
display_df( df_top25_descending[['lemmas', 'token count']].set_index('lemmas'), labels='df2 chart '+corpus_to_search2, mode='chart' )


# TODO: lemmata tonen die in 1 of 2 ontbreken

# compute the rank diff of lemmata in frequency tables

# sort and display
df_rankdiffs = get_rank_diff(df_frequency_list1, df_frequency_list2, index='lemmas')

display_df(df_rankdiffs.sort_values(by=['rank_diff']).head(25), labels='Differences in ranks')

df_top25_descending = df_rankdiffs.sort_values(ascending=False, by=['rank_diff']).head(25)
display_df( df_top25_descending['rank_diff'], labels='chart large diff', mode='chart' )

df_top25_ascending = df_rankdiffs.sort_values(ascending=True, by=['rank_diff']).head(25)
display_df( df_top25_ascending['rank_diff'], labels='chart small diff', mode='chart' )

### Search corpus for wordforms of lemma not included in lexicon <a class="anchor" id="corpus-wordforms-not-lexicon"></a>

In [None]:
from chaininglib.search.LexiconQuery import *
from chaininglib.search.CorpusQuery import *
from chaininglib.ui.dfui import display_df

# Let's build a function to do the job:
# The function will require a lexicon name and a part-of-speech to limit the search to, and the name of a corpus to be searched.
# It will return a Pandas DataFrame associating lemmata to their paradigms ('known_wordforms' column) and
# missing wordforms found in the corpus ('unknown_wordforms' column).

def get_missing_wordforms(lexicon_name, lexicon_postag, corpus, corpus_postag):    
    
    print('Finding missing wordforms in a lexicon can take some time...');
    
    # LEXICON: 
    # get a lemmata list having a given part-of-speech
    # MoLex is a convenient source to get a list of lemmata
    
    lq = create_lexicon("molex").pos(lexicon_postag).search()
    df_lexicon = lq.kwic()
    
    # Put the results into an array, so we can loop through the list of lemmata
    lexicon_lemmata_arr = [w.lower() for w in df_lexicon["writtenForm"]][-50:]
    
    # Test array, instead of querying Molex
    #lexicon_lemmata_arr = ["denken", "doen", "hebben", "maken"]
    
    # Prepare the output:
    # instantiate a DataFrame for storing lemmata and mssing wordforms
    df_enriched_lexicon = pd.DataFrame(index=lexicon_lemmata_arr, columns=['lemma', 'pos', 'known_wordforms', 'unknown_wordforms'])
    df_enriched_lexicon.index.name = 'lemmata'
    
    # CORPUS: 
    # loop through the lemmata list, query the corpus for each lemma, 
    # and compute paradigms differences between both

    
    # loop through the lemmata list
    # and query the corpus for occurances of the lemmata
    
    # It's a good idea to work with more than one lemma at the time (speed)!
    nr_of_lemmata_to_query_atonce = 100
    
    for i in range(0, len(lexicon_lemmata_arr), nr_of_lemmata_to_query_atonce):
        
        # slice to small array of lemmata to query at once
        small_lemmata_arr = lexicon_lemmata_arr[i : i+nr_of_lemmata_to_query_atonce]
        
        # join set of lemmata to send them in a query all at once
        # beware: single quotes need escaping
        lemmata_list = "|".join(small_lemmata_arr).replace("'", "\\\\'")
        print("Querying lemmata %i-%i of %i from corpus." % (i, i+nr_of_lemmata_to_query_atonce, len(lexicon_lemmata_arr) ))
        cq = create_corpus(corpus).pattern(r'[lemma="' + lemmata_list + r'" & pos="'+corpus_postag+'"]').search()
        df_corpus = cq.kwic()
        
        # if the corpus gave results,
        # query the lexicon for the same lemmata
        # and compare the paradigms!
        
        if (len(df_corpus)>0):
            small_lemmata_set = set(small_lemmata_arr)
            for one_lemma in small_lemmata_set: 
                
                # look up the known wordforms in the lexicon
                ql = create_lexicon(lexicon_name).lemma(one_lemma).search()
                df_known_wordforms = ql.kwic()
                
                # we have a lexicon paradigm to compare, do the job now
                if (len(df_known_wordforms) != 0):
                    
                    # gather the lexicon wordforms in a set
                    known_wordforms = set( df_known_wordforms['wordform'].str.lower() )
                    
                    # gather the corpus wordforms (of the same lemma) in a set too
                    corpus_lemma_filter = (df_corpus['lemma 0'] == one_lemma)
                    corpus_wordforms = set( (df_corpus[ corpus_lemma_filter ])['word 0'].str.lower() )
                    
                    # Now compute the differences:
                    # gather in a set all the corpus wordforms that cannot be found in the lexicon wordforms 
                    unknown_wordforms = corpus_wordforms.difference(known_wordforms)

                    # If we found some missing wordforms, add the results to the output!
                    
                    if (len(unknown_wordforms) !=0):                        
                        # The index of our results will be a key consisting of lemma + part-of-speech
                        # Part-of-speech is needed to distinguish homonyms with different grammatical categories.
                        # Of course, we need to take glosses into account too to do a truely correct job
                        # But we didn't do it here
                        key = one_lemma + lexicon_postag
                        df_enriched_lexicon.at[key, 'lemma'] = one_lemma
                        df_enriched_lexicon.at[key, 'pos'] = lexicon_postag
                        df_enriched_lexicon.at[key, 'known_wordforms'] = known_wordforms
                        df_enriched_lexicon.at[key, 'unknown_wordforms'] = unknown_wordforms
                
    # return non-empty results, t.i. cases in which we found some wordforms
    return df_enriched_lexicon[ df_enriched_lexicon['unknown_wordforms'].notnull() ]


# Run the function!
#
# ask the lexicon which wordforms it knows, and try to find new unknown wordforms in the corpus

lexicon_name="mnwlex"
corpus_to_search="zeebrieven"

# beware: lexicon and corpus may have different parts-of-speech sets in use
df = get_missing_wordforms(lexicon_name, "VERB", corpus_to_search, "VRB")

# After such a heavy process, it's a good idea to save the results

df.to_csv( "missing_wordforms.csv", index=False)

display_df(df, labels='Missing wordforms')


## Treebanks

### Treebank search <a class="anchor" id="treebank-search"></a>

In [None]:
from chaininglib.search.TreebankQuery import *


print ("search...")
tbq = create_treebank("cgn").pattern("//node[@cat='pp' and node[@cat='ap' and node[@cat='np']]]").search()

print ("get XML...")

xml = tbq.xml()
print(xml)

print ("get trees and their string representations...")

trees = tbq.trees()

for tree in trees:
    display(tree.toString())

df = tbq.kwic()
    
display(df)

### Which kind of nouns are used in a prepositional complement of the verb *geven* ? <a class="anchor" id="treebank-objects-geven"></a>

In [None]:
from chaininglib.search.TreebankQuery import *


print ("search...")

tbq = create_treebank("cgn").pattern('//node[node[@rel="hd" and @pt="ww" and @root="geven"] and node[@rel="obj1" and @pt="n"]]').search()


print ("get list of nouns which are part of an PP, as argument of predicate 'geven'...")

trees = tbq.trees()

list_of_nouns = []
for tree in trees:
    nouns = tree.extract(['pp', 'np'])
    list_of_nouns = list_of_nouns + nouns
    

display(list_of_nouns)
    
df = tbq.kwic(align_lemma='geven')
display(df)