# Examples Chaining search
This notebook contains a number of examples of chaining linguistic resources: corpora, lexica and treebanks. Try the examples, or copy the code and customize the examples in the [Sandbox](Sandbox.ipynb).


## List of examples
### Corpora
 * [Corpus search](#corpus-search)
 * [Frequency of *zeker*+verb and *vast*+verb compared](#freq-puur-zuiver)
 * [Train a POS tagger on an annotated corpus](#pos-tagger)
 * [Search in corpus and filter on metadata](#corpus-filter-metadata)
 * [Visualizing h-dropping](#visualizing-h-dropping)
 * [Generate lexicon from several corpora](#lexicon-several-corpora)

### Lexica
 * [Lexicon search](#lexicon-search)

### Corpus + lexicon
 * [Retrieve synonyms from DiaMaNT, look up in Gysseling](#synonyms-diamant-gysseling)
 * [Build a frequency list of the lemma of some corpus output](#freq-lemma-corpus)
 * [Find occurences of attributive adjectives not ending with -e, even though they are preceeded by a definite article](#adjective-e)
 * [Look up inflected forms and spelling variants for a given lemma in a corpus](#inflected-spelling-corpus)
 * [Corpus frequency list of lemmata from lexicon with given lemma](#corpus-frequency-lemma-pos)
 * [Build a frequency table of some corpus, based on lemmata of a given lexicon](#freqtable-lemmalist)
 * [Search corpus for wordforms of lemma not included in lexicon](#corpus-wordforms-not-lexicon)
 
### Treebanks
 * [Treebank search](#treebank-search)
 * [Which objects of verb *geven* occur?](#treebank-objects-geven)
 ---

## Corpus

### Corpus search <a class="anchor" id="corpus-search"></a>

* Run the cell below to show the UI, and fill in your search query

In [None]:
from chaininglib.ui.search import create_corpus_ui
from chaininglib.ui.dfui import display_df, get_uploader

# Create corpus UI, creates references to field contents
corpusQueryField, corpusField = create_corpus_ui()


 * Click the cell below and press Run to perform the given query

In [None]:
from chaininglib.search.CorpusQuery import *

#from chaininglib import search
query= corpusQueryField.value
corpus_name = corpusField.value
df_corpus = create_corpus(corpus_name).pattern(query).search().kwic()
#df_corpus = load_dataframe('mijn_resultaten.csv')
display_df(df_corpus, labels="Results")



### Frequency of *zeker*+verb and *vast*+verb compared <a class="anchor" id="freq-puur-zuiver"></a>
* Below cell searches for *zeker*+verb and for *vast*+verb in the Letters as Loot (zeebrieven) corpus
* Compare frequencies

In [None]:
#from chaininglib import search
from IPython.core.display import display, HTML
from chaininglib.search.CorpusQuery import *
from chaininglib.ui.dfui import display_df
from chaininglib.utils.dfops import column_difference

corpus_name = "zeebrieven"

# Word 1: puur
word1= "zeker"
cq1 = create_corpus(corpus_name).pattern(r'[lemma="' + word1 + r'"][pos="VRB.*"]')
df_corpus1 = cq1.search().kwic()
display_df(df_corpus1, word1)

# Word 2: zuiver
word2 = "vast"
cq2 = create_corpus(corpus_name).pattern(r'[lemma="' + word2 + r'"][pos="VRB.*"]')
df_corpus2 = cq2.search().kwic()
display_df(df_corpus2, word2)

# Compute difference
diff_left, diff_right, intersec = column_difference(df_corpus1["word 1"], df_corpus2["word 1"])
# Elements of 1 that are not in 2
display(HTML('Werkwoorden voor <b>' + word1 + '</b> niet in <b>' + word2 + '</b>: ' + ", ".join(diff_left)))
# Elements of 2 that are not in 1
display(HTML('Werkwoorden voor <b>' + word2 + '</b> niet in <b>' + word1 + '</b>: ' + ", ".join(diff_right)))
# Elements both in 1 and 2
display(HTML('Werkwoorden zowel voor <b>' + word1 + '</b> als voor <b>' + word2 + '</b>: ' + ", ".join(intersec)))

### Train a POS tagger on an annotated corpus <a class="anchor" id="pos-tagger"></a>

In [None]:
from chaininglib.ui.dfui import display_df
from chaininglib.process.corpus import get_tagger
from chaininglib.search.CorpusQuery import *
from chaininglib.search.LexiconQuery import *

import pandas as pd

# gather some pattern including our word, out of annotated corpora

dfs_all_corpora = pd.DataFrame()

for one_corpus in ['zeebrieven', 'gysseling']:
    print('querying '+one_corpus+'...')
    c = create_corpus(one_corpus).word(some_word).detailed_context(True).search()
    df_corpus = c.kwic() 
    
    # store the results
    dfs_all_corpora = pd.concat( [dfs_all_corpora, df_corpus] )


# get a tagger trained with our corpus data
tagger = get_tagger(dfs_all_corpora, pos_key = 'pos') 

# Use the trained tagger to tag unknown sentences
# The input must be like: tagger.tag(['today','is','a','beautiful','day'])

sentence = 'Het is mooi weer, dus gaan we in het bos lopen'
tagged_sentence = tagger.tag( sentence.split() )

print(tagged_sentence)


# Know we can lemmatize each occurence of our lemma in the new sentences

### Search in corpus and filter on metadata <a class="anchor" id="corpus-filter-metadata"></a>
First, we request all available metadata fields of the corpus. Then, we issue a search query, and request all metadata fields for the result. Finally, we filter on metadata values.

In [None]:
from chaininglib.search.metadata import get_available_metadata
from chaininglib.utils.dfops import df_filter, property_freq
from chaininglib.ui.dfui import display_df
from chaininglib.search.CorpusQuery import *


corpus_name="zeebrieven"
query=r'[lemma="boek"]'
# Request all metadata fields from corpus
fields = get_available_metadata(corpus_name)
# Perform query and ask all metadata
c = create_corpus(corpus_name).pattern(query).extra_fields_doc(fields["document"]).search()
df_corpus = c.kwic()

# Filter on year: > 1700
df_filter_year = df_corpus[df_corpus["witnessYear_from"].astype('int32') > 1700] 
display_df(df_filter_year, labels="After 1700")

# Filter on sender birth place Amsterdam
condition = df_filter(df_corpus["afz_geb_plaats"], pattern="Amsterdam")
df_filter_place = df_corpus[ condition ]
display_df(df_filter_place, labels="Sender born in Amsterdam")


# Group by birth place
df = property_freq(df_corpus,"afz_loc_plaats")
display_df(df, labels="Most frequent sender locations")

### Visualizing h-dropping  <a class="anchor" id="visualizing-h-dropping"></a>

In [None]:

from chaininglib.search.CorpusQuery import *
from chaininglib.search.metadata import get_available_metadata
from chaininglib.ui.dfui import display_df
 
corpus_to_search="zeebrieven"
group_by_column = 'afz_geb_plaats'

fields = get_available_metadata(corpus_to_search)

df_corpus1 = create_corpus(corpus_to_search).pattern(r'[lemma="h[aeo].*" & word="h[aeo].*"]').extra_fields_doc(fields["document"]).search().kwic()
df_corpus2 = create_corpus(corpus_to_search).pattern(r'[lemma="h[aeo].*" & word="[aeo].*"]').extra_fields_doc(fields["document"]).search().kwic()

print('Draw charts showing geographic differences between normal language and h-dropping')

display_df( df_corpus1[['lemma 0', group_by_column]].groupby(group_by_column).count().sort_values(ascending=False,by=['lemma 0']).head(25), labels="normal", mode='chart') 
display_df( df_corpus2[['lemma 0', group_by_column]].groupby(group_by_column).count().sort_values(ascending=False,by=['lemma 0']).head(25), labels="h-dropping", mode='chart')



### Generate lexicon from several corpora <a class="anchor" id="lexicon-several-corpora"></a>

In [None]:
from chaininglib.ui.dfui import display_df
from chaininglib.process.corpus import extract_lexicon
from chaininglib.search.CorpusQuery import *

dfs_all_corpora = []
for one_corpus in get_available_corpora(exclude=["nederlab"]):
    print('querying '+one_corpus+'...')
    c = create_corpus(one_corpus).pos("NOU").detailed_context(True).search()
    df_corpus = c.kwic() 
    # store the results
    dfs_all_corpora.append(df_corpus)

    
# extract lexicon and show the result
extracted_lexicon = extract_lexicon(dfs_all_corpora, posColumnName="pos") # For FCS: posColumnName=universal_dependency
display(extracted_lexicon)

## Lexicon

### Lexicon search <a class="anchor" id="lexicon-search"></a>

* Run the cell below to show the UI, and fill in your search query in the UI

In [None]:
from chaininglib.ui.search import create_lexicon_ui

#from chaininglib import ui
searchWordField, lexiconField = create_lexicon_ui()

 * Click the cell below and press Run to perform the given query

In [None]:
from chaininglib.search.LexiconQuery import *
from chaininglib.ui.dfui import display_df

search_word = searchWordField.value
lexicon_name = lexiconField.value
# USER: can replace this by own custom query
lex = create_lexicon(lexicon_name).lemma(search_word).search()
df_lexicon = lex.kwic()
display_df(df_lexicon)
#df_columns_list = list(df_lexicon.columns.values)
#df_lexicon_in_columns = df_lexicon[df_columns_list]
#display(df_lexicon_in_columns)

## Corpus + lexicon

### Retrieve synonyms from DiaMaNT, look up in Gysseling <a class="anchor" id="synonyms-diamant-gysseling"></a>
* Below cell searches for term "boek" in DiaMaNT, and looks up all variants in Gysseling

In [None]:
from chaininglib.search.CorpusQuery import *
from chaininglib.search.LexiconQuery import *
from IPython.core.display import display, HTML
from chaininglib.search.corpusQueries import corpus_query
from chaininglib.process.lexicon import get_diamant_synonyms
from chaininglib.ui.dfui import display_df

search_word = "boek"
lexicon_name = "diamant"
corpus= "gysseling"

# First, lookup synonyms in DiaMaNT
lq = create_lexicon(lexicon_name).lemma(search_word).search()
df_lexicon = lq.kwic()

syns = get_diamant_synonyms(df_lexicon)
syns.add(search_word) # Also add search word itself
display(HTML('Synoniemen voor <b>' + search_word + '</b>: ' + ", ".join(syns)))

# Search for all synonyms in corpus
## Create queries: search by lemma
syns_queries = [corpus_query(lemma=syn) for syn in syns]

## Search for all synonyms in corpus
df = pd.DataFrame()
for one_pattern in syns_queries:
    cq = create_corpus(corpus).pattern(one_pattern).search()
    df = df.append(cq.kwic())
display_df(df)



###  Build a frequency list of the lemma of some corpus output <a class="anchor" id="freq-lemma-corpus"></a>

In [None]:
from chaininglib.search.CorpusQuery import *
from chaininglib.process.corpus import *
from chaininglib.ui.dfui import *

# do some corpus search
print('This can take a few seconds... please wait!')

corpus_to_search="zeebrieven"
df_corpus = create_corpus(corpus_to_search).detailed_context(True).pos("NOU.*").search().kwic()

# compute and display a table of the frequencies of the lemmata

freq_df = get_frequency_list(df_corpus)
display_df(freq_df)

### Find occurences of attributive adjectives not ending with -e, even though they are preceeded by a definite article <a class="anchor" id="adjective-e"></a>

In [None]:
from chaininglib.search.CorpusQuery import *
from chaininglib.search.LexiconQuery import *
from chaininglib.utils.dfops import df_filter
from chaininglib.ui.dfui import display_df

corpus_to_search="chn-extern"
lexicon_to_search="molex"

# CORPUS: get [article + attributive adjective + nouns] combinations in which the adjective does not end with -e
print('Get occurences of attributive adjectives not ending with -e')
cq = create_corpus(corpus_to_search).pattern(r'[lemma="de|het"][word="^g(.+)[^e]$" & pos="AA.*"][pos="NOU.*"]')
df_corpus = cq.search().kwic()
display(df_corpus)

# LEXICON: get adjectives the lemma of which does not end with -e
lq = create_lexicon(lexicon_to_search).lemma('^g(.+)[^e]$').pos('ADJ').search()
df_lexicon = lq.search().kwic()

# LEXICON: get adjectives having a final -e in definite attributive use
print('Filtering lexicon results')
final_e_condition = df_filter(df_lexicon["wordform"], 'e$')
df_lexicon_form_e = df_lexicon[ final_e_condition ]

# RESULT: get the records out of our first list in which the -e-less-adjectives match the lemma form of our last list
print('List of attributive adjectives not ending with -e even though they should have a final -e:')
e_forms = list(df_lexicon_form_e.lemma)
no_final_e_condition = df_filter(df_corpus["word 1"], pattern=set(e_forms), method="isin")
result_df = df_corpus[ no_final_e_condition ]
display_df( result_df )

### Look up inflected forms and spelling variants for a given lemma in a corpus <a class="anchor" id="inflected-spelling-corpus"></a>

In [None]:
from chaininglib.ui.dfui import display_df
from chaininglib.search.CorpusQuery import *
from chaininglib.search.LexiconQuery import *

# Corpus Gysseling and lexicon mnwlex are from same period: 1250-1550
lexicon_to_search="mnwlex"
corpus_to_search="gysseling"

##############################################
# TODO  zelfde met meerdere lemmata en gegroepeerd 
##############################################

lemma_to_look_for="denken"

# LEXICON: Search for the inflected forms of a lemma in a morphosyntactic lexicon
lq = create_lexicon(lexicon_to_search).lemma(lemma_to_look_for).search()
df_lexicon = lq.kwic()
display_df(df_lexicon)

# Put all inflected forms into a list
inflected_wordforms = list(df_lexicon.wordform)

# CORPUS: Look up the inflected forms in a (possibly unannotated) corpus
# beware: If the corpus is not annotated, all we can do is searching for the inflected words
#         But if the corpus is lemmatized, we have to make sure we're retrieving correct data by specifying the lemma as well
annotated_corpus = True
query = r'[lemma="'+lemma_to_look_for+r'" & word="'+r"|".join(inflected_wordforms)+r'"]' if annotated_corpus else r'[word="'+r"|".join(inflected_wordforms)+r'"]'
cq = create_corpus(corpus_to_search).pattern(query).search()
df_corpus = cq.kwic() 
display_df(df_corpus)

### Corpus frequency list of lemmata from lexicon with given lemma <a class="anchor" id="corpus-frequency-lemma-pos"></a>
Build a function with which we can gather all lemmata of a lexicon, and build a frequency list of those lemmata in a corpus.

In [None]:
from chaininglib.search.LexiconQuery import *
from chaininglib.search.CorpusQuery import *
from chaininglib.process.corpus import get_frequency_list
from chaininglib.ui.dfui import display_df
import numpy as np


# build a function as required. We will run it afterwards

def get_frequency_list_given_a_corpus(lexicon, pos, corpus):
    
    # LEXICON: get a lemmata list to work with

    # query the lexicon for lemma with a given part-of-speech
    lq = create_lexicon(lexicon).pos(pos).search()
    df_lexicon = lq.kwic()

    # Put the results into an array, so we can loop through the found lemmata
    lexicon_lemmata_arr = [w.lower() for w in df_lexicon["writtenForm"]][-200:]
    # Instantiate a DataFrame, in which we will gather all single lemmata occurences
    df_full_list = pd.DataFrame()


    # CORPUS: loop through the lemmata list, query the corpus with each lemma, and count the results

    # It's a good idea to query more than one lemma at at the time,
    # but not too many, otherwise the server will get overloaded!
    nr_of_lemmata_to_query_atonce = 100

    # loop over lemma list 
    for i in range(0, len(lexicon_lemmata_arr), nr_of_lemmata_to_query_atonce):
        
        print('Lemmata processed: '+str(i)+'/'+str(len(lexicon_lemmata_arr)))
        
        # slice to small array of lemmata to query at once
        small_lemmata_arr = lexicon_lemmata_arr[i : i+nr_of_lemmata_to_query_atonce] 

        # join set of lemmas to send them in a query all at once
        # beware: single quotes need escaping
        lemmata_list = "|".join(small_lemmata_arr).replace("'", "\\\\'")
        cq = create_corpus(corpus).pattern(r'[lemma="' + lemmata_list + r'"]').search()
        df_corpus = cq.kwic()
        
        # add the results to the full list
        if "lemma 0" in df_corpus.columns:
            df_full_list = pd.concat( [df_full_list, df_corpus["lemma 0"]] )     
        

    # make sure the columnswith that contains the lemmata is same as given to get_frequency_list function
    column_name="lemma"
    df_full_list.columns = [column_name]

    # we're done with querying, build the frequency list now
    print('Done.')
    freq_df = get_frequency_list(df_full_list, column_name=column_name)
    

    return freq_df

    
# run it!
lexicon="molex"
# TODO: Maybe too much too ask all nouns? Maybe take random sample?
corpus_to_search="chn-extern"
pos="NOU.*"

freq_df = get_frequency_list_given_a_corpus(lexicon, pos, corpus_to_search)

display_df(freq_df)

### Build a frequency table of some corpus, based on lemma list of a given lexicon <a class="anchor" id="freqtable-lemmalist"></a>
In this case study, we compare lemma frequencies for corpora from different time periods: CHN extern (contemporary Dutch Antilles & Suriname) and Letters as Loot (sailors' letters, 17th and 18th century).

*For this case study, you need to run the previous case study first, because it generates a function we need here.*

In [None]:
from chaininglib.utils.dfops import get_rank_diff
from chaininglib.ui.dfui import display_df

# For this case study, you need to run the previous case study first, because it generates a function we need here

# Use lexica and corpora from same period
base_lexicon1="molex"
corpus_to_search1="chn-extern"

base_lexicon2="molex"
corpus_to_search2="zeebrieven"

# ADJ gives interesting comparison

pos="ADJ.*"

# build frequency tables of two corpora

df_frequency_list1 = get_frequency_list_given_a_corpus(base_lexicon1, pos, corpus_to_search1)
# sort and display
df_top25_descending = df_frequency_list1.sort_values(ascending=False,by=['token count']).head(25)
df_top25_ascending =  df_frequency_list1.sort_values(ascending=True, by=['rank']).head(25)
display_df( df_top25_descending[['lemmas', 'token count']].set_index('lemmas'), labels='df1 chart '+corpus_to_search1, mode='chart' )

df_frequency_list2 = get_frequency_list_given_a_corpus(base_lexicon2, pos, corpus_to_search2)
# sort and display
df_top25_descending = df_frequency_list2.sort_values(ascending=False,by=['token count']).head(25)
df_top25_ascending =  df_frequency_list2.sort_values(ascending=True, by=['rank']).head(25)
display_df( df_top25_descending[['lemmas', 'token count']].set_index('lemmas'), labels='df2 chart '+corpus_to_search2, mode='chart' )


# TODO: lemmata tonen die in 1 of 2 ontbreken

# compute the rank diff of lemmata in frequency tables

# sort and display
df_rankdiffs = get_rank_diff(df_frequency_list1, df_frequency_list2, index='lemmas')

display_df(df_rankdiffs.sort_values(by=['rank_diff']).head(25), labels='Differences in ranks')

df_top25_descending = df_rankdiffs.sort_values(ascending=False, by=['rank_diff']).head(25)
display_df( df_top25_descending['rank_diff'], labels='chart large diff', mode='chart' )

df_top25_ascending = df_rankdiffs.sort_values(ascending=True, by=['rank_diff']).head(25)
display_df( df_top25_ascending['rank_diff'], labels='chart small diff', mode='chart' )

### Search corpus for wordforms of lemma not included in lexicon <a class="anchor" id="corpus-wordforms-not-lexicon"></a>

In [None]:
from chaininglib.search.LexiconQuery import *
from chaininglib.search.CorpusQuery import *
from chaininglib.ui.dfui import display_df

# Let's build a function to do the job:
# The function will require a lexicon name and a part-of-speech to limit the search to, and the name of a corpus to be searched.
# It will return a Pandas DataFrame associating lemmata to their paradigms ('known_wordforms' column) and
# missing wordforms found in the corpus ('unknown_wordforms' column).

def get_missing_wordforms(lexicon_name, lexicon_postag, corpus, corpus_postag):    
    
    print('Finding missing wordforms in a lexicon can take some time...');
    
    # LEXICON: 
    # get a lemmata list having a given part-of-speech
    # MoLex is a convenient source to get a list of lemmata
    
    lq = create_lexicon("molex").pos(lexicon_postag).search()
    df_lexicon = lq.kwic()
    
    # Put the results into an array, so we can loop through the list of lemmata
    lexicon_lemmata_arr = [w.lower() for w in df_lexicon["writtenForm"]][-50:]
    
    # Test array, instead of querying Molex
    #lexicon_lemmata_arr = ["denken", "doen", "hebben", "maken"]
    
    # Prepare the output:
    # instantiate a DataFrame for storing lemmata and mssing wordforms
    df_enriched_lexicon = pd.DataFrame(index=lexicon_lemmata_arr, columns=['lemma', 'pos', 'known_wordforms', 'unknown_wordforms'])
    df_enriched_lexicon.index.name = 'lemmata'
    
    # CORPUS: 
    # loop through the lemmata list, query the corpus for each lemma, 
    # and compute paradigms differences between both

    
    # loop through the lemmata list
    # and query the corpus for occurances of the lemmata
    
    # It's a good idea to work with more than one lemma at the time (speed)!
    nr_of_lemmata_to_query_atonce = 100
    
    for i in range(0, len(lexicon_lemmata_arr), nr_of_lemmata_to_query_atonce):
        
        # slice to small array of lemmata to query at once
        small_lemmata_arr = lexicon_lemmata_arr[i : i+nr_of_lemmata_to_query_atonce]
        
        # join set of lemmata to send them in a query all at once
        # beware: single quotes need escaping
        lemmata_list = "|".join(small_lemmata_arr).replace("'", "\\\\'")
        print("Querying lemmata %i-%i of %i from corpus." % (i, i+nr_of_lemmata_to_query_atonce, len(lexicon_lemmata_arr) ))
        cq = create_corpus(corpus).pattern(r'[lemma="' + lemmata_list + r'" & pos="'+corpus_postag+'"]').search()
        df_corpus = cq.kwic()
        
        # if the corpus gave results,
        # query the lexicon for the same lemmata
        # and compare the paradigms!
        
        if (len(df_corpus)>0):
            small_lemmata_set = set(small_lemmata_arr)
            for one_lemma in small_lemmata_set: 
                
                # look up the known wordforms in the lexicon
                ql = create_lexicon(lexicon_name).lemma(one_lemma).search()
                df_known_wordforms = ql.kwic()
                
                # we have a lexicon paradigm to compare, do the job now
                if (len(df_known_wordforms) != 0):
                    
                    # gather the lexicon wordforms in a set
                    known_wordforms = set( df_known_wordforms['wordform'].str.lower() )
                    
                    # gather the corpus wordforms (of the same lemma) in a set too
                    corpus_lemma_filter = (df_corpus['lemma 0'] == one_lemma)
                    corpus_wordforms = set( (df_corpus[ corpus_lemma_filter ])['word 0'].str.lower() )
                    
                    # Now compute the differences:
                    # gather in a set all the corpus wordforms that cannot be found in the lexicon wordforms 
                    unknown_wordforms = corpus_wordforms.difference(known_wordforms)

                    # If we found some missing wordforms, add the results to the output!
                    
                    if (len(unknown_wordforms) !=0):                        
                        # The index of our results will be a key consisting of lemma + part-of-speech
                        # Part-of-speech is needed to distinguish homonyms with different grammatical categories.
                        # Of course, we need to take glosses into account too to do a truely correct job
                        # But we didn't do it here
                        key = one_lemma + lexicon_postag
                        df_enriched_lexicon.at[key, 'lemma'] = one_lemma
                        df_enriched_lexicon.at[key, 'pos'] = lexicon_postag
                        df_enriched_lexicon.at[key, 'known_wordforms'] = known_wordforms
                        df_enriched_lexicon.at[key, 'unknown_wordforms'] = unknown_wordforms
                
    # return non-empty results, t.i. cases in which we found some wordforms
    return df_enriched_lexicon[ df_enriched_lexicon['unknown_wordforms'].notnull() ]


# Run the function!
#
# ask the lexicon which wordforms it knows, and try to find new unknown wordforms in the corpus

lexicon_name="mnwlex"
corpus_to_search="zeebrieven"

# beware: lexicon and corpus may have different parts-of-speech sets in use
df = get_missing_wordforms(lexicon_name, "VERB", corpus_to_search, "VRB")

# After such a heavy process, it's a good idea to save the results

df.to_csv( "missing_wordforms.csv", index=False)

display_df(df, labels='Missing wordforms')


## Treebanks

### Treebank search <a class="anchor" id="treebank-search"></a>

In [1]:
from chaininglib.search.TreebankQuery import *


print ("search...")
#tbq = create_treebank("cgn").pattern('//node[@cat="smain" and node[@rel="su" and @pt="vnw"] and node[@rel="hd" and @pt="ww"] and node[@rel="predc" and @cat="np" and node[@rel="det" and @pt="lid"] and node[@rel="hd" and @pt="n"]]]').search()
tbq = create_treebank("cgn").pattern("//node[@cat='pp' and node[@cat='ap' and node[@cat='np']]]").search()
#tbq = create_treebank("cgn").pattern("//node[@cat='np']").search()

print ("get XML...")

xml = tbq.xml()
print(xml)

print ("get trees and their string representations...")

trees = tbq.trees()

for tree in trees:
    display(tree.toString())

df = tbq.kwic()
    
display(df)

search...
[Fget XML...                                                          
<node begin="11" cat="pp" end="16" id="19" rel="mod">
    <node begin="11" end="12" id="20" lcat="VZ(init)" lemma="van" pos="T701" postag="VZ(init)" pt="vz" rel="hd" root="van" vztype="init" word="van"/>
    <node begin="12" cat="ap" end="16" id="21" rel="obj1">
      <node begin="12" cat="np" end="15" id="22" rel="mod">
        <node begin="12" cat="np" end="14" id="23" rel="det">
          <node begin="12" end="13" id="24" lcat="LID(onbep,stan,agr)" lemma="een" lwtype="onbep" naamval="stan" npagr="agr" pos="U608" postag="LID(onbep,stan,agr)" pt="lid" rel="det" root="een" word="een"/>
          <node begin="13" end="14" genus="onz" getal="ev" graad="basis" id="25" lcat="N(soort,ev,basis,onz,stan)" lemma="paar" naamval="stan" ntype="soort" pos="T102" postag="N(soort,ev,basis,onz,stan)" pt="n" rel="hd" root="paar" word="paar"/>
        </node>
        <node begin="14" end="15" getal="mv" graad="basis" id="

' [ van  [  [  [ een paar ]/np huizen ]/np verder ]/ap ]/pp'

' [ van  [  [  [ vijftig zestig zeventig tachtig ]/list jaar ]/np oud ]/ap ]/pp'

' [ van  [  [ drieduizend jaar ]/np oud ]/ap ]/pp'

' [ van  [  [  [ elf dertien en veertien ]/conj jaar ]/np oud ]/ap ]/pp'

' [ van  [  [ meter  [  [ een of ]/mwu dertig ]/detp ]/np hoog ]/ap ]/pp'

' [  [  [  [ een paar ]/np metertjes ]/np verder ]/ap naar voren ]/pp'

' [ van  [  [ een beetje ]/np blasé ]/ap ]/pp'

' [ van  [  [ vijftien jaar ]/np oud ]/ap ]/pp'

' [ van  [  [ zevenhonderd meter ]/np lang ]/ap ]/pp'

" [ tot  [  [ 's avonds ]/np laat ]/ap ]/pp"

' [ van  [  [ zeven centimeter ]/np dik ]/ap ]/pp'

' [ van  [  [ dertig jaar ]/np oud ]/ap ]/pp'

' [ met  [  [ je mond ]/np vol ]/ap ]/pp'

" [ van  [  [ 's morgens ]/np vroeg  [ tot  [  [ 's avonds ]/np laat ]/ap ]/pp ]/ap ]/pp"

" [ tot  [  [ 's avonds ]/np laat ]/ap ]/pp"

' [ met  [  [ je ogen ]/np dicht ]/ap ]/pp'

Unnamed: 0,lemma 0,pos 0,wordform 0,lemma 1,pos 1,wordform 1,lemma 2,pos 2,wordform 2,lemma 3,...,wordform 4,lemma 5,pos 5,wordform 5,lemma 6,pos 6,wordform 6,lemma 7,pos 7,wordform 7
0,van,VZ(init),van,een,"LID(onbep,stan,agr)",een,paar,"N(soort,ev,basis,onz,stan)",paar,huis,...,verder,,,,,,,,,
1,van,VZ(init),van,vijftig,"TW(hoofd,prenom,stan)",vijftig,zestig,"TW(hoofd,prenom,stan)",zestig,zeventig,...,tachtig,jaar,"N(soort,ev,basis,onz,stan)",jaar,oud,"ADJ(vrij,basis,zonder)",oud,,,
2,van,VZ(init),van,drieduizend,"TW(hoofd,prenom,stan)",drieduizend,jaar,"N(soort,ev,basis,onz,stan)",jaar,oud,...,,,,,,,,,,
3,van,VZ(init),van,elf,"TW(hoofd,prenom,stan)",elf,dertien,"TW(hoofd,prenom,stan)",dertien,en,...,veertien,jaar,"N(soort,ev,basis,onz,stan)",jaar,oud,"ADJ(vrij,basis,zonder)",oud,,,
4,van,VZ(init),van,meter,"N(soort,ev,basis,zijd,stan)",meter,een,"LID(onbep,stan,agr)",een,of,...,dertig,hoog,"ADJ(postnom,basis,zonder)",hoog,,,,,,
5,een,"LID(onbep,stan,agr)",een,paar,"N(soort,ev,basis,onz,stan)",paar,meter,"N(soort,mv,dim)",metertjes,ver,...,naar,voren,BW(),voren,,,,,,
6,van,VZ(init),van,een,"LID(onbep,stan,agr)",een,beetje,"N(soort,ev,dim,onz,stan)",beetje,blasé,...,,,,,,,,,,
7,van,VZ(init),van,vijftien,"TW(hoofd,prenom,stan)",vijftien,jaar,"N(soort,ev,basis,onz,stan)",jaar,oud,...,,,,,,,,,,
8,van,VZ(init),van,zevenhonderd,"TW(hoofd,prenom,stan)",zevenhonderd,meter,"N(soort,ev,basis,zijd,stan)",meter,lang,...,,,,,,,,,,
9,tot,VZ(init),tot,de,"LID(bep,gen,evmo)",'s,avond,"N(soort,ev,basis,gen)",avonds,laat,...,,,,,,,,,,


### Which kind of nouns are used in a prepositional complement of the verb *geven* ? <a class="anchor" id="treebank-objects-geven"></a>

In [4]:
from chaininglib.search.TreebankQuery import *


print ("search...")

tbq = create_treebank("cgn").pattern('//node[node[@rel="hd" and @pt="ww" and @root="geven"] and node[@rel="obj1" and @pt="n"]]').search()


print ("get list of nouns which are part of an PP, as argument of predicate 'geven'...")

trees = tbq.trees()

list_of_nouns = []
for tree in trees:
    nouns = tree.extract(['pp', 'np'])
    list_of_nouns = list_of_nouns + nouns
    

display(list_of_nouns)
    
df = tbq.kwic(align_lemma='geven')
display(df)

search...
[Fget list of nouns which are part of an PP, as argument of predicate 'geven'...


['haar woede',
 'de verdediging',
 "'t conservatorium als basis",
 'de glastelers',
 'de kampen die  komen te in Amsterdam Groningen Rotterdam en Tilburg staan',
 'een Miele',
 'veranderingen in de school',
 'de Davondsfonds-leden',
 'de -bestuursleden',
 'de ontwikkeling van lokale Nederlandstalige educatieve software',
 'het jaar',
 'ene kamp',
 'één stuk',
 'de computer']

Unnamed: 0,left context,lemma 0,pos 0,wordform 0,right context
0,uiting,geven,"WW(inf,vrij,zonder)",geven,aan haar woede
1,handschoenen er bij,geven,"WW(inf,vrij,zonder)",geven,
2,elkaar verrijking,geven,"WW(inf,vrij,zonder)",geven,
3,Gods liefde gestalte,geven,"WW(inf,vrij,zonder)",geven,
4,kwaadheid woorden,geven,"WW(inf,vrij,zonder)",geven,zonder te in de verdediging gaan
5,hoop,geven,"WW(inf,vrij,zonder)",geven,aan rechtelozen
6,ook concerten,geven,"WW(inf,vrij,zonder)",geven,vanuit 't conservatorium als basis
7,overbruggingskredieten,geven,"WW(inf,vrij,zonder)",geven,aan de glastelers
8,toelating,geven,"WW(inf,vrij,zonder)",geven,
9,medewerkers van Artsen Zonder Grenzen rondleidingen,geven,"WW(pv,tgw,mv)",geven,in de kampen die komen te in Amsterdam Groningen Rotterdam en Tilburg staan
