# Groene boekje part 2: wordform links

In [the first Groene Boekje notebook](groene_boekje.ipynb) we cleaned just for wordforms. Now we will extend that with wordform links based on the relations in the rows. For this we will probably also need to modify dbutils.

In [1]:
%load_ext autoreload

In [284]:
%autoreload

In [3]:
import ticclat.dbutils
import pandas as pd
import numpy as np

In [4]:
# Read information to connect to the database and put it in environment variables
import os
with open('ENVVARS.txt') as f:
    for line in f:
        parts = line.split('=')
        if len(parts) == 2:
            os.environ[parts[0]] = parts[1].strip()

In [5]:
db_name = 'ticclat'
os.environ['dbname'] = db_name

In [6]:
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy_utils import database_exists, create_database

engine = create_engine("mysql://{}:{}@localhost/{}".format(os.environ['user'], 
                                                           os.environ['password'], 
                                                           os.environ['dbname']))
if not database_exists(engine.url):
    create_database(engine.url)

print(database_exists(engine.url))

Session = sessionmaker(bind=engine)

True


In [7]:
from sqlalchemy import inspect

inspector = inspect(engine)

In [8]:
# Get table information
print(inspector.get_table_names())

['anahashes', 'corpora', 'corpusId_x_documentId', 'documents', 'lexica', 'lexical_source_wordform', 'source_x_wordform_link', 'text_attestations', 'wordform_links', 'wordforms']


# Load Groene Boekje data into Pandas

In [9]:
GB_basepath = "/Users/pbos/projects/ticclat/data/GB/"

In [10]:
GB1914_path = GB_basepath + "1914/22722-8.txt"
GB1995_path = GB_basepath + "1995-2005/1995/GB95_002.csv"
GB2005_path = GB_basepath + "1995-2005/2005/GB05_002.csv"

In [11]:
df_GB1995 = pd.read_csv(GB1995_path, sep=';', names=["word", "syllables", "see also", "disambiguation",
                                                     "grammatical tag", "article",
                                                     "plural/past/attrib", "plural/past/attrib syllables",
                                                     "diminu/compara/past plural", "diminu/compara/past plural syllables",
                                                     "past perfect/superla", "past perfect/superla syllables"],
                        encoding='utf8') # encoding necessary for later loading into sqlalchemy!

In [43]:
# df_GB1995[~df_GB1995["see also"].isnull()].sample(10)
# df_GB1995[~df_GB1995["disambiguation"].isnull()].sample(10)
df_GB1995[~df_GB1995["disambiguation"].isnull() & df_GB1995["disambiguation"].str.contains(' ')].sample(10)

Unnamed: 0,word,syllables,see also,disambiguation,grammatical tag,article,plural/past/attrib,plural/past/attrib syllables,diminu/compara/past plural,diminu/compara/past plural syllables,past perfect/superla,past perfect/superla syllables
102500,vorst2,vorst,,(andere bett.),znw.,de[m.],vorsten,vor/sten,,,,
63764,oor,oor,,(andere bett. dan afstammeling),znw.,het,oren,oren,oortje,oor/tje,,
85589,stand2,stand,,(andere bett.),znw.,de[m.],standen,stan/den,,,,
36423,hol3,hol,,(andere bett.),znw.,het,holen,ho/len,holletje,hol/le/tje,,
79581,schoffel1,schof/fel,,(het schoffelen),znw.,de[m.],,,,,,
47535,kuch2,kuch,,(andere bett.),znw.,de[m.],kuchen,ku/chen,,,,
75138,rek1,rek,,"(het rekken, veerkracht)",znw.,de[m.],,,,,,
104863,weer2,weer,zie ook weder2,"(luchtgestelheid, bederf)",znw.,het,,,weertje,weer/tje,,
66995,pand2,pand,,(andere bett.),znw.,het,panden,pan/den,,,,
13794,brik1,brik,,"(steen, voorwerpsnaam)",znw.,de[m.],brikken,brik/ken,,,,


## Clean-up
Clean up will be different now that we also need links.
- At first we need to retain the columns while cleaning them, because we'll need the rows to define links.
- The cleaning of the wordform columns will be the same as before, but now per column instead of all in one go.
- However, because we must retain rows, things like splitting on commas will now mean the entire row must be duplicated.

Some notes picked up from random sample checking:
- The disambiguation words can also have a duplicate word numbers! These are irrelevant to us, since we only deal with word forms, not semantics, so we have to remove them there as well.

### Links
The different types of links we could extract from this table are:

- "see also" (column 3): this is usually an **spelling variant**, so very relevant in our case. Both are **correct** words.
- "disambiguation (column 4):
    + This column is always between parentheses.
    + Usually it is a semantically very similar word.
    + Sometimes the disambiguation is in multiple words. In this case (as opposed to the multiple words in wordform columns) multiple words are more often really separate words, like `slechte waar` (for `kamelot2`) or `het slopen` (for `sloop1`), which are really two separate wordforms, not a "multi-word wordform". We probably can't count on this, but we could do a lookup of the separate words. Even if we did that, this column entry is really almost more like a sentence, an explanation of the word, so the separate words cannot be entered into the database as wordforms, nor can they be entered separately, because their combination makes up their meaning, which it will lose when taken apart.
    + There are even entries that have multiple disambiguating words, separated by a comma. These can easily be used separately.
    + A lot of them have `(andere bett.)` or (more rarely) `(andere bet.)` which I guess means "other meaning(s)", not sure though.
    + So to sum up, what we can do with this column:
        * Remove `(andere bett.)`
        * Strip parentheses
        * Split by comma
        * **First approximation**: remove multi-word entries
        * *Use remaining words as **semantic** links*
        * Again, both are **correct** words.
- Columns 7, 9 and 11 give us **morphological links** of different types. Again, all **correct** words.
    + We could in principle deduce the type from the grammatical tag of the word in column 5, but will no do so for now.

# Clean up links

This time we clean up the dataframe as a table, not as a single row of wordforms.

We can drop a few columns though.

In [175]:
link_data = df_GB1995.drop(["syllables", "grammatical tag", "article",
                            "plural/past/attrib syllables", "diminu/compara/past plural syllables", "past perfect/superla syllables"], axis=1)

In [70]:
link_data.sample(10)

Unnamed: 0,word,see also,disambiguation,plural/past/attrib,diminu/compara/past plural,past perfect/superla
78325,satelliettelevisie,,,,,
91214,tijdsruimte,zie ook tijdruimte,,"tijdsruimten, tijdsruimtes",,
50372,lias,,"(veter, bundel)",liassen,,
90644,testen,,,testte,,getest
30515,gespuug,,,,,
58775,neteldier,,,neteldieren,,
105984,werkploeg,,,werkploegen,,
57526,muziekfestival,,,muziekfestivals,,
70663,popstation,,,popstations,,
22424,dropping,,,droppings,,


## Clean columns 2 and 3

We have to remove `zie ook ` (see also ) from column 2.

For column 3 we will:
- strip the parentheses
- remove multi-word entries (including especially `andere bett.`).

### Clean "see also" (column 2)

In [176]:
link_data['see also'] = link_data['see also'].str.replace('zie ook ', '')

In [177]:
link_data[~link_data['see also'].isnull()].sample(5)

Unnamed: 0,word,see also,disambiguation,plural/past/attrib,diminu/compara/past plural,past perfect/superla
81109,@@s-Hertogenbosch,Den Bosch,,,,
53909,medewerken,meewerken,,werkte mede,,medegewerkt
32725,grief,grieve,,grieven,,
56058,minstreel,"meistreel, menestreel",,minstrelen,,
68205,"pensioenaanvraag, pensioenaanvrage",pensioensaanvraag,,pensioenaanvragen,,


### Clean "disambiguation" (column 3)

In [178]:
link_data['disambiguation'] = link_data['disambiguation'].str.strip('()')

In [179]:
link_data['disambiguation'][link_data['disambiguation'] == 'andere bett.'] = None
link_data['disambiguation'][link_data['disambiguation'] == 'andere bet.'] = None

In [180]:
link_data[~link_data['disambiguation'].isnull()].sample(5)

Unnamed: 0,word,see also,disambiguation,plural/past/attrib,diminu/compara/past plural,past perfect/superla
48139,kweek1,,"het kweken, het gekweekte",,,
55092,metselsteen2,,stofnaam,,,
70092,pof2,,slag,poffen,,
31316,gier1,,mestvocht,,,
96229,veer3,,"overvaart, beurtvaart",veren,,


1. First remove the isolated multi-words, i.e. the ones without commas.
2. Then, if necessary (turns out, it's only 20-30 rows left), let's make a nice regex to replace multi-words, both isolated ones and ones in comma separated lists.

In [181]:
link_data['disambiguation'][link_data['disambiguation'].str.contains(" ", na=False)
                            & ~link_data['disambiguation'].str.contains(",", na=False)] = None

In [182]:
link_data['disambiguation'][link_data['disambiguation'].str.contains("[^,] ", na=False)] = (
    link_data['disambiguation']
     [link_data['disambiguation'].str.contains("[^,] ", na=False)]
     .str.split(', ')
     .map(lambda x: [i for i in x if not ' ' in i])
     .map(lambda x: None if len(x) == 0 else ', '.join(x))
)

In [183]:
link_data['disambiguation'][link_data['disambiguation'].str.contains("[^,] ", na=False)]

Series([], Name: disambiguation, dtype: object)

In [184]:
link_data['disambiguation'][link_data['disambiguation'].str.contains(" ", na=False)].sample(5)

20970         vermomming, persoon
106291                spel, drank
94052        katoengaren, weefsel
104863    luchtgestelheid, bederf
77875                 bont, zwart
Name: disambiguation, dtype: object

One remaining pesky thing: `kippenloop, -korf`. Is this a pattern?

In [185]:
link_data['disambiguation'][link_data['disambiguation'].str.contains(", -", na=False)]

75359    kippenloop, -korf
Name: disambiguation, dtype: object

Nope, so let's just get rid of it here and now.

In [186]:
link_data['disambiguation'][link_data['disambiguation'].str.contains(", -", na=False)] = None

In [187]:
link_data.sample(10)

Unnamed: 0,word,see also,disambiguation,plural/past/attrib,diminu/compara/past plural,past perfect/superla
15094,carrier,,,carriers,,
72602,pruik,,,pruiken,,
13315,bovenkledingmarkt,,,,,
56523,moedertaal,,,moedertalen,,
80965,serie,,,"series, serie@\n""",,
83456,spaarduit,,,spaarduiten,,
5382,attaque,,,attaques,,
18358,dagsucces,,,dagsuccessen,,
98621,verstandhouding,,,,,
10544,"bie@\nnale""",,,,,


## Remove empty lines

For links, we don't need the rows that only have a `word` entry, since these by definition have no links in that row. They may be linked to another word through the `see also` column, but then the word will also be in the row of the word where it is in the `see also` column, so the word's row itself can safely be removed.

In [188]:
link_data = link_data.dropna(how='all', subset=["see also", "disambiguation", "plural/past/attrib", "diminu/compara/past plural", "past perfect/superla"])

# Convert to links!

Actually, for now, we can keep this rather simple: we just make a table with two columns, where the first column has the wordforms in the `word` column and the second has the wordforms in the other columns.

We will then split comma separated words in a second pass to keep things simple.

In [193]:
link_df = (link_data.set_index('word').stack().reset_index().drop('level_1', axis=1)
                    .rename({'word': 'wordform_1', 0: 'wordform_2'}, axis=1))

In [195]:
link_df.head()

Unnamed: 0,wordform_1,wordform_2
0,a,a@@s
1,a,a@@tje
2,aagt,aagten
3,aai,aaien
4,aai,aaitje


In [197]:
has_comma1 = link_df['wordform_1'].str.contains(',')
link_df = pd.concat((link_df[~has_comma1],) + tuple(pd.DataFrame({'wordform_1': row['wordform_1'].split(', '),
                                                                  'wordform_2': (row['wordform_2'],) * len(row['wordform_1'].split(', '))})
                                                     for ix, row in link_df[has_comma1].iterrows()))

has_comma2 = link_df['wordform_2'].str.contains(',')
link_df = pd.concat((link_df[~has_comma2],) + tuple(pd.DataFrame({'wordform_1': (row['wordform_1'],) * len(row['wordform_2'].split(', ')),
                                                                  'wordform_2': row['wordform_2'].split(', ')})
                                                     for ix, row in link_df[has_comma2].iterrows()))

In [207]:
link_df.tail(12)

Unnamed: 0,wordform_1,wordform_2
0,zuiveringschap,zuiveringschappen
1,zuiveringschap,zuiveringsschappen
0,zuiveringsschap,zuiveringschappen
1,zuiveringsschap,zuiveringsschappen
0,zweetpoeder2,zweetpoeders
1,zweetpoeder2,zweetpoeiers
0,zweetpoeier,zweetpoeders
1,zweetpoeier,zweetpoeiers
0,zwoerd,zwoerden
1,zwoerd,zwoorden


# Clean wordforms

Almost like in [the first Groene Boekje notebook](groene_boekje.ipynb), but now per column. The main changes are:
- We do it per column
- We must not do `.unique()` at the end, because duplicate words in a column may be linked to different words in the other column! Unique should be row-based.

We also redo the above in one go in the next cell.

In [218]:
link_data = df_GB1995.drop(["syllables", "grammatical tag", "article",
                            "plural/past/attrib syllables", "diminu/compara/past plural syllables", "past perfect/superla syllables"], axis=1)

# clean link_data
link_data['see also'] = link_data['see also'].str.replace('zie ook ', '')
link_data['disambiguation'] = link_data['disambiguation'].str.strip('()')
link_data['disambiguation'][link_data['disambiguation'] == 'andere bett.'] = None
link_data['disambiguation'][link_data['disambiguation'] == 'andere bet.'] = None
link_data['disambiguation'][link_data['disambiguation'].str.contains(" ", na=False)
                            & ~link_data['disambiguation'].str.contains(",", na=False)] = None
link_data['disambiguation'][link_data['disambiguation'].str.contains("[^,] ", na=False)] = (
    link_data['disambiguation']
     [link_data['disambiguation'].str.contains("[^,] ", na=False)]
     .str.split(', ')
     .map(lambda x: [i for i in x if not ' ' in i])
     .map(lambda x: None if len(x) == 0 else ', '.join(x))
)
link_data['disambiguation'][link_data['disambiguation'].str.contains(", -", na=False)] = None
link_data = link_data.dropna(how='all', subset=["see also", "disambiguation", "plural/past/attrib", "diminu/compara/past plural", "past perfect/superla"])

# convert to link_df
link_df = (link_data.set_index('word').stack().reset_index().drop('level_1', axis=1)
                    .rename({'word': 'wordform_1', 0: 'wordform_2'}, axis=1))
has_comma1 = link_df['wordform_1'].str.contains(',')
link_df = pd.concat((link_df[~has_comma1],) + tuple(pd.DataFrame({'wordform_1': row['wordform_1'].split(', '),
                                                                  'wordform_2': (row['wordform_2'],) * len(row['wordform_1'].split(', '))})
                                                     for ix, row in link_df[has_comma1].iterrows()))

has_comma2 = link_df['wordform_2'].str.contains(',')
link_df = pd.concat((link_df[~has_comma2],) + tuple(pd.DataFrame({'wordform_1': (row['wordform_1'],) * len(row['wordform_2'].split(', ')),
                                                                  'wordform_2': row['wordform_2'].split(', ')})
                                                     for ix, row in link_df[has_comma2].iterrows()))

link_df = link_df.reset_index(drop=True)

In [228]:
def clean_wordform_series(wordform_series, remove_duplicates=False):
    # remove colons
    wordform_series = wordform_series.str.replace(':', '')
    # strip whitespace
    wordform_series = wordform_series.str.strip()
    # remove duplicate word footnote numbers
    duplicates = wordform_series.str.contains('[0-9]$', regex=True)
    wordform_series = pd.concat((wordform_series[~duplicates], wordform_series[duplicates].str.replace('[0-9]$', '', regex=True)))
    # remove parentheses around some words
    wordform_series = wordform_series.sort_values().str.strip("()")
    # remove abbreviations
    abbreviation = wordform_series.str.contains('\.$')
    wordform_series = wordform_series[~abbreviation]
    
    if remove_duplicates:
        wordform_series = pd.Series(wordform_series.unique())
    return wordform_series

In [245]:
link_clean_df = link_df.copy()
link_clean_df['wordform_1'] = clean_wordform_series(link_clean_df['wordform_1'])
link_clean_df['wordform_2'] = clean_wordform_series(link_clean_df['wordform_2'])
# some links will be removed (abbreviations), so drop those rows
link_clean_df = link_clean_df.dropna()

In [248]:
len(link_df) - len(link_clean_df)

10

In [230]:
def check_cleanliness_wordform_series(wordform_series, head=5):
    print("Random sample:")
    display(wordform_series.sample(10))
    print("Colons, periods:")
    display(wordform_series[wordform_series.str.contains(':')].head(head))
    display(wordform_series[wordform_series.str.contains('.', regex=False)].head(head))
    print("White space padding:")
    display(wordform_series[wordform_series.str.contains('^ | $')].head(head))
    print("Trailing numbers:")
    display(wordform_series[wordform_series.str.contains('[0-9]$', regex=True)].head(head))
    print("Parentheses:")
    display(wordform_series[wordform_series.str.contains('\(|\)', regex=True)].head(head))
    print("Abbreviations:")
    display(wordform_series[wordform_series.str.contains('\.$')].head(head))
    
    print("Finally, just the first entries of sorted df:")
    display(wordform_series.sort_values().head(head))
    display(wordform_series.sort_values().tail(head))

In [231]:
check_cleanliness_wordform_series(link_df['wordform_1'])
check_cleanliness_wordform_series(link_df['wordform_2'])

Random sample:


93818                  hormon
5345                 balsport
50832             opensnijden
32118                 isobaar
28796               holocaust
46475    niet-gouvernementeel
62240               scalperen
6160              bee@\digen"
31141                inlossen
9440         bioscoopbezoeker
Name: wordform_1, dtype: object

Colons, periods:


3940         AOW@@er:
12875    chocolaatje:
13305    collegaatje:
19322       EHBO@@er:
20633      extraatje:
Name: wordform_1, dtype: object

4960                   B.
5801                  bc.
41423                  M.
68778    st.-jakobsschelp
Name: wordform_1, dtype: object

White space padding:


Series([], Name: wordform_1, dtype: object)

Trailing numbers:


8     aak1
9     aak1
10    aak2
11    aak2
12    aal1
Name: wordform_1, dtype: object

Parentheses:


76330    van zijn positieve(n) zijn
84759                      (werken)
84760                      (werken)
86155                      (werken)
86156                      (werken)
Name: wordform_1, dtype: object

Abbreviations:


4960      B.
5801     bc.
41423     M.
Name: wordform_1, dtype: object

Finally, just the first entries of sorted df:


84759          (werken)
86155          (werken)
86156          (werken)
84760          (werken)
64468    @@s-Gravenhage
Name: wordform_1, dtype: object

88280     zwoel
94873    zwoerd
94874    zwoerd
94875    zwoord
94876    zwoord
Name: wordform_1, dtype: object

Random sample:


60944          rolwisselingen
18119              drijfassen
76422            vastgegroeid
52705            overschoenen
43518        middagmaaltijden
77231             verdoemenis
12118            burgerhuizen
59344         regeringszetels
74614               uitgebijt
13634    computerprogramma@@s
Name: wordform_2, dtype: object

Colons, periods:


1115      abc@@tje:
1205    acaciaatje:
1275      accuutje:
3095    agendaatje:
3561    amforaatje:
Name: wordform_2, dtype: object

3386        znw.
7060     eigenn.
7271        znw.
20249       znw.
30231       znw.
Name: wordform_2, dtype: object

White space padding:


93553    ganzenvederen 
93556    ganzenvederen 
Name: wordform_2, dtype: object

Trailing numbers:


29       aal1
7609    best3
7844     bes2
8064    beet1
8796    beet1
Name: wordform_2, dtype: object

Parentheses:


57881                       spits);
59415                     koor(zang
76330    bij zijn positieve(n) zijn
Name: wordform_2, dtype: object

Abbreviations:


3386        znw.
7060     eigenn.
7271        znw.
20249       znw.
30231       znw.
Name: wordform_2, dtype: object

Finally, just the first entries of sorted df:


15478       @@s-Gravenhage
15473    @@s-Hertogenbosch
1135           A-biljetten
1142              A-bommen
3938            A-omroepen
Name: wordform_2, dtype: object

94874                   zwoorden
94876                   zwoorden
3085       zworen af zweerden af
75776    zworen uit zweerden uit
88194            zworen zweerden
Name: wordform_2, dtype: object

In [246]:
check_cleanliness_wordform_series(link_clean_df['wordform_1'])
check_cleanliness_wordform_series(link_clean_df['wordform_2'])

Random sample:


28268     hersenspoelen
15526           dentaal
65340        sloopwagen
28673              hoep
49180    onheilsprofeet
74344            tutten
72940        toneellamp
44803            mormel
88856           ballade
16953       doodernstig
Name: wordform_1, dtype: object

Colons, periods:


Series([], Name: wordform_1, dtype: object)

68778    st.-jakobsschelp
Name: wordform_1, dtype: object

White space padding:


Series([], Name: wordform_1, dtype: object)

Trailing numbers:


Series([], Name: wordform_1, dtype: object)

Parentheses:


76330    van zijn positieve(n) zijn
Name: wordform_1, dtype: object

Abbreviations:


Series([], Name: wordform_1, dtype: object)

Finally, just the first entries of sorted df:


64468       @@s-Gravenhage
64489    @@s-Hertogenbosch
1135              A-biljet
1142                 A-bom
3938              A-omroep
Name: wordform_1, dtype: object

88280     zwoel
94873    zwoerd
94874    zwoerd
94875    zwoord
94876    zwoord
Name: wordform_1, dtype: object

Random sample:


84983           West-Indische
12350    camouflagetechnieken
15054              decimeerde
67865           stageplaatsen
38975                laterale
77380                verengde
83007       waarnemingsvelden
83840                weegbare
40470             liquidaties
74139              tuinbedden
Name: wordform_2, dtype: object

Colons, periods:


Series([], Name: wordform_2, dtype: object)

68778    st.-jakobsschelpen
76051       vademen. vadems
Name: wordform_2, dtype: object

White space padding:


Series([], Name: wordform_2, dtype: object)

Trailing numbers:


Series([], Name: wordform_2, dtype: object)

Parentheses:


57881                       spits);
59415                     koor(zang
76330    bij zijn positieve(n) zijn
Name: wordform_2, dtype: object

Abbreviations:


Series([], Name: wordform_2, dtype: object)

Finally, just the first entries of sorted df:


15478       @@s-Gravenhage
15473    @@s-Hertogenbosch
1135           A-biljetten
1142              A-bommen
3938            A-omroepen
Name: wordform_2, dtype: object

94874                   zwoorden
94876                   zwoorden
3085       zworen af zweerden af
75776    zworen uit zweerden uit
88194            zworen zweerden
Name: wordform_2, dtype: object

### Normalizing diacritics

In [249]:
# note that these are regex formatted, i.e. with special characters escaped
diacritic_markers = {'@`': '\u0300',    # accent grave
                     "@\\'": '\u0301',  # accent aigu
                     '@\\\\': '\u0308', # trema
                     '@\+': '\u0327',   # cedilla
                     '@\^': '\u0302',   # accent circumflex
                     '@=': '\u0303',    # tilde
                     '@@': "'",         # apostrophe (not actually a diacritic)
                     '@2': '\u2082',    # subscript 2
                     '@n': '\u0308n'    # trema followed by n
                    }

In [251]:
for column in link_clean_df:
    for marker, umarker in diacritic_markers.items():
        link_clean_df[column] = link_clean_df[column].str.replace(marker, umarker)

In [271]:
link_clean_df.sort_values(by=link_clean_df.columns.tolist()).sample(10)

Unnamed: 0,wordform_1,wordform_2
25256,godheid,godheden
41597,maatstaf,maatstaven
44022,minnelijk,minnelijke
1641,"adenoïde""","adenoïden"""
44581,moloch,molochs
76935,verbaasd,verbaasde
6874,belastingopbrengst,belastingopbrengsten
86096,wriggelen,gewriggeld
42243,masker,maskers
26585,haageik,haageiken


# Load Groene Boekje DataFrames into TICCLAT database

In [285]:
with ticclat.dbutils.session_scope(Session) as session:
    for ix, row in link_clean_df.iterrows():
        w1 = session.query(ticclat.ticclat_schema.Wordform).filter_by(wordform=row["wordform_1"]).one()
        w2 = session.query(ticclat.ticclat_schema.Wordform).filter_by(wordform=row["wordform_2"]).one()
        w1.links.append(w2)
        print(w1.links)
        if ix >= 10:
            break

InvalidRequestError: One or more mappers failed to initialize - can't proceed with initialization of other mappers. Triggering mapper: 'Mapper|WordformLink|wordform_links'. Original exception was: Could not determine join condition between parent/child tables on relationship WordformLink.wordform_1 - there are multiple foreign key paths linking the tables.  Specify the 'foreign_keys' argument, providing a list of those columns which should be counted as containing a foreign key reference to the parent table.

In [281]:
with ticclat.dbutils.session_scope(Session) as session:
    ticclat.dbutils.add_lexicon(session, "Groene Boekje 1995", link_clean_df.to_frame(name='wordform'))

AttributeError: 'DataFrame' object has no attribute 'to_frame'