What's going on with the unicode words in the Groene Boekje (see [Groene Boekje links notebook](ticclat_db_ingestion/groene_boekje_2-links.ipynb))? Why do they return `None`s when queried from the database?

In [None]:
%load_ext autoreload

In [None]:
%autoreload 1

In [None]:
import os
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy_utils import database_exists, create_database
import numpy as np

In [None]:
# Read information to connect to the database and put it in environment variables
with open('ticclat_db_ingestion/ENVVARS.txt') as f:
    for line in f:
        parts = line.split('=')
        if len(parts) == 2:
            os.environ[parts[0]] = parts[1].strip()

In [None]:
db_name = 'ticclat'
os.environ['dbname'] = db_name

In [None]:
engine = create_engine("mysql://{}:{}@localhost/{}?charset=utf8mb4".format(os.environ['user'], 
                                                           os.environ['password'], 
                                                           os.environ['dbname']))

print(database_exists(engine.url))

Session = sessionmaker(bind=engine)

In [None]:
import ticclat.ingest.groene_boekje as ingestGB
import ticclat
import tqdm

In [None]:
gb95 = ingestGB.load_GB95("/Users/pbos/projects/ticclat/data/GB/1995-2005/1995/GB95_002.csv")

In [None]:
link_df = ingestGB.create_GB95_link_df(gb95)

In [None]:
with ticclat.dbutils.session_scope(Session) as session:
    lexicon = session.query(ticclat.ticclat_schema.Lexicon).filter(ticclat.ticclat_schema.Lexicon.lexicon_name=='Groene Boekje 1995').first()
    if lexicon is None:
        raise Exception("No lexicon found!")
    for idx, row in tqdm.tqdm(link_df.iterrows(), total=link_df.shape[0]):
        wf = session.query(ticclat.ticclat_schema.Wordform).filter(ticclat.ticclat_schema.Wordform.wordform == row['wordform_1']).first()
        corr = session.query(ticclat.ticclat_schema.Wordform).filter(ticclat.ticclat_schema.Wordform.wordform == row['wordform_2']).first()
        if wf is None:
            print("wordform_1 gives None: ", row['wordform_1'])
        if corr is None:
            print("wordform_2 gives None: ", row['wordform_2'])

Hm, this doesn't look like a unicode issue at all...

Indeed, querying for these words yields nothing from the table:

```mysql
select * from wordforms where wordform like "%kaartenboek%";
```

Checking out the words by grepping the CSV file, it looks like these are words that are in the disambiguation column, between parentheses. They are all words that don't have their own separate entries in the table. Why didn't these get in the wordforms df?

In [None]:
wordforms = ingestGB.create_GB95_wordform_df(gb95)

In [None]:
any(wordforms.str.contains('kaartenboek'))

In [None]:
any(wordforms.str.contains('paard'))

In [None]:
ingestGB.contains_in_any_column(gb95, 'kaartenboek')

Ok, so at least it's actually in there... but why is it filtered out? And when?

Ohhh, wait! The disambiguation column is not even added!

After modifying the create wordform function:

In [None]:
wordforms = ingestGB.create_GB95_wordform_df(gb95)

In [None]:
wordforms

Right, that's a problem, the disambiguation column contains sometimes multiple words. We took care of that in the links data, but not in the wordforms data. Let's do it there as well then!

Another try:

In [None]:
wordforms = ingestGB.create_GB95_wordform_df(gb95)

In [None]:
any(wordforms.str.contains('kaartenboek'))

There we go.