# Groene boekje
* Use `dbutils.add_lexicon`!
* Use `bulk_add_anahashes` and `connect_anahases_to_wordforms`

INL takse about 3 minutes (without anahashes)

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import ticclat.dbutils
import pandas as pd
import numpy as np

## To do

* Add relationships between models (should make processing an xml file faster)
* Use sessions better: https://docs.sqlalchemy.org/en/latest/orm/session_basics.html#when-do-i-construct-a-session-when-do-i-commit-it-and-when-do-i-close-it
* Add multiple documents
* Extract vocabulary

In [None]:
# Read information to connect to the database and put it in environment variables
import os
with open('ENVVARS.txt') as f:
    for line in f:
        parts = line.split('=')
        if len(parts) == 2:
            os.environ[parts[0]] = parts[1].strip()

In [None]:
db_name = 'ticclat'
os.environ['dbname'] = db_name

In [None]:
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy_utils import database_exists, create_database

engine = create_engine("mysql://{}:{}@localhost/{}".format(os.environ['user'], 
                                                           os.environ['password'], 
                                                           os.environ['dbname']))
if not database_exists(engine.url):
    create_database(engine.url)

print(database_exists(engine.url))

Session = sessionmaker(bind=engine)

In [None]:
# from ticclat.lexicon_schema import AnalyzedWordform, Document, Lemmata, TokenAttestation, Wordform, Base
import ticclat.ticclat_schema as schema

In [None]:
# create tables
schema.Base.metadata.create_all(engine)

In [None]:
from sqlalchemy import inspect

inspector = inspect(engine)

In [None]:
# Get table information
print(inspector.get_table_names())

# Load Groene Boekje data into Pandas

In [None]:
GB_basepath = "/Users/pbos/projects/ticclat/data/GB/"

In [None]:
GB1914_path = GB_basepath + "1914/22722-8.txt"
GB1995_path = GB_basepath + "1995-2005/1995/GB95_002.csv"
GB2005_path = GB_basepath + "1995-2005/2005/GB05_002.csv"

In [None]:
df_GB1995 = pd.read_csv(GB1995_path, sep=';', names=["word", "syllables", "see also", "disambiguation",
                                                     "grammatical tag", "article",
                                                     "plural/past/attrib", "plural/past/attrib syllables",
                                                     "diminu/compara/past plural", "diminu/compara/past plural syllables",
                                                     "past perfect/superla", "past perfect/superla syllables"],
                        encoding='utf8') # encoding necessary for later loading into sqlalchemy!

In [None]:
# df_GB1995
# df_GB1995['see also'].dropna().map(lambda x: x[:8]).unique()

There's a lot of stuff in there... clean up time.

Multiple columns:
- Some entries have "@" in them. They seem to be general rules for some kinds of words. Not sure what to do with these, so will just filter out for now.

First column:
- Has entries with multiple values (e.g. "Zwols, Zwoller")
- Sometimes has multiple rows with the same word, probably in different meanings or something, e.g. "aal1", "aal2" and "aal3".
    + In this case, the fourth column has a note to disambiguate the meanings.

Second column: Splits words according to syllables, we can disregard this.

Third: a "See also \[other word\]" note (they all start with "Zie ook")

Fourth: disambiguation of a duplicate word (see first column)

Fifth: grammatical tag, i.e. noun, adjective, verb, etc.

Sixth: the proper definite article for the noun ("de" or "het").

Seventh: first inflection form. Sometimes left empty.
- If a noun: plural form
- If a verb: past tense singular
- If an adjective (or pronoun (`vnw.`)): attributive form

Nineth: second inflection form. Often left empty, probably for regular or easy forms.
- If a noun: diminutive
- If an adjective: comparative
- If a verb: past tense plural

Eleventh:
- If a verb: past perfect (voltooid verleden tijd?)
- If an adjective: superlative

8, 10, 12: Syllables of the preceding.

## Clean up

### Wordforms
- 1 needs most work:
  + Split up if it has a comma and two (or more) words --> Make it two (or more) rows; check that the other columns also have **the same number** of comma separated words and match those, otherwise just duplicate other columns.
  + Make a new column for rows with a "duplicate entry number" postfixed; put the number there, remove it from the first row.
    - Note that column 3 also uses the postfix number to refer to specific duplicates!
- 2, 8, 10 and 12 can go, we have no need for pronunciation information currently.
- We have no place in the current schema for 5, so we drop it as well for now.
- Column 3 is for links only.
- Columns 7, 9 and 11 also contain wordforms, so we give these their own rows as well. For now, we don't need links, so we just extract the words and worry about efficient linking later.

### Links
This lexicon also contains a lot of word links.

- Obviously, the three inflection columns, 7 9 and 11, give clear immediate links between wordforms, i.e. grammatical ones.
- Column 3 also gives a link, of words with similar meanings, i.e. a semantic link.

### Grammatical function
Column 5 may at some point be used to add grammatical information, if we ever plan on using this.

## Actual wordforms

In [None]:
df_GB1995_wordforms = df_GB1995.drop(["syllables", "see also", "disambiguation", "grammatical tag", "article",
                                      "plural/past/attrib syllables", "diminu/compara/past plural syllables", "past perfect/superla syllables"], axis=1)

In [None]:
df_GB1995_wordforms.sample(10)

## Categories

In [None]:
df_GB1995["grammatical tag"].unique()

In [None]:
df_GB1995[df_GB1995["grammatical tag"] == 'tw.']

We should probably skip `uitdr.` (sayings). Maybe also `eigenn.` (proper names). Will leave them in for now.

# Wordform clean up

## Mysterious "@" rows
Let's take out the weird "@" rows first to see what we can do with them.

In [None]:
def nd_any(*args):
    result = (args[0] | args[0])
    for arg in args:
        result = (result | arg)
    return result

In [None]:
df_GB1995_at = df_GB1995[nd_any(*tuple(df_GB1995[col].str.contains('@') for col in df_GB1995.columns))]

In [None]:
df_GB1995_at.sample(4)

In [None]:
x = df_GB1995_at.values.flatten()
x = pd.Series(x).dropna().to_frame()
x = x[x[0].str.contains('@')]

In [None]:
ats = x[0].str.extract('(\@[^\w]*)')

In [None]:
ats[0].unique()

Ahhh, they seem to be diacritic markers!
- @\` is accent grave on the previous character
- @\' is accent aigu on the previous character
- @@ seems to be apostrophe

But then... a whole zoo of quite rare ones. Let's check out examples.

In [None]:
for at in ats[0].unique():
    print(x[ats[0].str.contains(at.replace('\\', "\\\\"))].head())

Ok, so the three-character ones seem to be wrong: / are all just syllable separators, with a - behind them are just koppeltekens, with space is just space, etc., so let's cut it down to a two-character filter (actually also just one, there's a naked @ as well... though is that one actually naked or does the following character there also have special meaning? **make sure**):

In [None]:
ats = x[0].str.extract('(\@[^\w]?)')

In [None]:
ats[0].unique()

In [None]:
for at in ats[0].unique():
    print(at)
    print(x[ats[0].str.contains(at.replace('\\', "\\\\").replace('+', "\+").replace('^', "\^"))].head())

- @\` is accent grave on the previous character
- @\' is accent aigu
- @\\ is trema
- @+ is cedilla
- @^ is accent circumflex
- @= is a tilde
- @@ is apostrophe between the characters

The naked @ requires a bit more work:

In [None]:
naked_at = x[~nd_any(*tuple(ats[0].str.contains(at.replace('\\', "\\\\").replace('+', "\+").replace('^', "\^")) for at in ats[0].unique() if at != '@'))]

In [None]:
naked_at[~naked_at[0].str.contains('/')]

Ok, so the presumed naked @ is actually two possible things I wrongly filtered out:
- @2 is 2 in subscript, as in CO$_2$ (and only that)
- @n comes exclusively before an e that should have a trema on it; the backslash that's normally used is probably omitted to avoid confusion with newline-character \\n

We will normalize the diacritics below when we have gathered all wordforms in one array, for easier processing.

## Gather wordforms

Now to gather all actual wordforms, i.e. those in columns 1, 7, 9 and 11, also splitting by comma.

In [None]:
wordform_df = pd.concat((df_GB1995["word"], df_GB1995[df_GB1995.columns[6]], df_GB1995[df_GB1995.columns[8]], df_GB1995[df_GB1995.columns[10]]))\
                .dropna()

In [None]:
wordform_df[wordform_df.str.contains(', ')].head()

In [None]:
wordform_df[wordform_df.str.contains(':')].sample(10)

ffs, some words have colons in them, what do those mean then?

In [None]:
df_GB1995_colon = df_GB1995[nd_any(*tuple(df_GB1995[col].str.contains(':') for col in df_GB1995.columns))]

In [None]:
df_GB1995_colon.head()

I have no idea what this means, so will just filter it out.

Also, there are "words" that are actually several words... arrggh!

In [None]:
has_comma = wordform_df.str.contains(', ')
wordform_df = pd.concat((wordform_df[~has_comma],) + tuple(pd.Series(row.split(', ')) for row in wordform_df[has_comma]))

In [None]:
wordform_df[wordform_df.str.contains(', ')].head()

### Filter out crap
- Colons behind some words: we remove the colons and leave the rest of the word in.
- Words that are in fact several words, like "aan weerszijden" or "schoof vooruit". Remove those entries, split the words and append them to the end.
- Strip whitespace from either end of words (some apparently have it).
- Remove "footnote" numbers postfixed to duplicate words: like with colons.
- Retain only unique words after the above procedures.

In [None]:
# draw some samples to check for remaining weird shit
wordform_df.sample(10)

#### Remove colons

In [None]:
wordform_df = wordform_df.str.replace(':', '')

In [None]:
wordform_df[wordform_df.str.contains(':')].head()

#### Split multiple word entries

In [None]:
# surround space with any-character, because some single words also just have space padding
multi_word = wordform_df.str.contains('. .', regex=True)
wordform_df = pd.concat((wordform_df[~multi_word],) + tuple(pd.Series(row.split(' ')) for row in wordform_df[multi_word]))

In [None]:
wordform_df[wordform_df.str.contains('. .')]

#### Strip whitespace
Ok, so it's actually just one word, but still.

In [None]:
wordform_df[wordform_df.str.contains(' ')]

In [None]:
wordform_df = wordform_df.str.strip()

In [None]:
wordform_df[wordform_df.str.contains(' ')]

#### Remove duplicate word footnote numbers

In [None]:
duplicates = wordform_df.str.contains('[0-9]$', regex=True)
wordform_df = pd.concat((wordform_df[~duplicates], wordform_df[duplicates].str.replace('[0-9]$', '', regex=True)))

In [None]:
wordform_df.tail(), wordform_df[wordform_df.str.contains('[0-9]$', regex=True)]

#### Retain only unique wordforms

In [None]:
wordform_df = pd.Series(wordform_df.unique())

In [None]:
wordform_df.sort_values().head()

#### Extra: remove stuff between parentheses
This turned up when sorting. We should remove "etc.", which is a abbreviation, but we could keep the others.

In [None]:
wordform_df = wordform_df.sort_values().str.strip("()")

In [None]:
wordform_df.head()

#### Extra 2: abbreviations
Are there any more of them?

In [None]:
wordform_df[wordform_df.str.contains('.', regex=False)]

Ok, we remove the ones that are "pure" abbreviations, i.e. that end in a period.

In [None]:
abbreviation = wordform_df.str.contains('\.$')
wordform_df = wordform_df[~abbreviation]

In [None]:
wordform_df[wordform_df.str.contains('.', regex=False)]

#### Extra 3: Retain only unique wordforms... again

In [None]:
wordform_df = pd.Series(wordform_df.unique())

In [None]:
wordform_df.sort_values().head()

Ok, this is a problem. "'s" is not a separate word, it really belongs to some other word, that we split it off from.

### Aaaand again

After discussion, we decided to keep in the multi-word wordforms after all. We will see how to deal with them in TICCL later.

For the occasion, let's also just put everything in one function.

In [None]:
wordform_df = pd.concat((df_GB1995["word"],
                         df_GB1995["plural/past/attrib"],
                         df_GB1995["diminu/compara/past plural"],
                         df_GB1995["past perfect/superla"]))\
                .dropna()
has_comma = wordform_df.str.contains(', ')
wordform_df = pd.concat((wordform_df[~has_comma],) + tuple(pd.Series(row.split(', ')) for row in wordform_df[has_comma]))

In [None]:
def clean_wordform_df(wordform_df):
    # remove colons
    wordform_df = wordform_df.str.replace(':', '')
    # strip whitespace
    wordform_df = wordform_df.str.strip()
    # remove duplicate word footnote numbers
    duplicates = wordform_df.str.contains('[0-9]$', regex=True)
    wordform_df = pd.concat((wordform_df[~duplicates], wordform_df[duplicates].str.replace('[0-9]$', '', regex=True)))
    # remove parentheses around some words
    wordform_df = wordform_df.sort_values().str.strip("()")
    # remove abbreviations
    abbreviation = wordform_df.str.contains('\.$')
    wordform_df = wordform_df[~abbreviation]
    # remove duplicates
    wordform_df = pd.Series(wordform_df.unique())
    return wordform_df

In [None]:
wordform_clean_df = clean_wordform_df(wordform_df)

In [None]:
def check_cleanliness_wordform_df(wordform_df, head=5):
    print("Random sample:")
    display(wordform_df.sample(10))
    print("Colons, periods:")
    display(wordform_df[wordform_df.str.contains(':')].head(head))
    display(wordform_df[wordform_df.str.contains('.', regex=False)].head(head))
    print("White space padding:")
    display(wordform_df[wordform_df.str.contains('^ | $')].head(head))
    print("Trailing numbers:")
    display(wordform_df[wordform_df.str.contains('[0-9]$', regex=True)].head(head))
    print("Parentheses:")
    display(wordform_df[wordform_df.str.contains('\(|\)', regex=True)].head(head))
    print("Abbreviations:")
    display(wordform_df[wordform_df.str.contains('\.$')].head(head))
    
    print("Finally, just the first entries of sorted df:")
    display(wordform_df.sort_values().head(head))
    display(wordform_df.sort_values().tail(head))    

In [None]:
check_cleanliness_wordform_df(wordform_df)

In [None]:
check_cleanliness_wordform_df(wordform_clean_df)

So, there's some remaining issues here that should be at least duly noted (though ideally dealt with):

- Some words have `(etc.)` in it, like `ten tweede (etc.) male`. This should be expanded into `ten derde male`, and so forth for all counting words.
- Other parenthesized words like `op de(n) duur` should be considered as two words, `op de duur` and `op den duur`.
- I'm not sure what the period in `vademen. vadems` means, though I suspect the period is a mistyped comma.

**We will leave these things as they are for now.**

One more pressing remaining issue is that there are apparently "multiple words" that are not separated by comma at all! See `zworen af zweerde af` which should be two terms. Let's check what this looks like in the original table:

In [None]:
df_GB1995[df_GB1995['word'].str.contains('zweren', na=False)]

No idea how to fix this... **yet**. Leave it as is for now.

### Normalizing diacritics

TICCLAT has unicode wordforms, so we can just replace the diacritic markers with actual diacritics. How to do that?

Apparently, there is such a thing as "combining characters" in Unicode: https://stackoverflow.com/questions/34755556/how-do-i-add-accents-to-a-letter. Nice! Here's a table of them: https://en.wikipedia.org/wiki/Combining_character

In [None]:
# note that these are regex formatted, i.e. with special characters escaped
diacritic_markers = {'@`': '\u0300',    # accent grave
                     "@\\'": '\u0301',  # accent aigu
                     '@\\\\': '\u0308', # trema
                     '@\+': '\u0327',   # cedilla
                     '@\^': '\u0302',   # accent circumflex
                     '@=': '\u0303',    # tilde
                     '@@': "'",         # apostrophe (not actually a diacritic)
                     '@2': '\u2082',    # subscript 2
                     '@n': '\u0308n'    # trema followed by n
                    }

In [None]:
for marker, umarker in diacritic_markers.items():
    wordform_clean_df = wordform_clean_df.str.replace(marker, umarker)

In [None]:
wordform_clean_df.sort_values().head(10)

# Load Groene Boekje DataFrames into TICCLAT database

In [None]:
with ticclat.dbutils.session_scope(Session) as session:
    ticclat.dbutils.add_lexicon(session, "Groene Boekje 1995", wordform_clean_df.to_frame(name='wordform'))

Ok, this is giving me an error

```python-traceback
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-93-22e27abe721c> in <module>
      1 with ticclat.dbutils.session_scope(Session) as session:
----> 2     ticclat.dbutils.add_lexicon(session, "Groene Boekje 1995", wordform_clean_df.to_frame(name='wordform'))

~/projects/ticclat/ticclat/ticclat/dbutils.py in add_lexicon(session, lexicon_name, wfs, num)
     97     in this case just "wordform"
     98     """
---> 99     bulk_add_wordforms(session, wfs, num=num)
    100 
    101     lexicon = Lexicon(lexicon_name=lexicon_name)

~/projects/ticclat/ticclat/ticclat/dbutils.py in bulk_add_wordforms(session, wfs, num)
     72 
     73         q = session.query(Wordform)
---> 74         result = q.filter(Wordform.wordform.in_(wordforms)).all()
     75 
     76         existing_wfs = [wf.wordform for wf in result]

~/sw/miniconda3/envs/ticclat2/lib/python3.7/site-packages/sqlalchemy/orm/query.py in all(self)
   2923 
   2924         """
-> 2925         return list(self)
   2926 
   2927     @_generative(_no_clauseelement_condition)

~/sw/miniconda3/envs/ticclat2/lib/python3.7/site-packages/sqlalchemy/orm/query.py in __iter__(self)
   3079         if self._autoflush and not self._populate_existing:
   3080             self.session._autoflush()
-> 3081         return self._execute_and_instances(context)
   3082 
   3083     def __str__(self):

~/sw/miniconda3/envs/ticclat2/lib/python3.7/site-packages/sqlalchemy/orm/query.py in _execute_and_instances(self, querycontext)
   3104         )
   3105 
-> 3106         result = conn.execute(querycontext.statement, self._params)
   3107         return loading.instances(querycontext.query, result, querycontext)
   3108 

~/sw/miniconda3/envs/ticclat2/lib/python3.7/site-packages/sqlalchemy/engine/base.py in execute(self, object_, *multiparams, **params)
    978             raise exc.ObjectNotExecutableError(object_)
    979         else:
--> 980             return meth(self, multiparams, params)
    981 
    982     def _execute_function(self, func, multiparams, params):

~/sw/miniconda3/envs/ticclat2/lib/python3.7/site-packages/sqlalchemy/sql/elements.py in _execute_on_connection(self, connection, multiparams, params)
    271     def _execute_on_connection(self, connection, multiparams, params):
    272         if self.supports_execution:
--> 273             return connection._execute_clauseelement(self, multiparams, params)
    274         else:
    275             raise exc.ObjectNotExecutableError(self)

~/sw/miniconda3/envs/ticclat2/lib/python3.7/site-packages/sqlalchemy/engine/base.py in _execute_clauseelement(self, elem, multiparams, params)
   1097             distilled_params,
   1098             compiled_sql,
-> 1099             distilled_params,
   1100         )
   1101         if self._has_events or self.engine._has_events:

~/sw/miniconda3/envs/ticclat2/lib/python3.7/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1238         except BaseException as e:
   1239             self._handle_dbapi_exception(
-> 1240                 e, statement, parameters, cursor, context
   1241             )
   1242 

~/sw/miniconda3/envs/ticclat2/lib/python3.7/site-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context)
   1458                 util.raise_from_cause(sqlalchemy_exception, exc_info)
   1459             else:
-> 1460                 util.reraise(*exc_info)
   1461 
   1462         finally:

~/sw/miniconda3/envs/ticclat2/lib/python3.7/site-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
    275         if value.__traceback__ is not tb:
    276             raise value.with_traceback(tb)
--> 277         raise value
    278 
    279 

~/sw/miniconda3/envs/ticclat2/lib/python3.7/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1234                 if not evt_handled:
   1235                     self.dialect.do_execute(
-> 1236                         cursor, statement, parameters, context
   1237                     )
   1238         except BaseException as e:

~/sw/miniconda3/envs/ticclat2/lib/python3.7/site-packages/sqlalchemy/engine/default.py in do_execute(self, cursor, statement, parameters, context)
    534 
    535     def do_execute(self, cursor, statement, parameters, context=None):
--> 536         cursor.execute(statement, parameters)
    537 
    538     def do_execute_no_params(self, cursor, statement, context=None):

~/sw/miniconda3/envs/ticclat2/lib/python3.7/site-packages/MySQLdb/cursors.py in execute(self, query, args)
    237                 args = dict((key, db.literal(item)) for key, item in args.items())
    238             else:
--> 239                 args = tuple(map(db.literal, args))
    240             if not PY2 and isinstance(query, (bytes, bytearray)):
    241                 query = query.decode(db.encoding)

~/sw/miniconda3/envs/ticclat2/lib/python3.7/site-packages/MySQLdb/connections.py in literal(self, o)
    319             s = self._tuple_literal(o)
    320         else:
--> 321             s = self.escape(o, self.encoders)
    322         # Python 3(~3.4) doesn't support % operation for bytes object.
    323         # We should decode it before using %.

~/sw/miniconda3/envs/ticclat2/lib/python3.7/site-packages/MySQLdb/connections.py in unicode_literal(u, dummy)
    227             # unicode_literal() is called for arbitrary object.
    228             def unicode_literal(u, dummy=None):
--> 229                 return db.string_literal(str(u).encode(db.encoding))
    230 
    231         def bytes_literal(obj, dummy=None):

UnicodeEncodeError: 'latin-1' codec can't encode character '\u0308' in position 14: ordinal not in range(256)
```

It seems like the `db.encoding` for some reason is latin-1 there, even though we set the ticclat database to be utf8 in MySQL using

```mysql
CREATE DATABASE ticclat CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;
```

This is a guide that should set everything to utf8: https://mathiasbynens.be/notes/mysql-utf8mb4#mysql-utf8mb4

The important part there that we're still missing is that `character_set_client` may still be non-utf8, i.e. clients could still read the data as latin-1, even though the database is encoded as utf8 (see e.g. https://nicj.net/mysql-converting-an-incorrect-latin1-column-to-utf8/). In my case it seems like its neither latin-1 nor utf8mb4, but actually "utf8" which is an alias for utf8mb3 (see https://stackoverflow.com/a/30074553/1199693), as I found out by running `SHOW VARIABLES LIKE 'character_set_client';`.

We will do this by creating a configuration file. To find out which file your client reads from, run this:
```sh
mysql --help | grep -A 1 "Default options are read from the following files"
```

For me this gives
```sh
Default options are read from the following files in the given order:
/etc/my.cnf /etc/mysql/my.cnf /Users/pbos/sw/miniconda3/envs/ticclat2/etc/my.cnf ~/.my.cnf
```

I'm using MySQL from Conda, so I'll use the miniconda location and put this in the file:

```ini
[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
```

Then restart the server and recreate the database (don't care much about saving data at this point) by deleting and recreating.

```sh
mysqld restart
```

Indeed, now `SHOW VARIABLES LIKE 'character_set_client';` gives utf8mb4.

.......

Ok, never mind, I was using the "groene_boekje" database, not the "ticclat" database. D'oh!

.......

Ok, but with that fixed, the problem still persists. Even with the whole configuration file above added... crap.

In [None]:
with ticclat.dbutils.session_scope(Session) as session:
    ticclat.dbutils.add_lexicon(session, "Groene Boekje 1995", wordform_clean_df.str.encode('utf8').to_frame(name='wordform'))

In [None]:
# wordform_clean_df.to_frame(name='wordform')['wordform']

Yaaay, that seems to have worked! The crucial addition there is `.str.encode('utf-8')`. Actually, we already discussed this, but I got lost in the MySQL settings and forgot.

This is also useful to check the encoding of tables themselves:
```mysql
SELECT T.table_name, CCSA.character_set_name FROM information_schema.`TABLES` T,
information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` CCSA
WHERE CCSA.collation_name = T.table_collation AND T.table_schema = "ticclat";
```

But also have to make sure that the data is put in correctly. Just browsing through the data in mysql with

```mysql
SELECT * FROM wordforms;
```

Shows that the unicode words are not displayed correctly in the terminal at least, e.g. `zooÌˆloog"` instead of `zoöloog`.

How about if we query from here?

In [None]:
'zoo{}loog'.format(diacritic_markers['@\\\\'])

In [None]:
# again gives long utf-8/latin-1 UnicodeEncodeError:
# with ticclat.dbutils.session_scope(Session) as session:
#     for wordform in session.query(ticclat.ticclat_schema.Wordform).filter_by(wordform='zoo{}loog'.format(diacritic_markers['@\\\\'])):
#          print(wordform)

# this gives no results at all
with ticclat.dbutils.session_scope(Session) as session:
    for wordform in session.query(ticclat.ticclat_schema.Wordform).filter_by(wordform='zooÌˆloog"'.encode('utf-8')):
         print(wordform)

# ... neither does this, when I manually type in zoöloog with alt-U O (on macOS) for the ö
with ticclat.dbutils.session_scope(Session) as session:
    for wordform in session.query(ticclat.ticclat_schema.Wordform).filter_by(wordform='zoöloog"'.encode('utf-8')):
         print(wordform)

# ... but when I copy-paste the output of the cell above it works!
with ticclat.dbutils.session_scope(Session) as session:
    for wordform in session.query(ticclat.ticclat_schema.Wordform).filter_by(wordform='zoöloog"'.encode('utf-8')):
         print(wordform)

We'll leave this unicode retrieval problem for later (made an issue).

However, in the links notebook we ran into a related issue, which are the double quotes at the ends of these words.

In [None]:
[w for w in wordform_clean_df if '"' in w][:20]

This actually seems to be a bigger problem! In some cases, there are apparently rows where cells have been misidentified, like `'België";Bel/gië""'` above. This is really a mystery; why is that `;` not identified as a separator in the csv? Is it really a different unicode character?

In [None]:
[w for w in wordform_clean_df if '"' in w][10]

In [None]:
[w for w in wordform_clean_df if '"' in w][10][8] == ';'

Nope, just a regular `;`... Let's see what this line looks like in the file, here's the paste from grep:

```
"Belgie@\"";"Bel/gie@\"";;;"eigenn.";;;;;;;
```

AHA, I've been identifying trema incorrectly! It's not `@\` as I assumed, it's `@\"`, which makes way more sense anyway. Also, are those escaped double quotes causing some issues?

Let's add `escapechar` for starters.

In [None]:
df_GB1995 = pd.read_csv(GB1995_path, sep=';', escapechar='\\',
                        names=["word", "syllables", "see also", "disambiguation",
                               "grammatical tag", "article",
                               "plural/past/attrib", "plural/past/attrib syllables",
                               "diminu/compara/past plural", "diminu/compara/past plural syllables",
                               "past perfect/superla", "past perfect/superla syllables"],
                        encoding='utf8') # encoding necessary for later loading into sqlalchemy!

In [None]:
# note that these are regex formatted, i.e. with special characters escaped
diacritic_markers = {'@`': '\u0300',    # accent grave
                     "@\\'": '\u0301',  # accent aigu
                     '@\\"': '\u0308', # trema
                     '@\+': '\u0327',   # cedilla
                     '@\^': '\u0302',   # accent circumflex
                     '@=': '\u0303',    # tilde
                     '@@': "'",         # apostrophe (not actually a diacritic)
                     '@2': '\u2082',    # subscript 2
                     '@n': '\u0308n'    # trema followed by n
                    }

In [None]:
wordform_df = pd.concat((df_GB1995["word"],
                         df_GB1995["plural/past/attrib"],
                         df_GB1995["diminu/compara/past plural"],
                         df_GB1995["past perfect/superla"]))\
                .dropna()
has_comma = wordform_df.str.contains(', ')
wordform_df = pd.concat((wordform_df[~has_comma],) + tuple(pd.Series(row.split(', ')) for row in wordform_df[has_comma]))

wordform_clean_df = clean_wordform_df(wordform_df)

for marker, umarker in diacritic_markers.items():
    wordform_clean_df = wordform_clean_df.str.replace(marker, umarker)

In [None]:
[w for w in wordform_clean_df if '"' in w], [w for w in wordform_clean_df if ';' in w], [w for w in wordform_clean_df if '/' in w]

In [None]:
[w for w in wordform_clean_df if 'Belgi' in w]

That's better! Except that Ilias... The grep:

```
"Ilias";;;"znw.";"de";;"Ili/as";;;;;"5.2.5;5.3.5a"
```

Ah, there's a missing `;` after the first cell, really weird line, the syllables cell which should be second is then in the place of the plural/past/attrib cell... ok, so we clean that one manually and we're done.

In [None]:
df_GB1995 = pd.read_csv(GB1995_path, sep=';', escapechar='\\',
                        names=["word", "syllables", "see also", "disambiguation",
                               "grammatical tag", "article",
                               "plural/past/attrib", "plural/past/attrib syllables",
                               "diminu/compara/past plural", "diminu/compara/past plural syllables",
                               "past perfect/superla", "past perfect/superla syllables"],
                        encoding='utf8') # encoding necessary for later loading into sqlalchemy!

df_GB1995 = df_GB1995.where(df_GB1995 != "Ili/as", other=None)

wordform_df = pd.concat((df_GB1995["word"],
                         df_GB1995["plural/past/attrib"],
                         df_GB1995["diminu/compara/past plural"],
                         df_GB1995["past perfect/superla"]))\
                .dropna()
has_comma = wordform_df.str.contains(', ')
wordform_df = pd.concat((wordform_df[~has_comma],) + tuple(pd.Series(row.split(', ')) for row in wordform_df[has_comma]))

wordform_clean_df = clean_wordform_df(wordform_df)

for marker, umarker in diacritic_markers.items():
    wordform_clean_df = wordform_clean_df.str.replace(marker, umarker)

In [None]:
[w for w in wordform_clean_df if '"' in w], [w for w in wordform_clean_df if ';' in w], [w for w in wordform_clean_df if '/' in w]

In [None]:
[w for w in wordform_clean_df if 'Belgi' in w]

In [None]:
[[w for w in wordform_clean_df if d in w][:3] for d in diacritic_markers.values()]