# Groene boekje part 3: 2005

The first two Groene Boekje notebooks dealt with the wordforms and links between wordforms for the 1995 corpus only.

This notebook we will try the developed methods on the 2005 version and see what blows up and how to fix it.

In [None]:
%load_ext autoreload

In [None]:
%autoreload

In [None]:
import ticclat.dbutils
import ticclat.ticclat_schema
import pandas as pd
import numpy as np
import tqdm

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy_utils import database_exists, create_database

import ticclat.ingest.groene_boekje

In [None]:
pd.options.display.max_columns = None

In [None]:
# Read information to connect to the database and put it in environment variables
import os
with open('ENVVARS.txt') as f:
    for line in f:
        parts = line.split('=')
        if len(parts) == 2:
            os.environ[parts[0]] = parts[1].strip()

In [None]:
db_name = 'ticclat'
os.environ['dbname'] = db_name

# Load Groene Boekje data into Pandas

In [None]:
GB_basepath = "/Users/pbos/projects/ticclat/data/GB/"

In [None]:
GB1914_path = GB_basepath + "1914/22722-8.txt"
GB1995_path = GB_basepath + "1995-2005/1995/GB95_002.csv"
GB2005_path = GB_basepath + "1995-2005/2005/GB05_002.csv"

In [None]:
gb2005 = ticclat.ingest.groene_boekje.load_GB95(GB2005_path)

In [None]:
gb2005.head()

Ok, that's not quite right.

In [None]:
df_GB2005 = pd.read_csv(GB2005_path, sep=';')

In [None]:
df_GB2005.head()

There's a lot more columns this time. Let's vogel it out.

First, there's an extra first column that seems to be some index, probably relating the entries to the 1995 table rows.

The other extra columns all seem to be empty for the first set of entries, i.e. those in the range up to 200,000. It looks like these are the same words as in the 1995 version.

After 200,000, the first columns are instead empty and only the latter columns are used. Maybe they are using a different categorization there. If so, it would probably make more sense to split the table in two.

First let's check whether indeed the above statements are true:

1. Up to 200,000, the last columns are empty.
2. Up to 200,000 are the same words as in the 1995 version. N.B.: It doesn't actually really matter whether words were already in the 1995 version; if they're in 2005 as well, that's extra info that needs to be incorporated in the database (i.e. that the wordforms also occur in this lexicon).
3. After 200,000 are new words.
4. After 200,000 the first columns are empty.
5. The columns after 200,000 are different from those before.

In [None]:
df_GB2005 = pd.read_csv(GB2005_path, sep=';', escapechar="\\", index_col=0,
                        names=["word", "syllables", "see also", "disambiguation",
                               "grammatical tag", "article",
                               "plural/past/attrib", "plural/past/attrib syllables",
                               "diminu/compara/past plural", "diminu/compara/past plural syllables",
                               "past perfect/superla", "past perfect/superla syllables"]
                              + ["???"] * 12  # 25 total columns, 12 of which unknown, 1 is the index column; so 12 known, 12 unknown...
                       )

In [None]:
print([len(df_GB2005.loc[:200000][col].dropna()) for col in df_GB2005.columns])

Ok, so they aren't exactly empty, at all really. But this is because (as we'll see below, the words are actually repeated in those rows for the (supposedly) 1995 words.

What about the first columns for the later words?

In [None]:
print([len(df_GB2005.loc[200000:][col].dropna()) for col in df_GB2005.columns])

There we are almost confirmed in our hypothesis...

...except that the last two columns are present. This seems to be because three columns from the 1995 data are now removed: "see also", "disambiguation" and "article". That then leaves us with 9 known columns, 1 index column and 15 unknown ones. Judging from the numbers, the last four are some kind of special row, so probably the 9 columns after the first 9 are the same, but then for the 2005 set. Actually, there appears to be one mysterious number row in between the two sets of 9.

... ok all nice speculation, let's just check the CSV file though, doing that below.

First to check the rest of the above assumptions...

In [None]:
df_GB2005.head(5)

We can see immediately that the columns have been reordered and modified. We will just examine the CSV and use the correct order below.

In [None]:
names = ["word", "syllables", "grammatical tag",
         "plural/past/attrib", "plural/past/attrib syllables",
         "diminu/compara/past plural", "diminu/compara/past plural syllables",
         "past perfect/superla", "past perfect/superla syllables"]
actual_names = [n + " 95" for n in names] \
               + ["MYSTERIOUS NUMBERS",] \
               + [n + " 05" for n in names[:3]] \
               + ['article'] \
               + [n + " 05" for n in names[3:]] \
               + ["1st person present singular", "1st person present singular syllables",
                  "2nd/3rd person present singular", "2nd/3rd person present singular syllables"]
df_GB2005 = pd.read_csv(GB2005_path, sep=';', escapechar="\\", index_col=0, names=actual_names)

In [None]:
df_GB2005[~df_GB2005["MYSTERIOUS NUMBERS"].isna()].sample(6)
# df_GB2005.sample(10)

## Peculiarities

- Some inflections in the 05 columns of the 95 words contain `/schrappen/`, meaning "scratch". Probably this means they wanted to remove this inflection from the 2005 version for some reason.
- The last four columns are new, but also used for some of the 95 words. These are first and second/third person present tense singular verb forms for loanwords (mostly from English).
- I have no idea what the mysterious numbers stand for...

In [None]:
# # df_GB1995[~df_GB1995["see also"].isnull()].sample(10)
# # df_GB1995[~df_GB1995["disambiguation"].isnull()].sample(10)
# df_GB1995[~df_GB1995["disambiguation"].isnull() & df_GB1995["disambiguation"].str.contains(' ')].sample(10)