Let's see what we can do with my first TICCL runs! Done on 4 SoNaR subcorpora:

1. Newspapers
2. Periodicals/magazines
3. Websites
4. Wikipedia

Using two pipelines: with indexer and with indexerNT.

In this notebook we'll focus on the indexer (non-NT) data.

First step: load the data and get it in a format we can work with. Let's do that with `pandas`.

In [None]:
%matplotlib inline

In [None]:
import pandas as pd
import functools  # lru_cache
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize as scop
import csv
import seaborn as sns
import corner

In [None]:
class Corpus():
    """
    For a bit of order and structure, let's make this dummy class,
    which may be extended later if necessary.
    """
    pass

In [None]:
websites = Corpus()

# Freq
_Computed with `FoLia-stats`, but in this case they were already provided._

In [None]:
websites.freq = pd.read_csv('sonar_ticcl/WR-P-E-I_web_sites.wordfreqlist.tsv', sep='\t', names=['word', 'number', 'other_number', 'decimal_number'])

In [None]:
# websites.freq

What are the columns really? Perhaps the decimal number is the (cumulative) percentage of that word in the total corpus?

In [None]:
# (websites.freq.number/websites.freq.number.sum()).cumsum()

No, that doesn't really fit at all... The third column may be a running sum of the first one...

In [None]:
# websites.freq.number.cumsum() == websites.freq.other_number

Ok, that seems to be close...

# Clean

_Computed with `TICCL-unk`_

Anyway, it doesn't seem to be used by TICCL. In the clean files, we only see the first column being used as frequency/count, but with the added artifrq added to it in case of words in the lexicon:

In [None]:
websites.clean = pd.read_csv('sonar_ticcl/WR-P-E-I_web_sites.wordfreqlist.tsv.clean', sep='\t', names=['word', 'counts'])

In [None]:
# websites.clean
len(websites.clean)

So this list is a LOT longer than the original one. I guess a lot of lexicon words were added. Those should have count exactly 1000000000...

In [None]:
sum(websites.clean.counts == 100000000), len(websites.clean) - (sum(websites.clean.counts == 100000000) + len(websites.freq))

Right, so 11811 "words" out of 72160 were filtered out by clean and 1015180 were added from the lexicon.

The punct file should contain punctuation "corrected" words and the unk file should contain "unknown" words. Not sure what that meant anymore...

NO, THERE'S ALSO OVERLAP

In [None]:
websites.punct = pd.read_csv('sonar_ticcl/WR-P-E-I_web_sites.wordfreqlist.tsv.punct', sep='\t', names=['word', 'correction'])

In [None]:
len(websites.punct)

In [None]:
# websites.punct

In [None]:
websites.unk = pd.read_csv('sonar_ticcl/WR-P-E-I_web_sites.wordfreqlist.tsv.unk', sep='\t', names=['word', 'counts'])

In [None]:
len(websites.unk)

In [None]:
# websites.unk

Most filtered out words are not saved anywhere, it seems.

In [None]:
"Nino')Vieira" in websites.clean.word

So what words does punct contain then? The ones that even after trimming punctuation were not found to be clean? Even adding those we still have about 11k words unaccounted for, so indeed most cleaned-out words are gone.

Anyway...
## Nice plot time

In [None]:
websites.clean.plot(logy=True)

Right, that doesn't work because of artifrq. Let's correct for that:

In [None]:
websites.clean[:10].apply(lambda x: (x['word'], x.counts), axis=1, broadcast=True)

In [None]:
# THIS IS SOOOO SLOOOOW!
# websites.clean_no_artifrq = websites.clean.apply(lambda x: (x['word'], x['counts']-100000000 if x['counts']>=100000000 else x['counts']), axis=1, broadcast=True)
# better:
websites.clean_no_artifrq = websites.clean[['word', 'counts']].copy()
websites.clean_no_artifrq.loc[websites.clean_no_artifrq['counts'] >= 100000000, 'counts'] -= 100000000

In [None]:
websites.clean_no_artifrq.plot(logx=True, logy=True)

Let's see if we can compare that / fit to a Zipf curve.

In [None]:
@functools.lru_cache(maxsize=10)
def zipf_normalization(N, s):
    return sum(1 / np.arange(1, N + 1)**s)


def zipf_from_ranks(ranks, *, s=1):
    return 1/ranks**s / zipf_normalization(len(ranks), s)


def zipf(N, *, s=1):
    ranks = np.arange(1, N + 1)
    return zipf_from_ranks(ranks, s=s)


def zipf_mandelbrot(N, *, q=0, s=1):
    ranks_plus_q = np.arange(1, N + 1) + q
    return zipf_from_ranks(ranks_plus_q, s=s)

Remarkably good fit with no parameter tweaking at all!

Actually, it makes sense to have an index starting from 1 here, to make plotting in log-log nicer.

In [None]:
websites.clean_no_artifrq.index += 1

In [None]:
fig, ax = plt.subplots(1, 1)
websites.clean_no_artifrq.plot(logx=True, logy=True, ax=ax, legend=False)
N = len(websites.clean_no_artifrq)
ax.plot(np.arange(1, N + 1), websites.clean_no_artifrq.counts.sum() * zipf_mandelbrot(N),
       label='zipf-mandelbrot')
ax.legend()

What about the fraction?

In [None]:
plt.semilogx(1 - websites.clean_no_artifrq.counts/(websites.clean_no_artifrq.counts.sum() * zipf_mandelbrot(N)))
plt.ylim(-1, 1)

And the KL-divergence (in **bits**, i.e. using `log2`) of the data compared to the theoretical Zipf-curve?

In [None]:
def KLdiv(data, model):
    data_masked = np.ma.masked_array(data, mask=data <= 0)
    return -np.sum((data_masked * np.log2(model / data_masked)))

In [None]:
KLdiv(websites.clean_no_artifrq.counts/websites.clean_no_artifrq.counts.sum(), zipf_mandelbrot(N))

0.5 bits, is that good? Should compare to the entropy of the data:

In [None]:
def entropy(p):
    return -np.sum(p * np.log2(p))

In [None]:
entropy(websites.clean_no_artifrq.counts/websites.clean_no_artifrq.counts.sum())

So, only 5% information is lost in the Zipf-approximation of the data, i.e. you need about 5% more bits to encode the "true" distribution (the observed data, the counts) compared to an optimal encoding based on a Zipf-curve. This seems pretty good to me.

In principle, you could try to fit the parameters on a minimum KLdiv. Let's try, why not.

In [None]:
scop.minimize(lambda parameter_array: KLdiv(websites.clean_no_artifrq.counts/websites.clean_no_artifrq.counts.sum(), zipf_mandelbrot(N, q=parameter_array[0], s=parameter_array[1])),
              x0=[0, 1], bounds=[(0, None), (0, None)])

Ok, nice, `q` indeed stays at zero, `s` is only a bit higher than 1 and the KL-divergence is only very slightly lower. So indeed, the "default" Zipf curve with power 1 is a very good fit already.

In [None]:
fig, ax = plt.subplots(1, 1)
websites.clean_no_artifrq.plot(logx=True, logy=True, ax=ax, legend=False)
N = len(websites.clean_no_artifrq)
ax.plot(np.arange(1, N + 1), websites.clean_no_artifrq.counts.sum() * zipf_mandelbrot(N, q=0, s=1.08957528),
       label='zipf-mandelbrot')
ax.legend()

Yeah, that looks slightly better by eye, but nothing amazing.

# Back to data loading

Still have a few things to load: anahash, confuslist.index, short.ldcalc, ldcalc.ambi, ldcalc and ldcalc.ranked for the non-NT run and also corpusfoci (part of anahash) for the NT run.

## Anahash

In [None]:
websites.anahash = pd.read_csv('sonar_ticcl/WR-P-E-I_web_sites.wordfreqlist.tsv.clean.anahash', sep='~',
                               index_col=0, names=['anahash', 'words'])
websites.anahash.head()

Hmm, how do we load such a data file into Pandas efficiently? Asked data SIG. In the meantime, let's try this (https://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-rows):

In [None]:
def load_anahash_first_try():
    anahash = pd.read_csv('sonar_ticcl/WR-P-E-I_web_sites.wordfreqlist.tsv.clean.anahash', sep='~',
                                   index_col=0, names=['anahash', 'words'],
                                   converters={'words': lambda w: tuple(w.split('#'))})
    anahash['words'][:20].apply(pd.Series, 1).stack()
    anahash['words'][:20]
    return anahash

In [None]:
# websites.anahash = load_anahash()

Really weird, some `\n`s aren't read as newlines! Sublime Text has no problem with them... What's up? Wrong encoding?

In [None]:
# websites.anahash = pd.read_csv('sonar_ticcl/WR-P-E-I_web_sites.wordfreqlist.tsv.clean.anahash',
#                                sep='~', quoting=csv.QUOTE_NONE,
#                                index_col=0, names=['anahash', 'words'],
#                                converters={'words': lambda w: tuple(w.split('#'))}, encoding='utf-8')
# websites.anahash.head(20)

That's better. Try again:

In [None]:
# websites.anahash[:20]['words'].apply(pd.Series, 1).stack()

Awesome. Now in one go?

In [None]:
# websites.anahash = pd.read_csv('sonar_ticcl/WR-P-E-I_web_sites.wordfreqlist.tsv.clean.anahash',
#                                sep='~', quoting=csv.QUOTE_NONE,
#                                index_col=0, names=['anahash', 'words'],
#                                converters={'words': lambda w: pd.Series(w.split('#'))}, encoding='utf-8')
# websites.anahash.head(20)

Ok that doesn't work. Let's stick to two (or actually four) steps then.

In [None]:
def load_anahash():
    anahash_tuples_df = pd.read_csv('sonar_ticcl/WR-P-E-I_web_sites.wordfreqlist.tsv.clean.anahash',
                                    sep='~', quoting=csv.QUOTE_NONE,
                                    index_col=0, names=['anahash', 'words'],
                                    converters={'words': lambda w: tuple(w.split('#'))}, encoding='utf-8')
    anahash = anahash_tuples_df['words'].apply(pd.Series, 1).stack().to_frame()
    anahash.index.rename(["anahash", "variant_id"], inplace=True)
    anahash.rename({0: 'word'}, axis='columns', inplace=True)

In [None]:
# websites.anahash = load_anahash()

In [None]:
%time load_anahash()

This takes some time to load, so let's save the result.

In [None]:
# websites.anahash.to_msgpack('sonar_ticcl/WR-P-E-I_web_sites.wordfreqlist.tsv.clean.anahash.msgpack')

In [None]:
websites.anahash = pd.read_msgpack('sonar_ticcl/WR-P-E-I_web_sites.wordfreqlist.tsv.clean.anahash.msgpack')

In [None]:
websites.anahash.head()

Ok, great. Let's see what's in there.

In [None]:
len(websites.anahash)

Odd, that's two more words than the original clean list...

In [None]:
len(websites.clean)

Ok, must be some empty line or some wrong handling of a comma or newline... anyway...

In [None]:
websites.anahash.groupby('anahash').count()

In [None]:
websites.anahash.groupby('anahash').count().mean()

Yeah, ok, some anagrams have more variants than others, nothing surprising. Let's see some more interesting statistics, like word length vs variants.

ACCORDING TO MARTIN IT SHOULD BE ABOUT 1.3 ON AVERAGE.

Assuming that all anagrams have an equal number of characters, we can just use the 0 variant_ids to count the string lengths.

In [None]:
df= websites.anahash\
      .groupby('anahash').count()\
      .rename({'word': 'variant_count'}, axis='columns')\
      .join(websites.anahash
            .loc[(slice(None), 0), :]['word']
            .str.len()
            .reset_index(level='variant_id', drop=True)
            .rename('word_length'))

In [None]:
corner.corner(df);

Also not surprising: longer words, less variants.

In [None]:
websites.anahash.groupby('variant_id').count()

# confuslist.index

This file is huge, 1.8G, so we need some other sort of handling, pandas will surely crash on anything but a supercomputer.

In [None]:
websites.confuslist_index = pd.read_csv('sonar_ticcl/WR-P-E-I_web_sites.wordfreqlist.tsv.clean.confuslist.index',
                                        memory_map=True, sep='#', nrows=10, index_col=0,
                                        names=['confusion', 'word_hashes'])

In [None]:
websites.confuslist_index

If we were to load this in similarly to the anahashes, with a MultiIndex, how much memory would that take? It would increase the number of columns to three (even though the second index could be a small int probably), so $3 * 8 = 24$ bytes per word hash. Number of word hashes is number of commas plus number of new lines in the file. This counts commas:
```sh
tr -cd ',' < WR-P-E-I_web_sites.wordfreqlist.tsv.clean.confuslist.index | wc -c
```
This counts newlines:
```sh
wc -l WR-P-E-I_web_sites.wordfreqlist.tsv.clean.confuslist.index
```

In [None]:
commas = 148610673
newlines = 250747
commas + newlines, (commas + newlines) * 24, (commas + newlines) * 14

Oi, 3.5G, more than expected from just the text file size... Actually, we can probably just use uint16 for the second index and uint32 for the confusion, only the actual hashes must be uint64, so that would sum to just 14 bytes per line, a total of about 2G then.

In [None]:
commas/newlines

That should be safely less than 65535, so then indeed a uint16 for the second column would be possible.

Ok, let's try it out then, developed this in about 1.5 hours:

In [None]:
import ticcl_output_reader

In [None]:
%timeit confusion_array, confusion_word_index_array, word_anahash_array = ticcl_output_reader.load_confuslist_index("sonar_ticcl/WR-P-E-I_web_sites.wordfreqlist.tsv.clean.confuslist.index.head")

In [None]:
def get_confuslist_index_head():
    df_tuples = pd.read_csv('sonar_ticcl/WR-P-E-I_web_sites.wordfreqlist.tsv.clean.confuslist.index.head',
                        sep='#', index_col=0, names=['confusion', 'word_hashes'],
                        converters={'word_hashes': lambda w: tuple(w.split(','))})
    df = df_tuples['word_hashes'].apply(pd.Series, 1).stack().to_frame()
    df.index.rename(["confusion", "list_index"], inplace=True)
    df.rename({0: 'word_hash'}, axis='columns', inplace=True)
    return df

In [None]:
%timeit get_confuslist_index_head()

In [None]:
200/4

Yay! Is it correct though?

Not immediately, had to fix some bugs.

In [None]:
cpp_index = ticcl_output_reader.load_confuslist_index("sonar_ticcl/WR-P-E-I_web_sites.wordfreqlist.tsv.clean.confuslist.index.head")

In [None]:
cpp_index

In [None]:
cpp_index_df = pd.DataFrame.from_records({"confusion": cpp_index[0],
                                          "list_index": cpp_index[1],
                                          "word_hash": cpp_index[2]}, index=["confusion", "list_index"])

In [None]:
pandas_index = get_confuslist_index_head()

In [None]:
all(cpp_index_df.index == pandas_index.index)

In [None]:
cpp_index_df.values == pandas_index.values

In [None]:
cpp_index_df.head(5)

In [None]:
pandas_index.head(5)

Odd, they seem equal...

In [None]:
cpp_index_df.dtypes

In [None]:
pandas_index.dtypes

Ahhh, yeah ok. That may also explain the slowness...

In [None]:
def get_confuslist_index_head2():
    df_tuples = pd.read_csv('sonar_ticcl/WR-P-E-I_web_sites.wordfreqlist.tsv.clean.confuslist.index.head',
                        sep='#', index_col=0, names=['confusion', 'word_hashes'],
                        converters={'word_hashes': lambda w: tuple(w.split(','))})
    df = df_tuples['word_hashes'].apply(pd.Series, 1).stack().astype('uint64').to_frame()
    df.index.rename(["confusion", "list_index"], inplace=True)
    df.rename({0: 'word_hash'}, axis='columns', inplace=True)
    return df

In [None]:
pandas_index = get_confuslist_index_head2()

In [None]:
pandas_index.equals(cpp_index_df)

Whoohoo!

And timing on this?

In [None]:
%timeit get_confuslist_index_head2()

Same, good.

A Redditor came up with the suggestion to do it in pure Python, let's try that:

In [None]:
def get_confuslist_index_head_pure_python():
    with open('sonar_ticcl/WR-P-E-I_web_sites.wordfreqlist.tsv.clean.confuslist.index.head') as f:
        dc = {}
        for line in f:
            if not line:
                continue
            key, _, value = line.partition("#")
            values = value.rstrip("\n").split(",")
            dc[int(key)] = values
    return dc

In [None]:
%timeit get_confuslist_index_head_pure_python()

Holy crap, but what does it look like?

In [None]:
pure_py = get_confuslist_index_head_pure_python()

In [None]:
# pure_py

Right, so I'll need to still convert to three columns here as well:

In [None]:
def df_confuslist_index_head_pure_python():
    with open('sonar_ticcl/WR-P-E-I_web_sites.wordfreqlist.tsv.clean.confuslist.index.head') as f:
        dc = {}
        for line in f:
            if not line:
                continue
            key, _, value = line.partition("#")
            values = value.rstrip("\n").split(",")
            dc[int(key)] = values
    df = pd.DataFrame.from_dict(dc, orient='index').stack().astype('uint64').to_frame()
    df.index.rename(["confusion", "list_index"], inplace=True)
    df.rename({0: 'word_hash'}, axis='columns', inplace=True)
    return df

In [None]:
pure_py_df = df_confuslist_index_head_pure_python()

In [None]:
pandas_index.equals(pure_py_df)

Ok, then we time that...

In [None]:
%timeit df_confuslist_index_head_pure_python()

# short.ldcalc

# ldcalc.ambi

# ldcalc

# ldcalc.ranked