# Preparation

Use the following cell, if the test version is required

In [None]:
%pip install biopython
%pip install Levenshtein
%pip install gensim
%pip install pyjarowinkler

%pip install --index-url https://test.pypi.org/simple/ --no-deps --force-reinstall corpus_distance 

Use the following cell, if the release version is required

In [None]:
%pip install corpus_distance

For reproducibility purposes, the next step is to set two random seeds, as some of the used tools refer to `random.seed`, and others -- to `numpy.random.RandomState`. In addition, in Python 3 it is necessary to set `PYTHONHASHSEED` environment variable to 0 for `FastText` character-based embeddings and `LDA` topic modelling to be reproducible.

In [None]:
SEED = 42

In [None]:
import random
random.seed(SEED)

In [None]:
from numpy.random import RandomState
RandomState(SEED)

In [None]:
%env PYTHONHASHSEED=0

The next step is to configure the logger. This will provide a user with the better understanding of what is going on. By default, the logger provides only warnings (`logging.WARNING` level); to get more verbose information, set to `logging.INFO`. In debug mode it is preferrable to use `logging.DEBUG` level.

In [None]:
import logging
logger = logging.getLogger()
logging.basicConfig(format='%(asctime)s %(name)s:%(levelname)s:%(message)s', level=logging.WARNING, datefmt='%Y-%m-%d %H:%M:%S')

# Data loading

The first step is setting the content directory.

In [None]:
CONTENT_DIR = "/home/ilia/coding/Research/datasets/raw_small_corpora/ES"

The following step is to create (or set) directory, where the package will store files.

In [None]:
from corpus_distance.pipeline import create_and_set_storage_directory

In [None]:
STORAGE_DIR = "exp_1"

In [None]:
create_and_set_storage_directory(STORAGE_DIR)

Texts (or collections of texts) should be pre-tokenised single strings, (optionally) stored in separate files. Filenames should contain lect name before extension, split by '.'. For example, 'Akimov.Belogornoje.txt', where *Akimov* is a text name, *Belogornoje* is a lect name, and *txt* is an extension.

Texts become dictionary keys, and lects names - its values.

The `SPLIT` variable regulates the share of the data taken (from `0` to `1`).

In [None]:
SPLIT = 1

In [None]:
from corpus_distance.data_preprocessing.data_loading import load_data
df = load_data(CONTENT_DIR, SPLIT)

In [None]:
df.head(10)

The next stage is transformation of dictionary into a dataframe of the following format:

| index | text | lect |
| -------- | ------- |------- |
| 0 | text1 | lect1 |
| 1 | text2 | lect1 |
| 2 | text1 | lect2 |
| ... | ... | ... |
| m | textN | lectK |

*m* here represents the overall number of texts, *K* - the overall number of lects, and *N* is the number of texts in lect *K*.  

In [None]:
df.head()

# Data processing

Here we get lect names.

In [None]:
from corpus_distance.cdutils import get_lects_from_dataframe

In [None]:
lects = get_lects_from_dataframe(df)

In [None]:
lects

## Topic antimodelling

Topic antimodelling is used to delete topic words that reflect the features of the texts, and not the language. To enable it, use `TOPIC_NORMALISATION = 'substitute'`, otherwise - `TOPIC_NORMALISATION = 'not_substitute'`. To use topic modelling, use `TOPIC_NORMALISATION = 'topic_words_only'`.

In [None]:
TOPIC_NORMALISATION = 'substitute'

In [None]:
from corpus_distance.data_preprocessing.topic_modelling import get_topic_words_for_lects, add_topic_modelling

In [None]:
topic_words = get_topic_words_for_lects(df, lects)

In [None]:
df_without_topics = add_topic_modelling(df, STORAGE_DIR, topic_words, TOPIC_NORMALISATION)

In [None]:
df_without_topics.head()

## Vectorisation

I start with creating a model for representing key properties of the lect:

* Its name
* Text it contains, lowercased
* Its alphabet (with obligatory CLS `^` and EOS `$` symbols)
* Amount of entropy of its alphabet
* Vector for each given symbol of alphabet

In [None]:
from corpus_distance.data_preprocessing.vectorisation import create_vectors_for_lects, gather_vector_information, FastTextParams

In [None]:
vectors_for_lects = create_vectors_for_lects(df_without_topics, STORAGE_DIR, FastTextParams(seed=SEED))

In [None]:
from pprint import pprint

In [None]:
pprint(vectors_for_lects)

# Date preprocessing

The first stage of data preprocessing is splitting tokens into character 3-grams. The character n-grams help to find coinciding sequences more easily, than tokens or token n-grams. Specifically 3-grams help to underscore the exact places where the change is happening, providing minimal left and right context for each symbol within the sequence. Adding special symbols *^* and *$* to the start and the end of each sequence helps to do this for the first and the last symbol of the given sequence as well.

In [None]:
from corpus_distance.data_preprocessing.shingle_processing import split_lects_by_n_grams

In [None]:
df_with_n_grams = split_lects_by_n_grams(df_without_topics)

New dataframe is in the following format:

| index | lect | n-gram array |
| -------- | ------- |------- |
| 0 | lect1 | n-grams of lect1 |
| 1 | lect1 | n-grams of lect1 |
| ... | ... | ... |
| k | lectK | n-grams of lect lectK |

Here, *k* is overall number of lects.

In [None]:
df_with_n_grams.head()

The next step is to rank n-grams by frequency. The results form *frequency_arranged_n_grams* column of the dataframe.

In [None]:
from corpus_distance.data_preprocessing.frequency_scoring import count_n_grams_frequencies

In [None]:
df_new = count_n_grams_frequencies(df_with_n_grams)

In [None]:
# add information on letter vectors and alphabet entropy to dataframe

df_new = gather_vector_information(df_new, vectors_for_lects)

In [None]:
df_new.head()

# Metrics

First step is to introduce a measure for hybridisation.

One possible measure is scoring Euclidean distance between sum of letter vectors for each n-gram. This results in a loss of order within n-gram, which can yield possible disadvantages (bra === bar), when the measure is used alone; however, when joined with DistRank and Jaro distance, hopefully they yield better results.

Optional normalisation includes using alphabet entropy difference, calculated via subtraction of the second alphabet entropy from the first one. This allows to compensate for the cases, when letter from one alphabet may have multiple correspondences in the other, depending on the context. Direct (and not reversed, `1 - X`) measure is better, because the more information one alphabet carries, when contrasted to the other, the more possible one-to-many correspondences there are, the more distortions in vectors there are, the more normalisation is needed.

Final normalisation includes traditional split by maximal length of two strings, introduced in Holman et al. (2008).

In [None]:
from corpus_distance.distance_measurement.string_similarity import *
from corpus_distance.distance_measurement.hybridisation import HybridisationParameters

In [None]:
# assigning global values
# group of languages  and its outgroup
GROUP = "East Slavic"
OUTGROUP = "Zialionka"

# if hybrid metrics aids DistRank
HYBRIDISATION = True
# if hybrid values join DistRank values in a single array, or they both are
# independent values, equally contributing to the final metric
HYBRIDISATION_AS_ARRAY = True

# if distrank normalisation includes soerensen coefficient
SOERENSEN_NORMALISATION = True

# choose a metric for hybridisation
HYBRID = weighted_jaro_winkler_wrapper

# if string similarity measure includes correction by
# difference in the alphabet entropies
ALPHABET_NORMALISATION = True

# metric description
METRICS = f"{GROUP}-{SPLIT}-{TOPIC_NORMALISATION}-DistRank-{SOERENSEN_NORMALISATION}-{HYBRIDISATION}-{HYBRIDISATION_AS_ARRAY}-{HYBRID.__name__}-{ALPHABET_NORMALISATION}"

In [None]:
hybridisation_parameters = HybridisationParameters(HYBRIDISATION, SOERENSEN_NORMALISATION, HYBRIDISATION_AS_ARRAY, HYBRID, ALPHABET_NORMALISATION)

In [None]:
METRICS

In [None]:
from corpus_distance.distance_measurement.metrics_pipeline import score_metrics_for_corpus_dataset

In [None]:
# declare arrays
# calculate distances for each pair of lects
overall_results = score_metrics_for_corpus_dataset(df_new, GROUP, STORAGE_DIR, METRICS, hybridisation_parameters)

In [None]:
overall_results

In [None]:
from pandas import DataFrame

In [None]:
final_matrix = []
ordered_lects = list(set(lects))
for i in range(len(ordered_lects)):
    final_matrix.append([])
    for j in range(len(ordered_lects)):
        if j < i:
            dist =\
                [d[1] for d in overall_results if\
                    set([ordered_lects[i], ordered_lects[j]]) == set([d[0][0], d[0][1]])][0]
            final_matrix[i].append(dist)
    final_matrix[i].append(0)
    for j in range(len(ordered_lects)):
        if j > i:
            dist =\
                [d[1] for d in overall_results if\
                    set([ordered_lects[i], ordered_lects[j]]) == set([d[0][0], d[0][1]])][0]
            final_matrix[i].append(dist)
df_res = DataFrame(final_matrix, columns=lects)

In [None]:
import fastnntpy as fn
n = fn.run_neighbour_net(df_res)

In [None]:
import networkx as nx
import matplotlib.pyplot as plt
import os

In [None]:
def plot_fast_nnt_networkx(nx_obj, out_path="test/plots/fast_nnt_nx.png",
                           shift=0, node_size=10, font_size=7,
                           scale_width_by_weight=False, dpi=300):
    # -- data from PyNexus --
    labels = {i + shift: s for i, s in nx_obj.get_node_translations()}
    pos    = {i + shift: (x, y) for i, x, y in nx_obj.get_node_positions()}
    # corrected parsing order: (edge_id, u, v, sid, w)
    edges_raw = [ (u + shift, v + shift, w)
                  for (_eid, u, v, _sid, w) in nx_obj.get_graph_edges() ]

    # only keep edges whose endpoints have positions
    edges = [(u, v, w) for (u, v, w) in edges_raw if u in pos and v in pos]
    if not edges:
        raise ValueError("No drawable edges (endpoints missing positions).")

    # -- build graph --
    G = nx.Graph()
    for u, v, w in edges:
        G.add_edge(u, v, weight=w)

    # leaves only (degree == 1)
    leaves = [n for n, d in G.degree() if d == 1]

    # edge widths (optional)
    if scale_width_by_weight:
        ws = [G[u][v].get("weight", 1.0) for u, v in G.edges()]
        wmax = max(ws) if ws else 1.0
        widths = [0.5 + 2.5 * (w / wmax) for w in ws]
    else:
        widths = 0.8

    # -- draw (no layout) --
    os.makedirs(os.path.dirname(out_path) or ".", exist_ok=True)
    plt.figure(figsize=(8, 8), dpi=dpi)

    nx.draw_networkx_edges(G, pos, width=widths, edge_color="black", alpha=0.9)
    nx.draw_networkx_nodes(G, pos, nodelist=leaves, node_size=node_size, node_color="black")

    leaf_labels = {n: labels.get(n, str(n)) for n in leaves}
    nx.draw_networkx_labels(G, pos, labels=leaf_labels, font_size=font_size)

    plt.axis("equal"); plt.axis("off"); plt.tight_layout(pad=0.02)
    plt.savefig(out_path, dpi=dpi, bbox_inches="tight", pad_inches=0.01)
    base, _ = os.path.splitext(out_path)
    plt.savefig(base + ".svg", bbox_inches="tight", pad_inches=0.01)
    plt.close()
    return out_path

plot_fast_nnt_networkx(n, out_path="1.png")

In [None]:
print("Labels")
print(len(n.get_labels()))
print("Splits Records")
print(len(n.get_splits_records()))
print("Node Translations")
print(len(n.get_node_translations()))
print("Node Positions")
print(len(n.get_node_positions()))
print("Graph Edges")
print(len(n.get_graph_edges()))

# Clusterisation

The final step is to cluster the lects into groups, and to decide, whether the method works correctly.

In [None]:
from corpus_distance.clusterisation.clusterisation import ClusterisationParameters, clusterise_lects_from_distance_matrix
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor

In [None]:
cluster_params = ClusterisationParameters(lects, OUTGROUP, GROUP, METRICS, DistanceTreeConstructor().upgma, STORAGE_DIR)

In [None]:
clusterise_lects_from_distance_matrix(overall_results, cluster_params)