# Imports

Use the following cell, if the test version is required

In [None]:
%pip install biopython
%pip install Levenshtein
%pip install gensim
%pip install pyjarowinkler

%pip install --index-url https://test.pypi.org/simple/ --no-deps --force-reinstall corpus_distance 

Use the following cell, if the release version is required

In [None]:
%pip install corpus_distance

In [None]:
import corpus_distance

In [None]:
import random
random.seed(6)

# Data loading

In [None]:
CONTENT_DIR = "/texts/OES_texts_only/"
STORAGE_DIR = "exp_1"
TOPIC_NORMALISATION = True
SPLIT = 1

In [None]:
import os

In [None]:
if not os.path.exists(STORAGE_DIR):
    os.mkdir(STORAGE_DIR)

Texts (or collections of texts) should be pre-tokenised single strings, (optionally) stored in separate files. Filenames should contain lect name before extension, split by '.'. For example, 'Akimov.Belogornoje.txt', where *Akimov* is a text name, *Belogornoje* is a lect name, and *txt* is an extension.

Texts become dictionary keys, and lects names - its values.

In [None]:
from corpus_distance.data_preprocessing.data_loading import load_data
df = load_data(CONTENT_DIR, SPLIT)

In [None]:
df.head(10)

The next stage is transformation of dictionary into a dataframe of the following format:

| index | text | lect |
| -------- | ------- |------- |
| 0 | text1 | lect1 |
| 1 | text2 | lect1 |
| 2 | text1 | lect2 |
| ... | ... | ... |
| m | textN | lectK |

*m* here represents the overall number of texts, *K* - the overall number of lects, and *N* is the number of texts in lect *K*.  

In [None]:
df.head()

# Data processing

Here we get lect names.

In [None]:
from corpus_distance.cdutils import get_lects_from_dataframe

In [None]:
lects = get_lects_from_dataframe(df)

In [None]:
lects

## Topic modelling

Topic modelling is used to delete topic words that reflect the features of the texts, and not the language.

In [None]:
from corpus_distance.data_preprocessing.topic_modelling import get_topic_words_for_lects, add_topic_modelling

In [None]:
topic_words = get_topic_words_for_lects(df, lects)

In [None]:
df_without_topics = add_topic_modelling(df, STORAGE_DIR, topic_words, 'substitute')

In [None]:
df_without_topics.head()

## Vectorisation

I start with creating a model for representing key properties of the lect:

* Its name
* Text it contains, lowercased
* Its alphabet (with obligatory CLS `^` and EOS `$` symbols)
* Amount of enthropy of its alphabet
* Vector for each given symbol of alphabet

In [None]:
from corpus_distance.data_preprocessing.vectorisation import create_vectors_for_lects, gather_vector_information, FastTextParams

In [None]:
vectors_for_lects = create_vectors_for_lects(df_without_topics, FastTextParams(seed=42))

In [None]:
from pprint import pprint

In [None]:
pprint(vectors_for_lects)

# Date preprocessing

The first stage of data preprocessing is splitting tokens into character 3-grams. The character n-grams help to find coinciding sequences more easily, than tokens or token n-grams. Specifically 3-grams help to underscore the exact places where the change is happening, providing minimal left and right context for each symbol within the sequence. Adding special symbols *^* and *$* to the start and the end of each sequence helps to do this for the first and the last symbol of the given sequence as well.

In [None]:
from corpus_distance.data_preprocessing.shingle_processing import split_lects_by_n_grams

In [None]:
df_with_n_grams = split_lects_by_n_grams(df_without_topics)

New dataframe is in the following format:

| index | lect | n-gram array |
| -------- | ------- |------- |
| 0 | lect1 | n-grams of lect1 |
| 1 | lect1 | n-grams of lect1 |
| ... | ... | ... |
| k | lectK | n-grams of lect lectK |

Here, *k* is overall number of lects.

In [None]:
df_with_n_grams.head()

The next step is to rank n-grams by frequency. The results form *frequency_arranged_n_grams* column of the dataframe.

In [None]:
from corpus_distance.data_preprocessing.frequency_scoring import count_n_grams_frequencies

In [None]:
df_new = count_n_grams_frequencies(df_with_n_grams)

In [None]:
# add information on letter vectors and alphabet information to dataframe

df_new = gather_vector_information(df_new, vectors_for_lects)

In [None]:
df_new.head()

# Metrics

First step is to introduce a measure for hybridisation.

One possible measure is scoring Euclidean distance between sum of letter vectors for each n-gram. This results in a loss of order within n-gram, which can yield possible disadvantages (bra === bar), when the measure is used alone; however, when joined with DistRank and Jaro distance, hopefully they yield better results.

Optional normalisation includes using alphabet information difference, calculated via subtraction of the second alphabet information from the first one. This allows to compensate for the cases, when letter from one alphabet may have multiple correspondences in the other, depending on the context. Direct (and not reversed, `1 - X`) measure is better, because the more information one alphabet carries, when contrasted to the other, the more possible one-to-many correspondences there are, the more distortions in vectors there are, the more normalisation is needed.

Final normalisation includes traditional split by maximal length of two strings, introduced in Holman et al. (2008).

In [None]:
from corpus_distance.distance_measurement.string_similarity import *
from corpus_distance.distance_measurement.hybridisation import HybridisationParameters

In [None]:
# assigning global values
# group of languages  and its outgroup
GROUP = "Old East Slavic"
OUTGROUP = "Novgorod"

# if hybrid metrics aids DistRank
HYBRIDISATION = True
# if hybrid values join DistRank values in a single array, or they both are
# independent values, equally contributing to the final metric
HYBRIDISATION_AS_ARRAY = True

# if distrank normalisation includes soerensen coefficient
SOERENSEN_NORMALISATION = True

# choose a metric for hybridisation
HYBRID = weighted_jaro_winkler_wrapper

# if string similarity measure includes correction by
# difference in the information that alphabets carry
ALPHABET_NORMALISATION = True

# metric description
METRICS = f"{GROUP}-{SPLIT}-{TOPIC_NORMALISATION}-DistRank-{SOERENSEN_NORMALISATION}-{HYBRIDISATION}-{HYBRIDISATION_AS_ARRAY}-{HYBRID.__name__}-{ALPHABET_NORMALISATION}"

In [None]:
hybridisation_parameters = HybridisationParameters(HYBRIDISATION, SOERENSEN_NORMALISATION, HYBRIDISATION_AS_ARRAY, HYBRID, ALPHABET_NORMALISATION)

In [None]:
METRICS

In [None]:
from corpus_distance.distance_measurement.metrics_pipeline import score_metrics_for_corpus_dataset

In [None]:
# declare arrays
# calculate distances for each pair of lects
overall_results = score_metrics_for_corpus_dataset(df_new, STORAGE_DIR, METRICS, hybridisation_parameters)

# Clusterisation

The final step is to cluster the lects into groups, and to decide, whether the method works correctly.

In [None]:
from corpus_distance.clusterisation.clusterisation import ClusterisationParameters, clusterise_lects_from_distance_matrix
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor

In [None]:
cluster_params = ClusterisationParameters(lects, OUTGROUP, GROUP, METRICS, DistanceTreeConstructor().upgma, STORAGE_DIR)

In [None]:
clusterise_lects_from_distance_matrix(overall_results, cluster_params)