# Intro

In this notebook, we explore a collection of ancient Akkadian and ancient Egyptian texts using the vector space model approach described by [Karsdorp et al. in the chapter "Exploring Texts using the Vector Space Model"](https://www.humanitiesdataanalysis.org/vector-space-model/notebook.html). By representing the texts as numeric vectors capturing word frequencies, we can quantify the lexical similarities and differences between corpora in each of these two ancient languages. The vector space model allows us to reason about texts spatially and apply geometric concepts like distance metrics to assess how "close" texts are to each other based on shared vocabulary.

We preprocess the texts by tokenizing them into words, constructing a document-term matrix recording word frequencies per text, and analyzing the matrix using tools from the Python scientific computing stack, including NumPy, SciPy and Scikit-learn. Through techniques like tSNE (t-Distributed Stochastic Neighbor Embedding) and aggregation by text metadata like script type, language or genre, we explore patterns in the Akkadian and Egyptian corpora and showcase how the vector space model can yield quantitative insights into ancient textual data. The notebook serves as an example application of the concepts and methods covered in depth by Karsdorp et al. in their chapter.

This notebook has been prepared by **Avital Romach** and is based on her research. It should be cited accordingly (see citation information at the bottom).

# Preprocessing the corpus

## Imports

In [None]:
import os
import re
import numpy as np
import pandas as pd
import requests
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from scipy.spatial.distance import pdist, squareform
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import plotly.express as px

## Functions

### To upload corpus and metadata from GitHub

#### Functions and import for the **Akkadian** corpus

The Akkadian corpus consists of a part of the _[Royal Inscriptions of the Neo-Assyrian Period (RINAP)](https://colab.research.google.com/drive/14hTZCg-9XyiireusajDQqc9k2GAbc82e#scrollTo=qUcbzacX0kJy&line=3&uniqifier=1)_, licensed CC-BY-SA, and was taken from Open Richely Annotated Cuneiform Corpus (ORACC).

In [None]:
def create_corpus_from_github_api(url):
  # URL on the Github where the csv files are stored
  github_url = url
  response = requests.get(github_url)

  corpus = []
  # Check if the request was successful
  if response.status_code == 200:
    files = response.json()
    for file in files:
      if file["download_url"][-3:] == "csv":
        corpus.append(pd.read_csv(file["download_url"], encoding="utf-8", index_col="Unnamed: 0").fillna(""))
        # For Egyptian adapt like this:
        #corpus.append(pd.read_csv(file["download_url"], encoding="utf-8").fillna(""))
  else:
    print('Failed to retrieve files:', response.status_code)

  return corpus

def get_metadata_from_raw_github(url):
  metadata = pd.read_csv(url, encoding="utf-8", index_col="Unnamed: 0").fillna("")
  return metadata

In [None]:
# Prepare Akkadian corpus (list of dataframes)

corpus = create_corpus_from_github_api('https://api.github.com/repos/DigitalPasts/ALP-course/contents/course_notebooks/data/rinap01')
corpus.extend(create_corpus_from_github_api('https://api.github.com/repos/DigitalPasts/ALP-course/contents/course_notebooks/data/rinap05'))


In [None]:
# Prepare Akkadian metadata
metadata = get_metadata_from_raw_github("https://raw.githubusercontent.com/DigitalPasts/ALP-course/master/course_notebooks/data/rinap1_5_metadata.csv")


#### Functions and import for the **Egyptian** corpus

The Egyptian corpus is an extract of the database of the _[Thesaurus Linguae Aegyptiae (TLA)](https://thesaurus-linguae-aegyptiae.de)_, containing literary (and if you like: medical) texts. This export from the database is not published under a free license. Therefore, we access it from a private GitHub repository using an access token.

In [None]:
def create_corpus_from_private_github_api(url, token):
# URL on the Github where the csv files are stored
    headers = {
        "Authorization": f"token {token}"
    }
    github_url = url
    response = requests.get(github_url, headers=headers)

    dtype_dict = {"lemma_id": "str"}

    corpus = []
    # Check if the request was successful
    if response.status_code == 200:
        files = response.json() # Github API provides information about the data in the repository, e.g. the download_url
        for file in files:
            if file["download_url"][-3:] == "csv" or ".csv?token=" in file["download_url"]:
                corpus.append(pd.read_csv(file["download_url"], encoding="utf-8", sep = ',', dtype=dtype_dict).fillna(""))
    else:
        print('Failed to retrieve files:', response.status_code)

    return corpus

from io import StringIO

def get_metadata_from_raw_private_github(url, token):
    headers = {
        "Authorization": f"token {token}"
    }
    github_url = url
    response = requests.get(github_url, headers=headers)

    # Check if the request was successful
    if response.status_code == 200:
        csv_data = StringIO(response.text)
        metadata = pd.read_csv(csv_data, encoding="utf-8", sep = ',', index_col="text_id").fillna("")
        return metadata
    else:
        raise Exception(f"Failed to retrieve metadata: {response.status_code}")

In [None]:
# only if corpus is not yet loaded
# Prepare Egyptian corpus (lists of dataframes)

#if False:

# NB: This token will expire at the end ot the year (2025)
tla_access_token = "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"

  ## TLA Literature
corpus = create_corpus_from_private_github_api('https://api.github.com/repos/thesaurus-linguae-aegyptiae/test-rawdata/contents/alp-course-2024/TLA_literature/erzaehlungen', tla_access_token)

corpus.extend(create_corpus_from_private_github_api('https://api.github.com/repos/thesaurus-linguae-aegyptiae/test-rawdata/contents/alp-course-2024/TLA_literature/reden', tla_access_token))

corpus.extend(create_corpus_from_private_github_api('https://api.github.com/repos/thesaurus-linguae-aegyptiae/test-rawdata/contents/alp-course-2024/TLA_literature/lehren', tla_access_token))

  ## TLA Medical
  #corpus = create_corpus_from_private_github_api('https://api.github.com/repos/thesaurus-linguae-aegyptiae/test-rawdata/contents/alp-course-2024/TLA_medical/TLA_pEbers', tla_access_token)

  #corpus.extend(create_corpus_from_private_github_api('https://api.github.com/repos/thesaurus-linguae-aegyptiae/test-rawdata/contents/alp-course-2024/TLA_medical/TLA_pEdwinSmith', tla_access_token))


In [None]:
# Egyptian metadata
metadata = get_metadata_from_raw_private_github("https://raw.githubusercontent.com/thesaurus-linguae-aegyptiae/test-rawdata/master/alp-course-2024/TLA_literature/TLA_metadata.csv", tla_access_token)


In [None]:
## Check if data is loaded
corpus[0].head()

In [None]:
# Prepare text_ids (list of unique ids), and metadata

text_ids = []
for text in corpus:
  text_ids.append(text["text"].iloc[0])


for id in text_ids:
  if id not in metadata.index:
    print(f"Text {id} missing from metadata")

metadata = metadata[metadata.index.isin(text_ids)]

metadata

### To convert dataframe to string

This is necessary because `TfidfVectorizer` that we will use to do the tf-idf calculations requires a list of strings as input. Each string is an entire text (document).

**Function to split the text dataframes according to a column**. Used to separate text to lines:
* param df: dataframe containing one word in each row.
* param column: the column by which to split the dfs, perferably `text` or `line`.
* return: a list of dataframes split according to the value given to the column parameter.



In [None]:
def split_df_by_column_value(df, column):

    dfs = []
    column_values = df[column].unique()
    for value in column_values:
        split_df = df[df[column]==value]
        dfs.append(split_df)
    return dfs

In [None]:
split_df_by_column_value(corpus[0].head(), "line")

**Function to convert the values from the text dataframe to a string of text with or without line breaks and word segmentation**.
* param df: the text dataframe
* param column: the chosen column from the dataframe to construct the text from (preferably unicode_word, cf, or lemma)
* param break_perc: a parameter which dictates whether to include broken words depending on the percentage of how broken they are.
                       Compares this value to the `break_perc` column in the dataframe.
                       Parameter is set to 1 (i.e. all words, whether broken or not, are included); can be any float between 0 and 1.
* param mask: boolean whether to mask named entities or not; set to True.
* return: a string which includes all the words in the texts according to the column chosen. Extra spaces that were between broken words or empty lines are removed.

In [None]:
def df2str(df, column, break_perc=1, mask=True, segmentation=True):

    # check if column exists in dataframe. If not, return empty text.
    if column not in df.columns:
        return ("", 0, 0)
    else:
        # remove rows that include duplicate values for compound words
        if column not in ["norm", "cf", "sense", "pos"]:
            df = df.drop_duplicates("ref").copy()
        # if column entry is empty string, replace with UNK (can happen with normalization or lemmatization)
        mask_empty = df[column]==""
        df[column] = df[column].where(~mask_empty, other="UNK")
        # mask proper nouns
        if mask and "pos" in df.columns:
            mask_bool = df["pos"].isin(["PN", "RN", "DN", "GN", "MN", "SN", "n"])
            df[column] = df[column].where(~mask_bool, other=df["pos"])

        # change number masking from `n` to `NUM`
        # !comment out for Egyptian
        #if mask:
        #    mask_num = df[column]=="n"
        #    df[column] = df[column].where(~mask_num, other="NUM")

        # remove rows without break_perc (happens with non-Akkadian words)
        if "" in df["break_perc"].unique():
            df = df[df["break_perc"]!=""].copy()
        # filter according to break_perc
        mask_break = df["break_perc"] <= break_perc
        df[column] = df[column].where(mask_break, other="X")
        # calculate text length with and without UNK and x tokens
        text_length_full = df.shape[0]
        mask_partial = df[column].isin(["UNK", "X", "x"])
        text_length_partial = text_length_full - sum(mask_partial)
        # create text lines
        text = ""
        df_lines = split_df_by_column_value(df, "line")
        for line in df_lines:
            word_list = list(filter(None, line[column].to_list()))
            if word_list != []:
                text += " ".join(map(str, word_list)).replace("x", "X").strip() + " " #+ "\n"

        if segmentation == False:
            # remove all white spaces (word segmentation and line breaks)
            text = re.sub(r"[\s\u00A0]+", "", text)

        return (text, text_length_full, text_length_partial)

In [None]:
df2str(corpus[0], "lemma_id")

### To convert to specific word levels and create dictionaries

**Function to convert the dataframes into strings of lemmatized texts**.
* param corpus: a list of dataframes
* param break_perc: a parameter which dictates whether to include broken words depending on the percentage of how broken they are.
                       Compares this value to the `break_perc` column in the dataframe.
                       Parameter is set to 1 (i.e. all words, whether broken or not, are included); can be any float between 0 and 1.
* param mask: boolean whether to mask named entities or not; set to True.
* return: a dictionary where the keys are the text IDs and the values are the lemmatized texts

In [None]:
def get_lemmatized_texts(corpus, break_perc=1, mask=True):

    texts_dict = {}
    for df in corpus:
        # get the text number from the dataframe "text" column
        key = df["text"].iloc[0]
        text, text_length_full, text_length_partial = df2str(df, "lemma_id", break_perc, mask)
        texts_dict[key] = (text, text_length_full, text_length_partial)
    return texts_dict

In [None]:
get_lemmatized_texts((split_df_by_column_value(corpus[0], "text")))

**Function to convert the dataframes into strings of normalized texts**.
* param corpus: a list of dataframes
* param break_perc: a parameter which dictates whether to include broken words depending on the percentage of how broken they are.
                       Compares this value to the `break_perc` column in the dataframe.
                       Parameter is set to 1 (i.e. all words, whether broken or not, are included); can be any float between 0 and 1.
* param mask: boolean whether to mask named entities or not; set to True.
* return: a dictionary where the keys are the text IDs and the values are the normalized texts

In [None]:
def get_normalized_texts(corpus, break_perc=1, mask=True):

    texts_dict = {}
    for df in corpus:
        # get the text number from the dataframe "text" column
        key = df["text"].iloc[0]
        text, text_length_full, text_length_partial = df2str(df, "norm", break_perc, mask)
        texts_dict[key] = (text, text_length_full, text_length_partial)
    return texts_dict

In [None]:
get_normalized_texts((split_df_by_column_value(corpus[0], "text")))

**Function to convert the dataframes into strings of segmented unicode texts**.
* param corpus: a list of dataframes
* param break_perc: a parameter which dictates whether to include broken words depending on the percentage of how broken they are.
                       Compares this value to the `break_perc` column in the dataframe.
                       Parameter is set to 1 (i.e. all words, whether broken or not, are included); can be any float between 0 and 1.
* param mask: boolean whether to mask named entities or not; set to True.
* return: a dictionary where the keys are the text IDs and the values are the segmented unicode texts

In [None]:
def get_segmented_unicode_texts(corpus, break_perc=1, mask=True):

    texts_dict = {}
    for df in corpus:
        # get the text number from the dataframe "text" column
        key = df["text"].iloc[0]
        text, text_length_full, text_length_partial = df2str(df, "unicode_word", break_perc, mask)
        texts_dict[key] = (text, text_length_full, text_length_partial)
    return texts_dict

In [None]:
get_segmented_unicode_texts((split_df_by_column_value(corpus[0], "text")))

### To create the vector space model

#### Vectorizing texts with TfidfVectorizer

🔧 What Does TfidfVectorizer Do?

TfidfVectorizer is a class that:

   * Reads text data
   * Cleans and tokenizes it
   * Builds a vocabulary
   * Calculates TF-IDF values
   * Returns a matrix (lokks similar to a Pandas dataframe but isn't a dataframe) where each row is a document and each column is a term

**Converts a list of texts into a term-document matrix based on TF-IDF scores**.

Full documentation of the variables of TfidfVectorizer from sklearn, see: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer
* param corpus: a dataframe in which the texts are in a `"text"` column and the dataframe's index is the text ids.
* param analyzer: whether the feature should be made of word or character n-grams.
                     use `"word"` for word features, `"char_wb"` for character n-grams within word boundaries,
                     or `"char"` for character n-grams without word boundaries.
* param ngram_range: the lower and upper boundary of the range of n-values for different n-grams to be extracted.
* param max_df: threshold to ignore terms that have a document frequency above a certain value.
                   If the threshold is a float, it represent a proportion of the documents.
                   If the threshold is an integer, it represents absolute counts of number of documents in which the terms appears.
* param min_df: threshold to ignore terms that have a document frequency below a certain value.
                   If the threshold is a float, it represent a proportion of the documents.
                   If the threshold is an integer, it represents absolute counts of number of documents in which the terms appears.
* param max_features: if not `None`, build a vocabulary that only considers the top max_features ordered by term frequency across the corpus.
* param stop_words: if `None`, no stop words are used. Otherwise, can be a list with words to be removed from resulting tokens.
* return: `counts` the raw counts of the vectorizer,
             `counts_df` a dataframe of the counts where the index is the text ids and the columns are the tokens,
             `stop_words` an updated list of stop words

![](https://www.humanitiesdataanalysis.org/_images/bow.png)



**Figure 1**. Example of a document-term matrix extracted from a corpus, see Fig. 3 in Karsdorp, F., Kestemont, M., & Riddell, A. (2021). Humanities Data Analysis: Case Studies
with Python. Princeton University Press.

In [None]:
def vectorize(corpus, analyzer="word", ngram_range=(1,1), max_df=1.0, min_df=1, max_features=None, stop_words=["UNK", "X"]):

    vectorizer = TfidfVectorizer(input="content", lowercase=False, analyzer=analyzer,
                                 # RegEx for Akkadian
                                 #token_pattern=r"(?u)\b\w+\b", ngram_range=ngram_range,
                                 # RegEx for Egyptian
                                 token_pattern=r"(?u)\b[\w\.]+\b", ngram_range=ngram_range,
                                 max_df=max_df, min_df=min_df, max_features=max_features, stop_words=stop_words)

    counts = vectorizer.fit_transform(corpus["text"].tolist()).toarray()
    #stop_words = vectorizer.stop_words_ # use when stop_words are not defined in the parameters

    # saving the vocab used for vectorization, and switching the dictionary so that the feature index is the key
    vocab = vectorizer.vocabulary_
    switched_vocab = {value: key for key, value in vocab.items()}
    # adding the vocab words to the counts dataframe for easier viewing.
    column_names = []
    x = 0
    while x < len(switched_vocab):
        column_names.append(switched_vocab[x])
        x += 1

    counts_df = pd.DataFrame(counts, index=corpus.index, columns=column_names)

    return (counts, counts_df, stop_words)

#### Calculating distances between vectorized documents

**Converts a term-document matrix to a text similarity matrix**.
* param counts: the raw counts from the `vectorize` function.
* param metric: the metric by which to calculate the distances between the texts in the corpus. For one place to look into the different types of matrics see "Computing distances between documents" in [Karsdrop, Kestemont, & Riddell 2021](https://www.humanitiesdataanalysis.org/vector-space-model/notebook.html#computing-distances-between-documents)
                   Valid metrics are:
                   ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’,
                   ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulczynski1’,
                   ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’,
                   ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’.
* param text_ids: list of unique text_ids.
* return: a dataframe matrix of distance between texts.

In [None]:
def distance_calculator(counts, metric, text_ids):

    return pd.DataFrame(squareform(pdist(counts, metric=metric)), index=text_ids, columns=text_ids)

#### reducing dimensions with pca or tsne

**Reduces multidimensional data into two dimensions using PCA**.
* param df: dataframe holding the dimensions to reduce. All columns should include numerical values only.
               The dataframe's index should hold the unique text ids.
* param metadata: the rest of the metadata in the corpus, to help visualize the resulting clusters in meaningful ways.
                     The metadata's index should hold the unique text ids.
* return: a dataframe with the coordinates of the two remaining dimensions on all other columns from the metadata.

In [None]:
def reduce_dimensions_pca(df, metadata):

    pca = PCA(n_components=2)
    reduced_data = pca.fit_transform(df)
    reduced_df = pd.DataFrame(data=reduced_data, index=df.index, columns=["component 1", "component 2"])
    reduced_df_metadata = metadata.join(reduced_df)
    return reduced_df_metadata

**Reduces multidimensional data into two dimensions using TSNE**.

See full documentation of sklearn's TSNE on: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
* param df: dataframe holding the dimensions to reduce. All columns should include numerical values only.
               The dataframe's index should hold the unique text ids.
* param perplexity: perplexity is a measure the weighs the importance of nearby versus distant points when creating a lower-dimension mapping.
                       t-SNE first converts the distances between points into conditional probabilities that represent similarities,
                       using Gaussian probability distributions.
                       The perplexity parameter influences the variance used to compute these probabilities.
                       A higher perplexity leads to a broader Gaussian that considers a larger number of neighbors when assessing similarity.
                       Lower perplexity puts more focus on the local structure and considers fewer neighbors.
                       A good perplexity depends greatly on dataset size and density.
                       The documentation recommends a value between 5 and 50.
                       We recommend to start with the square root of the length of the corpus.
* param n_iter: maximum number of iterations for optimization.
* param metric: the metric to be used when calculating distances between vectors.
                   Valid metrics are:
                   ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’,
                   ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulczynski1’,
                   ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’,
                   ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’.
* param metadata: the rest of the metadata in the corpus, to help visualize the resulting clusters in meaningful ways.
                     The metadata's index should hold the unique text ids.
* return: a dataframe with the coordinates of the two remaining dimensions on all other columns from the metadata.

In [None]:
def reduce_dimensions_tsne(df, perplexity, n_iter, metric, metadata):

    tsne = TSNE(n_components=2, perplexity=perplexity, n_iter=n_iter, metric=metric, init="pca")
    reduced_data = tsne.fit_transform(df)
    reduced_df = pd.DataFrame(data=reduced_data, index=df.index, columns=["component 1", "component 2"])
    reduced_df_metadata = metadata.join(reduced_df)
    return reduced_df_metadata

## Process texts from dataframes and combine results with metadata dataframe

In [None]:
# Function to combine processed texts with metadata

def get_corpus_metadata(texts_dict, metadata):
  texts_df = pd.DataFrame(texts_dict, index=["text", "full_length", "partial_length"]).transpose()
  df = metadata.join(texts_df)
  return df

In [None]:
## vectorize lemma forms
corpus_dict = get_lemmatized_texts(corpus, break_perc=0)
## vectorize normalized forms
#corpus_dict = get_normalized_texts(corpus, break_perc=0)
## vectorize Unicode cuneiform
#corpus_dict = get_segmented_unicode_texts(corpus, break_perc=0)

corpus_metadata = get_corpus_metadata(corpus_dict, metadata)

## For Akkadian
## remove texts which have less than n words excluding UNK and X
#n = 10
#print(f"Number of texts before filtering: {corpus_metadata.shape[0]}")
#corpus_metadata = corpus_metadata[corpus_metadata["partial_length"]>=n]
#print(f"Number of texts after filtering: {corpus_metadata.shape[0]}")


# For Egyptian use this instead, reset the index
n = 150
print(f"Number of texts before filtering: {corpus_metadata.shape[0]}")
corpus_metadata = corpus_metadata[corpus_metadata["partial_length"]>=n].set_index("text_name")
print(f"Number of texts after filtering: {corpus_metadata.shape[0]}")

In [None]:
corpus_metadata

# Exploring the Akkadian RINAP or Egyptian TLA Corpus using the Vector Space Model

In [None]:
# vectorize corpus
counts, counts_df, stop_words = vectorize(corpus_metadata, max_features=50)

In [None]:
counts_df.head(3)

In [None]:
# calculate distance between vectorized texts
matrix = distance_calculator(counts, "cosine", corpus_metadata.index)
matrix

In [None]:
# visualize matrix
fig = px.imshow(matrix)

# adjust size of the matrix
fig.update_layout(
    autosize=False,
    width=1500,
    height=1500,
)
fig.show()

In [None]:
# reduce matrix dimensions
reduced_tsne = reduce_dimensions_tsne(matrix, perplexity=matrix.shape[0]**0.5, n_iter=5000, metric="euclidean", metadata=corpus_metadata)

In [None]:
# visualize reduced dimensions

# adjust size column for visualization
size_min = 3
size_max = 70
size = (reduced_tsne["partial_length"] / reduced_tsne["partial_length"].max() * (size_max - size_min) + size_min).tolist()

# create figure
# for Akkadian use symbol = "script", color="project",
# for Egyptian
fig = px.scatter(reduced_tsne, x="component 1", y="component 2", size=size, symbol = "corpus_manual", color="language_manual", hover_data=["partial_length", "full_length", reduced_tsne.index])
fig.update_traces(marker=dict(line=dict(width=1, color='black')))
fig.show()

# Find Shared Tokens

**creates a mini df that includes only the chosen text and the shared tokens in those texts**
  (i.e., all tokens that are none zero in all texts).
* param df: the counts_df where the index is the text ids and the columns are the tokens.
* param text_ids: a list containing text ids.
* return: a dataframe where the index are the shared tokens and the columns are the texts.
           the values are the tf-idf scores.

In [None]:
def find_shared_tokens(df, text_ids):

  mini_df = df[df.index.isin(text_ids)].copy()
  mini_df = mini_df.loc[:, (mini_df != 0).all(axis=0)].copy()
  return mini_df.transpose()

In [None]:
# Akkadian
#shared_tokens = find_shared_tokens(counts_df, ["Q003450", "Q003711", "Q003790"])
# Egyptian
shared_tokens = find_shared_tokens(counts_df, ["pBerlin P 3023 + pAmherst I (Bauer, B1) || Der beredte Bauer (Version B1)", "pBerlin P 3025 + pAmherst II (Bauer, B2) || Der beredte Bauer (Version B2)", "pPetersburg 1116 A || Verso: Die Lehre für Merikare", "pBM EA 10509 (Ptahhotep, Version L2+L2G) || Die Lehre des Ptahhotep (Version L2+L2G)"])#, "3RU7Z4VQ45CYFIQ4PUGQ3HDJFU"])

shared_tokens

In [None]:
px.scatter(shared_tokens)

*This notebook was created by [Avital Romach](https://github.com/ARomach), with additional code and text by [Eliese-Sophia Lincke](https://www.geschkult.fu-berlin.de/e/aegyptologie/personen/Professorinnen-und-Professoren/Lincke/index.html), [Shai Gordin](https://digitalpasts.github.io/) and [Daniel A. Werning](https://www.bbaw.de/die-akademie/mitarbeiterinnen-mitarbeiter/werning-daniel) in Spring 2024 for the course [Ancient Language Processing](https://digitalpasts.github.io/ALP-course/). Code can be reused under a [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)*