# Intro

In this notebook, we explore a collection of ancient Akkadian and ancient Egyptian texts using the vector space model approach described by [Karsdorp et al. in the chapter "Exploring Texts using the Vector Space Model"](https://www.humanitiesdataanalysis.org/vector-space-model/notebook.html). By representing the texts as numeric vectors capturing word frequencies, we can quantify the lexical similarities and differences between corpora in each of these two ancient languages. The vector space model allows us to reason about texts spatially and apply geometric concepts like distance metrics to assess how "close" texts are to each other based on shared vocabulary.

We preprocess the texts by tokenizing them into words, constructing a document-term matrix recording word frequencies per text, and analyzing the matrix using tools from the Python scientific computing stack, including NumPy, SciPy and Scikit-learn. Through techniques like tSNE (t-Distributed Stochastic Neighbor Embedding) and aggregation by text metadata like script type, language or genre, we explore patterns in the Akkadian and Egyptian corpora and showcase how the vector space model can yield quantitative insights into ancient textual data. The notebook serves as an example application of the concepts and methods covered in depth by Karsdorp et al. in their chapter.

This notebook has been prepared by Avital Romach and is based on her research. It should be cited accordingly (see citation information at the bottom).

# Preprocessing the corpus

## Imports

In [44]:
import os
import re
import numpy as np
import pandas as pd
import requests
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from scipy.spatial.distance import pdist, squareform
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import plotly.express as px

## Functions

### To upload corpus and metadata from GitHub

In [45]:
def create_corpus_from_github_api(url):
  # URL on the Github where the csv files are stored
  github_url = url
  response = requests.get(github_url)

  corpus = []
  # Check if the request was successful
  if response.status_code == 200:
    files = response.json()
    for file in files:
      if file["download_url"][-3:] == "csv":
        #corpus.append(pd.read_csv(file["download_url"], encoding="utf-8", index_col="Unnamed: 0").fillna(""))
        # For Egyptian adapt like this:
        corpus.append(pd.read_csv(file["download_url"], encoding="utf-8").fillna(""))
  else:
    print('Failed to retrieve files:', response.status_code)

  return corpus

def get_metadata_from_raw_github(url):
  metadata = pd.read_csv(url, encoding="utf-8", index_col="Unnamed: 0").fillna("")
  return metadata

In [46]:
# Prepare Akkadian corpus (list of dataframes)

#corpus = create_corpus_from_github_api('https://api.github.com/repos/DigitalPasts/ALP-course/contents/course_notebooks/data/rinap01')
#corpus.extend(create_corpus_from_github_api('https://api.github.com/repos/DigitalPasts/ALP-course/contents/course_notebooks/data/rinap05'))

# Prepare Egyptian corpus (list sof dataframes)
corpus = create_corpus_from_github_api('https://api.github.com/repos/DigitalPasts/ALP-course/contents/course_notebooks/data/TLA_pEbers')
corpus.extend(create_corpus_from_github_api('https://api.github.com/repos/DigitalPasts/ALP-course/contents/course_notebooks/data/TLA_pEdwinSmith'))


In [47]:
corpus[0].head()

Unnamed: 0,text,text_name,line,word,ref,frag,norm,unicode_original,unicode_word,unicode,lemma_id,cf,pos,mask,sense,inst,reading,break_perc,unicode_splitted,break
0,5FTT5MPBHZC3HK4LCBHO5ILEQY,"Medizinische Texte || pEbers || 1,1-2,6 = Eb 1...",1,3,IBUBd7b7rdeJUUT0nty1mN1QWUM,ḥꜣ.t-ꜥ,ḥꜣ.t-ꜥ,𓄂𓐰𓂝,𓄂𓂝,"['𓄂', '𓂝']",853378,ḥꜣ.t-ꜥ,NOUN,,"Anfang (von etwas, mit versch. Präpositionen)",,,0.0,"['𓄂', '𓂝']","['complete', 'complete']"
1,5FTT5MPBHZC3HK4LCBHO5ILEQY,"Medizinische Texte || pEbers || 1,1-2,6 = Eb 1...",1,4,IBkAd3TWR04hpEXUm7OiyMRF04s,m,m,𓅓,𓅓,['𓅓'],400082,m,PREP,,in (der Art); von (partitiv); als (Eigenschaft...,,,0.0,['𓅓'],['complete']
2,5FTT5MPBHZC3HK4LCBHO5ILEQY,"Medizinische Texte || pEbers || 1,1-2,6 = Eb 1...",1,5,IBUBd03k21v2k0dNtDvMsxLbqT0,rʾ,rʾ,𓂋𓏤,𓂋𓏤,"['𓂋', '𓏤']",92580,rʾ,NOUN,,Spruch; das Sagen,,,0.0,"['𓂋', '𓏤']","['complete', 'complete']"
3,5FTT5MPBHZC3HK4LCBHO5ILEQY,"Medizinische Texte || pEbers || 1,1-2,6 = Eb 1...",1,6,IBUBdyzQkQqkHUDEtJYoxN3GjHE,n,n,𓈖,𓈖,['𓈖'],79800,n.ꞽ,ADJ,,von [Genitiv],,,0.0,['𓈖'],['complete']
4,5FTT5MPBHZC3HK4LCBHO5ILEQY,"Medizinische Texte || pEbers || 1,1-2,6 = Eb 1...",1,7,IBUBdx1WBVRlPERaoRaCiwnroDo,wꜣḥ,wꜣḥ,𓎝𓎛𓑕𓏛,𓎝𓎛𓑕𓏛,"['𓎝', '𓎛', '\U00013455', '𓏛']",43010,wꜣḥ,VERB,,legen; dauern; opfern; zurücklassen,,,0.166667,"['𓎝', '𓎛', '𓏛']","['complete', 'damaged', 'complete']"


In [48]:
# Prepare text_ids (list of unique ids), and metadata

text_ids = []
for text in corpus:
  text_ids.append(text["text"].iloc[0])

# Akkadian metadata
metadata = get_metadata_from_raw_github("https://raw.githubusercontent.com/DigitalPasts/ALP-course/master/course_notebooks/data/rinap1_5_metadata.csv")
# Egyptian metadata
metadata = get_metadata_from_raw_github("https://raw.githubusercontent.com/DigitalPasts/ALP-course/master/course_notebooks/data/TLA_metadata.csv")

for id in text_ids:
  if id not in metadata.index:
    print(f"Text {id} missing from metadata")

metadata = metadata[metadata.index.isin(text_ids)]

### To convert dataframe to string

**Function to split the text dataframes according to a column**. Used to separate text to lines:
* param df: dataframe containing one word in each row.
* param column: the column by which to split the dfs, perferably `text` or `line`.
* return: a list of dataframes split according to the value given to the column parameter.



In [49]:
def split_df_by_column_value(df, column):

    dfs = []
    column_values = df[column].unique()
    for value in column_values:
        split_df = df[df[column]==value]
        dfs.append(split_df)
    return dfs

In [50]:
split_df_by_column_value(corpus[0].head(), "line")

[                         text  \
 0  5FTT5MPBHZC3HK4LCBHO5ILEQY   
 1  5FTT5MPBHZC3HK4LCBHO5ILEQY   
 2  5FTT5MPBHZC3HK4LCBHO5ILEQY   
 3  5FTT5MPBHZC3HK4LCBHO5ILEQY   
 4  5FTT5MPBHZC3HK4LCBHO5ILEQY   
 
                                            text_name  line  word  \
 0  Medizinische Texte || pEbers || 1,1-2,6 = Eb 1...     1     3   
 1  Medizinische Texte || pEbers || 1,1-2,6 = Eb 1...     1     4   
 2  Medizinische Texte || pEbers || 1,1-2,6 = Eb 1...     1     5   
 3  Medizinische Texte || pEbers || 1,1-2,6 = Eb 1...     1     6   
 4  Medizinische Texte || pEbers || 1,1-2,6 = Eb 1...     1     7   
 
                            ref    frag    norm unicode_original unicode_word  \
 0  IBUBd7b7rdeJUUT0nty1mN1QWUM  ḥꜣ.t-ꜥ  ḥꜣ.t-ꜥ              𓄂𓐰𓂝           𓄂𓂝   
 1  IBkAd3TWR04hpEXUm7OiyMRF04s       m       m                𓅓            𓅓   
 2  IBUBd03k21v2k0dNtDvMsxLbqT0      rʾ      rʾ               𓂋𓏤           𓂋𓏤   
 3  IBUBdyzQkQqkHUDEtJYoxN3GjHE       n       n       

**Function to convert the values from the text dataframe to a string of text with or without line breaks and word segmentation**.
* param df: the text dataframe
* param column: the chosen column from the dataframe to construct the text from (preferably unicode_word, cf, or lemma)
* param break_perc: a parameter which dictates whether to include broken words depending on the percentage of how broken they are.
                       Compares this value to the `break_perc` column in the dataframe.
                       Parameter is set to 1 (i.e. all words, whether broken or not, are included); can be any float between 0 and 1.
* param mask: boolean whether to mask named entities or not; set to True.
* return: a string which includes all the words in the texts according to the column chosen. Extra spaces that were between broken words or empty lines are removed.

In [51]:
def df2str(df, column, break_perc=1, mask=True, segmentation=True):

    # check if column exists in dataframe. If not, return empty text.
    if column not in df.columns:
        return ("", 0, 0)
    else:
        # remove rows that include duplicate values for compound words
        if column not in ["norm", "cf", "sense", "pos"]:
            df = df.drop_duplicates("ref").copy()
        # if column entry is empty string, replace with UNK (can happen with normalization or lemmatization)
        mask_empty = df[column]==""
        df[column] = df[column].where(~mask_empty, other="UNK")
        # mask proper nouns
        if mask and "pos" in df.columns:
            mask_bool = df["pos"].isin(["PN", "RN", "DN", "GN", "MN", "SN", "n"])
            df[column] = df[column].where(~mask_bool, other=df["pos"])
        # change number masking from `n` to `NUM`
        # !comment out for Egyptian
        #if mask:
         #   mask_num = df[column]=="n"
          #  df[column] = df[column].where(~mask_num, other="NUM")
        # remove rows without break_perc (happens with non-Akkadian words)
        if "" in df["break_perc"].unique():
            df = df[df["break_perc"]!=""].copy()
        # filter according to break_perc
        mask_break = df["break_perc"] <= break_perc
        df[column] = df[column].where(mask_break, other="X")
        # calculate text length with and without UNK and x tokens
        text_length_full = df.shape[0]
        mask_partial = df[column].isin(["UNK", "X", "x"])
        text_length_partial = text_length_full - sum(mask_partial)
        # create text lines
        text = ""
        df_lines = split_df_by_column_value(df, "line")
        for line in df_lines:
            word_list = list(filter(None, line[column].to_list()))
            if word_list != []:
                text += " ".join(map(str, word_list)).replace("x", "X").strip() + "\n"

        if segmentation == False:
            # remove all white spaces (word segmentation and line breaks)
            text = re.sub(r"[\s\u00A0]+", "", text)

        return (text, text_length_full, text_length_partial)

In [52]:
df2str(corpus[0], "cf")

('ḥꜣ.t-ꜥ m rʾ n.ꞽ wꜣḥ pẖr.t ḥr ꜥ.t nb n.ꞽ z\npri̯ ⸗ꞽ m Ꞽwn.w ḥnꜥ Wr.w-n.w-Ḥw.t-ꜥꜣ.t nb-mkw.t ḥqꜣ-nḥḥ\nnḥm.n pri̯ ⸗ꞽ m Zꜣw ḥnꜥ mw.t-nṯr\nrḏi̯ ⸗sn n ⸗ꞽ mkw.t ⸗sn\nꞽw ṯꜣz n ⸗ꞽ ꞽri̯ nb-r-ḏr r dr s.t-ꜥ nṯr nṯr.t mwt mwt.t ḥmw.t-rʾ n.tꞽ m tp ⸗ꞽ pn m nḥb.t ⸗ꞽ ꞽptn tn m qꜥḥ ⸗ꞽ ꞽpn m ꞽwf ⸗ꞽ pn m ꜥ.t ⸗ꞽ ꞽptn r szwnu̯ srḫ.y ḥr.ꞽ sꜥq ẖnn.w m ꞽwf ⸗ꞽ pn bꞽbꞽ m ꜥ.t ⸗ꞽ ꞽptn m ꜥq m ꞽwf ⸗ꞽ pn m tp ⸗ꞽ pn m qꜥḥ ⸗ꞽ ꞽpn m ḥꜥ.w ⸗ꞽ pn m ꜥ.t ⸗ꞽ ꞽptn\nn.ꞽ wꞽ Rꜥw ḏd ⸗f\nꞽnk nḏ ⸗ꞽ sw m-ꜥ ḫft.ꞽ ⸗f\nsšm.w ⸗f pw Ḏḥw.tꞽ\nꞽw ⸗f rḏi̯ ⸗f mdwi̯ drf\nꞽri̯ ⸗f dmḏ.t\nrḏi̯ ⸗f ꜣḫ.w n rḫ-ꞽḫ.t n zwn.w ꞽm.ꞽ-ḫt ⸗f r wḥꜥ mri̯ nṯr sꜥnḫ ⸗f sw\nꞽnk pw mri̯ nṯr sꜥnḫ ⸗f wꞽ\nḏd-mdw ḫft wꜣḥ pẖr.t ḥr ꜥ.t nb n.ꞽ z n.tꞽ mr\nsšr-mꜣꜥ ḥḥ n.ꞽ zp\nky rʾ n.ꞽ wḥꜥ wt nb\nwḥꜥ zp ꞽn Ꜣs.t\nwḥꜥ Ḥr.w ꞽn Ꜣs.t m ḏw.t ꞽri̯ r ⸗f ꞽn sn ⸗f Stẖ m smꜣ ⸗f ꞽtꞽ ⸗f Wsꞽr\nꞽ Ꜣs.t wr.t-ḥkꜣ.w wḥꜥ ⸗ṯ wꞽ\nsfḫ ⸗ṯ wꞽ m-ꜥ ꞽḫ.t nb bꞽn ḏw dšr m-ꜥ s.t-ꜥ nṯr s.t-ꜥ nṯr.t m-ꜥ mwt mwt.t m-ꜥ ḏꜣ.yw ḏꜣ.yt ḏꜣi̯ ⸗fꞽ sw m ⸗ꞽ mꞽ wḥꜥ ⸗ṯ mꞽ sfḫ ⸗ṯ m-ꜥ zꜣ ⸗ṯ Ḥr.w ḥr-n.tꞽt ꜥq ⸗ꞽ m ḫ.t pri̯ ⸗

### To convert to specific word levels and create dictionaries

**Function to convert the dataframes into strings of lemmatized texts**.
* param corpus: a list of dataframes
* param break_perc: a parameter which dictates whether to include broken words depending on the percentage of how broken they are.
                       Compares this value to the `break_perc` column in the dataframe.
                       Parameter is set to 1 (i.e. all words, whether broken or not, are included); can be any float between 0 and 1.
* param mask: boolean whether to mask named entities or not; set to True.
* return: a dictionary where the keys are the text IDs and the values are the lemmatized texts

In [53]:
def get_lemmatized_texts(corpus, break_perc=1, mask=True):

    texts_dict = {}
    for df in corpus:
        # get the text number from the dataframe "text" column
        key = df["text"].iloc[0]
        text, text_length_full, text_length_partial = df2str(df, "cf", break_perc, mask)
        texts_dict[key] = (text, text_length_full, text_length_partial)
    return texts_dict

In [54]:
get_lemmatized_texts((split_df_by_column_value(corpus[0], "text")))

{'5FTT5MPBHZC3HK4LCBHO5ILEQY': ('ḥꜣ.t-ꜥ m rʾ n.ꞽ wꜣḥ pẖr.t ḥr ꜥ.t nb n.ꞽ z\npri̯ ⸗ꞽ m Ꞽwn.w ḥnꜥ Wr.w-n.w-Ḥw.t-ꜥꜣ.t nb-mkw.t ḥqꜣ-nḥḥ\nnḥm.n pri̯ ⸗ꞽ m Zꜣw ḥnꜥ mw.t-nṯr\nrḏi̯ ⸗sn n ⸗ꞽ mkw.t ⸗sn\nꞽw ṯꜣz n ⸗ꞽ ꞽri̯ nb-r-ḏr r dr s.t-ꜥ nṯr nṯr.t mwt mwt.t ḥmw.t-rʾ n.tꞽ m tp ⸗ꞽ pn m nḥb.t ⸗ꞽ ꞽptn tn m qꜥḥ ⸗ꞽ ꞽpn m ꞽwf ⸗ꞽ pn m ꜥ.t ⸗ꞽ ꞽptn r szwnu̯ srḫ.y ḥr.ꞽ sꜥq ẖnn.w m ꞽwf ⸗ꞽ pn bꞽbꞽ m ꜥ.t ⸗ꞽ ꞽptn m ꜥq m ꞽwf ⸗ꞽ pn m tp ⸗ꞽ pn m qꜥḥ ⸗ꞽ ꞽpn m ḥꜥ.w ⸗ꞽ pn m ꜥ.t ⸗ꞽ ꞽptn\nn.ꞽ wꞽ Rꜥw ḏd ⸗f\nꞽnk nḏ ⸗ꞽ sw m-ꜥ ḫft.ꞽ ⸗f\nsšm.w ⸗f pw Ḏḥw.tꞽ\nꞽw ⸗f rḏi̯ ⸗f mdwi̯ drf\nꞽri̯ ⸗f dmḏ.t\nrḏi̯ ⸗f ꜣḫ.w n rḫ-ꞽḫ.t n zwn.w ꞽm.ꞽ-ḫt ⸗f r wḥꜥ mri̯ nṯr sꜥnḫ ⸗f sw\nꞽnk pw mri̯ nṯr sꜥnḫ ⸗f wꞽ\nḏd-mdw ḫft wꜣḥ pẖr.t ḥr ꜥ.t nb n.ꞽ z n.tꞽ mr\nsšr-mꜣꜥ ḥḥ n.ꞽ zp\nky rʾ n.ꞽ wḥꜥ wt nb\nwḥꜥ zp ꞽn Ꜣs.t\nwḥꜥ Ḥr.w ꞽn Ꜣs.t m ḏw.t ꞽri̯ r ⸗f ꞽn sn ⸗f Stẖ m smꜣ ⸗f ꞽtꞽ ⸗f Wsꞽr\nꞽ Ꜣs.t wr.t-ḥkꜣ.w wḥꜥ ⸗ṯ wꞽ\nsfḫ ⸗ṯ wꞽ m-ꜥ ꞽḫ.t nb bꞽn ḏw dšr m-ꜥ s.t-ꜥ nṯr s.t-ꜥ nṯr.t m-ꜥ mwt mwt.t m-ꜥ ḏꜣ.yw ḏꜣ.yt ḏꜣi̯ ⸗fꞽ sw m ⸗ꞽ mꞽ wḥꜥ ⸗ṯ mꞽ sfḫ ⸗ṯ m-ꜥ zꜣ ⸗ṯ Ḥ

**Function to convert the dataframes into strings of normalized texts**.
* param corpus: a list of dataframes
* param break_perc: a parameter which dictates whether to include broken words depending on the percentage of how broken they are.
                       Compares this value to the `break_perc` column in the dataframe.
                       Parameter is set to 1 (i.e. all words, whether broken or not, are included); can be any float between 0 and 1.
* param mask: boolean whether to mask named entities or not; set to True.
* return: a dictionary where the keys are the text IDs and the values are the normalized texts

In [55]:
def get_normalized_texts(corpus, break_perc=1, mask=True):

    texts_dict = {}
    for df in corpus:
        # get the text number from the dataframe "text" column
        key = df["text"].iloc[0]
        text, text_length_full, text_length_partial = df2str(df, "norm", break_perc, mask)
        texts_dict[key] = (text, text_length_full, text_length_partial)
    return texts_dict

In [56]:
get_normalized_texts((split_df_by_column_value(corpus[0], "text")))

{'5FTT5MPBHZC3HK4LCBHO5ILEQY': ('ḥꜣ.t-ꜥ m rʾ n wꜣḥ pẖr.t ḥr ꜥ.t nb.t n.t s\npri̯.n ⸗ꞽ m ꞽwn.w ḥnꜥ wr.w.PL-n.w-ḥw.t-ꜥꜣ.t.PL nb.w.PL-mk.t ḥqꜣ.w.PL-nḥḥ\nnḥm.n pri̯.n ⸗ꞽ m sꜣw ḥnꜥ mw.t-nṯr.PL\nrḏi̯.n ⸗sn n ⸗ꞽ mk.t ⸗sn\nꞽw ṯs.w.PL n ⸗ꞽ ꞽri̯.n nb-r-ḏr r dr s.t-ꜥ nṯr nṯr.t mwt mwt.t ḥmw.t-rʾ n.tꞽ m dp ⸗ꞽ pn m nḥb.t ⸗ꞽ UNK tn m qꜥḥ.w.PL ⸗ꞽ ꞽpn m ꞽwf ⸗ꞽ pn m ꜥ.t.PL ⸗ꞽ ꞽptn r sswnu̯ srḫ.y ḥr.ꞽ sꜥq.y.w.PL ẖnn m ꞽwf ⸗ꞽ pn bꞽbꞽ m ꜥ.t.PL ⸗ꞽ ꞽptn m ꜥq.t m ꞽwf ⸗ꞽ pn m dp ⸗ꞽ pn m qꜥḥ.w.PL ⸗ꞽ ꞽpn m ḥꜥ.PL ⸗ꞽ pn m ꜥ.t.PL ⸗ꞽ ꞽptn\nnꞽ wꞽ rꜥ ḏd.n ⸗f\nꞽnk nḏ ⸗ꞽ sw m-ꜥ ḫft.ꞽ.w.PL ⸗f\nsšm.w ⸗f pw ḏḥw.tꞽ\nꞽw ⸗f ḏi̯ ⸗f mdwi̯ drf\nꞽri̯ ⸗f dmḏ.t.PL\nḏi̯ ⸗f ꜣḫ n rḫ.w.PL-ꞽḫ.t.PL n swn.w.PL ꞽm.ꞽ.w.PL-ḫt.PL ⸗f r wḥꜥ mrr.w nṯr sꜥnḫ ⸗f sw\nꞽnk pw mrr.w nṯr sꜥnḫ ⸗f wꞽ\nḏd-mdw ḫft wꜣḥ pẖr.t ḥr ꜥ.t nb.t n.t s n.t.t mḥr.tꞽ\nšs-mꜣꜥ ḥḥ n zp\nky rʾ n wḥꜥ wt nb\nwḥꜥ zp-2 ꞽn ꜣs.t\nwḥꜥ ḥr.w ꞽn ꜣs.t m ḏw.t.PL ꞽri̯.yt.PL r ⸗f ꞽn sn ⸗f stḫ m smꜣm ⸗f ꞽt ⸗f ws-ꞽr\nꞽ ꜣs.t wr.t-ḥkꜣ.w wḥꜥ ⸗t wꞽ\nsfḫ ⸗t wꞽ m-ꜥ ꞽḫ.t nb.t bꞽn.t ḏw.t dšr.t m-

**Function to convert the dataframes into strings of segmented unicode texts**.
* param corpus: a list of dataframes
* param break_perc: a parameter which dictates whether to include broken words depending on the percentage of how broken they are.
                       Compares this value to the `break_perc` column in the dataframe.
                       Parameter is set to 1 (i.e. all words, whether broken or not, are included); can be any float between 0 and 1.
* param mask: boolean whether to mask named entities or not; set to True.
* return: a dictionary where the keys are the text IDs and the values are the segmented unicode texts

In [57]:
def get_segmented_unicode_texts(corpus, break_perc=1, mask=True):

    texts_dict = {}
    for df in corpus:
        # get the text number from the dataframe "text" column
        key = df["text"].iloc[0]
        text, text_length_full, text_length_partial = df2str(df, "unicode_word", break_perc, mask)
        texts_dict[key] = (text, text_length_full, text_length_partial)
    return texts_dict

In [58]:
get_segmented_unicode_texts((split_df_by_column_value(corpus[0], "text")))

{'5FTT5MPBHZC3HK4LCBHO5ILEQY': ('𓄂𓂝 𓅓 𓂋𓏤 𓈖 𓎝𓎛\U00013455𓏛 𓄲𓂋𓏏𓈒𓏥 𓁷𓏤 𓂝𓏏𓄹 𓎟𓏏 𓈖𓏏 𓊃𓀀𓏤\n𓉐𓂋𓂻𓈖 𓀀 𓅓 𓉺𓏌𓊖 𓎛𓈖𓂝 𓀗𓏲𓅆𓏪𓏌𓏤𓉗𓉻𓏏𓉐𓅆𓏪 𓎟𓏲𓏥𓅓𓂝𓎢𓏏𓏛 𓋾𓈎𓏲𓅆𓏪𓎛𓇳𓎛\n𓈖𓈞𓅓𓂝𓈖 𓉐𓂋𓂻𓈖 𓀀 𓅓 𓊃𓅭𓄿𓏲𓊖 𓎛𓈖𓂝 𓅐𓏏𓏯𓅆𓏪𓊹𓊹𓊹𓅆𓏪\n𓂋𓂝𓈖 𓋴𓈖𓏥 𓈖 𓀀 𓅓𓂝𓎢𓏏𓏛𓏪 𓋴𓈖𓏥\n𓇋𓏲 𓋭𓊃𓏲𓀁𓏪 𓈖 𓀀 𓁹𓈖 𓎟𓂋𓇥𓂋𓅆 𓂋 𓂧𓂋𓀜 𓊨𓏏𓂝𓏤𓅪𓏥 𓊹𓅆 𓊹𓂋𓏏𓏯𓆗 𓅓𓏏𓏱 𓅓𓏏𓏏𓏱 𓍍𓏏𓏤𓏛𓏥𓂋𓏤 𓈖𓏏𓏭 𓅓 𓁶𓏤 𓀀 𓊪𓈖 𓅓 𓅘𓎛𓃀𓏏𓄹 𓀀 𓇋𓊪𓏏𓈖 UNK 𓅓 𓈎𓂝𓎛𓏲𓂢𓏥 𓀀 𓇋𓊪𓈖 𓅓 𓇋𓏲𓆑𓄹𓏥 𓀀 𓊪𓈖 𓅓 𓂝𓏏𓄹𓏥 𓀀 𓇋𓊪𓏏𓈖 𓂋 𓋴𓋴𓃹𓈖𓌕𓅪 𓋴𓂋𓐍𓇋𓇋𓀁𓏱 𓁷𓂋𓏭𓇯 𓋴𓅧𓈎𓇋𓇋𓏲𓂻𓏥 𓂙𓈖𓈖𓐎 𓅓 𓇋𓏲𓆑𓄹𓏥 𓀀 𓊪𓈖 𓃀𓇋𓃀𓇋𓀁 𓅓 𓂝𓏏𓄹𓏥 𓀀 𓇋𓊪𓏏𓈖 𓅓 𓅧𓈎𓏏𓂻 𓅓 𓇋𓏲𓆑𓄹𓏥 𓀀 𓊪𓈖 𓅓 𓁶𓏤 𓀀 𓊪𓈖 𓅓 𓈎𓂝𓎛𓏲𓂢𓏥 𓀀 𓇋𓊪𓈖 𓅓 𓎛𓂝𓄹𓏥 𓀀 UNK 𓅓 𓂝𓏏𓄹𓏥 𓀀 𓇋𓊪𓏏𓈖\n𓈖 𓅱𓀀 𓇳𓅆 𓆓𓂧𓈖 𓆑\n𓏌𓎢𓀀 𓐩𓏌𓂝 𓀀 𓇓𓏲 𓅓𓂝 𓐍𓆑𓏏𓅂𓏱𓏥 𓆑\n𓋴𓌫𓅓𓏲𓂻 𓆑 𓊪𓏲 𓅝𓏏𓏭\n𓇋𓏲 𓆑 𓂞 𓆑 𓌃𓂧𓏲𓀁 𓂧𓂋𓆑𓏯𓏛\n𓁹 𓆑 𓋬𓂧𓏏𓍼\n𓂞 𓆑 𓅜𓐍𓏛 𓈖 𓂋𓐍𓏲𓏛𓏥𓐍𓏏𓏛𓏥𓀀𓏥 𓈖 𓌕𓏌𓏤𓀀𓏥 𓏶𓅓𓏲𓏪𓆱𓐍𓏏𓂻𓀀𓏥 𓆑 𓂋 𓊠𓂝𓍢𓂝 𓌻𓂋𓂋𓏲𓀁 𓊹𓅆 𓋴𓋹𓈖𓐍 𓆑 𓇓𓏲\n𓏌𓎢𓀀 𓊪𓏲 𓌻𓂋𓂋𓏲𓀁 𓊹𓅆 𓋴𓋹𓈖𓐍 𓆑 𓏲𓀀\n𓆓𓂧𓌃𓏤𓏥 𓐍𓆑𓏏 𓎝𓎛𓏛 𓄲𓂋𓏏𓈒𓏥 𓁷𓏤 𓂝𓏏𓄹 𓎟𓏏 𓈖𓏏 𓊃𓀀𓏤 𓈖𓏏𓏏 𓍋𓅓𓂋𓅪𓍘𓇋\n𓍱𓏤𓏛𓏥𓌶𓂝𓏛 𓁨\U00013445 𓈖 𓊃𓊪𓊗\n𓎢\U00013455𓇋𓇋 𓂋𓏤 𓈖 𓊠𓂝𓍢𓂝 𓐎𓂝 𓎟\n𓊠𓂝𓍢𓂝 𓊗𓏻 𓇋𓈖 𓊨𓏏𓏯𓆗\n𓊠𓂝𓍢𓂝 𓅃𓅆 𓇋𓈖 𓆇𓏏𓏯𓅆 𓅓 𓈋𓅱𓏏𓅪𓏥 𓁹𓇋𓇋𓏏𓏥 𓂋 𓆑 𓇋𓈖 𓌢𓅆 𓆑 𓃩𓅆 𓅓 𓋴𓌳𓄿𓅓𓌪𓂝 𓆑 𓇋𓏏𓆑𓅆 𓆑 𓍟𓁹𓅆\n𓇋𓀁 𓊨𓏏𓆇𓏯𓅆 𓅨𓂋𓏏𓎛𓂓𓄿𓏲𓀁𓏪𓅆 𓊠𓂝𓍢𓂝 𓏏 𓏲𓀀\n𓋴𓆑𓐍𓍼𓏛 𓏏 𓏲𓀀 𓅓𓂝 𓐍𓏏𓏛𓏥 𓎟𓏏 𓃀𓇋𓈖𓏏𓅪 𓈋𓅱𓏏𓅪𓏥 𓂧𓈙𓂋𓏏𓏯 𓅓𓂝 𓊨𓏏𓂝𓏤𓅪𓏥 𓊹𓅆 𓊨𓏏𓂝𓏤𓅪𓏥 𓊹𓂋𓏏𓏯𓆗 𓅓𓏏 𓅓𓏏𓏱 𓅓𓏏𓏏𓏱 𓅓𓏏 𓍑𓄿𓇋𓇋𓏲𓏱 𓍑𓄿𓇋𓇋𓏏𓏱 𓍑𓄿𓏏𓏴 𓆑𓏭 𓇓𓏲 𓇋𓅓 𓀀 𓏇𓇋 𓊠𓂝𓍢𓂝 𓏏 𓏇𓇋 

### To create the vector space model

#### vectorizing texts

**Converts a list of texts into a term-document matrix based on TF-IDF scores**.

Full documentation of the variables of TfidfVectorizer from sklearn, see: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer
* param corpus: a dataframe in which the texts are in a `"text"` column and the dataframe's index is the text ids.
* param analyzer: whether the feature should be made of word or character n-grams.
                     use `"word"` for word features, `"char_wb"` for character n-grams within word boundaries,
                     or `"char"` for character n-grams without word boundaries.
* param ngram_range: the lower and upper boundary of the range of n-values for different n-grams to be extracted.
* param max_df: threshold to ignore terms that have a document frequency above a certain value.
                   If the threshold is a float, it represent a proportion of the documents.
                   If the threshold is an integer, it represents absolute counts of number of documents in which the terms appears.
* param min_df: threshold to ignore terms that have a document frequency below a certain value.
                   If the threshold is a float, it represent a proportion of the documents.
                   If the threshold is an integer, it represents absolute counts of number of documents in which the terms appears.
* param max_features: if not `None`, build a vocabulary that only considers the top max_features ordered by term frequency across the corpus.
* param stop_words: if `None`, no stop words are used. Otherwise, can be a list with words to be removed from resulting tokens.
* return: `counts` the raw counts of the vectorizer,
             `counts_df` a dataframe of the counts where the index is the text ids and the columns are the tokens,
             `stop_words` an updated list of stop words

![](https://www.humanitiesdataanalysis.org/_images/bow.png)



**Figure 1**. Example of a document-term matrix extracted from a corpus, see Fig. 3 in Karsdorp, F., Kestemont, M., & Riddell, A. (2021). Humanities Data Analysis: Case Studies
with Python. Princeton University Press.

In [59]:
def vectorize(corpus, analyzer="word", ngram_range=(1,1), max_df=1.0, min_df=1, max_features=None, stop_words=["UNK", "X"]):

    vectorizer = TfidfVectorizer(input="content", lowercase=False, analyzer=analyzer,
                                 # RegEx for Akkadian
                                 #token_pattern=r"(?u)\b\w+\b", ngram_range=ngram_range,
                                 # RegEx for Egyptian
                                 token_pattern=r"(?u)\b[\w\.]+\b", ngram_range=ngram_range,
                                 max_df=max_df, min_df=min_df, max_features=max_features, stop_words=stop_words)

    counts = vectorizer.fit_transform(corpus["text"].tolist()).toarray()
    stop_words = vectorizer.stop_words_

    # saving the vocab used for vectorization, and switching the dictionary so that the feature index is the key
    vocab = vectorizer.vocabulary_
    switched_vocab = {value: key for key, value in vocab.items()}
    # adding the vocab words to the counts dataframe for easier viewing.
    column_names = []
    x = 0
    while x < len(switched_vocab):
        column_names.append(switched_vocab[x])
        x += 1

    counts_df = pd.DataFrame(counts, index=corpus.index, columns=column_names)

    return (counts, counts_df, stop_words)

#### calculating distances between vectorized documents

**Converts a term-document matrix to a text similarity matrix**.
* param counts: the raw counts from the `vectorize` function.
* param metric: the metric by which to calculate the distances between the texts in the corpus. For one place to look into the different types of matrics see "Computing distances between documents" in [Karsdrop, Kestemont, & Riddell 2021](https://www.humanitiesdataanalysis.org/vector-space-model/notebook.html#computing-distances-between-documents)
                   Valid metrics are:
                   ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’,
                   ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulczynski1’,
                   ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’,
                   ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’.
* param text_ids: list of unique text_ids.
* return: a dataframe matrix of distance between texts.

In [60]:
def distance_calculator(counts, metric, text_ids):

    return pd.DataFrame(squareform(pdist(counts, metric=metric)), index=text_ids, columns=text_ids)

#### reducing dimensions with pca or tsne

**Reduces multidimensional data into two dimensions using PCA**.
* param df: dataframe holding the dimensions to reduce. All columns should include numerical values only.
               The dataframe's index should hold the unique text ids.
* param metadata: the rest of the metadata in the corpus, to help visualize the resulting clusters in meaningful ways.
                     The metadata's index should hold the unique text ids.
* return: a dataframe with the coordinates of the two remaining dimensions on all other columns from the metadata.

In [61]:
def reduce_dimensions_pca(df, metadata):

    pca = PCA(n_components=2)
    reduced_data = pca.fit_transform(df)
    reduced_df = pd.DataFrame(data=reduced_data, index=df.index, columns=["component 1", "component 2"])
    reduced_df_metadata = metadata.join(reduced_df)
    return reduced_df_metadata

**Reduces multidimensional data into two dimensions using TSNE**.

See full documentation of sklearn's TSNE on: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
* param df: dataframe holding the dimensions to reduce. All columns should include numerical values only.
               The dataframe's index should hold the unique text ids.
* param perplexity: perplexity is a measure the weighs the importance of nearby versus distant points when creating a lower-dimension mapping.
                       t-SNE first converts the distances between points into conditional probabilities that represent similarities,
                       using Gaussian probability distributions.
                       The perplexity parameter influences the variance used to compute these probabilities.
                       A higher perplexity leads to a broader Gaussian that considers a larger number of neighbors when assessing similarity.
                       Lower perplexity puts more focus on the local structure and considers fewer neighbors.
                       A good perplexity depends greatly on dataset size and density.
                       The documentation recommends a value between 5 and 50.
                       We recommend to start with the square root of the length of the corpus.
* param n_iter: maximum number of iterations for optimization.
* param metric: the metric to be used when calculating distances between vectors.
                   Valid metrics are:
                   ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’,
                   ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulczynski1’,
                   ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’,
                   ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’.
* param metadata: the rest of the metadata in the corpus, to help visualize the resulting clusters in meaningful ways.
                     The metadata's index should hold the unique text ids.
* return: a dataframe with the coordinates of the two remaining dimensions on all other columns from the metadata.

In [62]:
def reduce_dimensions_tsne(df, perplexity, n_iter, metric, metadata):

    tsne = TSNE(n_components=2, perplexity=perplexity, n_iter=n_iter, metric=metric, init="pca")
    reduced_data = tsne.fit_transform(df)
    reduced_df = pd.DataFrame(data=reduced_data, index=df.index, columns=["component 1", "component 2"])
    reduced_df_metadata = metadata.join(reduced_df)
    return reduced_df_metadata

## Process texts from dataframes and combine results with metadata dataframe

In [63]:
# Function to combine processed texts with metadata

def get_corpus_metadata(texts_dict, metadata):
  texts_df = pd.DataFrame(texts_dict, index=["text", "full_length", "partial_length"]).transpose()
  df = metadata.join(texts_df)
  return df

In [64]:
## vectorize lemma forms
corpus_dict = get_lemmatized_texts(corpus, break_perc=0)
## vectorize normalized forms
# corpus_dict = get_normalized_texts(corpus, break_perc=0)
## vectorize Unicode cuneiform
# corpus_dict = get_segmented_unicode_texts(corpus, break_perc=0)

corpus_metadata = get_corpus_metadata(corpus_dict, metadata)

## For Akkadian
## remove texts which have less than n words excluding UNK and X
#n = 10
#print(f"Number of texts before filtering: {corpus_metadata.shape[0]}")
#corpus_metadata = corpus_metadata[corpus_metadata["partial_length"]>=n]
#print(f"Number of texts after filtering: {corpus_metadata.shape[0]}")


# For Egyptian use this instead, resetting the index
n = 10
print(f"Number of texts before filtering: {corpus_metadata.shape[0]}")
corpus_metadata = corpus_metadata[corpus_metadata["partial_length"]>=n].set_index("popular_name")
print(f"Number of texts after filtering: {corpus_metadata.shape[0]}")

Number of texts before filtering: 47
Number of texts after filtering: 47


In [65]:
get_corpus_metadata(corpus_dict, metadata)

Unnamed: 0_level_0,popular_name,credits,script_type,period,collection,project,text,full_length,partial_length
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2SQY75KWF5HUXBJCDHOUZPKT3E,pEbers_757-760,"Lutz Popko, unter Mitarbeit von Altägyptisches...",hieratic,16th c. BCE (18th dyn.),"Bibliotheca Albertina, Papyrus- und Ostrakasam...",pEbers,ḥꜣ.t-ꜥ m pẖr.t n.ꞽ srwḫ gs wnm.ꞽ m rwy.t\nꜣḥ w...,99,99
3RU7Z4VQ45CYFIQ4PUGQ3HDJFU,pEbers_other-1,"Lutz Popko, unter Mitarbeit von Altägyptisches...",hieratic,16th c. BCE (18th dyn.),"Bibliotheca Albertina, Papyrus- und Ostrakasam...",pEbers,rnp.t-zp 1...n ḫr ḥm n.ꞽ nswt-bꞽ.tꞽ Ḏsr-kꜣ-Rꜥw...,93,93
4QZJJ2HEGBGDTIKGGBXB2L2NA4,pEbers_761-763,"Lutz Popko, unter Mitarbeit von Altägyptisches...",hieratic,16th c. BCE (18th dyn.),"Bibliotheca Albertina, Papyrus- und Ostrakasam...",pEbers,ḥꜣ.t-ꜥ m pẖr.t n.ꞽ rš\nbnꞽ.w\nmḥ rʾ ⸗f ꞽm\nky ...,93,93
5FTT5MPBHZC3HK4LCBHO5ILEQY,pEbers_1-3,"Lutz Popko, unter Mitarbeit von Altägyptisches...",hieratic,16th c. BCE (18th dyn.),"Bibliotheca Albertina, Papyrus- und Ostrakasam...",pEbers,ḥꜣ.t-ꜥ m rʾ n.ꞽ X pẖr.t ḥr ꜥ.t nb n.ꞽ z\npri̯ ...,375,372
5VTRLPNGTVB4NFASALVWWRA524,pEbers_432-436,"Lutz Popko, unter Mitarbeit von Altägyptisches...",hieratic,16th c. BCE (18th dyn.),"Bibliotheca Albertina, Papyrus- und Ostrakasam...",pEbers,ky n.ꞽ pzḥ n.ꞽ rmṯ\nẖꜥꜥ n.ꞽ šḏ.t n.tꞽ m ꜥnḏ.w ...,116,116
7GYITRCORFDG5LSMIKTYILCDJY,pEbers_764-782,"Lutz Popko, unter Mitarbeit von Altägyptisches...",hieratic,16th c. BCE (18th dyn.),"Bibliotheca Albertina, Papyrus- und Ostrakasam...",pEbers,ḥꜣ.t-ꜥ m pẖr.t n.ꞽ msḏr nḏs sḏm ⸗f\nmnš.t ḏrḏ ...,569,567
7RRQAVSPNFFTRNFXVC7JTPGQM4,pEbers_284-293,"Lutz Popko, unter Mitarbeit von Altägyptisches...",hieratic,16th c. BCE (18th dyn.),"Bibliotheca Albertina, Papyrus- und Ostrakasam...",pEbers,ḥꜣ.t-ꜥ m pẖr.t n.ꞽ rḏi̯ šzp ꞽb tʾ\nꞽwf ḏdꜣ 1.....,172,170
7USDA2QTOZEKHKANKSDDZY7KXM,pEbers_305-325,"Lutz Popko, unter Mitarbeit von Altägyptisches...",hieratic,16th c. BCE (18th dyn.),"Bibliotheca Albertina, Papyrus- und Ostrakasam...",pEbers,ḥꜣ.t-ꜥ m pẖr.t n.ꞽ dr sry.t\nḏꜣr.t wꜣḏ\nrḏi̯ ḥ...,541,541
7YKDGDFXMVDU7FGWTYWL5VG7IA,pEbers_451-463,"Lutz Popko, unter Mitarbeit von Altägyptisches...",hieratic,16th c. BCE (18th dyn.),"Bibliotheca Albertina, Papyrus- und Ostrakasam...",pEbers,ḥꜣ.t-ꜥ m pẖr.t n.ꞽ dr skm srwḫ šnꞽ\nznf n.ꞽ bḥ...,228,227
A3SCLECB6ZH23OARP74WFWLFAU,pEbers_627-696,"Lutz Popko, unter Mitarbeit von Altägyptisches...",hieratic,16th c. BCE (18th dyn.),"Bibliotheca Albertina, Papyrus- und Ostrakasam...",pEbers,ḥꜣ.t-ꜥ m nwd.t n.ꞽ smn mt\npẖr.t n.ꞽ snḏm mt\n...,1615,1615


# Exploring the RINAP Corpus using the Vector Space Model

In [66]:
# vectorize corpus
counts, counts_df, stop_words = vectorize(corpus_metadata, max_features=200)

In [67]:
counts_df.head(3)

Unnamed: 0_level_0,1,1...n,16,2,2...1,32,64,_,bnꞽ,bꜣq,...,ꞽrṯ.t,ꞽt,ꞽw,ꞽwf,ꞽwi,ꞽšd,ꞽḥ,ꞽḫ.t,ꞽꜣd.t,ꞽꜥr
popular_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
pEbers_757-760,0.56861,0.360432,0.0,0.0,0.514457,0.0,0.06492,0.291715,0.0,0.0,...,0.0,0.035448,0.0,0.0,0.0,0.0,0.0,0.01897,0.0,0.0
pEbers_other-1,0.0,0.661201,0.0,0.0,0.0,0.0,0.0,0.447379,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
pEbers_761-763,0.0,0.110962,0.0,0.103672,0.0,0.0,0.0,0.0,0.089885,0.0,...,0.179769,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [68]:
# calculate distance between vectorized texts
matrix = distance_calculator(counts, "cosine", corpus_metadata.index)

In [69]:
# visualize matrix
fig = px.imshow(matrix)

# adjust size of the matrix
fig.update_layout(
    autosize=False,
    width=1000,
    height=1000,
)
fig.show()

In [70]:
# reduce matrix dimensions
reduced_tsne = reduce_dimensions_tsne(matrix, perplexity=matrix.shape[0]**0.5, n_iter=5000, metric="euclidean", metadata=corpus_metadata)

In [71]:
# visualize reduced dimensions

# adjust size column for visualization
size_min = 3
size_max = 70
size = (reduced_tsne["partial_length"] / reduced_tsne["partial_length"].max() * (size_max - size_min) + size_min).tolist()

# create figure
# for Akkadian use symbol = "script"
# for Egyptian delete that parameter or switch to symbol = 'script_type'
fig = px.scatter(reduced_tsne, x="component 1", y="component 2", size=size, color="project", symbol="script_type", hover_data=["partial_length", "full_length", reduced_tsne.index])
fig.update_traces(marker=dict(line=dict(width=1, color='black')))
fig.show()

# Find Shared Tokens

**creates a mini df that includes only the chosen text and the shared tokens in those texts**
  (i.e., all tokens that are none zero in all texts).
* param df: the counts_df where the index is the text ids and the columns are the tokens.
* param text_ids: a list containing text ids.
* return: a dataframe where the index are the shared tokens and the columns are the texts.
           the values are the tf-idf scores.

In [72]:
def find_shared_tokens(df, text_ids):

  mini_df = df[df.index.isin(text_ids)].copy()
  mini_df = mini_df.loc[:, (mini_df != 0).all(axis=0)].copy()
  return mini_df.transpose()

In [73]:
# Akkadian
#shared_tokens = find_shared_tokens(counts_df, ["Q003450", "Q003711", "Q003790"])
# Egyptian
shared_tokens = find_shared_tokens(counts_df, ["pEdwinSmith_Wundenbuch_1-27", "pEdwinSmith_Wundenbuch_28-48", "pEdwinSmith_Hautverschoenerung", "pEbers_432-436"])#, "3RU7Z4VQ45CYFIQ4PUGQ3HDJFU"])

shared_tokens

popular_name,pEbers_432-436,pEdwinSmith_Wundenbuch_28-48,pEdwinSmith_Wundenbuch_1-27,pEdwinSmith_Hautverschoenerung
bꞽ.t,0.041768,0.023721,0.019329,0.197175
m,0.252499,0.263158,0.25352,0.17028
n.ꞽ,0.247292,0.174394,0.148158,0.083385
pẖr.t,0.076803,0.001678,0.003332,0.09064
wꜥ,0.085326,0.009319,0.002468,0.201397
ḥr,0.274495,0.118202,0.091868,0.092557
ꞽḫ.t,0.081792,0.019652,0.035485,0.193057


In [74]:
px.scatter(shared_tokens)

*This notebook was created by [Avital Romach](https://github.com/ARomach), with additional code and text by [Eliese-Sophia Lincke](https://www.berliner-antike-kolleg.org/en/bak/team/fda/eliese-sophia_lincke/index.html) and [Shai Gordin](https://digitalpasts.github.io/) in Spring 2024 for the course [Ancient Language Processing](https://digitalpasts.github.io/ALP-course/). Code can be reused under a [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)*