# Stylometric Analysis for Akkadian Cuneiform Texts

This Jupyter notebook provides code to perform stylometric analysis on cuneiform texts. The code was designed specifically for Akkadian texts, however, it is likely to work well on all languages that used the cuneiform script, meaning all languages that have a transliteration system that follows the same principles. Additionally, texts written in other non-alphabetic writing systems for whose characters there are unicode representations could probably benefit from this code as well. The code is based on [*Introduction to stylometry with python* from the Programming Historian website](https://programminghistorian.org/en/lessons/introduction-to-stylometry-with-python).

It is accompanied by a metadata folder and texts folders. The texts folder currently holds one corpus as a case-study, 81 legal and administrative documents from the Neo-Babylonian period (ca. 7^th^-5^th^ centuries BCE). These texts edited in the 1975 dissertation of Raymond B. Dillard. They are currently in the Free Library of Philadelphia (FLP), after being purchased on the antiquities market in the early 20^th^ century by John Frederik Lewis.

The notebook employs the following methods on the texts:

1. Tf-idf vectorization
2. Create a distance matrix between the texts
3. T-SNE dimenstion reduction to display the distances between the texts on a scatter plot
4. Network analysis of the relationships between similar texts
5. A close reading: looking at groupings of nearest texts

The code is written so that no previous knowledge in python or stylometric methods is assumed. There are instructions throughout that explain the code, the stylometric methods, and give the necessary information to choose specific parameters.

As different parameters can give different results, it is possible to run the code several times and assess the differences.

Certain parameters are decided in advance (e.g. not to lowercase all characters) and some are left to the user (e.g. choosing 1-gram, 2-gram, or 3-gram). The idea behind this was that certain parameters are necessary for a logical parsing of cuneiform transliterations or their representation in Unicode cuneiform glyphs.

## Adding corpora

The stylometric analysis can be performed on texts which are either transliterated, or in which the signs are represented in Unicode. For transliterated texts, it is recommended that personal names and numbers be replaced with PN and NUM (or similar), depending on the focus of study (i.e. period, genre, scribal habits etc.).

The new corpora need to be added to the relevant folders before starting to use the code. This includes the following:

1. A folder which includes the texts for stylometric analysis:
- The folder should be placed under the `texts` folder.
- Each file should be a plain text file and contain one text only.
- The texts should be transliterated texts without any editorial marks (including no hyphens and dots between signs), or texts in Unicode cuneiform.
- It is recommended that each file name will be the name of the text (this is later used as the text identifier in the dataframes and CSVs produced).
2. A metadata file
- The file should be placed under the `metadata` folder.
- The file should be a CSV file.
- One of the columns **needs to be titled** `text_name`, and its values need to be identical to the file names of the texts under the `texts` folder.
- It is recommended that additional columns will have relevant information about the texts you want to study (period, genre, area, scribe, etc.).

## Step 1: Importing Libraries and Defining Functions

The following five blocks of code import the python libraries needed and define functions which will be called later in the code.

A short explanation precedes each function to explain what its different parameters mean. A fuller explanation of what the function do appears below when they are used in the main code.

You will only need to run these initial five code blocks once.

In [None]:
import os
import time
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from scipy.spatial.distance import pdist, squareform
import networkx as nx
from networkx.algorithms.community import *
import panel as pn
import panel.widgets as pnw
pn.extension()
pn.extension("plotly", "tabulator")
import holoviews as hv
from holoviews import dim, opts
hv.extension("bokeh", "matplotlib")

In [None]:
!jupyter nbextension enable --py widgetsnbextension --sys-prefix

### vectorize() function

This function uses the tf-idf Vectorizer from sklearn library. 

Two parameters are set: the _input_ is `filename` and _lowercase_ is set to `False`.

The function takes the following variables:

`file-paths`: the full paths to the text files

`analyzer`: word, char, or char-wb, see futher below

`ngram-range`: see further below

`max_df`: see further below

`min_df`: see further below

`max-features`: see further below

`file-keys`: the file names which are used as keys in the metadata dataframe

The function returns a list with the following:

`counts`: the tf-idf score of every word for every text as an array

`counts_df`: the tf-idf score of every word for every text as a dataframe, where the columns are the file names and the columns are the words in the vocabulary

`vocab_df`: a dataframe of the vocabulary words

In [None]:
def vectorize(file_paths, analyzer, ngram_range, max_df, min_df, max_features, file_keys):
    # returning the parameters that decide the size of the features to their factory settings if none were chosen.
    if max_df == False:
        max_df = 1.0
    if min_df == False:
        min_df = 1
    if max_features == False:
        max_features = None
    
    vectorizer = TfidfVectorizer(input="filename", lowercase=False, analyzer=analyzer, ngram_range=ngram_range, max_df=max_df, min_df=min_df, max_features=max_features)
    counts = vectorizer.fit_transform(file_paths).toarray()
    # saving the vocab used for vectorization, and switching the dictionary so that the feature index is the key
    vocab = vectorizer.vocabulary_
    switched_vocab = {value:key for key, value in vocab.items()}
    # adding the vocab words to the counts dataframe for easier viewing.
    column_names = []
    x=0
    while x < len(switched_vocab):
        column_names.append(switched_vocab[x])
        x+=1
    
    vocab_df = pd.DataFrame(vocab, index=["feature index"]).transpose()
    counts_df = pd.DataFrame(counts, index=file_keys, columns=column_names)
    
    
    return [counts, counts_df, vocab_df]

### distance_calculator() function

Calculates the distances between texts based on their tf-idf counts, using the scipy library.

Takes the following variables:

`counts`: tf-idf counts, calculated using the previous function

`metric`: the metric by which to calculate the distances. A list of the different available metrices is given below.

`file-keys`: the file names that are used to identify the texts in the dataframe produced.

The function returns a dataframe which contains a matrix of the distances between each text to every other text in the corpus.

In [None]:
def distance_calculator(counts, metric, file_keys):
    return pd.DataFrame(squareform(pdist(counts, metric=metric)), index=file_keys, columns=file_keys)

### reduce_dimensions() function

This function reduces vectors in multiple dimensions to two dimensional vectors.

It takes the following variables:

`df`: the dataframe which holds the matrix of relative distances between the texts in the corpus.

`file_keys`: the names of the texts used as an index.

`perplexity`: see further below.

`n_iter`: see further below.

`metric`: see further below.

The function returns a list with the following:

`reduced_df`: a dataframe which holds the results of the two dimensional vectors.

`kl_divergence`: the Kullback–Leibler divergence score.

`n_features_in`: the number of features (texts) in the given corpus found by the model.

In [None]:
def reduce_dimensions(df, file_keys, perplexity, n_iter, metric):
    tsne = TSNE(n_components=2, perplexity=perplexity, n_iter=n_iter, metric=metric, init="pca")
    reduced_data = tsne.fit_transform(df)
    kl_divergence = tsne.kl_divergence_
    n_features_in = tsne.n_features_in_
    reduced_df = pd.DataFrame(data=reduced_data, index=file_keys, columns=["component 1", "component 2"])
    
    return [reduced_df, kl_divergence, n_features_in]

## Step 2: Tf-idf Vectorization and Distance Matrix

Running the following code block will upload a widget in which you can choose the corpus you want to use, and the corresponding metadata file. It also allows you to choose parameters for the vectorization of the corpus and for calculating the distances between all the texts in the corpus. The results of the distances between the texts is viewed through a heatmap on the second tab of the widget. Changing the parameters dynamically updates the visualization results.

Before the codeblock, read carefully the explanation below to understand what the different parameters and choices mean.

### Vectorization

Our first step is to vectorize the texts - turn the words or characters of each text into meaningful combinations of numbers, creating a vector for each text. This is done using [Bag-of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) and [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), through the sklearn library. The tf-idf gives a numerical value that represents the relative importance of any given word or character in the document and in the entire corpus.

Most of the basic parameters of sklearn's vectorizer were kept at default. One was changed for the purposes of cuneiform transliteration. The default settings of the vectorizer turn all words to lowercase, but in the case of cuneiform transliterations, where the distinction between logograms (written in caps) and phonetic readings (written in lowercase) are important, this was changed.

There are three additional parameters to choose values for. They decide how the texts are tokenized and how many tokens to keep:

1. analyzer
2. n_gram range
3. max features / max_df / min_df

#### analyzer

The analyzer parameter decides whether to tokenize the texts on the character or the word level. There are three options:

1. `word`: tokenization of each word, i.e. everything that is separated by a space.
2. `char`: tokenization on the character level, where sequences of adjacent characters (a sequence in a length of your choosing, see n_gram range) are tokenized as "words". This setting does not take into consideration word boundary, meaning it will put together sequences of characters even if they are in two different words. This setting is recommended for Unicode cuneiform input.
3. `char-wb`: this tokenization is like the previous, except that **it does** take into consideration word boundaries. Characters at the end or beginning of words will be padded with space if needed, depending on the n_grams chosen (see below). This setting is recommended for Unicode cuneiform input.

#### n_gram range

The tokens don't have to be standalone words.

- If you have chosen `word` as your analyzer, the n_grams will represent the sequence of words grouped together as a token.
- If you have chosen `char` or `char-wb`, the n_grams will represent the sequence of characters grouped together as a token.

The n_gram parameter requires two numbers: the minimum length of the sequence and the maximum length.

- If you wish to have tokens in the lengths of 2 and 3 characters, or 2 and 3 words, your first number will be 2 and the second 3. 
- If you don't want to have a range of possible sequences, you can put the same number twice. For example, to have only sequences of 3 characters, slide the range slider below to 3.
- If you want the features be sequences of 1 word or character, slide it to 1.

For good results, it is recommended not to use more then a maximum of 3-grams.

#### max features / max_df / min_df

It is usually ineffective to take into consideration all the vocabulary, or tokens, in a given corpora: not only is that sometimes computationally intensive, but also calculations can be thrown off by too rare or too common words that don't have real significance.

There are three ways to handle the size of your corpus in sklearn's Vectorizer.

1. `max_df`: this parameter is an effective method for removing stopwords. It can take an integer number or a float; the code below allows only for a float between 0 and 1, that represents the *percentage of the documents in which a certain term appears*. Words that appear in **more then** a chosen percentage of the texts, will not be used as part of the vectorization. For example, if the `max_df` is set to 0.7, that means that terms that appear in 70% or more of the texts in the corpus, will not be kept as features in the vectorization.
2. `min_df`: this parameter is the opposite of max_df, and it is an effective method for removing particularly rare words. It can take an integer number or a float; the code below allows only for a float between 0 and 1, that represents the *percentage of the documents in which a certain term appears*. Words that appear in **less then** a chosen percentage of the texts, will not be used as part of the vectorization. For example, if the `min_df` is set to 0.1, terms that appear in 10% or less of the texts in the corpus, will not be used as features for vectorization. However, **if `min_df` is set to 1**, all terms that appear in at least one document are used, meaning, this is the default setting that does not ignore any of the terms.
3. `max features`: the maximum number of terms used for vectorization. After taking into consideration `max_df` (if that parameter was given), it chooses the x most important terms in the corpus based on term frequency. Meaning, if 1000 is chosen for `max features` (and `max_df` was left at default), that means that the vectorization will use the 1000 features that have the highest term frequency.

### Calculating distances between vectors

Once texts are turned into vectors, it is possible to calculate their distance from each other. In this notebook, the `pdist` function from the scipy python library is used. 

There is one parameter to decide for the `pdist` function, and that is the metric by which we measure the distance between each pair of vectors. 

The following metrices for measuring distance are available in scipy library:

‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘kulczynski1’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’.

A description of how the different distances are calculated is available at [scipy's documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html). Usually, it is recommended to use the ‘euclidean’ or ‘cosine’ metrics.

In [None]:
# text_options = os.listdir("texts/")
# metadata_options = os.listdir("metadata/")
metrices = ['braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'dice', 'euclidean', 'hamming', 'jaccard', 'jensenshannon', 'kulsinski', 'kulczynski1', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule']

folder_name = pnw.Select(name="Corpus", value=os.listdir("texts/")[0], options=os.listdir("texts/"))
metadata_name = pnw.Select(name="Corpus", value=os.listdir("metadata/")[0], options=os.listdir("metadata/"))
analyzer = pnw.Select(name="Analyzer", value="word", options=["word", "char", "char_wb"])
n_gram = pnw.IntRangeSlider(name="n-grams", value=(1, 1), start=1, end=5, step=1)
max_df = pnw.FloatSlider(name="max_df", value=1.0, start=0, end=1, step=0.01)
min_df = pnw.FloatSlider(name="min_df", value=1, start=0, end=1, step=0.01)
max_features = pnw.IntSlider(name="max features", value=500, start=50, end=10000, step=50)
dis_metric = pnw.Select(name="Distance Metric", value="euclidean", options=metrices)

@pn.depends(folder_name.param.value, metadata_name.param.value, analyzer.param.value, n_gram.param.value, max_df.param.value, min_df.param.value, max_features.param.value, dis_metric.param.value)
def get_matrix(folder_name, metadata_name, analyzer, n_gram, max_df, min_df, max_features, dis_metric):
    texts_path = "texts/" + folder_name
    files_path = [texts_path + "/" + x for x in os.listdir(texts_path)]
    file_keys = [os.path.basename(x).split(".")[0] for x in files_path]   
    metadata_path = "metadata/" + metadata_name
    metadata = pd.read_csv(metadata_path).set_index("text_name")
    
    vectorization = vectorize(files_path, analyzer, n_gram, max_df, min_df, max_features, file_keys)
    distance_matrix = distance_calculator(vectorization[0], dis_metric, file_keys)
    
    len_vocab = pnw.StaticText(name="Number of Vocabulary Words", value=len(vectorization[2].index))
    row1 = pn.Column(len_vocab, vectorization[2].sort_values("feature index"))#.servable("Cross-selector")
    row2 = pn.Row(px.imshow(distance_matrix))
    tabs = pn.Tabs(("Vocabulary", row1), ("Distance matrix", row2), dynamic=True)
    return tabs

widgets = pn.WidgetBox(folder_name, metadata_name, analyzer, n_gram, max_df, min_df, max_features, dis_metric)
col = pn.Column(widgets)
tabs = pn.Row(col, get_matrix)
tabs

## Step 3: Dimension Reduction

In this part of the notebook, we are going to use t-SNE to reduce our multi-dimensional matrix into a two-dimensions, which we can put on a scatter plot to get a simplified view of groupings of similar texts in the corpora.

t-SNE is a probablistic method that tries to find the best way to reduce dimensions with as little distortion and information loss as possible. Because it is based on probabilities, everytime the code is run, The resulting scatter plot will look slightly different, even when the same parameters are used. While the exact placements of the texts in the scatter plot will change, the relative distance between texts usually stays more or less the same.

The code below uses the default settings for t-SNE in sklearn, except raising the number of iterations from 1,000 to 5,000 (), and the following two parameters which are left to the user:

There are two main parameters to choose:

1. perplexity
2. n_iter
3. distance metric

In addition to the explanations below, the following resource [following resource](https://distill.pub/2016/misread-tsne/) can be beneficial to get a better intuition on how the parameters affect the results.

### perplexity

The recommended size of perplexity changes depending on the size of the corpus under study. Generally speaking, the larger the corpus (the more dimensions there are), the higher perplexity is needed to get good results that represent the corpus well in 2D space. It is defined as follows in sklearn's documentation:

> The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. Different values can result in significantly different results.

However, some have shown that for truly large databases, even larger perplexity than 50 is recommended.

A useful rule of thumb to begin with for choosing perplexity, is the to use the number of dimensions by the power of 0.5. In our case, the number of dimensions is the number of texts in the corpus. If there are, for example, 500 texts, 500 by the power of 0.5 is ca. 22.36, so a good starting perplexity would be about 22 or 23. In the case of the texts published in Dillard's dissertation, there are 81 documents, so a good perplexity to start with is 9.

The following code block automatically calculates this recommended rule of thumb, but you can also change it if you wish.

### n_iter

This parameter is the number of iterations the t-SNE will run. Each iteration attempts to place the different datapoint on a 2D plot in the best way possible. The default setting in t-SNE in sklearn in 1000. Generally speaking, a large number of iterations is recommended: when there are two few, the data will not divide to meaningful groupings. The main disadvantage of large datasets is that a large number of iterations will take significantly longer to compute.

For the texts published by Dillard, 5000 iterations were found to be sufficient for reproducible results.

### distance metric

The same metrices that were used before are available here:

‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘kulczynski1’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’.

The First code block below calculates the distance matrix based on the latest settings from the previous code block (if you change the settings there, you will need to rerun this codeblock to apply the changes to the t-SNE visualization). Then a widget will appear in which you can choose the parameters for the t-SNE.

Then running the code block below it will display the results of the t-SNE in a 2D scatter plot. You can change the color and the shape of the points in the scatter plot according to information from the metadata, to see if the calculations managed to detect meaningful groupings.

Red warning messages may appear after running the second cell: if the scatter plot eventually displays, they can be safely ignored.

In [None]:
# getting needed values from previous
metrices = ['braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'dice', 'euclidean', 'hamming', 'jaccard', 'jensenshannon', 'kulsinski', 'kulczynski1', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule']
metadata_path = "metadata/" + metadata_name.value
metadata = pd.read_csv(metadata_path).set_index("text_name")
texts_path = "texts/" + folder_name.value
files_path = [texts_path + "/" + x for x in os.listdir(texts_path)]
file_keys = [os.path.basename(x).split(".")[0] for x in files_path]
vectorization = vectorize(files_path, analyzer.value, n_gram.value, max_df.value, min_df.value, max_features.value, file_keys)
distance_matrix = distance_calculator(vectorization[0], dis_metric.value, file_keys)

#creating tsne widget box and button
tsne_metric_widg = pnw.Select(name="T-SNE Metric", value="euclidean", options=metrices)
n_iter_widg = pnw.IntSlider(name="Number of Iterations", value=5000, start=500, end=10000, step=100)
perplexity_widg = pnw.IntSlider(name="Perplexity", value=int(round(len(file_keys)**0.5,0)), start=5, end=50, step=1)

pn.WidgetBox(tsne_metric_widg, n_iter_widg, perplexity_widg)

In [None]:
# set t-sne parameters for existing values in boxes above
tsne_metric = tsne_metric_widg.value
n_iter = n_iter_widg.value
perplexity = perplexity_widg.value

distance_tsne = reduce_dimensions(distance_matrix, file_keys, perplexity, n_iter, tsne_metric)
distance_2Dplot = pd.concat([distance_tsne[0], metadata], axis=1)

marker_labels = ["circle", "triangle", "diamond", "square", "plus", "star", "hex", "inverted_triangle", "circle_cross", "circle_dot", "circle_x", "circle_y", "triangle_dot", "triangle_pin", "diamond_dot", "diamond_cross", "square_cross", "square_dot", "square_pin", "square_x", "star_dot", "hex_dot"]
marker_columns = [col for col in metadata if len(metadata[col].unique()) < len(marker_labels)]

color = pnw.Select(name="Color", value=list(metadata.columns)[0], options=list(metadata.columns))
marker = pnw.Select(name="Shape", value=marker_columns[0], options=marker_columns)

@pn.depends(color.param.value, marker.param.value)
def create_tsne(color, marker):
    opts = dict(cmap="Category20", width=700, height=600, line_color="black", size=10, tools=["hover"], legend_position="left", color=color, marker=hv.dim(marker).categorize(marker_labels))
    return hv.Points(distance_2Dplot, ["component 1", "component 2"]).opts(**opts)

def save(event):
    t = get_time()
    folder = "experiments/" + max([f for f in os.listdir("experiments/")], key=lambda x: os.stat(os.path.join("experiments/",x)).st_ctime)
    tsne_title = "/distance tsne perplexity"+str(perplexity)+" metric-"+tsne_metric+" n-iter"+str(n_iter)+" kl-divergence"+str(round(distance_tsne[1], 2))+" "+t
    distance_2Dplot.to_csv(folder+tsne_title+".csv", encoding="utf-8")
    graph = hv.Points(distance_2Dplot, ["component 1", "component 2"]).opts(cmap="Category20", width=700, height=600, line_color="black", size=10, tools=["hover"], legend_position="left", color=color.value, marker=hv.dim(marker.value).categorize(marker_labels))
    hv.save(graph, folder+tsne_title+".html", fmt="html")
    #hv.save(graph, folder+tsne_title+".png", fmt="png")
    
    with open(folder+"/"+"T-SNE documentation"+t+".md", "a", encoding="utf-8") as doc:
        doc.write(f"## T-SNE Experiment run at {t}")
        doc.write(f"\nmetric: {tsne_metric}")
        doc.write(f"\nperplexity: {perplexity}")
        doc.write(f"\nn_iter: {n_iter}")
        doc.write(f"\nKullback-Leibler divergence after optimization: {distance_tsne[1]}")
        doc.write(f"\nnumber of features seen during fit: {distance_tsne[2]}\n\n")
    
    col.append(pnw.StaticText(value=f"Your T-SNE results were saved successfully! You will find a new CSV under the same experiments folder ({folder}), and the parameters for this t-SNE were documented in a documentation markdown file."))

button = pnw.Button(name="Save current results and settings", button_type="primary")
button.on_click(save)
widgets = pn.WidgetBox(color, marker)
col = pn.Column(widgets, button)
pn.Row(col, create_tsne).servable("Cross-selector")

## Step 4: Network Analysis

Another way of looking at relationships between the texts in the corpus, is to map their relations on a graph.

The code belows creates edge links between the texts, or nodes, in the corpus. Edges are created between each node and a subset of the texts that are closest to it. The number of closest texts is chosen by running the first code block below.

Running the second code block will visualize the network. The nodes can be colored based on the metadata.

In [None]:
n_nearest_texts = pnw.IntSlider(name="Number of Nearest Texts to consider", value=3, start=1, end=10, step=1)

pn.WidgetBox(n_nearest_texts)

In [None]:
# create a dictionary that holds the edges

texts = distance_matrix.index.tolist()
num = n_nearest_texts.value
edges_dict = []

for text in texts:
    target_texts_df = distance_matrix.nsmallest(num, text)[1:]
    target_texts = target_texts_df.index.tolist()
    
    for target in target_texts:
        weight = target_texts_df.loc[target][text]
        edges_dict.append({"source": text, "target": target, "weight": weight})

## normalizing weights so that higher values would reflect closer connections between texts
weights = [edge["weight"]**2 for edge in edges_dict]
max_weight = max(weights)+1
for edge in edges_dict:
    edge["weight"] = max_weight - edge["weight"]
        
edges_df = pd.DataFrame.from_records(edges_dict)

# create a network with networkx
G = nx.from_pandas_edgelist(edges_df, "source", "target", "weight", create_using=nx.DiGraph)

# create metadata for the network

def get_metadata_for_node(metadata_df, metadata_type, nodes):
    metadata_dict = {}
    for node in nodes:
        metadata_dict[node] = metadata_df.loc[node][metadata_type] #"FLP "+str(node)
    return metadata_dict

# set node attributes

for meta in metadata.columns:
    data = get_metadata_for_node(metadata, meta, G.nodes)
    nx.set_node_attributes(G, name=meta, values=data)

nx.set_node_attributes(G, name="Degree", values=dict(nx.degree(G)))

positions = nx.layout.fruchterman_reingold_layout(G)

color1 = pnw.Select(name="Node color 1", value=list(metadata.columns)[0], options=list(metadata.columns))
color2 = pnw.Select(name="Node color 2", value=list(metadata.columns)[1], options=list(metadata.columns))

def opts_settings(node_color, title):
    return opts.Graph(directed=True, node_size="Degree", arrowhead_length=0.005, width=400, height=400, node_color=node_color, cmap="Category20", edge_color="grey", edge_line_width="weight", show_legend=True, title=title)

@pn.depends(color1.param.value, color2.param.value)
def make_networks(color1, color2):
    network1 = hv.Graph.from_networkx(G, positions)
    network2 = hv.Graph.from_networkx(G, positions)
    network1.opts(opts_settings(color1, color1))
    network2.opts(opts_settings(color2, color2))
    return hv.Layout(network1 + network2)

widgets = pn.WidgetBox(color1, color2)
pn.Column(widgets, make_networks).servable("Cross-selector")

## Step 5: close reading

What makes texts in the corpus similar to each other, from a computational perspective?

In the first code block below, you can choose the text name you want to compare to, and the number of most similar texts to it. That number will include the text chosen for comparison - meaning, if 5 is chosen, the five texts will be the one chosen and its 4 nearest texts. You can then see a table with infromation from the metadata that may or may not explain the similarities.

The code block below it, looks at shared terms used in pairs of documents. It creates a scatter plot in which the points are terms which appear in both values. The x-axis is the tf-idf values of the term in the first text, and the y-axis is the tf-idf values of the same term in the second text. In some pairs of texts that are particularly similar, it is possible to see that their shared terms create a vector on the scatter plot, indicating key terms share similar tf-idf in similar documents.

In [None]:
text = pnw.Select(name="Text", value=file_keys[0], options=file_keys)
num = pnw.IntInput(name="Number of Comparisons", value=5, start=1, end=len(file_keys))

@pn.depends(text.param.value, num.param.value)
def compare_ntexts(text, num):
    nearest_texts = distance_matrix.nsmallest(num, text)[text]
    nearest_texts_metadata = pd.concat([nearest_texts, metadata], axis=1, join="inner")
    #pd.options.display.max_columns = None
    #display(nearest_texts_metadata)
    tabulator_editors = {col:None for col in nearest_texts_metadata.columns}
    return pn.widgets.Tabulator(nearest_texts_metadata, widths=100, editors=tabulator_editors)


widgets = pn.WidgetBox(text, num, width=200)
pn.Row(widgets, compare_ntexts).servable("Cross-selector")

## Step 4: Compare Two Texts

In [None]:
text1 = pnw.Select(name="Text 1", value=file_keys[0], options=file_keys)
text2 = pnw.Select(name="Text 2", value=file_keys[1], options=file_keys)

@pn.depends(text1.param.value, text2.param.value)
def compare2texts(text1, text2):
    opts = dict(width=500, height=500, size=10, tools=["hover"])
    if text1 != None and text2 != None:
        comparison_df = vectorization[1].loc[[text1, text2]].transpose()
        comparison_df = comparison_df[comparison_df[text1] != 0]
        comparison_df = comparison_df[comparison_df[text2] != 0]
        comparison_df["token"] = comparison_df.index 

        return hv.Points(comparison_df).opts(**opts)
    
widgets = pn.WidgetBox(text1, text2, width=200, height=600)
pn.Row(widgets, compare2texts).servable("Cross-selector")