# Distance Measurements: Sumerian Literature

This is a work-in-progress Notebook.


In [None]:
import pandas as pd
import glob
import re
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# Read in Data
First read the directory with the cleaned ETCSL texts. These files contain lemmatization in ORACC (ePSD2 style). The texts list lemmatizations per line.

In [None]:
path =r'../Scrape-etcsl/cleaned/' # use your path
allFiles = glob.glob(path + "/*.txt")
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    list_.append(df)
etcsl_data = pd.concat(list_)
etcsl_data

# Create Document Term Matrix
In order to transform this DataFrame into a proper Document Term Matrix we need to discard the columns `version` and `l_no` and concatenate all the text that belongs to a single composition. Some lines have no content in the `text` column - these lines need to be dropped.

First select the relevant columns and drop the rows that have no text content.

In [None]:
etcsl_data = etcsl_data[['id_text', 'text_name', 'text']]
etcsl_data = etcsl_data.dropna()
etcsl_data.head()

Group the rows by `id_text` and apply the `join` function to the `text` column. Transform the aggregated data into a new DataFrame.

In [None]:
etcsl_bytext = etcsl_data['text'].groupby(etcsl_data['id_text']).apply(' '.join)
etcsl_bytext_df = pd.DataFrame(etcsl_bytext)
etcsl_bytext_df.head()

Create a DataFrame of `id_text` and `text_name` equivalencies, with `id_text` set as index (row names). Then merge this DataFrame with the the `etctsl_bytext_df` using the indexes.

In [None]:
etcsl_id_names = etcsl_data[['id_text', 'text_name']].drop_duplicates().set_index('id_text')
etcsl_data_df = pd.merge(etcsl_id_names, etcsl_bytext_df, right_index=True, left_index=True)
etcsl_data_df.head()

Transfrom the DataFrame into a Document Term Matrix (DTM) by using `CountVecorizer`. This function uses a Regular Expression (`token_pattern`) to indicate how to find the beginning and end of each word (or token). In lemmatized Sumerian, a space indicates the boundary between two lemmas. The expression `r.[^ ]+` means: any combination of characters, except the space.

The output of the CountVectorizer (`etcsl_dtm`) is not in a human-readable format. It is transformed into another DataFrame, with the ETCSL numbers as index.

In [None]:
cv = CountVectorizer(analyzer='word', token_pattern=r'[^ ]+')
etcsl_dtm = cv.fit_transform(etcsl_data_df['text'])
etcsl_dtm_df = pd.DataFrame(etcsl_dtm.toarray(), columns = cv.get_feature_names(), index = etcsl_data_df.index.values)
etcsl_dtm_df.head()

# Compute Text Length
The length of each composition in ETCSL may be computed by adding up all the numbers in a row of the Document Term Matrix. The text length will be used at various places in further computations. Text length is added as a column to etcsl_data_df for future use.

In [None]:
etcsl_data_df['length'] =  etcsl_dtm_df.sum(axis=1)
etcsl_data_df.head()

# Normalize
Each row in the Document Term Matrix may be perceived as a vector with 4296 positions. The length of each vector depends on the number of lemmatized words. In order to make the vectors more comparable, we may set vector length at 1. This is done by dividing each value by the length of the vector, which equals the square root of the sum of the squared values: 

vector length = $\sqrt{\sum_{i=1}^nX^2_i}$.

We now have two versions of the same Document Term Matrix: `etcsl_dtm_df` (with regular vocabulary counts) and `etcsl_dtm_df_norm` with normalized counts.

In [None]:
etcsl_dtm_df_norm = etcsl_dtm_df.apply(lambda x: (x / np.sqrt(sum(np.square(x)))), axis = 1)

# Eliminate very short compositions
Some of the compositions in [ETCSL](etcsl.orinst.ox.ac.uk) are very short because they are only known from catalog texts or they exist only in a single (fragmentary) exemplar. Normalizing will dramatically increase the weight of the few words that appear in those texts. In addition to normalizing, therefore, we will also eliminate texts that have fewer than 50 lemmatized words. The regular DTM (`etcsl_dtm_df`) still contains all 394 compositions.

The following 39 compositions have been eliminated from the normalized DTM (listed by length of the composition).

In [None]:
etcsl_dtm_df_norm = etcsl_dtm_df_norm[etcsl_data_df['length'] > 49]
etcsl_data_df[['text_name', 'length']][etcsl_data_df['length'] < 50].sort_values(by='length')

# Distances and Clustering
From here on use the explanations in the blog by [Jörn Hees](https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/) for hierarchical clustering and creating dendrograms for step 1 of the clustering analysis (step 2 is K-means clustering). Check also the text clustering Notebook by [Brandon Rose](http://brandonrose.org/clustering), which combines K-means, hierarchical clustering and topic modeling.

In [None]:
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, ward
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist, squareform

#import numpy as np

In [None]:
# some setting for this notebook to actually show the graphs inline, you probably won't need this. Disable the %matplotlib line in order to get 
# the interactive matplotlib GUI.
%matplotlib inline
np.set_printoptions(precision=5, suppress=True)  # suppress scientific float notation

# The function dendro()

The function `dendro()` draws a dendrogram of a hierarchical clustering analysis, using the Ward method for the linkage and Euclidean distances. The `dendro()` function will be used to try different approaches to hierarchical clustering: clustering the entire corpus in raw counts; clustering with normalized vectors; and clustering with `tf-idf` weighted count. In addition, the function is used to cluster a sub-corpus: the heroic narratives around Gilgamesh, Enmerkar, and Lugalbanda.

The `dendro()` function draws a horizontal dendrogram. It takes four parameters: the DTM that is to be analyzed; the labels to be used for each data point (a list, to be derived from the DTM), the label to be printed along the Y-axis (describing the dataset), and the height of the figure (an integer). The analysis of the entire corpus will produce a very large image. Right click the nimage and open it in a new browser to inspect it in more detail.

The function returns the linkage matrix Z, which may be used in computing the Cophenetic Correlation Coefficient.

In [None]:
def dendro(dtm, labels, ylabel, heigth = 10):
# generate the linkage matrix
    Z = linkage(dtm, 'ward')
# calculate full dendrogram
    plt.figure(figsize=(25, heigth))
    plt.title('Hierarchical Clustering Dendrogram')
    plt.xlabel('distance')
    plt.ylabel(ylabel)
    dendrogram(
        Z,
        labels = labels,
        orientation = 'right',
#        leaf_rotation=90.,  # rotates the x axis labels
        leaf_font_size=10.,  # font size for the x axis labels
    )
    plt.show()
    return Z

In [None]:
Z = dendro(etcsl_dtm_df, etcsl_data_df['text_name'], 'full etcsl corpus\nno normalization', 50)
c, coph_dists = cophenet(Z, pdist(etcsl_dtm_df))
c

In [None]:
labels = etcsl_data_df['text_name'][etcsl_data_df['length'] > 49]
Z = dendro(etcsl_dtm_df_norm, labels, 'normalized etcsl corpus (vector length = 1)\ntexs < 50 lemmatized words omitted', 50)
c, coph_dists = cophenet(Z, pdist(etcsl_dtm_df_norm))
c

# Select
Select only the Gilgamesh, Enmerkar and Lugalbanda stories.

In [None]:
heroic = ['c.1.8.1.1', 'c.1.8.1.2', 'c.1.8.1.3', 'c.1.8.1.4', 'c.1.8.1.5', 'c.1.8.1.5.1', 'c.1.8.2.1', 
          'c.1.8.2.2', 'c.1.8.2.3', 'c.1.8.2.4'
         ]
labels = list(etcsl_data_df.ix[heroic]['text_name'])


Compute a distance matrix [or perhaps skip this step]

In [None]:
X = etcsl_dtm_df.ix[heroic]
dist = pdist(X)
heroic_dist_df = pd.DataFrame(squareform(dist)).round(2)
heroic_dist_df.index = labels
heroic_dist_df.columns = labels
heroic_dist_df

In [None]:
Z = dendro(etcsl_dtm_df.ix[heroic], labels, 'heroic narratives\nno normalization')

# Normalized

In [None]:
X = etcsl_dtm_df_norm.ix[heroic]
dist = pdist(X)
heroic_dist_df = pd.DataFrame(squareform(dist)).round(2)
heroic_dist_df.index = labels
heroic_dist_df.columns = labels
heroic_dist_df

In [None]:
Z = dendro(etcsl_dtm_df_norm.ix[heroic], labels, 'heroic narratives normalized')

Is this due to personal names? Apparently not.

In [None]:
heroic_dtm = etcsl_dtm_df_norm.ix[heroic]

In [None]:
qpn = {'cn', 'dn', 'en', 'fn', 'gn', 'mn', 'on', 'pn', 'rn', 'sn', 'tn', 'wn'}
words = heroic_dtm.columns
words_no_names = [word for word in words if not word[-2:] in qpn]

In [None]:
heroic_no_names_dtm = heroic_dtm[words_no_names]

In [None]:
Z = dendro(heroic_no_names_dtm, labels, 'heroic narratives normalized \n no proper names')

# Testing

the entire dendrogram without proper names

In [None]:
words = etcsl_dtm_df_norm.columns
words_no_names = [word for word in words if not word[-2:] in qpn]
etcsl_no_names_dtm = etcsl_dtm_df_norm[words_no_names]
labels = etcsl_data_df['text_name'][etcsl_data_df['length'] > 49]
Z = dendro(etcsl_no_names_dtm, labels, 'etcsl normalized \n no proper names', 50)

# TF-IDF
tf-idf is an alternative vectorizer that weighs words and normalizes for text length. Texts that are too short (< 50 lemmatized words) are omitted before applying `tf-idf`. Among the settings for `tf-idf` are `max_df` and `min_df` or maximum and minimum document frequency. Document frequency refers to the percentage of the documents that contain a certain term at least once. Common values are 80% and 20% - that is, terms that appear in more than 80% or less than 20% of the documents in the corpus are excluded from the analysis. This setting excludes far too many words for the present purpose (it keeps only 178 out of more than 4,200 unique words). Currently `max_df` and `min_df` are set to 98% and 2%, respectively (keeping 1185 unique terms) - but more research is needed to find an appropriate setting.

In [None]:
long = etcsl_data_df['length'] > 49
etcsl_noshort_df = etcsl_data_df[long]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.98, max_features=200000,
                                 min_df=0.02,
                                 use_idf=True, token_pattern=r'[^ ]+', ngram_range=(1,1))
tfidf_matrix = tfidf_vectorizer.fit_transform(etcsl_noshort_df['text'])
etcsl_tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns = tfidf_vectorizer.get_feature_names(), 
                              index = etcsl_noshort_df.index.values)
etcsl_tfidf_df.head()

In [None]:
labels = etcsl_noshort_df['text_name']
Z = dendro(etcsl_tfidf_df, labels, 'etcsl tfidf', 50)

tf-idf without proper nouns. Note that in this dendrogram *Enmerkar and Ensuhgirana* and *Enmerkar and the Lord of Aratta* end up in entirely different branches of the tree. The two *Lugalbanda* stories are still together.

In [None]:
labels = etcsl_noshort_df['text_name']
words = etcsl_tfidf_df.columns
words_no_names = [word for word in words if not word[-2:] in qpn]
Z = dendro(etcsl_tfidf_df[words_no_names], labels, 'etcsl tf-idf\nno proper nouns', 50)

In [None]:
labels = list(etcsl_data_df.ix[heroic]['text_name'])
Z = dendro(etcsl_tfidf_df.ix[heroic], labels, 'tf-idf heroic narratives')

One can also look at n-grams using tf-idf (keep all other parameters the same).

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.98, min_df=0.02,
                                 use_idf=True, token_pattern=r'[^ ]+', ngram_range=(1,4))
tfidf_matrix = tfidf_vectorizer.fit_transform(etcsl_noshort_df['text'])
etcsl_tfidf_ngrams_df = pd.DataFrame(tfidf_matrix.toarray(), columns = tfidf_vectorizer.get_feature_names(), 
                              index = etcsl_noshort_df.index.values)
etcsl_tfidf_ngrams_df.head()

This gives a very nice clustering - but one may ask whether some of that clustering is based primarily on PNs?

In [None]:
labels = etcsl_noshort_df['text_name']
Z = dendro(etcsl_tfidf_ngrams_df, labels, 'etcsl tfidf\nn-gram range 1-4', 50)

Kicking the names out still gives an interesting picture. Note that Enmerkar and Ensuhgirana and Enmerkar and the Lord of Aratta do not cluster directly anymore - but are still pretty close.

In [None]:
labels = etcsl_noshort_df['text_name']
words = etcsl_tfidf_ngrams_df.columns
words_no_names = [word for word in words if not word[-2:] in qpn]
Z = dendro(etcsl_tfidf_ngrams_df[words_no_names], labels, 'etcsl tf-idf\nn-grams 1-4\nno proper nouns', 50)

In [None]:
labels = list(etcsl_data_df.ix[heroic]['text_name'])
Z = dendro(etcsl_tfidf_ngrams_df.ix[heroic], labels, 'tf-idf n-grams 1-4\nheroic narratives')