# Part 2: Analysis based on presence/absence of entries

One metric by which we can explore similarity of texts is by patterns of shared words (or in our case, entries). This notebook implements an analysis of shared entries among documents.

In [1]:
import pandas as pd
import numpy as np
import re

## Load the Data

We structured the data in a previous notebook (`02_structure_data.ipynb`) and exported the results in the directory `data/pass`. Here we load that structured data in for further analysis.

In [2]:
# Provide a filename. 
filename = input('Filename: ')

Filename: Q39_par.csv


In [3]:
# Read in the data in that file. 
file = '../data/pass/' + filename
df = pd.read_csv(file)

In [4]:
# Preview data
df.head()

Unnamed: 0,id_text,id_line,label,lemma,base,skip,entry
0,P117395,2,o 1,ŋešed[key]N,{ŋeš}e₃-a,0.0,ŋešed[key]N
1,P117395,3,o 2,pakud[~tree]N,{ŋeš}pa-kud,0.0,pakud[~tree]N
2,P117395,4,o 3,raba[clamp]N,{ŋeš}raba,0.0,raba[clamp]N
3,P117404,2,o 1,ig[door]N eren[cedar]N,{ŋeš}ig {ŋeš}eren,0.0,ig[door]N_eren[cedar]N
4,P117404,3,o 2,ig[door]N dib[board]N,{ŋeš}ig dib,0.0,ig[door]N_dib[board]N


The columns contain the following information:

- id_text: The ID for the text in [ORACC](http://oracc.museum.upenn.edu/dcclt/pager).  
- id_line: an abstract line ID of the specified text (integer)
- label:  human-legible line number (e.g. o i 4 means obverse column 1 line 4, according to common Assyriological conventions)
- lemma: The lemmatized text.  
- base: The transliterated text.  
- skip: Number of lines skipped/missing. 
- entry: The complete entry on the specified line, with words separated by underscores.

## Group Entries by Document
The `groupby()` function is used to group the data by document. The function `apply(' '.join)` concatenates the text in the `entries` column, separating them with a white space. The Pandas `groupby()` function results in a series, which is then tranformed into a new Dataframe.  

This transforms the data into a dataframe with each line containing all of the entries in a single document. 

In [5]:
df['entry'] = df['entry'].fillna('')
entries_df = df[['id_text', 'id_line', 'entry']]
#entries_df = entries_df.dropna()
grouped = entries_df['entry'].groupby(entries_df['id_text']).apply(' '.join).reset_index()
by_text_df = pd.DataFrame(grouped)
by_text_df = by_text_df.set_index('id_text')
by_text_df.head()

Unnamed: 0_level_0,entry
id_text,Unnamed: 1_level_1
P117395,ŋešed[key]N pakud[~tree]N raba[clamp]N
P117404,ig[door]N_eren[cedar]N ig[door]N_dib[board]N i...
P128345,garig[comb]N_siki[hair]N garig[comb]N_siki-sik...
P224980,gigir[chariot]N e[house]N_gigir[chariot]N e[ho...
P224986,guza[chair]N_anše[equid]N guza[chair]N_kaskal[...


In [6]:
len(by_text_df)

138

## Create a Document Term Matrix

A [document term matrix](https://en.wikipedia.org/wiki/Document-term_matrix) (DTM) is "a mathematical matrix that describes the frequency of terms that occur in a collection of documents."

Here we transform the dataframe into a DTM by using the function CountVectorizer. This function uses a regular expression (token_pattern) to indicate how to find the beginning and end of each token. In the current dataframe, entries are separated from each other by a white space. The expression `r"[^ ]+"` means: any combination of characters, except the space.

The output of the CountVectorizer (`dtm`) is not in a human-readable format. It is transformed into another dataframe, with `id_text` as index.

This returns a dataframe where each row is a single document, with one column per entry. There is one column for each entry in the entire collection of documents. The number in each cell represents the number of times that a specific entry appears in that document. Most cells will be zero - as most entries do not appear in many documents.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(analyzer='word', token_pattern=r'[^ ]+')
dtm = cv.fit_transform(by_text_df['entry'])
dtm_df = pd.DataFrame(dtm.toarray(), columns = cv.get_feature_names(), index = by_text_df.index.values)
dtm_df.head()

Unnamed: 0,1/2[na]na_giŋ[unit]n,5/6[na]na_sila[unit]n,a[arm]n_apin[plow]n,a[arm]n_diš[na]na,a[arm]n_gud[ox]n_apin[plow]n,a[arm]n_umbin[wheel]n_margida[cart]n,a[arm]n_ŋešrin[scales]n,ab[cow]n,ab[cow]n_ib[hips]n_gig[sick]v/i,ab[cow]n_mah[mature]v/i,...,šuʾa[stool]n_burgul[stone-cutter]n,šuʾa[stool]n_dus[bathroom]n,šuʾa[stool]n_kaskal[way]n,šuʾa[stool]n_nagar[carpenter]n,šuʾa[stool]n_niŋgula[greatness]n,šuʾa[stool]n_suhsah[sound]n,šuʾa[stool]n_tibira[sculptor]n,šuʾa[stool]n_x[na]na,šuʾa[stool]n_šu[hand]n,šuʾa[stool]n_šuʾi[barber]n
P117395,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
P117404,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
P128345,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
P224980,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
P224986,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
len(dtm_df)

138

## Analyzing the DTM
Each document in the DTM may be understood as a vector, which allows for various kinds of computations, such as distance or cosine-similarity. 

It is important to recall that the DTM does not preserve information about the order of entries.

It is also important to realize that the documents in this analysis are of very different length (from 2 to 778 entries), with more than half of the documents 5 lines or less. The composite text from Nippur is by far the longest document and will dominate any comparison

In [9]:
# Sum columns and provide summary

df_length = dtm_df.sum(axis=1)
df_length.describe()

count    138.000000
mean      58.014493
std      134.665901
min        2.000000
25%        3.000000
50%        5.500000
75%       29.000000
max      778.000000
dtype: float64

The above output provides descriptive statistics of the lengths of the texts in this corpus. For example, the `max` represents the length of the longest text; `min` the length of the shortest.  

Further analysis of the DTM is carried out in `R` in a separate notebook. 

In [10]:
len(dtm_df)

138

In [11]:
# Write the data out to csv
save_file = "../data/pass/" + filename[:-4] + '_dtm.csv'
dtm_df.to_csv(save_file)