# Part 2: Analysis based on presence/absence of entries

One metric by which we can explore similarity of texts is by patterns of shared words (or in our case, entries). This notebook implements an analysis of shared entries among documents.

In [1]:
import pandas as pd
import numpy as np
import re

## Load the Data

We structured the data in a previous notebook (`02_structure_data.ipynb`) and exported the results in the directory `data/pass`. Here we load that structured data in for further analysis.

In [2]:
# Provide a filename. 
filename = input('Filename: ')

Filename: Q39_par.csv


In [3]:
# Read in the data in that file. 
file = '../data/pass/' + filename
#df = pd.read_csv(file, dtype={'extent': object}).drop('Unnamed: 0', axis=1)
df = pd.read_csv(file)

Note to Niek: The "drop Unnamed" and "dtype extent" arguments appear to be outdated. Are these needed? 

In [4]:
# Preview data
df.head()

Unnamed: 0,id_text,id_line,label,lemma,base,skip,entry
0,P224980,4,o i 1,gigir[chariot]N,{ŋeš}gigir,0.0,gigir[chariot]N
1,P224980,5,o i 2,e[house]N gigir[chariot]N,{ŋeš}e₂ gigir,0.0,e[house]N_gigir[chariot]N
2,P224980,6,o i 3,e[house]N usan[whip]N gigir[chariot]N,{ŋeš}e₂ usan₃ gigir,0.0,e[house]N_usan[whip]N_gigir[chariot]N
3,P224986,4,o i 1,guza[chair]N anše[equid]N,{ŋeš}gu-za anše,0.0,guza[chair]N_anše[equid]N
4,P224986,5,o i 2,guza[chair]N kaskal[way]N,{ŋeš}gu-za kaskal,0.0,guza[chair]N_kaskal[way]N


The columns contain the following information:

- id_text: The ID for the text in [ORRAC](http://oracc.museum.upenn.edu/dcclt/pager).  
- id_line: The line number of the specified text. 
- label:  
- lemma: The lematicized text.  
- base: The transliterated text.  
- skip: Number of lines skipped/missing. 
- entry: The complete entry on the specified line, with words separated by underscores.

## Group Entries by Document
The `groupby()` function is used to group the data by document. The function `apply(' '.join)` concatenates the text in the `entries` column, separating them with a white space. The Pandas `groupby()` function results in a series, which is then tranformed into a new Dataframe.  

This transforms the data into a dataframe with each line containing all of the entries in a single document. 

In [6]:
df['entry'] = df['entry'].fillna('')
entries_df = df[['id_text', 'id_line', 'entry']]
#entries_df = entries_df.dropna()
grouped = entries_df['entry'].groupby(entries_df['id_text']).apply(' '.join).reset_index()
by_text_df = pd.DataFrame(grouped)
by_text_df = by_text_df.set_index('id_text')
by_text_df.head()

Unnamed: 0_level_0,entry
id_text,Unnamed: 1_level_1
P224980,gigir[chariot]N e[house]N_gigir[chariot]N e[ho...
P224986,guza[chair]N_anše[equid]N guza[chair]N_kaskal[...
P224994,{ŋeš}x-x[NA]NA {ŋeš}SI-x[NA]NA {ŋeš}šu-x[NA]NA
P224996,guza[chair]N guza[chair]N_gid[long]V/i guza[ch...
P225006,ig[door]N suku[pole]N_ig[door]N zara[pivot]N_i...


## Create a Document Term Matrix

A [document term matrix](https://en.wikipedia.org/wiki/Document-term_matrix) (DTM) is "a mathematical matrix that describes the frequency of terms that occur in a collection of documents."

Here we transform the dataframe into a DTM by using the function CountVectorizer. This function uses a regular expression (token_pattern) to indicate how to find the beginning and end of each token. In the current dataframe, entries are separated from each other by a white space. The expression `r.[^ ]+` means: any combination of characters, except the space.

The output of the CountVectorizer (`dtm`) is not in a human-readable format. It is transformed into another dataframe, with `id_text` as index.

This returns a dataframe where each line is a single document, with one column per entry. There is one column for each entry in the entire collection of documents. The number in each cell represents the number of times that a specific entry appears in that document. Most cells will be zero - as most entries do not appear in many documents.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(analyzer='word', token_pattern=r'[^ ]+')
dtm = cv.fit_transform(by_text_df['entry'])
dtm_df = pd.DataFrame(dtm.toarray(), columns = cv.get_feature_names(), index = by_text_df.index.values)
dtm_df.head()

Unnamed: 0,1/2[na]na_giŋ[unit]n,5/6[na]na_sila[unit]n,a[arm]n_apin[plow]n,a[arm]n_diš[na]na,a[arm]n_gud[ox]n_apin[plow]n,a[arm]n_umbin[wheel]n_margida[cart]n,a[arm]n_ŋešrin[scales]n,ab[cow]n,ab[cow]n_ib[hips]n_gig[sick]v/i,ab[cow]n_mah[mature]v/i,...,šuʾa[stool]n_burgul[stone-cutter]n,šuʾa[stool]n_dus[bathroom]n,šuʾa[stool]n_kaskal[way]n,šuʾa[stool]n_nagar[carpenter]n,šuʾa[stool]n_niŋgula[greatness]n,šuʾa[stool]n_suhsah[sound]n,šuʾa[stool]n_tibira[sculptor]n,šuʾa[stool]n_x[na]na,šuʾa[stool]n_šu[hand]n,šuʾa[stool]n_šuʾi[barber]n
P224980,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
P224986,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
P224994,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
P224996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
P225006,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Analyzing the DTM
Each document in the DTM may be understood as a vector, which allows for various kinds of computations, such as distance or cosine-similarity. 

It is important to recall that the DTM does not preserve information about the order of entries.

It is also important to realize that the documents in this analysis of are of very different length (from 1 to 750 entries), with more than half of the documents 3 lines or less. The composite text from Nippur is by far the longest document and will dominate any comparison

In [9]:
# Sum columns and provide summary

df_length = dtm_df.sum(axis=1)
df_length.describe()

count     71.000000
mean      90.591549
std      170.194475
min        2.000000
25%        3.000000
50%       10.000000
75%       99.000000
max      778.000000
dtype: float64

The above output summarizes the frequency of entries across our document collection (ie corpus). For example, the `max` represents the maximum number of times that a single entry appears in the corpus. 

Further analysis of the DTM is carried out in `R` in a separate notebook. 

In [10]:
# Write the data out to csv
save_file = "../data/pass/" + filename[:-4] + '_dtm.csv'
dtm_df.to_csv(save_file)