# Part 2: Analysis based on presence/absence of entries

In [2]:
import pandas as pd
import numpy as np
import re

## Load the Data
Load the data in the format produced by `02_structure_data.ipynb` and saved in `data/pass`.

In [3]:
filename = input('Filename: ')

Filename: Q01_par.csv


In [4]:
file = '../data/pass/' + filename
df = pd.read_csv(file, dtype={'extent': object}).drop('Unnamed: 0', axis=1)

## Group by Document
The `groupby()` function is used to group the data by document. The function `apply(' '.join)` concatenates the text in the `entries` column, separating them with a white space. The Pandas `groupby()` function results in a series, which is then tranformed into a new Dataframe.

In [5]:
df['entry'] = df['entry'].fillna('')
entries_df = df[['id_text', 'line', 'entry']]
#entries_df = entries_df.dropna()
grouped = entries_df['entry'].groupby(entries_df['id_text']).apply(' '.join).reset_index()
by_text_df = pd.DataFrame(grouped)
by_text_df = by_text_df.set_index('id_text')
by_text_df.head()

Unnamed: 0_level_0,entry
id_text,Unnamed: 1_level_1
P225009,udu[sheep]N_urmah[lion]N_gu[eat]V/t udu[sheep]...
P235800,muš[snake]N muš[snake]N_huš[reddish]V/i muš[sn...
P247526,unknown udu[sheep]N_kudkudra[disabled]N udu[sh...
P247533,unknown maš[goat]N_babbar[white]V/i maš[goat]N...
P247541,amar[young]N_babbar[white]V/i amar[young]N_gig...


## Document Term Matrix
Transform the DataFrame into a Document Term Matrix (DTM) by using CountVectorizer. This function uses a Regular Expression (token_pattern) to indicate how to find the beginning and end of token. In the current Dataframe entries are separated from each other by a white space. The expression `r.[^ ]+` means: any combination of characters, except the space.

The output of the CountVectorizer (`dtm`) is not in a human-readable format. It is transformed into another DataFrame, with `id_text` as index.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(analyzer='word', token_pattern=r'[^ ]+')
dtm = cv.fit_transform(by_text_df['entry'])
dtm_df = pd.DataFrame(dtm.toarray(), columns = cv.get_feature_names(), index = by_text_df.index.values)
dtm_df.head()

Unnamed: 0,a[arm]n_dalu[object]n,ab[cow]n,ab[cow]n_babbar[white]v/i,ab[cow]n_giggi[black]v/i,ab[cow]n_gud[ox]n,ab[cow]n_gunu[speckled]v/i,ab[cow]n_ib[hips]n_gig[sick]v/i,ab[cow]n_mah[mature]v/i,ab[cow]n_namra[booty]n,ab[cow]n_peš[thick]v/i,...,šah[pig]n_gi[thicket]n,šah[pig]n_ma₂-gan[na]na,šah[pig]n_namerim[oath]n,šah[pig]n_namuruna[~herd]n,šah[pig]n_niga[fattened]v/i,šah[pig]n_udšuš[daily]aj,šaran[pestle-bottom]n,šeg[animal]n,šeŋbar[animal]n,šurun[cricket]n
P225009,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
P235800,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
P247526,1,0,1,1,1,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
P247533,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
P247541,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Analyzing the DTM
Each document in the DTM may be understood as a vector, which allows for various kinds of computations, such as distance or cosine-similarity. 

It is important to recall that the DTM does not preserve information about the order of entries.

It is also important to realize that the documents in this analysis of are of very different length (from 1 to 750 entries), with more than half of the documents 3 lines or less. The composite text from Nippur is by far the longest document and will dominate any comparison

In [7]:
df_length = dtm_df.sum(axis=1)
df_length.describe()

count     15.000000
mean      53.000000
std      133.165847
min        3.000000
25%        3.000000
50%        3.000000
75%       15.500000
max      511.000000
dtype: float64

In [10]:
save_file = "../data/pass/" + filename[:-4] + '_dtm.csv'
dtm_df.to_csv(save_file)

# 05 Analyze DTM
Further analysis of the DTM is done in the notebook `05_analyze_DTM_R.ipynb`.