# Section order

Here we analyze document similarity/difference from one another based on the degree of shared order of sections within a document. Sections are defined in 03_define_sections_R. 

Our approach is based on the idea of *synteny* from the field of genomics. Two (or more) genes are said to have a synteny if those genes are found in proximity across a set of genomes. The degree of synteny is proportional to the frequency with which genes are colocalized and the evolutionary distance of organisms in which that colocalization occurs.  

A genome wide *synteny index* between a pair of genomes can be defined by calculating the *synteny* for all gene shared between that genome pair.

Reference: ~/resources/biology/gene-order-phylogenetics/Ordered-orthology-as-a-tool-in-prokaryotic-evolutionary-inference.pdf

For this analysis, **genes** are equivalent to **sections** (not **entries**) and **genomes** are equivalent to **documents**.  

### Limitations/Issues: 
- SI is only defined for sections shared between two documents. If there is no overlap in sections, the genome-wide SI is 0. Short documents will tend to have extremes of SI, either 1 (all sections shared) or 0 (no sections shared). 
    - That's ok and makes sense.
- Need to set a reasonable value for `k` by trial and error. This may also depend on the distribution of document lengths across our corpus. 
    - Try 2,3,5 or 7 and see what makes sense.
- This method was developed for microbial genomes, which are usually circular. For our purposes, we will need to decide how to treat sections that occur within `k` of the beginning or end of a document.
    - Idea: add a section called "top" and one called "bottom" when known to a document. Treat these as sections in the SI calculations.
- Is the document wide SI the average of all shared sections SIs or are non-shared sections counted? 
    - Answer: Don't count non-shared sections.
    - All of our documents are broken. Usually don't know whether a section was originally present in the non-broken document. (Difference because something missing because of sample and something truly missing.) 
- When a section appears multiple times in a document, how to deal with this? (Analogy in genomics is repeated genes.)
    - How common is this? 
- When two sections are flipped in order, this method wouldn't notice (unless the documents are relatively long compared to k). Do we want it to?
- Throw in a handful of outliers (documents from other list types).

### Steps: 
- Determine order of sections within each document in corpus.  
- Provisionally set `k` to some low number (2?).
- Define a document pair. For each section present in `Doc A`, check whether it is present in `Doc B`. 
    - If NO, SI for that section = 0.
    - If YES, get the set of sections within `k` of that section in `Doc A` and `Doc B`. Calculate overlap. SI = overlap (as decimal). 
- Document wide SI = average SI for all sections in that document.
- Document wide SI is **directional**. `SI(Doc A --> Doc B)` is not the same as `SI(Doc B --> Doc A)`. (Is this true?)
- Perform this operation for each document pair in each direction(?)

- synteny index (SI) is the number of common genes in the k neighborhoods of this gene across the genomes being compared  
- Distance = 1 - genome wide SI for any pair of genomes



In [3]:
setwd("../data/sections/")

####### Read in section definitions from file #######

Q1_sections = read.csv("Q01_sections.csv", stringsAsFactors = FALSE)
Q39_sections = read.csv("Q39_sections.csv", stringsAsFactors = FALSE)
Q40_sections = read.csv("Q40_sections.csv", stringsAsFactors = FALSE)
Q41_sections = read.csv("Q41_sections.csv", stringsAsFactors = FALSE)
Q42_sections = read.csv("Q42_sections.csv", stringsAsFactors = FALSE)

In [4]:
# Read in a corpus to analyze

df = read.csv("../pass/Q39_par.csv", stringsAsFactors = FALSE)
head(df)

X,id_line,label,lemma,base,id_text,line,skip,entry
0,P117395.2,o 1,ŋešed[key]N,{ŋeš}e₃-a,P117395,2,0,ŋešed[key]N
1,P117395.3,o 2,pakud[~tree]N,{ŋeš}pa-kud,P117395,3,0,pakud[~tree]N
2,P117395.4,o 3,raba[clamp]N,{ŋeš}raba,P117395,4,0,raba[clamp]N
3,P117404.2,o 1,ig[door]N eren[cedar]N,{ŋeš}ig {ŋeš}eren,P117404,2,0,ig[door]N_eren[cedar]N
4,P117404.3,o 2,ig[door]N dib[board]N,{ŋeš}ig dib,P117404,3,0,ig[door]N_dib[board]N
5,P117404.4,o 3,ig[door]N i[oil]N,{ŋeš}ig i₃,P117404,4,0,ig[door]N_i[oil]N


In [5]:
# for each entry in the corpus, identify which section it belongs to
# add section to which each entry belongs to new column in dataframe
# if an entry is found in more than one unique section, 
# output all sections names separated by ":"

section_defs = Q39_sections
for (i in 1:nrow(df)) {
    entry = tolower(df$entry[i])
    sections = names(which(sapply(section_defs, function(x) any(x == entry)) == TRUE))
    if (length(sections) == 0) df$section[i] = NA
    else df$section[i] = paste(sections, collapse = ":")
}
df

X,id_line,label,lemma,base,id_text,line,skip,entry,section
0,P117395.2,o 1,ŋešed[key]N,{ŋeš}e₃-a,P117395,2,0,ŋešed[key]N,
1,P117395.3,o 2,pakud[~tree]N,{ŋeš}pa-kud,P117395,3,0,pakud[~tree]N,
2,P117395.4,o 3,raba[clamp]N,{ŋeš}raba,P117395,4,0,raba[clamp]N,
3,P117404.2,o 1,ig[door]N eren[cedar]N,{ŋeš}ig {ŋeš}eren,P117404,2,0,ig[door]N_eren[cedar]N,
4,P117404.3,o 2,ig[door]N dib[board]N,{ŋeš}ig dib,P117404,3,0,ig[door]N_dib[board]N,door
5,P117404.4,o 3,ig[door]N i[oil]N,{ŋeš}ig i₃,P117404,4,0,ig[door]N_i[oil]N,door
6,P128345.2,o 1,garig[comb]N siki[hair]N,{ŋeš}ga-rig₂ siki,P128345,2,0,garig[comb]N_siki[hair]N,comb
7,P128345.3,o 2,garig[comb]N siki-siki[NA]NA,{ŋeš}ga-rig₂ siki-siki,P128345,3,0,garig[comb]N_siki-siki[NA]NA,
8,P128345.4,o 3,garig[comb]N saŋdu[head]N,{ŋeš}ga-rig₂ saŋ-du,P128345,4,0,garig[comb]N_saŋdu[head]N,comb
9,P224980.4,o i 1,gigir[chariot]N,{ŋeš}gigir,P224980,4,0,gigir[chariot]N,chariot


In [9]:
grep(":", df$section)

In [10]:
df[grep(":", df$section),]

Unnamed: 0,X,id_line,label,lemma,base,id_text,line,skip,entry,section
479,478,P247864.280,o v 28,ŋešgana[pestle]N,ŋeš-gan-na,P247864,280,0,ŋešgana[pestle]N,X.2:mortar
1860,1859,P273880.293,b iii 17,ŋešgana[pestle]N,ŋeš-gan-na,P273880,293,0,ŋešgana[pestle]N,X.2:mortar
2328,2327,P312012.3,3,ŋešgana[pestle]N,ŋeš-gan-na,P312012,3,0,ŋešgana[pestle]N,X.2:mortar
4509,4508,Q000039.172,145a,ŋešgana[pestle]N,ŋeš-gana,Q000039,172,0,ŋešgana[pestle]N,X.2:mortar
4633,4632,Q000039.296,260,ŋešgana[pestle]N,ŋeš-gan-na,Q000039,296,0,ŋešgana[pestle]N,X.2:mortar


In [205]:
df[df$id_text == "P250371",]

Unnamed: 0,X,id_line,label,lemma,base,id_text,line,skip,entry,section
1108,1107,P250371.3,o 1,e[house]N usan[whip]N gigir[chariot]N,{ŋeš}e₂ usan₃ gigir,P250371,3,0,e[house]N_usan[whip]N_gigir[chariot]N,chariot
1109,1108,P250371.4,o 2,gaba[chest]N gigir[chariot]N,{ŋeš}gaba gigir,P250371,4,0,gaba[chest]N_gigir[chariot]N,chariot
1110,1109,P250371.5,o 3,gabaŋal[guard]N gigir[chariot]N,{ŋeš}gaba-ŋal₂ gigir,P250371,5,0,gabaŋal[guard]N_gigir[chariot]N,chariot
1111,1110,P250371.6,o 4,sahargi[dust-guard]N gigir[chariot]N,{ŋeš}sahar-gi gigir,P250371,6,0,sahargi[dust-guard]N_gigir[chariot]N,chariot
1112,1111,P250371.7,o 5,si[horn]N gigir[chariot]N,{ŋeš}si gigir,P250371,7,0,si[horn]N_gigir[chariot]N,chariot
1113,1112,P250371.8,o 6,saŋ[head]N gigir[chariot]N,{ŋeš}saŋ gigir,P250371,8,0,saŋ[head]N_gigir[chariot]N,
1114,1113,P250371.9,o 7,lirum[strength]N gigir[chariot]N,{ŋeš}lirum gigir,P250371,9,0,lirum[strength]N_gigir[chariot]N,
1115,1114,P250371.10,o 8,guza[chair]N gigir[chariot]N,{ŋeš}gu-za gigir,P250371,10,0,guza[chair]N_gigir[chariot]N,chair
1116,1115,P250371.12,r 1,mud[tube]N gigir[chariot]N,{ŋeš}mud gigir,P250371,12,0,mud[tube]N_gigir[chariot]N,chariot
1117,1116,P250371.13,r 2,gag[nail]N mud[tube]N gigir[chariot]N,{ŋeš}gag mud gigir,P250371,13,0,gag[nail]N_mud[tube]N_gigir[chariot]N,chariot


In [None]:
if(is.na(NA)) print("Hi")

In [None]:
test = print(names(which(sapply(section_defs, function(x) any(x == "Hi erin")) == TRUE)))

In [None]:
length(test)

In [199]:
# How often does a section appear more than once within a document?

# First split documents from df

df$id_text = as.factor(df$id_text)
docs = split(df, df$id_text)
docs = sapply(docs, function(x) x$section)
#str(docs)

In [185]:
test = na.omit(docs$P235262)[1:5]
test = c(test, "tree")
test
rle(test)$values

In [200]:
# get rid of nas (entries that do not belong to a section)

#docs_na_omit = sapply(docs, function(x) na.exclude(x))
docs_na_omit = sapply(docs, function(x) x[!is.na(x)])

non_empty_docs = sapply(docs_na_omit, function(x) length(x) > 0)

docs_na_omit = docs_na_omit[non_empty_docs]
str(docs_na_omit)
    

List of 97
 $ P117404: chr [1:2] "door" "door"
 $ P128345: chr [1:2] "comb" "comb"
 $ P224980: chr [1:3] "chariot" "chariot" "chariot"
 $ P224986: chr [1:2] "chair" "chair"
 $ P224996: chr [1:3] "chair" "chair" "chair"
 $ P225006: chr [1:2] "door" "door"
 $ P225023: chr "X.6"
 $ P225033: chr [1:3] "pole" "pole" "comb"
 $ P225059: chr "jug"
 $ P225062: chr "tree.5"
 $ P225086: chr [1:2] "tree.2" "acacia"
 $ P225109: chr "vehicle_place"
 $ P225132: chr "ship"
 $ P229426: chr [1:8] "ship_fisherman" "ship_fisherman" "ship_fisherman" "ship_fisherman" ...
 $ P230069: chr "chair"
 $ P235262: chr [1:47] "tree" "tree" "apple" "apple" ...
 $ P247543: chr [1:2] "ship_fisherman" "ship_fisherman"
 $ P247864: chr [1:155] "cedar" "X.6" "tree.3" "tree" ...
 $ P249383: chr [1:2] "chariot" "chariot"
 $ P250361: chr [1:11] "bed" "chair" "chair" "chair" ...
 $ P250362: chr [1:6] "tool" "tool" "X.3" "X.3" ...
 $ P250363: chr [1:12] "door" "door" "door" "door" ...
 $ P250364: chr [1:70] "chariot" "chariot" 

In [196]:
unname(rle(unlist(docs_na_omit[20]))$values)

In [202]:
doc_sections = lapply(docs_na_omit, 
                      function(x) unname(rle(unlist(x))$values))
                          
doc_sections

In [204]:
lapply(doc_sections, table)

$P117404

door 
   1 

$P128345

comb 
   1 

$P224980

chariot 
      1 

$P224986

chair 
    1 

$P224996

chair 
    1 

$P225006

door 
   1 

$P225023

X.6 
  1 

$P225033

comb pole 
   1    1 

$P225059

jug 
  1 

$P225062

tree.5 
     1 

$P225086

acacia tree.2 
     1      1 

$P225109

vehicle_place 
            1 

$P225132

ship 
   1 

$P229426

ship_fisherman             X1 
             1              1 

$P230069

chair 
    1 

$P235262

         apple           boat        chariot           dish        harness 
             1              1              1              1              1 
     horn_ship         mortar           rope ship_fisherman          staff 
             1              1              1              1              1 
         stair          table       tamarisk           tree         tree.1 
             1              1              1              1              1 
        trough  vehicle_place            X.6 
             1              1      

In [34]:
sapply(docs, function(x) length(table(x$section) > 1))