# Section order

Here we analyze document similarity/difference from one another based on the degree of shared order of sections within a document. Sections are defined in 03_define_sections_R. 

Our approach is based on the idea of *synteny* from the field of genomics. Two (or more) genes are said to have a synteny if those genes are found in proximity across a set of genomes. The degree of synteny is proportional to the frequency with which genes are colocalized and the evolutionary distance of organisms in which that colocalization occurs.  

A genome wide *synteny index* between a pair of genomes can be defined by calculating the *synteny* for all gene shared between that genome pair.

Reference: ~/resources/biology/gene-order-phylogenetics/Ordered-orthology-as-a-tool-in-prokaryotic-evolutionary-inference.pdf

For this analysis, **genes** are equivalent to **sections** (not **entries**) and **genomes** are equivalent to **documents**.  

### Limitations/Issues: 
- SI is only defined for sections shared between two documents. If there is no overlap in sections, the genome-wide SI is 0. Short documents will tend to have extremes of SI, either 1 (all sections shared) or 0 (no sections shared).  
- Need to set a reasonable value for `k` by trial and error. This may also depend on the distribution of document lengths across our corpus.  
- This method was developed for microbial genomes, which are usually circular. For our purposes, we will need to decide how to treat sections that occur within `k` of the beginning or end of a document.
- Is the document wide SI the average of all shared sections SIs or are non-shared sections counted?

### Steps: 
- Determine order of sections within each document in corpus.  
- Provisionally set `k` to some low number (2?).
- Define a document pair. For each section present in `Doc A`, check whether it is present in `Doc B`. 
    - If NO, SI for that section = 0.
    - If YES, get the set of sections within `k` of that section in `Doc A` and `Doc B`. Calculate overlap. SI = overlap (as decimal). 
- Document wide SI = average SI for all sections in that document.
- Document wide SI is **directional**. `SI(Doc A --> Doc B)` is not the same as `SI(Doc B --> Doc A)`. (Is this true?)
- Perform this operation for each document pair in each direction(?)

- synteny index (SI) is the number of common genes in the k neighborhoods of this gene across the genomes being compared  
- Distance = 1 - genome wide SI for any pair of genomes



In [6]:
#setwd("../data/sections/")

####### Read in section definitions from file #######

Q1_sections = read.csv("Q01_sections.csv", stringsAsFactors = FALSE)
Q39_sections = read.csv("Q39_sections.csv", stringsAsFactors = FALSE)
Q40_sections = read.csv("Q40_sections.csv", stringsAsFactors = FALSE)
Q41_sections = read.csv("Q41_sections.csv", stringsAsFactors = FALSE)
Q42_sections = read.csv("Q42_sections.csv", stringsAsFactors = FALSE)

In [7]:
# Read in a corpus to analyze

df = read.csv("../pass/Q39_par.csv", stringsAsFactors = FALSE)
head(df)

X,id_line,label,lemma,base,id_text,line,skip,entry
0,P117395.2,o 1,ŋešed[key]N,{ŋeš}e₃-a,P117395,2,0,ŋešed[key]N
1,P117395.3,o 2,pakud[~tree]N,{ŋeš}pa-kud,P117395,3,0,pakud[~tree]N
2,P117395.4,o 3,raba[clamp]N,{ŋeš}raba,P117395,4,0,raba[clamp]N
3,P117404.2,o 1,ig[door]N eren[cedar]N,{ŋeš}ig {ŋeš}eren,P117404,2,0,ig[door]N_eren[cedar]N
4,P117404.3,o 2,ig[door]N dib[board]N,{ŋeš}ig dib,P117404,3,0,ig[door]N_dib[board]N
5,P117404.4,o 3,ig[door]N i[oil]N,{ŋeš}ig i₃,P117404,4,0,ig[door]N_i[oil]N


In [8]:
# for each entry in the corpus, identify which section it belongs to
# add section to which each entry belongs to new column in dataframe
# if an entry is found in more than one unique section, 
# output all sections names separated by ":"

section_defs = Q39_sections
for (i in 1:nrow(df)) {
    entry = tolower(df$entry[i])
    sections = names(which(sapply(section_defs, function(x) any(x == entry)) == TRUE))
    if (length(sections) == 0) df$section[i] = NA
    else df$section[i] = paste(sections, collapse = ":")
}
df

X,id_line,label,lemma,base,id_text,line,skip,entry,section
0,P117395.2,o 1,ŋešed[key]N,{ŋeš}e₃-a,P117395,2,0,ŋešed[key]N,
1,P117395.3,o 2,pakud[~tree]N,{ŋeš}pa-kud,P117395,3,0,pakud[~tree]N,
2,P117395.4,o 3,raba[clamp]N,{ŋeš}raba,P117395,4,0,raba[clamp]N,
3,P117404.2,o 1,ig[door]N eren[cedar]N,{ŋeš}ig {ŋeš}eren,P117404,2,0,ig[door]N_eren[cedar]N,
4,P117404.3,o 2,ig[door]N dib[board]N,{ŋeš}ig dib,P117404,3,0,ig[door]N_dib[board]N,door
5,P117404.4,o 3,ig[door]N i[oil]N,{ŋeš}ig i₃,P117404,4,0,ig[door]N_i[oil]N,door
6,P128345.2,o 1,garig[comb]N siki[hair]N,{ŋeš}ga-rig₂ siki,P128345,2,0,garig[comb]N_siki[hair]N,comb
7,P128345.3,o 2,garig[comb]N siki-siki[NA]NA,{ŋeš}ga-rig₂ siki-siki,P128345,3,0,garig[comb]N_siki-siki[NA]NA,
8,P128345.4,o 3,garig[comb]N saŋdu[head]N,{ŋeš}ga-rig₂ saŋ-du,P128345,4,0,garig[comb]N_saŋdu[head]N,comb
9,P224980.4,o i 1,gigir[chariot]N,{ŋeš}gigir,P224980,4,0,gigir[chariot]N,chariot


In [9]:
grep(":", df$section)

In [10]:
df[grep(":", df$section),]

Unnamed: 0,X,id_line,label,lemma,base,id_text,line,skip,entry,section
479,478,P247864.280,o v 28,ŋešgana[pestle]N,ŋeš-gan-na,P247864,280,0,ŋešgana[pestle]N,X.2:mortar
1860,1859,P273880.293,b iii 17,ŋešgana[pestle]N,ŋeš-gan-na,P273880,293,0,ŋešgana[pestle]N,X.2:mortar
2328,2327,P312012.3,3,ŋešgana[pestle]N,ŋeš-gan-na,P312012,3,0,ŋešgana[pestle]N,X.2:mortar
4509,4508,Q000039.172,145a,ŋešgana[pestle]N,ŋeš-gana,Q000039,172,0,ŋešgana[pestle]N,X.2:mortar
4633,4632,Q000039.296,260,ŋešgana[pestle]N,ŋeš-gan-na,Q000039,296,0,ŋešgana[pestle]N,X.2:mortar


In [11]:
df[df$id_text == "P273880",]

Unnamed: 0,X,id_line,label,lemma,base,id_text,line,skip,entry,section
1578,1577,P273880.4,unknown,unknown,unknown,P273880,4,999,unknown,
1579,1578,P273880.5,a i 1',{ŋeš}x-x[NA]NA,{ŋeš}x-x,P273880,5,0,{ŋeš}x-x[NA]NA,
1580,1579,P273880.6,a i 2',{ŋeš}x-x[NA]NA,{ŋeš}x-x,P273880,6,0,{ŋeš}x-x[NA]NA,
1581,1580,P273880.7,a i 3',{ŋeš}x-x[NA]NA,{ŋeš}x-x,P273880,7,0,{ŋeš}x-x[NA]NA,
1582,1581,P273880.8,a i 4',{ŋeš}x-x-x[NA]NA,{ŋeš}x-x-x,P273880,8,0,{ŋeš}x-x-x[NA]NA,
1583,1582,P273880.9,a i 5',allanum[oak]N,{ŋeš}al-la-nu-um,P273880,9,0,allanum[oak]N,
1584,1583,P273880.10,a i 6',halub[tree]N,{ŋeš}ha-lu-ub₂,P273880,10,0,halub[tree]N,tree
1585,1584,P273880.11,a i 7',{ŋeš}x-TUG₂[NA]NA,{ŋeš}x-TUG₂,P273880,11,0,{ŋeš}x-TUG₂[NA]NA,
1586,1585,P273880.12,a i 8',{ŋeš}x-x[NA]NA,{ŋeš}x-x,P273880,12,0,{ŋeš}x-x[NA]NA,
1587,1586,P273880.13,a i 9',šimgig[tree]N,{ŋeš}šim-gig,P273880,13,0,šimgig[tree]N,tree.2


In [None]:
if(is.na(NA)) print("Hi")

In [None]:
test = print(names(which(sapply(section_defs, function(x) any(x == "Hi erin")) == TRUE)))

In [None]:
length(test)