# Serialize XML to CEX


## Configuring CITE libraries for almond kernel

First, we'll make a bintray repository with CITE libraries available to your almond kernel.

In [27]:
val myBT = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(myBT)

[36mmyBT[39m: [32mcoursierapi[39m.[32mMavenRepository[39m = MavenRepository(https://dl.bintray.com/neelsmith/maven)

Next, we bring in specific libraries from the new repository using almond's `$ivy` magic:

In [28]:
import $ivy.`edu.holycross.shot::ohco2:10.16.0`
import $ivy.`edu.holycross.shot.cite::xcite:4.1.1`
import $ivy.`edu.holycross.shot::scm:7.2.0`
import $ivy.`edu.holycross.shot::dse:5.2.2`
import $ivy.`edu.holycross.shot::citebinaryimage:3.1.1`
import $ivy.`edu.holycross.shot::citeobj:7.3.4`
import $ivy.`edu.holycross.shot::citerelations:2.5.2`
import $ivy.`edu.holycross.shot::cex:6.3.3`


[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36m$ivy.$                                     
[39m
[32mimport [39m[36m$ivy.$                              
[39m
[32mimport [39m[36m$ivy.$                              
[39m
[32mimport [39m[36m$ivy.$                                          
[39m
[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36m$ivy.$                                        
[39m
[32mimport [39m[36m$ivy.$                              
[39m

## Imports

From this point on, your notebook consists of completely generic Scala, with the CITE Libraries available to use.

In [29]:
// Import some CITE libraries
import edu.holycross.shot.cite._
import edu.holycross.shot.ohco2._
import edu.holycross.shot.scm._
import edu.holycross.shot.citeobj._
import edu.holycross.shot.citerelation._
import edu.holycross.shot.dse._
import edu.holycross.shot.citebinaryimage._
import edu.holycross.shot.ohco2._

// Import some other stuff
import scala.xml.XML

import almond.display.UpdatableDisplay
import almond.interpreter.api.DisplayData.ContentType
import almond.interpreter.api.{DisplayData, OutputHandler}

import java.io.File
import java.io.PrintWriter

import scala.io.Source

[32mimport [39m[36medu.holycross.shot.cite._
[39m
[32mimport [39m[36medu.holycross.shot.ohco2._
[39m
[32mimport [39m[36medu.holycross.shot.scm._
[39m
[32mimport [39m[36medu.holycross.shot.citeobj._
[39m
[32mimport [39m[36medu.holycross.shot.citerelation._
[39m
[32mimport [39m[36medu.holycross.shot.dse._
[39m
[32mimport [39m[36medu.holycross.shot.citebinaryimage._
[39m
[32mimport [39m[36medu.holycross.shot.ohco2._

// Import some other stuff
[39m
[32mimport [39m[36mscala.xml.XML

[39m
[32mimport [39m[36malmond.display.UpdatableDisplay
[39m
[32mimport [39m[36malmond.interpreter.api.DisplayData.ContentType
[39m
[32mimport [39m[36malmond.interpreter.api.{DisplayData, OutputHandler}

[39m
[32mimport [39m[36mjava.io.File
[39m
[32mimport [39m[36mjava.io.PrintWriter

[39m
[32mimport [39m[36mscala.io.Source[39m

## Useful Functions

Save a String

In [30]:
def saveString(s:String, filePath:String = "", fileName:String = "temp.txt"):Unit = {
		 val writer = new PrintWriter(new File(s"${filePath}${fileName}"))
         writer.write(s)
         writer.close()
	}

defined [32mfunction[39m [36msaveString[39m

Pretty Print many things:

In [31]:
def showMe(v:Any):Unit = {
  v match {
    case _:StringHistogram => {
        for ( h <- v.asInstanceOf[StringHistogram].histogram ) {
            println(s"${h.count}\t${h.s}")
        }
    }
  	case _:Corpus => {
  		for ( n <- v.asInstanceOf[Corpus].nodes) {
  			println(s"${n.urn.passageComponent}\t\t${n.text}")
  		}	
  	}
    case _:Vector[Any] => println(s"""\n----\n${v.asInstanceOf[Vector[Any]].mkString("\n")}\n----\n""")
    case _:Iterable[Any] => println(s"""\n----\n${v.asInstanceOf[Iterable[Any]].mkString("\n")}\n----\n""")
    case _ => println(s"\n-----\n${v}\n----\n")
  }
}

defined [32mfunction[39m [36mshowMe[39m

## Load Some Template Information

In [32]:
val cexCatalogTemplatePath: String = "../data/cex_template.cex"

// Get it as a String

val rawCexTemplateString: String = scala.io.Source.fromFile(cexCatalogTemplatePath).mkString

// Give it a valid URN

val basicCatalogDesc: String = "Demonstration CEX of Benjamin of Tudela’s Itineraries. XML Editions"

val basicCatalogUrn: Cite2Urn = Cite2Urn("urn:cite2:fu_elijah:cexCatalogs.2021a:bot_plainText_editions")

val cexTemplateString = rawCexTemplateString
            .replaceAll("CEX_URN_GOES_HERE",basicCatalogUrn.toString)
            .replaceAll("CEX_DESC_GOES_HERE", basicCatalogDesc)

val cexPath: String = "../BoT_CEX/"

val cexXmlArchivalFileName: String = "BoT_XML.cex"

val cexXmlDisplayFileName: String = "BoT_XML_display.cex"

[36mcexCatalogTemplatePath[39m: [32mString[39m = [32m"../data/cex_template.cex"[39m
[36mrawCexTemplateString[39m: [32mString[39m = [32m"""// 

#!cexversion
3.0

#!citelibrary
name#CEX_DESC_GOES_HERE
urn#CEX_URN_GOES_HERE
license#CC Share Alike.  For details, see more info.

"""[39m
[36mbasicCatalogDesc[39m: [32mString[39m = [32m"Demonstration CEX of Benjamin of Tudela\u2019s Itineraries. XML Editions"[39m
[36mbasicCatalogUrn[39m: [32mCite2Urn[39m = [33mCite2Urn[39m(
  [32m"urn:cite2:fu_elijah:cexCatalogs.2021a:bot_plainText_editions"[39m
)
[36mcexTemplateString[39m: [32mString[39m = [32m"""// 

#!cexversion
3.0

#!citelibrary
name#Demonstration CEX of Benjamin of Tudela’s Itineraries. XML Editions
urn#urn:cite2:fu_elijah:cexCatalogs.2021a:bot_plainText_editions
license#CC Share Alike.  For details, see more info.

"""[39m
[36mcexPath[39m: [32mString[39m = [32m"../BoT_CEX/"[39m
[36mcexXmlArchivalFileName[39m: [32mString[39m = [32m"BoT_XML.cex"

## Set Up for Working With Base XML Files

We will set up to be able to iterate through the master XML files. We need two catalog entries for each XML file, because each contains two texts, and Introduction and the Itinerary.

Class `CatalogEntry` is part of the CITE OHCO2 library: <https://cite-architecture.github.io/cite-api-docs/ohco2/api/edu/holycross/shot/ohco2/CatalogEntry.html>.

In [33]:
case class TextVersion(
    path: String,
    mainCatalogEntry: CatalogEntry, 
    introCatalogEntry: CatalogEntry
)


val engXml =  TextVersion(
    path = "../data/BTAdler20210419.xml",
    mainCatalogEntry = CatalogEntry(
        urn = CtsUrn("urn:cts:elijahlab:benTud.itin.englishXml:"),
        citationScheme = "geographic narrative / section",
        lang = "eng",
        groupName = "Benjamin of Tudela",
        workTitle = "Itineraries",
        versionLabel = Some("English translation, XML. Marcus Nathan Adler, The Itinerary of Benjamin of Tudela, Critical Text, Translation and Commentary. London 1907, as made available in Project Gutenberg, https://www.gutenberg.org/files/14981/14981-h/14981-h.htm "),
        exemplarLabel = None,
        online = true
    ),
    introCatalogEntry = CatalogEntry(
        urn = CtsUrn("urn:cts:elijahlab:benTud.itinIntro.englishXml:"),
        citationScheme = "head, body",
        lang = "eng",
        groupName = "Benjamin of Tudela",
        workTitle = "Introduction to the Itineraries",
        versionLabel = Some("English translation, XML. Marcus Nathan Adler, The Itinerary of Benjamin of Tudela, Critical Text, Translation and Commentary. London 1907, as made available in Project Gutenberg, https://www.gutenberg.org/files/14981/14981-h/14981-h.htm "),
        exemplarLabel = None,
        online = true
    )
)

val hebXml = TextVersion(
    path = "../data/BTAsher20210429.xml",
    mainCatalogEntry = CatalogEntry(
        urn = CtsUrn("urn:cts:elijahlab:benTud.itin.hebrewXml:"),
        citationScheme = "geographic narrative / section",
        lang = "heb",
        groupName = "Benjamin of Tudela",
        workTitle = "Itineraries",
        versionLabel = Some("Hebrew edition, XML. Abraham Asher, The Itinerary of Rabbi Benjamin of Tudela. London-Berlin 1840-1841. Vol. 1"),
        exemplarLabel = None,
        online = true
    ),
    introCatalogEntry = CatalogEntry(
        urn = CtsUrn("urn:cts:elijahlab:benTud.itinIntro.hebrewXml:"),
        citationScheme = "head, body",
        lang = "heb",
        groupName = "Benjamin of Tudela",
        workTitle = "Introduction to the Itineraries",
        versionLabel = Some("Hebrew edition, XML. Abraham Asher, The Itinerary of Rabbi Benjamin of Tudela. London-Berlin 1840-1841. Vol. 1"),
        exemplarLabel = None,
        online = true
    )
)

val araXml =  TextVersion(
    path = "../data/BTHaddad20210425.xml",
    mainCatalogEntry = CatalogEntry(
        urn = CtsUrn("urn:cts:elijahlab:benTud.itin.arabicXml:"),
        citationScheme = "geographic narrative / section",
        lang = "eng",
        groupName = "Benjamin of Tudela",
        workTitle = "Itineraries",
        versionLabel = Some("Arabic translation, XML. Translated from the Hebrew original, with Introduction, Notes and Appendixes By Ezra H. Haddad, Baghdad 1945"),
        exemplarLabel = None,
        online = true
    ),
    introCatalogEntry = CatalogEntry(
        urn = CtsUrn("urn:cts:elijahlab:benTud.itinIntro.arabicXml:"),
        citationScheme = "head, body",
        lang = "heb",
        groupName = "Benjamin of Tudela",
        workTitle = "Introduction to the Itineraries",
        versionLabel = Some("Arabic translation, XML. Translated from the Hebrew original, with Introduction, Notes and Appendixes By Ezra H. Haddad, Baghdad 1945"),
        exemplarLabel = None,
        online = true
    )

)

// We'll throw those into a Vector so we can iterate across them.

val textVec: Vector[TextVersion] = Vector(engXml, hebXml, araXml)





defined [32mclass[39m [36mTextVersion[39m
[36mengXml[39m: [32mTextVersion[39m = [33mTextVersion[39m(
  [32m"../data/BTAdler20210419.xml"[39m,
  [33mCatalogEntry[39m(
    [33mCtsUrn[39m([32m"urn:cts:elijahlab:benTud.itin.englishXml:"[39m),
    [32m"geographic narrative / section"[39m,
    [32m"eng"[39m,
    [32m"Benjamin of Tudela"[39m,
    [32m"Itineraries"[39m,
    [33mSome[39m(
      [32m"English translation, XML. Marcus Nathan Adler, The Itinerary of Benjamin of Tudela, Critical Text, Translation and Commentary. London 1907, as made available in Project Gutenberg, https://www.gutenberg.org/files/14981/14981-h/14981-h.htm "[39m
    ),
    [32mNone[39m,
    true
  ),
  [33mCatalogEntry[39m(
    [33mCtsUrn[39m([32m"urn:cts:elijahlab:benTud.itinIntro.englishXml:"[39m),
    [32m"head, body"[39m,
    [32m"eng"[39m,
    [32m"Benjamin of Tudela"[39m,
    [32m"Introduction to the Itineraries"[39m,
    [33mSome[39m(
      [32m"English translati

## Load XML

In [34]:
val engTemplatePath: String = "../data/BTAdler20210419.xml"


val engXml: xml.Elem = XML.loadFile(engTemplatePath)


[36mengTemplatePath[39m: [32mString[39m = [32m"../data/BTAdler20210419.xml"[39m
[36mengXml[39m: [32mxml[39m.[32mElem[39m = <TEI xmlns="http://www.tei-c.org/ns/1.0">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title><persName xml:id="recogito-be8535af-1af8-4169-9717-6d69a9caee34">Benjamin</persName> of Tudela</title>
         </titleStmt>
         <publicationStmt>
            <p>Publication Information</p>
         </publicationStmt>
         <sourceDesc>
            <listPlace>
               <place type="point" xml:id="K6347">
                  <location>
                     <geo>41.65606 -0.87734</geo>
                  </location>
                  <idno type="URI">http://geo-kima.org/Place/6347</idno>
               </place>
               <place type="line" xml:id="GN3123754">
                  <idno type="URI">https://www.geonames.org/3123754/</idno>
               </place>
               <place type="point" xml:id="K9805">
                  <l

## Process the Texts

### A Method for Creating a Catalog Entry

In [35]:
def processCatalogEntry( tv: TextVersion ): String = {
    val txt: String = {
        tv.mainCatalogEntry.cex("#")
    }
    val intro: String = {
        tv.introCatalogEntry.cex("#")
    }
    txt + "\n" + intro
}

defined [32mfunction[39m [36mprocessCatalogEntry[39m

### A Method for Processing Intros

First, a method for sanitizing nodes…

… what do we want to remove?

For starters, we replace return-characters and tabs with spaces, and we make sure there are no "#" signs in the text itself, since that is our CEX delimiter.


In [36]:
def sanitizeNode(n: xml.Node, wrapped: Boolean): String = {
    if (wrapped) {
        n.toString.replaceAll("[\n\t#]", " ").replaceAll(" +", " ")
    } else {
        n.toString.replaceAll("[\n\t#]", " ").replaceAll(" +", " ").replaceAll("^<[^>]+>","").replaceAll("<[^>]+>$","")
    }
}

defined [32mfunction[39m [36msanitizeNode[39m

Now on to processing…

In [37]:

def processIntro(tv: TextVersion, wrapped: Boolean): String = {
    val thisText: TextVersion = tv
    
    // Suck in the whole XML
    val allXml: xml.Elem = XML.loadFile(thisText.path)
    
    // Get a NodeSeq of the <front> element
    val front: xml.NodeSeq = allXml \\ "front"

    // A little sanity check… there should be only one…
    assert(front.size == 1)

    // Filter out everything but <head> and <ab> elements
    val keepers: Vector[String] = Vector("head", "ab")
    val introElements: xml.NodeSeq = front.head.child.filter(n => {
        keepers.contains(n.label)
    })
        
    // Sanity Check… there should be only two elements left
    assert(introElements.size == 2)
 

    val introCex: String = introElements.zipWithIndex.toVector.map( t => {
        val newUrn = thisText.introCatalogEntry.urn.addPassage(t._2.toString)
        val newText = sanitizeNode(t._1, wrapped)
        newUrn.toString + "#" + newText
    }).mkString("\n")
                                                             
      "\n\n#!ctsdata\n" + introCex // return it!                                                      

}



defined [32mfunction[39m [36mprocessIntro[39m

### A Method for Processing the Main Texts

In [38]:
def idToCitation( n: xml.Node ): String = {
    val citeAttr: String = (n.attributes.head.value.toString).replaceAll("GN","").replaceAll("S",".")

    //println(citeAttr)
    citeAttr
    
}

def processSeg( n: xml.Node, u: CtsUrn, tl: String, wrapped: Boolean ): String = {
    
    val segs: xml.NodeSeq = n \\ "seg"
    
    val stringVec: Vector[String] = segs.zipWithIndex.map( s => {
        val segN: String = (s._2 + 1).toString
        //val newPassage: String = tl + "." + segN
        val newPassage: String = idToCitation(s._1)
        val newUrn: CtsUrn = u.addPassage(newPassage)
        val newText: String = sanitizeNode(s._1, wrapped)
        newUrn.toString + "#" + newText
    }).toVector
    
    stringVec.mkString("\n")
}

def processMain(tv: TextVersion, wrapped: Boolean): String = {
    val thisText: TextVersion = tv
    
    // Suck in the whole XML
    val allXml: xml.Elem = XML.loadFile(thisText.path)
    
    // Get a NodeSeq of the <front> element
    val body: xml.NodeSeq = allXml \\ "body"

    // A little sanity check… there should be only one…
    assert(body.size == 1)

    // Filter out everything but <head> and <ab> elements
    val keepers: Vector[String] = Vector("ab")
    val mainElements: xml.NodeSeq = body.head.child.filter(n => {
        keepers.contains(n.label)
    })
        
    


    val mainCex: String = mainElements.zipWithIndex.toVector.map( t => {
        val newTopLevel: String = (t._2 + 1).toString
                    
        val thisUrn: CtsUrn = thisText.mainCatalogEntry.urn
        processSeg( t._1, thisUrn, newTopLevel, wrapped)
    }).mkString("\n")
                                                             
      "\n\n#!ctsdata\n" + mainCex + "\n\n" // return it!                                                      

}



defined [32mfunction[39m [36midToCitation[39m
defined [32mfunction[39m [36mprocessSeg[39m
defined [32mfunction[39m [36mprocessMain[39m

## Do It! (Twice)

We actually want two XML editions of our texts. One will be for further processing, with each `citableNode` being well-formed XML (wrapped in a single root element). But, specific to this project and to the current state of the CITE tools, we want another with *internal* XML tags present in each `citableNode`, but without being wrapped in an XML element. 

Why? It has to do with how CITE-App determines whether a `citableNode`'s text-content is a left-to-right orthography or a right-to-left orthography. Currently, it (dumbly) grabs the first ten characters of a passage, and if any of them are in the Hebrew, Arabic, or Persian blocks of Unicode, it declares the whole passage RTL. There is probably a better way to do that, but here we are.

So this will build out two CEX files, one named whatever `cexArchivalFileName` is set to, above, and another named whatever `cexXmlDisplayFileName` is set to, above.

In [39]:
def makeXmlEdition( tv: Vector[TextVersion], wrapped: Boolean, path: String, fn: String ): Unit = {
    
    val xmlCexStrings: Vector[String] = {
        // Do Intros First
        val intros: Vector[String] = tv.map( tv => {
            processIntro(tv, wrapped)
        })
    
        val mains: Vector[String] = textVec.map( tv => {
            processMain(tv, wrapped)
        })
    
        Vector("\n\n") ++ intros ++ Vector("\n\n") ++ mains
    }
    
    val xmlCexCatalog: Vector[String] = {
        val intros: Vector[String] = tv.map( tv => {
            processCatalogEntry(tv)
        })

        s"""#!ctscatalog
urn#citationScheme#groupName#workTitle#versionLabel#exemplarLabel#online#lang""" +: intros
    }    
    
    // cexTemplateString

    val outputCexString: String = {
        val vec: Vector[String] = {
            Vector(cexTemplateString) ++ xmlCexCatalog ++ xmlCexStrings
        }

        vec.mkString("\n")
    }

    saveString(outputCexString, path, fn)  
    
}


makeXmlEdition( textVec, true, cexPath, cexXmlArchivalFileName)

makeXmlEdition( textVec, false, cexPath, cexXmlDisplayFileName)



defined [32mfunction[39m [36mmakeXmlEdition[39m