# Plain Text Editions from XML

## Configuring CITE libraries for almond kernel

First, we'll make a bintray repository with CITE libraries available to your almond kernel.

In [21]:
val myBT = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(myBT)

[36mmyBT[39m: [32mcoursierapi[39m.[32mMavenRepository[39m = MavenRepository(https://dl.bintray.com/neelsmith/maven)

Next, we bring in specific libraries from the new repository using almond's `$ivy` magic:

In [22]:
import $ivy.`edu.holycross.shot::ohco2:10.16.0`
import $ivy.`edu.holycross.shot.cite::xcite:4.1.1`
import $ivy.`edu.holycross.shot::scm:7.2.0`
import $ivy.`edu.holycross.shot::dse:5.2.2`
import $ivy.`edu.holycross.shot::citebinaryimage:3.1.1`
import $ivy.`edu.holycross.shot::citeobj:7.3.4`
import $ivy.`edu.holycross.shot::citerelations:2.5.2`
import $ivy.`edu.holycross.shot::cex:6.3.3`


[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36m$ivy.$                                     
[39m
[32mimport [39m[36m$ivy.$                              
[39m
[32mimport [39m[36m$ivy.$                              
[39m
[32mimport [39m[36m$ivy.$                                          
[39m
[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36m$ivy.$                                        
[39m
[32mimport [39m[36m$ivy.$                              
[39m

## Imports

From this point on, your notebook consists of completely generic Scala, with the CITE Libraries available to use.

In [23]:
// Import some CITE libraries
import edu.holycross.shot.cite._
import edu.holycross.shot.ohco2._
import edu.holycross.shot.scm._
import edu.holycross.shot.citeobj._
import edu.holycross.shot.citerelation._
import edu.holycross.shot.dse._
import edu.holycross.shot.citebinaryimage._
import edu.holycross.shot.ohco2._

// Import some other stuff
import scala.xml.XML

import almond.display.UpdatableDisplay
import almond.interpreter.api.DisplayData.ContentType
import almond.interpreter.api.{DisplayData, OutputHandler}

import java.io.File
import java.io.PrintWriter

import scala.io.Source

[32mimport [39m[36medu.holycross.shot.cite._
[39m
[32mimport [39m[36medu.holycross.shot.ohco2._
[39m
[32mimport [39m[36medu.holycross.shot.scm._
[39m
[32mimport [39m[36medu.holycross.shot.citeobj._
[39m
[32mimport [39m[36medu.holycross.shot.citerelation._
[39m
[32mimport [39m[36medu.holycross.shot.dse._
[39m
[32mimport [39m[36medu.holycross.shot.citebinaryimage._
[39m
[32mimport [39m[36medu.holycross.shot.ohco2._

// Import some other stuff
[39m
[32mimport [39m[36mscala.xml.XML

[39m
[32mimport [39m[36malmond.display.UpdatableDisplay
[39m
[32mimport [39m[36malmond.interpreter.api.DisplayData.ContentType
[39m
[32mimport [39m[36malmond.interpreter.api.{DisplayData, OutputHandler}

[39m
[32mimport [39m[36mjava.io.File
[39m
[32mimport [39m[36mjava.io.PrintWriter

[39m
[32mimport [39m[36mscala.io.Source[39m

## Useful Functions

Save a String

In [24]:
def saveString(s:String, filePath:String = "", fileName:String = "temp.txt"):Unit = {
		 val writer = new PrintWriter(new File(s"${filePath}${fileName}"))
         writer.write(s)
         writer.close()
	}

defined [32mfunction[39m [36msaveString[39m

Pretty Print many things:

In [25]:
def showMe(v:Any):Unit = {
  v match {
    case _:StringHistogram => {
        for ( h <- v.asInstanceOf[StringHistogram].histogram ) {
            println(s"${h.count}\t${h.s}")
        }
    }
  	case _:Corpus => {
  		for ( n <- v.asInstanceOf[Corpus].nodes) {
  			println(s"${n.urn.passageComponent}\t\t${n.text}")
  		}	
  	}
    case _:Vector[Any] => println(s"""\n----\n${v.asInstanceOf[Vector[Any]].mkString("\n")}\n----\n""")
    case _:Iterable[Any] => println(s"""\n----\n${v.asInstanceOf[Iterable[Any]].mkString("\n")}\n----\n""")
    case _ => println(s"\n-----\n${v}\n----\n")
  }
}

defined [32mfunction[39m [36mshowMe[39m

## Set Up for Working With Base XML Texts in CEX

We will a CITE Library of texts whose CitableNodes consist of well-formed XML. We'll use a custom Class, `TextVersion` to make it a bit easier to generate catalog information for our new plain-text editions. We need two catalog entries for each because each contains two texts, and Introduction and the Itinerary.

Class `CatalogEntry` is part of the CITE OHCO2 library: <https://cite-architecture.github.io/cite-api-docs/ohco2/api/edu/holycross/shot/ohco2/CatalogEntry.html>.

In [26]:
val cexCatalogTemplatePath: String = "../data/cex_template.cex"

// Get it as a String

val rawCexTemplateString: String = scala.io.Source.fromFile(cexCatalogTemplatePath).mkString

// Give it a valid URN

val basicCatalogDesc: String = "Demonstration CEX of Benjamin of Tudela’s Itineraries. Plain text editions."

val basicCatalogUrn: Cite2Urn = Cite2Urn("urn:cite2:fu_elijah:cexCatalogs.2021a:bot_plainText_editions")

val cexTemplateString = rawCexTemplateString
            .replaceAll("CEX_URN_GOES_HERE",basicCatalogUrn.toString)
            .replaceAll("CEX_DESC_GOES_HERE", basicCatalogDesc)

val xmlCexPath: String = "../BoT_Cex/BoT_XML.cex"

val textCexPath: String = "../BoT_Cex/"
val textCexFN: String = "BoT_plain_text.cex"

case class TextVersion(
    baseMainUrn: CtsUrn,
    mainCatalogEntry: CatalogEntry, 
    baseIntroUrn: CtsUrn,
    introCatalogEntry: CatalogEntry,
    path: String = xmlCexPath
)


val engPT =  TextVersion(
    baseMainUrn = CtsUrn("urn:cts:elijahlab:benTud.itin.englishXml:"),
    baseIntroUrn = CtsUrn("urn:cts:elijahlab:benTud.itinIntro.englishXml:"),
    mainCatalogEntry = CatalogEntry(
        urn = CtsUrn("urn:cts:elijahlab:benTud.itin.english:"),
        citationScheme = "geographic narrative / section",
        lang = "eng",
        groupName = "Benjamin of Tudela",
        workTitle = "Itineraries",
        versionLabel = Some("English translation, plain-text.  Marcus Nathan Adler, The Itinerary of Benjamin of Tudela, Critical Text, Translation and Commentary. London 1907, as made available in Project Gutenberg, https://www.gutenberg.org/files/14981/14981-h/14981-h.htm"),
        exemplarLabel = None,
        online = true
    ),
    introCatalogEntry = CatalogEntry(
        urn = CtsUrn("urn:cts:elijahlab:benTud.itinIntro.english:"),
        citationScheme = "head, body",
        lang = "eng",
        groupName = "Benjamin of Tudela",
        workTitle = "Introduction to the Itineraries",
        versionLabel = Some("English translation, plain-text.  Marcus Nathan Adler, The Itinerary of Benjamin of Tudela, Critical Text, Translation and Commentary. London 1907, as made available in Project Gutenberg, https://www.gutenberg.org/files/14981/14981-h/14981-h.htm"),
        exemplarLabel = None,
        online = true
    )
)

val hebPT = TextVersion(
    baseMainUrn = CtsUrn("urn:cts:elijahlab:benTud.itin.hebrewXml:"),
    baseIntroUrn = CtsUrn("urn:cts:elijahlab:benTud.itinIntro.hebrewXml:"),
    mainCatalogEntry = CatalogEntry(
        urn = CtsUrn("urn:cts:elijahlab:benTud.itin.hebrew:"),
        citationScheme = "geographic narrative / section",
        lang = "heb",
        groupName = "Benjamin of Tudela",
        workTitle = "Itineraries",
        versionLabel = Some("Hebrew edition, plain-text.  Abraham Asher, The Itinerary of Rabbi Benjamin of Tudela. London-Berlin 1840-1841. Vol. 1"),
        exemplarLabel = None,
        online = true
    ),
    introCatalogEntry = CatalogEntry(
        urn = CtsUrn("urn:cts:elijahlab:benTud.itinIntro.hebrew:"),
        citationScheme = "head, body",
        lang = "heb",
        groupName = "Benjamin of Tudela",
        workTitle = "Introduction to the Itineraries",
        versionLabel = Some("Hebrew edition, plain-text.  Abraham Asher, The Itinerary of Rabbi Benjamin of Tudela. London-Berlin 1840-1841. Vol. 1"),
        exemplarLabel = None,
        online = true
    )
)

val araPT =  TextVersion(
    baseMainUrn = CtsUrn("urn:cts:elijahlab:benTud.itin.arabicXml:"),
    baseIntroUrn = CtsUrn("urn:cts:elijahlab:benTud.itinIntro.arabicXml:"),
    mainCatalogEntry = CatalogEntry(
        urn = CtsUrn("urn:cts:elijahlab:benTud.itin.arabic:"),
        citationScheme = "geographic narrative / section",
        lang = "eng",
        groupName = "Benjamin of Tudela",
        workTitle = "Itineraries",
        versionLabel = Some("Arabic translation, plain-text. Translated from the Hebrew original, with Introduction, Notes and Appendixes By Ezra H. Haddad, Baghdad 1945"),
        exemplarLabel = None,
        online = true
    ),
    introCatalogEntry = CatalogEntry(
        urn = CtsUrn("urn:cts:elijahlab:benTud.itinIntro.arabic:"),
        citationScheme = "head, body",
        lang = "heb",
        groupName = "Benjamin of Tudela",
        workTitle = "Introduction to the Itineraries",
        versionLabel = Some("Arabic translation, plain-text. Translated from the Hebrew original, with Introduction, Notes and Appendixes By Ezra H. Haddad, Baghdad 1945"),
        exemplarLabel = None,
        online = true
    )

)

// We'll throw those into a Vector so we can iterate across them.

val textVec: Vector[TextVersion] = Vector(engPT, hebPT, araPT)






[36mcexCatalogTemplatePath[39m: [32mString[39m = [32m"../data/cex_template.cex"[39m
[36mrawCexTemplateString[39m: [32mString[39m = [32m"""// 

#!cexversion
3.0

#!citelibrary
name#CEX_DESC_GOES_HERE
urn#CEX_URN_GOES_HERE
license#CC Share Alike.  For details, see more info.

"""[39m
[36mbasicCatalogDesc[39m: [32mString[39m = [32m"Demonstration CEX of Benjamin of Tudela\u2019s Itineraries. Plain text editions."[39m
[36mbasicCatalogUrn[39m: [32mCite2Urn[39m = [33mCite2Urn[39m(
  [32m"urn:cite2:fu_elijah:cexCatalogs.2021a:bot_plainText_editions"[39m
)
[36mcexTemplateString[39m: [32mString[39m = [32m"""// 

#!cexversion
3.0

#!citelibrary
name#Demonstration CEX of Benjamin of Tudela’s Itineraries. Plain text editions.
urn#urn:cite2:fu_elijah:cexCatalogs.2021a:bot_plainText_editions
license#CC Share Alike.  For details, see more info.

"""[39m
[36mxmlCexPath[39m: [32mString[39m = [32m"../BoT_Cex/BoT_XML.cex"[39m
[36mtextCexPath[39m: [32mString[39m = 

## Load the XML Versions into a Cite Library

In [27]:
val lib: CiteLibrary = CiteLibrarySource.fromFile(xmlCexPath)

val tr: TextRepository = lib.textRepository.get

May 15, 2021 10:22:19 AM wvlet.log.Logger log
INFO: Building text repo from cex ...
May 15, 2021 10:22:19 AM wvlet.log.Logger log
INFO: Building collection repo from cex ...
May 15, 2021 10:22:19 AM wvlet.log.Logger log
INFO: Building relations from cex ...
May 15, 2021 10:22:19 AM wvlet.log.Logger log
INFO: All library components built.


[36mlib[39m: [32mCiteLibrary[39m = [33mCiteLibrary[39m(
  [32m"Demonstration CEX of Benjamin of Tudela\u2019s Itineraries. XML Editions"[39m,
  [33mCite2Urn[39m([32m"urn:cite2:fu_elijah:cexCatalogs.2021a:bot_plainText_editions"[39m),
  [32m"CC Share Alike.  For details, see more info."[39m,
  [33mVector[39m(),
  [33mSome[39m(
    [33mTextRepository[39m(
      [33mCorpus[39m(
        [33mVector[39m(
          [33mCitableNode[39m(
            [33mCtsUrn[39m([32m"urn:cts:elijahlab:benTud.itinIntro.englishXml:0"[39m),
            [32m"<head xmlns=\"http://www.tei-c.org/ns/1.0\"> THE ITINERARY OF <persName xml:id=\"recogito-9ea93359-2c2c-4427-a28b-55a60927450d\">BENJAMIN</persName> OF TUDELA. HEBREW INTRODUCTION.</head>"[39m
          ),
          [33mCitableNode[39m(
            [33mCtsUrn[39m([32m"urn:cts:elijahlab:benTud.itinIntro.englishXml:1"[39m),
            [32m"<ab xmlns=\"http://www.tei-c.org/ns/1.0\"> This is the book of travels, which was c

## A Little Validation

Let's take a moment to confirm that these texts are truly "aligned" in terms of their citation schemes. If they aren't, we can re-edit the original XML files, and use the [xml-editions.ipynb](xml-editions.ipynb) script to re-generate the CEX. Repeat as necessary.

In [28]:
// We have two Works, each represented by three Versions

val itinUrns = Vector(
  CtsUrn("urn:cts:elijahlab:benTud.itin.englishXml:"),
  CtsUrn("urn:cts:elijahlab:benTud.itin.hebrewXml:"),
  CtsUrn("urn:cts:elijahlab:benTud.itin.arabicXml:")
)

val introUrns = Vector(
  CtsUrn("urn:cts:elijahlab:benTud.itinIntro.englishXml:"),
  CtsUrn("urn:cts:elijahlab:benTud.itinIntro.arabicXml:"),
  CtsUrn("urn:cts:elijahlab:benTud.itinIntro.hebrewXml:")
)

val itinTexts: Vector[Corpus] = itinUrns.map( u => {tr.corpus ~~ u })

val introTexts: Vector[Corpus] = introUrns.map( u => {tr.corpus ~~ u})

// Do the citation values for the Introduction match across three texts?

val introPassages: Vector[Vector[String]] = introTexts.map( t => {
    t.urns.map(_.passageComponent)
})

assert( introPassages(0) == introPassages(1) )
assert( introPassages(1) == introPassages(2) )

// Do the citation values for the Itinerary match across three texts?

val itinPassages: Vector[Vector[String]] = itinTexts.map( t => {
    t.urns.map(_.passageComponent)
})

//assert( itinPassages(0) == itinPassages(1) )
//assert( itinPassages(1) == itinPassages(2) )

println(s"\nThe following passages are present in ${itinUrns(0)} but not in ${itinUrns(1)}\n")

for (d <- itinPassages(0).diff(itinPassages(1))) println(d)

println(s"\nThe following passages are present in ${itinUrns(1)} but not in ${itinUrns(0)}\n")

for (d <- itinPassages(1).diff(itinPassages(0))) println(d)

println(s"\nThe following passages are present in ${itinUrns(1)} but not in ${itinUrns(2)}\n")

for (d <- itinPassages(1).diff(itinPassages(2))) println(d)

println(s"\nThe following passages are present in ${itinUrns(2)} but not in ${itinUrns(1)}\n")

for (d <- itinPassages(2).diff(itinPassages(1))) println(d)

println(s"\nThe following passages are present in ${itinUrns(0)} but not in ${itinUrns(2)}\n")

for (d <- itinPassages(0).diff(itinPassages(2))) println(d)

println(s"\nThe following passages are present in ${itinUrns(2)} but not in ${itinUrns(0)}\n")

for (d <- itinPassages(2).diff(itinPassages(0))) println(d)




The following passages are present in urn:cts:elijahlab:benTud.itin.englishXml: but not in urn:cts:elijahlab:benTud.itin.hebrewXml:

10.16
14.7
21.7
36.4
36.6
38.2

The following passages are present in urn:cts:elijahlab:benTud.itin.hebrewXml: but not in urn:cts:elijahlab:benTud.itin.englishXml:

11.8
12.3
15.6
22.8

The following passages are present in urn:cts:elijahlab:benTud.itin.hebrewXml: but not in urn:cts:elijahlab:benTud.itin.arabicXml:

9.16
10.13
10.17
11.8
14.9
16.4
16.5
21.10
22.3
22.8
36.5

The following passages are present in urn:cts:elijahlab:benTud.itin.arabicXml: but not in urn:cts:elijahlab:benTud.itin.hebrewXml:

10.16
14.7
21.7
36.4
36.6

The following passages are present in urn:cts:elijahlab:benTud.itin.englishXml: but not in urn:cts:elijahlab:benTud.itin.arabicXml:

9.16
10.13
10.17
14.9
16.4
16.5
21.10
22.3
36.5
38.2

The following passages are present in urn:cts:elijahlab:benTud.itin.arabicXml: but not in urn:cts:elijahlab:benTud.itin.englishXml:

12.3
15.6


[36mitinUrns[39m: [32mVector[39m[[32mCtsUrn[39m] = [33mVector[39m(
  [33mCtsUrn[39m([32m"urn:cts:elijahlab:benTud.itin.englishXml:"[39m),
  [33mCtsUrn[39m([32m"urn:cts:elijahlab:benTud.itin.hebrewXml:"[39m),
  [33mCtsUrn[39m([32m"urn:cts:elijahlab:benTud.itin.arabicXml:"[39m)
)
[36mintroUrns[39m: [32mVector[39m[[32mCtsUrn[39m] = [33mVector[39m(
  [33mCtsUrn[39m([32m"urn:cts:elijahlab:benTud.itinIntro.englishXml:"[39m),
  [33mCtsUrn[39m([32m"urn:cts:elijahlab:benTud.itinIntro.arabicXml:"[39m),
  [33mCtsUrn[39m([32m"urn:cts:elijahlab:benTud.itinIntro.hebrewXml:"[39m)
)
[36mitinTexts[39m: [32mVector[39m[[32mCorpus[39m] = [33mVector[39m(
  [33mCorpus[39m(
    [33mVector[39m(
      [33mCitableNode[39m(
        [33mCtsUrn[39m([32m"urn:cts:elijahlab:benTud.itin.englishXml:1.1"[39m),
        [32m"<seg xml:id=\"GN1S1\" xmlns=\"http://www.tei-c.org/ns/1.0\">--I journeyed first from my native town to the city of <placeName ref=\" K6347\" 

## Make Plain-Text Editions

The steps are pretty simple. 

- For each text in `textVec`…
- twiddle our Corpus for both the `_.baseMainUrn` and `_.baseIntroUrn`,
- for each `CitableNode`, load the `_.text` into a Scala `xml.NodeSeq`,
- Get the `.text` content.
- Create a new CitableNode with new URN and this plain-text component,
- Wrap them all into a `Corpus`,
- Combine with the `CatalogEntry`,
- Serialize to CEX and save.

In [29]:
// Get a vector of Corpus objects

val newCorpora: Vector[Corpus] = {
    textVec.map( tv => {
        val mainCorp: Corpus = tr.corpus ~~ tv.baseMainUrn
        val introCorp: Corpus = tr.corpus ~~ tv.baseIntroUrn
             
        val newMainNodes: Vector[CitableNode] = mainCorp.nodes.map( c => {
            val newUrn: CtsUrn = tv.mainCatalogEntry.urn.addPassage(c.urn.passageComponent)
            val xmlText: xml.NodeSeq = xml.XML.loadString(c.text)
            val newText: String = xmlText.head.text
            CitableNode(newUrn, newText)
        })
        
        val newMainCorp = Corpus(newMainNodes)
        
        val newIntroNodes: Vector[CitableNode] = introCorp.nodes.map( c => {
            val newUrn: CtsUrn = tv.introCatalogEntry.urn.addPassage(c.urn.passageComponent)
            val xmlText: xml.NodeSeq = xml.XML.loadString(c.text)
            val newText: String = xmlText.head.text.trim
            CitableNode(newUrn, newText)
        })
        
        val newIntroCorp = Corpus(newIntroNodes)
        
        Vector(newIntroCorp, newMainCorp)
        
    }).flatten
}


[36mnewCorpora[39m: [32mVector[39m[[32mCorpus[39m] = [33mVector[39m(
  [33mCorpus[39m(
    [33mVector[39m(
      [33mCitableNode[39m(
        [33mCtsUrn[39m([32m"urn:cts:elijahlab:benTud.itinIntro.english:0"[39m),
        [32m"THE ITINERARY OF BENJAMIN OF TUDELA. HEBREW INTRODUCTION."[39m
      ),
      [33mCitableNode[39m(
        [33mCtsUrn[39m([32m"urn:cts:elijahlab:benTud.itinIntro.english:1"[39m),
        [32m"This is the book of travels, which was compiled by Rabbi Benjamin, the son of Jonah, of the land of Navarre--his repose be in Paradise. The said Rabbi Benjamin set forth from Tudela, his native city, and passed through many remote countries, as is related in his book. In every place which he entered, he made a record of all that he saw, or was told of by trustworthy persons--matters not previously heard of in the land of Sepharad. Also he mentions some of the sages and illustrious men residing in each place. He brought this book with him on his ret

## Write To CEX

In [30]:
// textCexPath

def processCatalogEntry( tv: TextVersion ): String = {
    val txt: String = {
        tv.mainCatalogEntry.cex("#")
    }
    val intro: String = {
        tv.introCatalogEntry.cex("#")
    }
    txt + "\n" + intro
}

def makePTEdition( tv: Vector[TextVersion], wrapped: Boolean, path: String, fn: String ): Unit = {
    
    // make the catalog
    
    val ptCexCatalog: Vector[String] = {
        val intros: Vector[String] = tv.map( tv => {
            processCatalogEntry(tv)
        })

        s"""#!ctscatalog
urn#citationScheme#groupName#workTitle#versionLabel#exemplarLabel#online#lang""" +: intros
    }    
    
    // make the Corpora
    
    val ptCexTexts: Vector[String] = newCorpora.map( nc => {
        val nodeVec: String = nc.cex("#") 
        Vector("#!ctsdata", nodeVec).mkString("\n")
    })

    val outputCexString: String = {
        val vec: Vector[String] = {
            Vector(cexTemplateString, ptCexCatalog.mkString("\n"), ptCexTexts.mkString("\n"))
        }

        vec.mkString("\n\n")
    }

    saveString(outputCexString, path, fn)  
    
}



makePTEdition( textVec, false, textCexPath, textCexFN)




defined [32mfunction[39m [36mprocessCatalogEntry[39m
defined [32mfunction[39m [36mmakePTEdition[39m