# Making a CEX File from a Text File




## Configuring CITE libraries for almond kernel

First, we'll make a bintray repository with CITE libraries available to your almond kernel.

In [1]:
val myBT = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(myBT)

[36mmyBT[39m: [32mcoursierapi[39m.[32mMavenRepository[39m = MavenRepository(https://dl.bintray.com/neelsmith/maven)

Next, we bring in specific libraries from the new repository using almond's `$ivy` magic:

In [2]:
import $ivy.`edu.holycross.shot::ohco2:10.16.0`
import $ivy.`edu.holycross.shot.cite::xcite:4.1.1`
import $ivy.`edu.holycross.shot::scm:7.2.0`
import $ivy.`edu.holycross.shot::dse:5.2.2`
import $ivy.`edu.holycross.shot::citebinaryimage:3.1.1`
import $ivy.`edu.holycross.shot::citeobj:7.3.4`
import $ivy.`edu.holycross.shot::citerelations:2.5.2`
import $ivy.`edu.holycross.shot::cex:6.3.3`
import $ivy.`edu.holycross.shot::greek:2.3.3`


[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36m$ivy.$                                     
[39m
[32mimport [39m[36m$ivy.$                              
[39m
[32mimport [39m[36m$ivy.$                              
[39m
[32mimport [39m[36m$ivy.$                                          
[39m
[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36m$ivy.$                                        
[39m
[32mimport [39m[36m$ivy.$                              
[39m
[32mimport [39m[36m$ivy.$                                
[39m

## Imports

From this point on, your notebook consists of completely generic Scala, with the CITE Libraries available to use.

In [3]:
// Import some CITE libraries
import edu.holycross.shot.cite._
import edu.holycross.shot.ohco2._
import edu.holycross.shot.scm._
import edu.holycross.shot.citeobj._
import edu.holycross.shot.citerelation._
import edu.holycross.shot.dse._
import edu.holycross.shot.citebinaryimage._
import edu.holycross.shot.ohco2._
import edu.holycross.shot.greek._

import almond.display.UpdatableDisplay
import almond.interpreter.api.DisplayData.ContentType
import almond.interpreter.api.{DisplayData, OutputHandler}

import java.io.File
import java.io.PrintWriter

import scala.io.Source


[32mimport [39m[36medu.holycross.shot.cite._
[39m
[32mimport [39m[36medu.holycross.shot.ohco2._
[39m
[32mimport [39m[36medu.holycross.shot.scm._
[39m
[32mimport [39m[36medu.holycross.shot.citeobj._
[39m
[32mimport [39m[36medu.holycross.shot.citerelation._
[39m
[32mimport [39m[36medu.holycross.shot.dse._
[39m
[32mimport [39m[36medu.holycross.shot.citebinaryimage._
[39m
[32mimport [39m[36medu.holycross.shot.ohco2._
[39m
[32mimport [39m[36medu.holycross.shot.greek._

[39m
[32mimport [39m[36malmond.display.UpdatableDisplay
[39m
[32mimport [39m[36malmond.interpreter.api.DisplayData.ContentType
[39m
[32mimport [39m[36malmond.interpreter.api.{DisplayData, OutputHandler}

[39m
[32mimport [39m[36mjava.io.File
[39m
[32mimport [39m[36mjava.io.PrintWriter

[39m
[32mimport [39m[36mscala.io.Source
[39m

## Useful Functions

Save a string:

In [4]:
def saveString(s:String, filePath:String = "", fileName:String = "temp.txt"):Unit = {
		 val writer = new PrintWriter(new File(s"${filePath}${fileName}"))
         writer.write(s)
         writer.close()
	}

defined [32mfunction[39m [36msaveString[39m

Pretty Print many things:

In [5]:
def showMe(v:Any):Unit = {
  v match {
    case _:StringHistogram => {
        for ( h <- v.asInstanceOf[StringHistogram].histogram ) {
            println(s"${h.count}\t${h.s}")
        }
    }
  	case _:Corpus => {
  		for ( n <- v.asInstanceOf[Corpus].nodes) {
  			println(s"${n.urn.passageComponent}\t\t${n.text}")
  		}	
  	}
    case _:Vector[Any] => println(s"""\n----\n${v.asInstanceOf[Vector[Any]].mkString("\n")}\n----\n""")
    case _:Iterable[Any] => println(s"""\n----\n${v.asInstanceOf[Iterable[Any]].mkString("\n")}\n----\n""")
    case _ => println(s"\n-----\n${v}\n----\n")
  }
}

defined [32mfunction[39m [36mshowMe[39m

## Load Library

We will load a Version-level, bilingual file, to start:

In [6]:
val cexPath = "cex/aristot_poetics.cex"
val lib = CiteLibrary(scala.io.Source.fromFile(cexPath).mkString)

Feb 20, 2020 11:52:38 PM wvlet.log.Logger log
INFO: Building text repo from cex ...
Feb 20, 2020 11:52:38 PM wvlet.log.Logger log
INFO: Building collection repo from cex ...
Feb 20, 2020 11:52:38 PM wvlet.log.Logger log
INFO: Building relations from cex ...
Feb 20, 2020 11:52:38 PM wvlet.log.Logger log
INFO: All library components built.


[36mcexPath[39m: [32mString[39m = [32m"cex/aristot_poetics.cex"[39m
[36mlib[39m: [32mCiteLibrary[39m = [33mCiteLibrary[39m(
  [32m"CEX library"[39m,
  [33mCite2Urn[39m([32m"urn:cite2:cex:TEMPCOLL.TEMPVERSION:TEMP_ID"[39m),
  [32m"CC 3.0 NC-BY"[39m,
  [33mVector[39m(),
  [33mSome[39m(
    [33mTextRepository[39m(
      [33mCorpus[39m(
        [33mVector[39m(
          [33mCitableNode[39m(
            [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0086.tlg034.bekker_fu:1.head"[39m),
            [32m"\u03a0\u0395\u03a1\u0399 \u03a0\u039f\u0399\u0397\u03a4\u0399\u039a\u0397\u03a3."[39m
          ),
          [33mCitableNode[39m(
            [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0086.tlg034.bekker_fu:1.1"[39m),
            [32m"\u03a0\u03b5\u03c1\u03b9 \u03c0\u03bf\u03b9\u03b7\u03c4\u03b9\u03ba\u1fc6\u03c2 \u03b1\u1f50\u03c4\u1fc6\u03c2 \u03c4\u03b5 \u03ba\u03b1\u1f76 \u03c4\u1ff6\u03bd \u03b5\u1f30\u03b4\u1ff6\u03bd \u03b1\u1f50\u03c4\u1fc6\u03c2, \u1

Get parts of the library where we can use them:

In [7]:
lazy val tr: TextRepository = lib.textRepository.get
lazy val corp: Corpus = tr.corpus
lazy val cat: Catalog = tr.catalog

In [8]:
val engUrn = CtsUrn("urn:cts:greekLit:tlg0086.tlg034.fyfe_fu:")
val grcUrn = CtsUrn("urn:cts:greekLit:tlg0086.tlg034.bekker_fu:")
val urn = CtsUrn("urn:cts:greekLit:tlg0086.tlg034:")

val engCorp = corp ~~ engUrn
val grcCorp = corp ~~ grcUrn

[36mengUrn[39m: [32mCtsUrn[39m = [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0086.tlg034.fyfe_fu:"[39m)
[36mgrcUrn[39m: [32mCtsUrn[39m = [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0086.tlg034.bekker_fu:"[39m)
[36murn[39m: [32mCtsUrn[39m = [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0086.tlg034:"[39m)
[36mengCorp[39m: [32mCorpus[39m = [33mCorpus[39m(
  [33mVector[39m(
    [33mCitableNode[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0086.tlg034.fyfe_fu:1.head"[39m),
      [32m"Poetics"[39m
    ),
    [33mCitableNode[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0086.tlg034.fyfe_fu:1.1"[39m),
      [32m"Let us here deal with Poetry, its essence and its several species, with the characteristic function of each species and the way in which plots must be constructed if the poem is to be a success; and also with the number and character of the constituent parts of a poem, and similarly with all other matters proper to this same inquiry; and let u

A little validation:

In [9]:
val grcPsgs = grcCorp.urns.map(_.collapsePassageTo(2).passageComponent)
val engPsgs = engCorp.urns.map(_.collapsePassageTo(2).passageComponent)

[36mgrcPsgs[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"1.head"[39m,
  [32m"1.1"[39m,
  [32m"1.2"[39m,
  [32m"1.3"[39m,
  [32m"1.4"[39m,
  [32m"1.5"[39m,
  [32m"1.6"[39m,
  [32m"1.7"[39m,
  [32m"1.8"[39m,
  [32m"1.9"[39m,
  [32m"1.10"[39m,
  [32m"1.11"[39m,
  [32m"1.12"[39m,
  [32m"1.13"[39m,
  [32m"1.14"[39m,
  [32m"2.1"[39m,
  [32m"2.2"[39m,
  [32m"2.3"[39m,
  [32m"2.4"[39m,
  [32m"2.5"[39m,
  [32m"2.6"[39m,
  [32m"2.7"[39m,
  [32m"3.1"[39m,
  [32m"3.2"[39m,
  [32m"3.3"[39m,
  [32m"3.4"[39m,
  [32m"3.5"[39m,
  [32m"3.6"[39m,
  [32m"3.7"[39m,
  [32m"4.1"[39m,
  [32m"4.2"[39m,
  [32m"4.3"[39m,
  [32m"4.4"[39m,
  [32m"4.5"[39m,
  [32m"4.6"[39m,
  [32m"4.7"[39m,
  [32m"4.8"[39m,
  [32m"4.9"[39m,
...
[36mengPsgs[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"1.head"[39m,
  [32m"1.1"[39m,
  [32m"1.2"[39m,
  [32m"1.3"[39m,
  [32m"1.4"[39m,
  [32m"1.5"[39m,
 

In [10]:
assert( grcPsgs.diff(engPsgs).size == 0)
assert(engPsgs.distinct.diff(grcPsgs).size == 0)

## Divide Both Corpora into Chunks of Equal Size

This is a tail-recursive function that will do that:

In [11]:
def equalSize( corp: Corpus, target: Int = 5000 ): Vector[Corpus] = {
		// This is the tail-recursive bit…
		// 'resultCorpusVec' is the "accumulator"
		// 'whatsLeft' is the unprocessed part of the original Corpus
		// 'target' is the number of chars we want to aim for
		def recurseEqualSize( resultCorpusVec: Vector[Corpus], whatsLeft: Corpus, target: Int): Vector[Corpus] = {

			// First, we see the size of the latest Corpus in the list
			//		We start with an empty accumulator, so we need to check for that possibility
			val workingCorpusSize: Int = {
				if (resultCorpusVec.size == 0) 0
				else {
					// Take the last Corpus in the list; count its characters.
					resultCorpusVec.last.nodes.map(_.text).mkString.size
				}
			}

			/* Three possibilities…
		 		 Case 1. There is only one CitableNode left in whatsLeft
		 		 Case 2. We've just met the target
		 		 Case 3. We haven't met the target
			*/
			if ( whatsLeft.size == 1) { 
				// Case 1: Add it and recurse
				val newResultVec: Vector[Corpus] = resultCorpusVec :+ whatsLeft
				newResultVec
			} else if (workingCorpusSize >= target) { 
				// Case 2: Recurse with an empty final Corpus as the '.lates' in results
				val emptyNewCorpus: Corpus = Corpus(Vector[CitableNode]())
				val newResultVec: Vector[Corpus] = resultCorpusVec :+ emptyNewCorpus
				recurseEqualSize( newResultVec, whatsLeft, target)
			} else {
				// Case 3: Add one more node to the latest Corpus, recurse
				val workingCorpus: Corpus = {
					// The very first time through, we'll have an empty Corpus, so check for this
					if ( resultCorpusVec.size == 0 ) {
						Corpus(Vector[CitableNode]())
					} else {
						resultCorpusVec.last	
					}
				}
				// All the untreated citable nodes…
				val poolNodes: Vector[CitableNode] = whatsLeft.nodes
				// Add the next node to our working corpus
				val expandedCorp: Corpus = workingCorpus ++ Corpus(Vector(poolNodes.head))
				// Remove that node from whatsLeft
				val newWhatsLeft: Corpus = Corpus(poolNodes.tail)
				// Add the new version of the working corpus to results
				val newResultCorpusVec: Vector[Corpus] = resultCorpusVec.dropRight(1) :+ expandedCorp
				// Recurse!
				recurseEqualSize( newResultCorpusVec, newWhatsLeft, target)
			}
		}

		// Invoke the recursive function for the first time.
		val answer: Vector[Corpus] = recurseEqualSize( Vector[Corpus](), corp, target)
		answer
	}

defined [32mfunction[39m [36mequalSize[39m

A function for setting up a CEX file…

In [12]:
def makeCexTop( urns: Vector[CtsUrn], cat: Catalog ): String = {
    val cexTop: String = """
#!cexversion
3.0

#!citelibrary
name#CEX library
urn#urn:cite2:cex:TEMPCOLL.TEMPVERSION:TEMP_ID
license#CC 3.0 NC-BY

#!ctscatalog
urn#citationScheme#groupName#workTitle#versionLabel#exemplarLabel#online#lang"""
    
    val cexLines: String = {
        urns.map( u => cat.entriesForUrn(u).map( _.cex("#") )).flatten.mkString("\n")
    }
    
    cexTop + "\n" + cexLines + "\n\n#!ctsdata\n"
}

def makeCexTop( urn: CtsUrn, cat: Catalog ): String = {
    makeCexTop( Vector(urn), cat)
}

defined [32mfunction[39m [36mmakeCexTop[39m
defined [32mfunction[39m [36mmakeCexTop[39m

In [13]:
makeCexTop(urn, cat)

[36mres12[39m: [32mString[39m = [32m"""
#!cexversion
3.0

#!citelibrary
name#CEX library
urn#urn:cite2:cex:TEMPCOLL.TEMPVERSION:TEMP_ID
license#CC 3.0 NC-BY

#!ctscatalog
urn#citationScheme#groupName#workTitle#versionLabel#exemplarLabel#online#lang
urn:cts:greekLit:tlg0086.tlg034.fyfe_fu:#section/subsection#Aristotle#Poetics#W.H. Fyfe, trans. 1932##true#eng
urn:cts:greekLit:tlg0086.tlg034.bekker_fu:#section/subsection#Aristotle#Poetics#Bekker, 1837##true#grc

#!ctsdata
"""[39m

## Actual Chunking

We'll do the Greek first, and then use that as a map to get an aligned English.

In [14]:
val grcChunks: Vector[Corpus] = equalSize(grcCorp, 5000)

[36mgrcChunks[39m: [32mVector[39m[[32mCorpus[39m] = [33mVector[39m(
  [33mCorpus[39m(
    [33mVector[39m(
      [33mCitableNode[39m(
        [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0086.tlg034.bekker_fu:1.head"[39m),
        [32m"\u03a0\u0395\u03a1\u0399 \u03a0\u039f\u0399\u0397\u03a4\u0399\u039a\u0397\u03a3."[39m
      ),
      [33mCitableNode[39m(
        [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0086.tlg034.bekker_fu:1.1"[39m),
        [32m"\u03a0\u03b5\u03c1\u03b9 \u03c0\u03bf\u03b9\u03b7\u03c4\u03b9\u03ba\u1fc6\u03c2 \u03b1\u1f50\u03c4\u1fc6\u03c2 \u03c4\u03b5 \u03ba\u03b1\u1f76 \u03c4\u1ff6\u03bd \u03b5\u1f30\u03b4\u1ff6\u03bd \u03b1\u1f50\u03c4\u1fc6\u03c2, \u1f25\u03bd \u03c4\u03b9\u03bd\u03b1 \u03b4\u1f7b\u03bd\u03b1\u03bc\u03b9\u03bd \u1f15\u03ba\u03b1\u03c3\u03c4\u03bf\u03bd \u1f14\u03c7\u03b5\u03b9, \u03ba\u03b1\u1f76 \u03c0\u1ff6\u03c2 \u03b4\u03b5\u1fd6 \u03c3\u03c5\u03bd\u1f77\u03c3\u03c4\u03b1\u03c3\u03b8\u03b1\u03b9 \u03c4\u03bf\u1f7a\u03c2 \u0

Now can can map the English corpus:

In [15]:
val engChunks: Vector[Corpus] = {
    grcChunks.map( gc => {
        val reff = gc.compressReff(gc.urns)
        engCorp ~~ reff.head.dropVersion
    })
}

[36mengChunks[39m: [32mVector[39m[[32mCorpus[39m] = [33mVector[39m(
  [33mCorpus[39m(
    [33mVector[39m(
      [33mCitableNode[39m(
        [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0086.tlg034.fyfe_fu:1.head"[39m),
        [32m"Poetics"[39m
      ),
      [33mCitableNode[39m(
        [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0086.tlg034.fyfe_fu:1.1"[39m),
        [32m"Let us here deal with Poetry, its essence and its several species, with the characteristic function of each species and the way in which plots must be constructed if the poem is to be a success; and also with the number and character of the constituent parts of a poem, and similarly with all other matters proper to this same inquiry; and let us, as nature directs, begin first with first principles."[39m
      ),
      [33mCitableNode[39m(
        [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0086.tlg034.fyfe_fu:1.2"[39m),
        [32m"Epic poetry, then, and the poetry of tragic drama, and, moreo

Some more validation:

In [16]:
assert(grcChunks.size == engChunks.size)
assert( grcChunks.head.urns.map(_.dropVersion).diff(engChunks.head.urns.map(_.dropVersion)).size == 0)

## Construct Bilingual Corpora for Ducat

We want to construct, from our data, a script to create DUCAT files for each chunk.

In [27]:
val script: String =  ( 0 until grcChunks.size ).map( c => {    val grc = grcChunks(c)
    val eng = engChunks(c)
    val grcPsg: String = {
        val startPsg = grc.nodes.head.urn.passageComponent
        val endPsg = grc.nodes.last.urn.passageComponent
        s"${startPsg}-${endPsg}"
    }
   val engPsg: String = {
        val startPsg = eng.nodes.head.urn.passageComponent
        val endPsg = eng.nodes.last.urn.passageComponent
        s"${startPsg}-${endPsg}"
    }
                                                            
    s"""
    
clearAll
addText("v6","${grcPsg}")
addText("v7","${engPsg}")
writeCex("aristotle_poetics${c}_${grcPsg}.cex")
"""                                                        
    
}).toVector.mkString("\n\n")

saveString(script,"/Users/cblackwell/cite/scala/cexshop/","aristotle_poetics.sc")

[36mscript[39m: [32mString[39m = [32m"""
    
clearAll
addText("v6","1.head-4.3")
addText("v7","1.head-4.3")
writeCex("aristotle_poetics0_1.head-4.3.cex")



    
clearAll
addText("v6","4.4-6.5")
addText("v7","4.4-6.5")
writeCex("aristotle_poetics1_4.4-6.5.cex")



    
clearAll
addText("v6","6.6-7.12")
addText("v7","6.6-7.12")
writeCex("aristotle_poetics2_6.6-7.12.cex")



    
clearAll
addText("v6","8.1-11.5")
addText("v7","8.1-11.5")
writeCex("aristotle_poetics3_8.1-11.5.cex")



    
clearAll
addText("v6","11.6-14.5")
addText("v7","11.6-14.5")
writeCex("aristotle_poetics4_11.6-14.5.cex")

[39m...

## Save Non-Tokenized Greek for Readers

In [28]:
val readerCexTop = makeCexTop(grcUrn, cat)

[36mreaderCexTop[39m: [32mString[39m = [32m"""
#!cexversion
3.0

#!citelibrary
name#CEX library
urn#urn:cite2:cex:TEMPCOLL.TEMPVERSION:TEMP_ID
license#CC 3.0 NC-BY

#!ctscatalog
urn#citationScheme#groupName#workTitle#versionLabel#exemplarLabel#online#lang
urn:cts:greekLit:tlg0086.tlg034.bekker_fu:#section/subsection#Aristotle#Poetics#Bekker, 1837##true#grc

#!ctsdata
"""[39m

In [37]:
val parseScript: String =  ( 0 until grcChunks.size ).map( c => {    val grc = grcChunks(c)
    val grcPsg: String = {
        val startPsg = grc.nodes.head.urn.passageComponent
        val endPsg = grc.nodes.last.urn.passageComponent
        s"${startPsg}-${endPsg}"
    }
                                                            
    s"""
    
clearAll
addText("v6","${grcPsg}")

writeCex("parsable_aristotle_poetics${c}.cex")
"""                                                        
    
}).toVector.mkString("\n\n")

saveString(parseScript,"/Users/cblackwell/cite/scala/cexshop/","parsable_aristotle_poetics.sc")

[36mparseScript[39m: [32mString[39m = [32m"""
    
clearAll
addText("v6","1.head-4.3")

writeCex("parsable_aristotle_poetics0.cex")



    
clearAll
addText("v6","4.4-6.5")

writeCex("parsable_aristotle_poetics1.cex")



    
clearAll
addText("v6","6.6-7.12")

writeCex("parsable_aristotle_poetics2.cex")



    
clearAll
addText("v6","8.1-11.5")

writeCex("parsable_aristotle_poetics3.cex")



    
clearAll
addText("v6","11.6-14.5")

writeCex("parsable_aristotle_poetics4.cex")

[39m...

[36mres29[39m: [32mString[39m = [32m"""urn:cts:greekLit:tlg0086.tlg034.bekker_fu:1.head#ΠΕΡΙ ΠΟΙΗΤΙΚΗΣ.
urn:cts:greekLit:tlg0086.tlg034.bekker_fu:1.1#Περι ποιητικῆς αὐτῆς τε καὶ τῶν εἰδῶν αὐτῆς, ἥν τινα δύναμιν ἕκαστον ἔχει, καὶ πῶς δεῖ συνίστασθαι τοὺς μύθους, εἰ μέλλει καλῶς ἕξειν ἡ ποίησις, ἔτι δὲ ἐκ πόσων καὶ ποίων ἐστὶ μορίων, ὁμοίως δὲ καὶ περὶ τῶν ἄλλων ὅσα τῆς αὐτῆς ἐστὶ μεθόδου, λέγωμεν, ἀρξάμενοι κατὰ φύσιν πρῶτον ἀπὸ τῶν πρώτων.
urn:cts:greekLit:tlg0086.tlg034.bekker_fu:1.2#Ἐποποιία δὴ καὶ ἡ τῆς τραγῳδίας ποίησις, ἔτι δὲ κωμῳδία καὶ ἡ διθυραμβοποιητικὴ καὶ τῆς αὐλητικῆς ἡ πλείστη καὶ κιθαριστικῆς, πᾶσαι τυγχάνουσιν οὖσαι μιμήσεις τὸ σύνολον.
urn:cts:greekLit:tlg0086.tlg034.bekker_fu:1.3#Διαφέρουσι δὲ ἀλλήλων τρισίν· ἢ γὰρ τῷ γένει ἑτέροις μιμεῖσθαι, ἢ τῷ ἕτερα, ἢ τῷ ἑτέρως καὶ μὴ τὸν αὐτὸν τρόπον.
urn:cts:greekLit:tlg0086.tlg034.bekker_fu:1.4#Ὥσπερ γὰρ καὶ χρώμασι καὶ σχήμασι πολλὰ μιμοῦνταί τινες ἀπεικάζοντες, οἱ μὲν διὰ τέχνης οἱ δὲ διὰ συνηθείας, ἕτεροι δὲ διὰ τῆς φω