# Transforming a Text file to CEX



## Configuring CITE libraries for almond kernel

First, we'll make a bintray repository with CITE libraries available to your almond kernel.

In [1]:
val myBT = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(myBT)

[36mmyBT[39m: [32mcoursierapi[39m.[32mMavenRepository[39m = MavenRepository(https://dl.bintray.com/neelsmith/maven)

Next, we bring in specific libraries from the new repository using almond's `$ivy` magic:

In [2]:
import $ivy.`edu.holycross.shot::ohco2:10.16.0`
import $ivy.`edu.holycross.shot.cite::xcite:4.1.1`
import $ivy.`edu.holycross.shot::scm:7.2.0`
import $ivy.`edu.holycross.shot::dse:5.2.2`
import $ivy.`edu.holycross.shot::citebinaryimage:3.1.1`
import $ivy.`edu.holycross.shot::citeobj:7.3.4`
import $ivy.`edu.holycross.shot::citerelations:2.5.2`
import $ivy.`edu.holycross.shot::cex:6.3.3`


[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36m$ivy.$                                     
[39m
[32mimport [39m[36m$ivy.$                              
[39m
[32mimport [39m[36m$ivy.$                              
[39m
[32mimport [39m[36m$ivy.$                                          
[39m
[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36m$ivy.$                                        
[39m
[32mimport [39m[36m$ivy.$                              
[39m

## Imports

From this point on, your notebook consists of completely generic Scala, with the CITE Libraries available to use.

In [3]:
// Import some CITE libraries
import edu.holycross.shot.cite._
import edu.holycross.shot.ohco2._
import edu.holycross.shot.scm._
import edu.holycross.shot.citeobj._
import edu.holycross.shot.citerelation._
import edu.holycross.shot.dse._
import edu.holycross.shot.citebinaryimage._
import edu.holycross.shot.ohco2._

import almond.display.UpdatableDisplay
import almond.interpreter.api.DisplayData.ContentType
import almond.interpreter.api.{DisplayData, OutputHandler}

import java.io.File
import java.io.PrintWriter

import scala.io.Source


[32mimport [39m[36medu.holycross.shot.cite._
[39m
[32mimport [39m[36medu.holycross.shot.ohco2._
[39m
[32mimport [39m[36medu.holycross.shot.scm._
[39m
[32mimport [39m[36medu.holycross.shot.citeobj._
[39m
[32mimport [39m[36medu.holycross.shot.citerelation._
[39m
[32mimport [39m[36medu.holycross.shot.dse._
[39m
[32mimport [39m[36medu.holycross.shot.citebinaryimage._
[39m
[32mimport [39m[36medu.holycross.shot.ohco2._

[39m
[32mimport [39m[36malmond.display.UpdatableDisplay
[39m
[32mimport [39m[36malmond.interpreter.api.DisplayData.ContentType
[39m
[32mimport [39m[36malmond.interpreter.api.{DisplayData, OutputHandler}

[39m
[32mimport [39m[36mjava.io.File
[39m
[32mimport [39m[36mjava.io.PrintWriter

[39m
[32mimport [39m[36mscala.io.Source
[39m

## Useful Functions

Save a string to a names file:

In [4]:
def saveString(s:String, filePath:String = "", fileName:String = "temp.txt"):Unit = {
		 val writer = new PrintWriter(new File(s"${filePath}${fileName}"))
         writer.write(s)
         writer.close()
	}

defined [32mfunction[39m [36msaveString[39m

Convert a Roman Numeral to an Integer:

In [5]:
def fromRoman(s: String) : Int = {
	try {
		val numerals = Map('I' -> 1, 'V' -> 5, 'X' -> 10, 'L' -> 50, 'C' -> 100, 'D' -> 500, 'M' -> 1000)

		s.toUpperCase.map(numerals).foldLeft((0,0)) {
		  case ((sum, last), curr) =>  (sum + curr + (if (last < curr) -2*last else 0), curr) }._1
	} catch {
		case e:Exception => throw new Exception(s""" "${s}" is not a valid Roman Numeral.""")
	}
}

defined [32mfunction[39m [36mfromRoman[39m

Like `.split`, but preserving the character we split on:

In [6]:
def splitWithSplitter(text: String, puncs: String): Vector[String] = {
	//val regexWithSplitter = s"((?<=${puncs})|(?=${puncs}))"
    val regexWithSplitter = s"((?<=${puncs}))"
	text.split(regexWithSplitter).toVector.filter(_.size > 0)
}

defined [32mfunction[39m [36msplitWithSplitter[39m

Pretty Print Things:

In [7]:
def showMe(v:Any):Unit = {
  v match {
    case _:StringHistogram => {
        for ( h <- v.asInstanceOf[StringHistogram].histogram ) {
            println(s"${h.count}\t${h.s}")
        }
    }
  	case _:Corpus => {
  		for ( n <- v.asInstanceOf[Corpus].nodes) {
  			println(s"${n.urn.passageComponent}\t\t${n.text}")
  		}	
  	}
    case _:Vector[Any] => println(s"""\n----\n${v.asInstanceOf[Vector[Any]].mkString("\n")}\n----\n""")
    case _:Iterable[Any] => println(s"""\n----\n${v.asInstanceOf[Iterable[Any]].mkString("\n")}\n----\n""")
    case _ => println(s"\n-----\n${v}\n----\n")
  }
}

defined [32mfunction[39m [36mshowMe[39m

## Load a Template File

Load it into a Vector[String], filtering out any empty lines:

In [8]:
/*
val filePath = s"Anon_2_eng.txt"
val urnBase = CtsUrn("urn:cts:xolotl:anonMex.001.crapo_eng:")
val cexHeaderPath = s"cex_header_anon_eng.txt"
val numberOffset: Int = 1
val fileName: String = "anon_eng.cex"
*/

/*
val filePath = s"Anon_2_nah1.txt"
val urnBase = CtsUrn("urn:cts:xolotl:anonMex.001.crapo_nah1:")
val cexHeaderPath = s"cex_header_anon_nah1.txt"
val numberOffset: Int = 0
val fileName: String = "anon_nah1.cex"
*/

val filePath = s"Anon_2_nah1.txt"
val urnBase = CtsUrn("urn:cts:xolotl:anonMex.001.crapo_nah1:")
val cexHeaderPath = s"cex_header_anon_nah1.txt"
val numberOffset: Int = 0
val fileName: String = "anon_nah1.cex"



val lines: Vector[String] = {
    scala.io.Source.fromFile(filePath).mkString.split("\n").toVector.filter( _.size > 1 )
}

[36mfilePath[39m: [32mString[39m = [32m"Anon_2_nah1.txt"[39m
[36murnBase[39m: [32mCtsUrn[39m = [33mCtsUrn[39m([32m"urn:cts:xolotl:anonMex.001.crapo_nah1:"[39m)
[36mcexHeaderPath[39m: [32mString[39m = [32m"cex_header_anon_nah1.txt"[39m
[36mnumberOffset[39m: [32mInt[39m = [32m0[39m
[36mfileName[39m: [32mString[39m = [32m"anon_nah1.cex"[39m
[36mlines[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"[Ynic Capitulo II]<note n=\"88\">"[39m,
  [32m"_nican motenehua<note n=\"89\"> yn\u00edc \u00f4me [.......n]<note n=\"90\"> oaltepe manaco nic\u00e1 mex\u00ed-catlal ytzmapan.<note n=\"91\">_"[39m,
  [32m"Zatepanian moch\u00ed yntlall\u00ed ocahualoc yeo yaque intolteca ypan Yn tlatilanaltin tlazintlan, Ypanceme Altepetl qu\u00edtocayo-tia<note n=\"93\"> Amaqu\u00e9m\u00ea.||<note n=\"94\"> ocatcaya<note n=\"95\"> \u00e7e theuctl\u00ed-yntlatocauh \u00fd chichimeca ytocaCatca<note n=\"96\"> Tlamacatzin, auh y ni macehualhu\u00e1, moch\u00

### We need to capture citation values for Chapters and Paragraphs.

Let's attach an index-number to every line. This will be broadly useful. The result will be a Vector of Tuples: (String, Int). Since that will be confusing, we can create a Class called IndexedLine, and map to a Vector[IndexedLine]:

In [9]:
case class IndexedLine( index: Int, text: String)
val indexedLines: Vector[IndexedLine] = lines.zipWithIndex.map( l => {
    IndexedLine( l._2, l._1 )
})

defined [32mclass[39m [36mIndexedLine[39m
[36mindexedLines[39m: [32mVector[39m[[32mIndexedLine[39m] = [33mVector[39m(
  [33mIndexedLine[39m([32m0[39m, [32m"[Ynic Capitulo II]<note n=\"88\">"[39m),
  [33mIndexedLine[39m(
    [32m1[39m,
    [32m"_nican motenehua<note n=\"89\"> yn\u00edc \u00f4me [.......n]<note n=\"90\"> oaltepe manaco nic\u00e1 mex\u00ed-catlal ytzmapan.<note n=\"91\">_"[39m
  ),
  [33mIndexedLine[39m(
    [32m2[39m,
    [32m"Zatepanian moch\u00ed yntlall\u00ed ocahualoc yeo yaque intolteca ypan Yn tlatilanaltin tlazintlan, Ypanceme Altepetl qu\u00edtocayo-tia<note n=\"93\"> Amaqu\u00e9m\u00ea.||<note n=\"94\"> ocatcaya<note n=\"95\"> \u00e7e theuctl\u00ed-yntlatocauh \u00fd chichimeca ytocaCatca<note n=\"96\"> Tlamacatzin, auh y ni macehualhu\u00e1, moch\u00ed. cenpetlauhtinem\u00eda, zantlaquentitinemia Yca tecuan cuetlaxt\u00edn temamauhtique, cayncemitol yn Yaoyotl quipiaya tetotocamitl, mintli Yn tlahuitol, quicauyia yn tlen cacia yolca

We can now build up a CEX expression of this Version…

In [10]:
val paragraphCexVec: Vector[CitableNode] = indexedLines.map( il => {
    val u = urnBase.addPassage(s"2.${il.index+numberOffset}")
    val psg = il.text
    CitableNode(u, psg)
})

[36mparagraphCexVec[39m: [32mVector[39m[[32mCitableNode[39m] = [33mVector[39m(
  [33mCitableNode[39m(
    [33mCtsUrn[39m([32m"urn:cts:xolotl:anonMex.001.crapo_nah1:2.0"[39m),
    [32m"[Ynic Capitulo II]<note n=\"88\">"[39m
  ),
  [33mCitableNode[39m(
    [33mCtsUrn[39m([32m"urn:cts:xolotl:anonMex.001.crapo_nah1:2.1"[39m),
    [32m"_nican motenehua<note n=\"89\"> yn\u00edc \u00f4me [.......n]<note n=\"90\"> oaltepe manaco nic\u00e1 mex\u00ed-catlal ytzmapan.<note n=\"91\">_"[39m
  ),
  [33mCitableNode[39m(
    [33mCtsUrn[39m([32m"urn:cts:xolotl:anonMex.001.crapo_nah1:2.2"[39m),
    [32m"Zatepanian moch\u00ed yntlall\u00ed ocahualoc yeo yaque intolteca ypan Yn tlatilanaltin tlazintlan, Ypanceme Altepetl qu\u00edtocayo-tia<note n=\"93\"> Amaqu\u00e9m\u00ea.||<note n=\"94\"> ocatcaya<note n=\"95\"> \u00e7e theuctl\u00ed-yntlatocauh \u00fd chichimeca ytocaCatca<note n=\"96\"> Tlamacatzin, auh y ni macehualhu\u00e1, moch\u00ed. cenpetlauhtinem\u00eda, zantlaque

## Make a Sentence-Level Exemplar 

We have a section-level Version in `sectionCexVec`. Let's split that up into a Sentence-level exemplar.

In [12]:
val sentenceSplitters: String = """[·.;?!]"""

val exemplarString = "sentences"

val versionNodes: Vector[CitableNode] = paragraphCexVec.map( scv => {
    val urn = scv.urn
    val text = scv.text
    CitableNode(urn, text)
})

val exemplarNodes: Vector[CitableNode] = versionNodes.map( vn => {
    val urn: CtsUrn = vn.urn
    val txt: String = vn.text
    val splitText: Vector[String] = splitWithSplitter(txt, sentenceSplitters).filter(_.size > 1)
    splitText.zipWithIndex.map( st => {
        val newUrnBase: CtsUrn = urn.addExemplar(exemplarString)
        val newUrn: CtsUrn = CtsUrn(s"${newUrnBase}.${st._2 + numberOffset}")
        val newTxt = st._1
        CitableNode(newUrn, newTxt)
    })
}).flatten


[36msentenceSplitters[39m: [32mString[39m = [32m"[\u00b7.;?!]"[39m
[36mexemplarString[39m: [32mString[39m = [32m"sentences"[39m
[36mversionNodes[39m: [32mVector[39m[[32mCitableNode[39m] = [33mVector[39m(
  [33mCitableNode[39m(
    [33mCtsUrn[39m([32m"urn:cts:xolotl:anonMex.001.crapo_nah1:2.0"[39m),
    [32m"[Ynic Capitulo II]<note n=\"88\">"[39m
  ),
  [33mCitableNode[39m(
    [33mCtsUrn[39m([32m"urn:cts:xolotl:anonMex.001.crapo_nah1:2.1"[39m),
    [32m"_nican motenehua<note n=\"89\"> yn\u00edc \u00f4me [.......n]<note n=\"90\"> oaltepe manaco nic\u00e1 mex\u00ed-catlal ytzmapan.<note n=\"91\">_"[39m
  ),
  [33mCitableNode[39m(
    [33mCtsUrn[39m([32m"urn:cts:xolotl:anonMex.001.crapo_nah1:2.2"[39m),
    [32m"Zatepanian moch\u00ed yntlall\u00ed ocahualoc yeo yaque intolteca ypan Yn tlatilanaltin tlazintlan, Ypanceme Altepetl qu\u00edtocayo-tia<note n=\"93\"> Amaqu\u00e9m\u00ea.||<note n=\"94\"> ocatcaya<note n=\"95\"> \u00e7e theuctl\u00ed-yntl

## Make Final CEX File

In [13]:

val headerLines: Vector[String] = {
    scala.io.Source.fromFile(cexHeaderPath).mkString.split("\n").toVector
}
val cexHeader: String = "\n" + headerLines.mkString("\n")

val versionString = "#!ctsdata" + "\n" + Corpus(versionNodes).cex("#") + "\n\n"
val exemplarString = "#!ctsdata" + "\n" + Corpus(exemplarNodes).cex("#").replaceAll("# +","#")
val allCex = cexHeader + "\n\n" + versionString + "\n\n" + exemplarString

[36mheaderLines[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"#!cexversion"[39m,
  [32m"3.0"[39m,
  [32m""[39m,
  [32m"#!citelibrary"[39m,
  [32m"name#CEX library"[39m,
  [32m"urn#urn:cite2:cex:temp_xolotl.v1:temp1"[39m,
  [32m"license#CC 3.0 NC-BY"[39m,
  [32m""[39m,
  [32m"#!ctscatalog"[39m,
  [32m"urn#citationScheme#groupName#workTitle#versionLabel#exemplarLabel#online#lang"[39m,
  [32m"urn:cts:xolotl:anonMex.001.crapo_nah1:#chapter/section#Anonymo Mexicano#Bibliothe\u0300que Nationale de Paris, coll. Aubin-Goupil 254: Documents en nahuatl relatifs aux Tolte\u0300ques, etc.#Nahautl. Diplomatic. Richley H. Crapo, Bonnie Glass-Coffin, edd. and trans.##true#nah"[39m,
  [32m"urn:cts:xolotl:anonMex.001.crapo_nah1.sentences:#chapter/section/sentence#Anonymo Mexicano#Bibliothe\u0300que Nationale de Paris, coll. Aubin-Goupil 254: Documents en nahuatl relatifs aux Tolte\u0300ques, etc.#Nahuatl. DiplomatiRichley H. Crapo, Bonnie Glass-Coffin, edd.

Save it!

In [14]:
saveString( allCex, "cex/", fileName)