# Aesop: Greek and Portuguese




This notebook takes a plain-text file containing the text of Aesop, *Fabulae*, 1–17, in the Greek edition of Helm (1872), and a new Portuguese translation by M.C. Dezotti (2020), and transforms it into a canonically-citable, CITE-compliant digital library serialized into [CEX format](http://cite-architecture.org/citedx/CEX-spec-3.0.1).

**This is not a generic script!** The input file is clean and well-structured plain-text, but in an idiosyncratic format. Because it is well-structured, we can work with it. Because it is idiosyncratic, this is an exercise in *some techniques* for moving legacy data into CEX.

## Configuring CITE libraries for almond kernel

First, we'll make a bintray repository with CITE libraries available to your almond kernel.

In [None]:
val myBT = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(myBT)

Next, we bring in specific libraries from the new repository using almond's `$ivy` magic:

In [None]:
import $ivy.`edu.holycross.shot::ohco2:10.16.0`
import $ivy.`edu.holycross.shot.cite::xcite:4.1.1`
import $ivy.`edu.holycross.shot::scm:7.2.0`
import $ivy.`edu.holycross.shot::dse:5.2.2`
import $ivy.`edu.holycross.shot::citebinaryimage:3.1.1`
import $ivy.`edu.holycross.shot::citeobj:7.3.4`
import $ivy.`edu.holycross.shot::citerelations:2.5.2`
import $ivy.`edu.holycross.shot::cex:6.3.3`


## Imports

From this point on, your notebook consists of completely generic Scala, with the CITE Libraries available to use.

In [None]:
// Import some CITE libraries
import edu.holycross.shot.cite._
import edu.holycross.shot.ohco2._
import edu.holycross.shot.scm._
import edu.holycross.shot.citeobj._
import edu.holycross.shot.citerelation._
import edu.holycross.shot.dse._
import edu.holycross.shot.citebinaryimage._
import edu.holycross.shot.ohco2._

import almond.display.UpdatableDisplay
import almond.interpreter.api.DisplayData.ContentType
import almond.interpreter.api.{DisplayData, OutputHandler}

import java.io.File
import java.io.PrintWriter

import scala.io.Source


## Useful Functions

A function for saving a String to a file.

In [None]:
def saveString(s:String, filePath:String = "", fileName:String = "temp.txt"):Unit = {
		 val writer = new PrintWriter(new File(s"${filePath}${fileName}"))
         writer.write(s)
         writer.close()
	}

A function to pretty-print lists and OHCO2 corpora.

In [None]:
def showMe(v:Any):Unit = {
  v match {
    case _:StringHistogram => {
        for ( h <- v.asInstanceOf[StringHistogram].histogram ) {
            println(s"${h.count}\t${h.s}")
        }
    }
  	case _:Corpus => {
  		for ( n <- v.asInstanceOf[Corpus].nodes) {
  			println(s"${n.urn.passageComponent}\t\t${n.text}")
  		}	
  	}
    case _:Vector[Any] => println(s"""\n----\n${v.asInstanceOf[Vector[Any]].mkString("\n")}\n----\n""")
    case _:Iterable[Any] => println(s"""\n----\n${v.asInstanceOf[Iterable[Any]].mkString("\n")}\n----\n""")
    case _ => println(s"\n-----\n${v}\n----\n")
  }
}

## Load a Template File

Load it:

In [None]:
val filePath = s"aesop.txt"
val allLines: Vector[String] = {
    scala.io.Source.fromFile(filePath).mkString.split("\n").toVector.filter( _.size > 0 )
}

## Parse Data

We define a custom Class that is String + Index:

In [None]:
case class IndexedLine( text: String, index: Int)

We want to separate heading-lines from the content-lines.

Attach to each line of the text, an index-number (this will stay with the lines, and be useful later).

In [None]:
val indexedLines: Vector[IndexedLine] = allLines.zipWithIndex.map ( l => {
    IndexedLine( l._1, l._2 )
})

We want to pull out just the lines that are headings. We start with a Regular Expression pattern that (we happen to know) will match all of these lines: lines beginning with Arabic numerals are our headings.

In [None]:
val pattern = "^[0-9]".r // note that .r after a String makes it into a RegEx

Now we use that regular expression, `pattern` as a filter to get a Vector of just our heading-lines.

In [None]:
val headingLines: Vector[IndexedLine] = indexedLines.filter( l => {
    pattern.findAllIn(l.text).size > 0
})

#### Group Text

We want to group our text by section. The procedure will be:

- Identify the index number of one heading.
- Identify the index number of the *next* heading.
- Get all lines that fall between the two.
- Attach them to the first heading.

Scala's [`.sliding`](http://daily-scala.blogspot.com/2009/11/iteratorsliding.html) method is ideal for this. It will group all the headings into pairs.

Below, `headingPairs` is a Vector of Vectors of IndexedLine objects. The inner Vector will have two IndexedLines, each one a heading. In the first pair will consist of the first heading and the second; the second pair will consist of the *second* heading (again) and the third.

In [None]:
val headingPairs: Vector[Vector[IndexedLine]] = headingLines.sliding(2,1).toVector

We can map this Vector of pairs and get all the chapters except the last one. For the last one, we need a variant. 

> In other programming idioms, we would iterate through the pairs, with a check, each time, to see if we were on the last one, or beyond the last one. In Scala's Functional Programming Idiom, we "do something to everything", and know in advance that this will not include the last section, and treat that differently. This helps avoid "off by one" errors, among other things.

In [None]:
val mappedHeadings: Vector[( IndexedLine, Vector[IndexedLine])] = {
    
    // We use up all the pairs…
    val allButLast: Vector[( IndexedLine, Vector[IndexedLine])] = headingPairs.map( p => {
        val firstIndex: Int = p.head.index
        val lastIndex: Int = p.last.index
        val firstLine: IndexedLine = indexedLines(firstIndex)
        val allLines: Vector[IndexedLine] = indexedLines.filter( il => {
            ( il.index > firstIndex) & ( il.index < lastIndex )
        })
        ( firstLine, allLines )
    })
    
    // We go get the last section, which we know was not included…
    val lastSection: Vector[( IndexedLine, Vector[IndexedLine])] = {
        val firstIndex: Int = headingPairs.last.last.index
        val firstLine: IndexedLine = indexedLines(firstIndex)
        val allLines: Vector[IndexedLine] = indexedLines.filter( il => {
            ( il.index > firstIndex) 
        })
        val tup = ( firstLine, allLines )
        Vector[( IndexedLine, Vector[IndexedLine])](tup)
    }
    
    // We concatenate the two Vectors…
    allButLast ++ lastSection
}

### A Useful Function for Title Lines

The title-line of this text consists of:

- An Arabic number (1–17), followed by a period.
- A Greek title
- The Portuguese title

In XML, *vel sim.*, all of these would be wrapped in some kind of markup. They are not, here, but we can still work with these three discrete sets of data, because the plain-text is clean and predictable.

We *could* do this in-line, but it is easier to see, and test, if we pull it out into a defined Function.

We grab the Heading-number (which we turn into a String, because it is merely a *label*), using a Regular Expression.

To split the Greek title from the Portuguese title, we do the following:

- Grab the chapter-label (some Arabic numerals) with a Regex
- Remove the chapter-label (and following period '.') before further processing: this is the String `val` called `twoTitles`
- Turn that into a Vector of `Char`.
- Filter out everything except `[A-Z]` (we know that the Greek title is first, and the Portuguese title begins with an upper-case Latin letter).
- The first element in the resulting list will be the start of the Portuguese title.
- Using Scala's [`.indexOf`](https://www.geeksforgeeks.org/scala-string-indexof-method-with-example/) method, we get the index of the first occurrance of the first `Char` of the Portuguese title in the `twoTitles` String.
- Using `.take` we grab the Greek title.
- Using `.takeRight` and some arithmetic we grab the Portuguese title.

The result will be a "3ple" of Strings: chapter-label, Greek title, Portugues title.

In [None]:
def splitTitle( testString: String ): (String, String, String) = {
    
    val chapterId: String = {
        val rx = "^[0-9]+".r
        val foundOption: Option[String] = rx.findFirstIn(testString)
        foundOption.getOrElse("NO_ID")
        
    }
    
    val twoTitles: String = testString.replaceAll("""^[0-9]+\.""", "").trim
    
    val charVec = twoTitles.toVector
    val filteredVec = charVec.filter( c => {
        val s = c.toString
        val rpl = s.replaceAll("[A-Z]", "")
        rpl == ""
    })
    val firstChar: Char = filteredVec.head.toChar
    val firstPorIndex: Int = charVec.indexOf(firstChar)
    val greekTitle: String = twoTitles.take(firstPorIndex - 1)
    val porTitle: String = twoTitles.takeRight( twoTitles.size - firstPorIndex )
    
    (chapterId, greekTitle, porTitle)
}

splitTitle("12. αβγδ ABCD")

### Make a CEX File!

We can make two CEX blocks, one for Greek and one for Portuguese. We happen to know that, for each section, there is a header-line, a Greek section (one line), and a Portuguese section (one line). 

**So this is not a generic script!** It only works with this file!

First we define our URN-base:

In [None]:
val urnBase = CtsUrn("urn:cts:greekLit:tlg0096.tlg002:")

We make a CEX block for Greek first…

In [None]:
val greekBlock: Vector[String] = mappedHeadings.map( h => {
    val heading: IndexedLine = h._1
    val section: IndexedLine = h._2.head
    val splitHeading = splitTitle(heading.text)
    val sectionId = splitHeading._1
    val sectionHeading = splitHeading._2
    val versionUrn = urnBase.addVersion("First1K-grc1")
    Vector(
        s"${versionUrn}${sectionId}.head#${sectionHeading}",
        s"${versionUrn}${sectionId}.text#${section.text}"
    )
}).flatten

Now a CEX block for Portuguese…

In [None]:
val portBlock: Vector[String] = mappedHeadings.map( h => {
    val heading: IndexedLine = h._1
    val section: IndexedLine = h._2.last
    val splitHeading = splitTitle(heading.text)
    val sectionId = splitHeading._1
    val sectionHeading = splitHeading._3
    val versionUrn = urnBase.addVersion("mcdezotti")
    Vector(
        s"${versionUrn}${sectionId}.head#${sectionHeading}",
        s"${versionUrn}${sectionId}.text#${section.text}"
    )
}).flatten

**Final Assembly**

We need to add the `#!ctsdata` header before each block, and of course the overall CEX header and CTS Catalog, which are convenientl saved in a separate template file.

> Concatenating, appending, and prepending things to Vectors in Scala is flexible, but the syntax is hard to remember. [This site](https://alvinalexander.com/scala/how-to-append-prepend-items-vector-seq-in-scala) is the definitive reference.

First, we load the CEX Header:

In [None]:
val filePath = s"aesop_cex_header.txt"
val cexHeader: String = {
    scala.io.Source.fromFile(filePath).mkString.split("\n").toVector.filter( _.size > 0 ).mkString("\n")
}

Now give our blocks their proper headers:

In [None]:
val greekCex: String = {
    ( "#!ctsdata" +: greekBlock ).mkString("\n")
}

In [None]:
val portCex: String = {
    ( "#!ctsdata" +: portBlock ).mkString("\n")
}

Put the whole things together:

In [None]:
val aesopCex: String = {
    cexHeader + "\n\n" + greekCex + "\n\n" + portCex + "\n"
}

Save it…

In [None]:
saveString( aesopCex, "", "aesop.cex")

## Test It!

### Load the Library

We can test the validity of our work by trying to load it into a [CiteLibrary](https://cite-architecture.github.io/cite-api-docs/).

In [None]:
val cexPath = "aesop.cex"
val lib = CiteLibrary(scala.io.Source.fromFile(cexPath).mkString)

If that worked (!??!), we can now try a little retrieval and analysis. 

A CITE Library has many possible components. The one we have just loaded is text-only, so let's get some parts of it convenient to hand.

> A CiteLibrary possesses an `Option[TextRepository]`. So there may or may not be a TextRepository in any given CiteLibrary, the value of `lib.textRepository` may be either `Some[TextRepository]` or `None`. We can "get" the TR with `lib.textRepository.get`. If the value is actually `None`, this will throw an exception. But in that case, something failed, above, so there is no point doing elaborate checking.

In [None]:
val tr: TextRepository = lib.textRepository.get // Go for it!

A TextRepository **must** have both a `Catalog` and a `Corpus`. See [the API docs for the `OHCO2` library](https://cite-architecture.github.io/cite-api-docs/ohco2/api/edu/holycross/shot/ohco2/index.html).

In [None]:
val cat: Catalog = tr.catalog

val corp: Corpus = tr.corpus

### Retrieval

For this exercise, we will define some URNs, and use them to retrieve passage of text. This will take advantage of 
the `showMe()` Function defined above.

In [None]:
// Urn to Aesop's Fabulae
val aesopUrn = CtsUrn("urn:cts:greekLit:tlg0096.tlg002:")

// Version ID for Greek
val greekVers = "First1K-grc1"

// Version ID for Portuguese
val portVers = "mcdezotti"

#### Retrieve Fables

One fable, in Greek:

In [None]:
val oneGreekCitation = aesopUrn.addVersion(greekVers).addPassage("3")

We use the `~~` method to retrieve a passage, based on a URN, from a Corpus.

In [None]:
val oneGreekFable: Corpus = corp ~~ oneGreekCitation

showMe(oneGreekFable)

One fable, in Portuguese:

In [None]:
val onePortCitation = aesopUrn.addVersion(portVers).addPassage("3")

We use the ~~ method to retrieve a passage, based on a URN, from a Corpus.

In [None]:
val onePortFable: Corpus = corp ~~ onePortCitation

showMe(onePortFable)

Two fables, in Greek:

In [None]:
val twoGreekCitations = aesopUrn.addVersion(greekVers).addPassage("4-5")

We use the `~~` method to retrieve a passage, based on a URN, from a Corpus.

In [None]:
val twoGreekFables: Corpus = corp ~~ twoGreekCitations

showMe(twoGreekFables)

One fable, in Portuguese:

In [None]:
val twoPortCitations = aesopUrn.addVersion(portVers).addPassage("4-5")

We use the ~~ method to retrieve a passage, based on a URN, from a Corpus.

In [None]:
val twoPortFables: Corpus = corp ~~ twoPortCitations

showMe(twoPortFables)

#### Retrieve Parts of Fables

The above retrieve by canonical citation, that is, by Fable. The library we define separates the heading from the text of a fable, for more precise identification and retrieval, *if so desired*.

In [None]:
val fableFiveGreekHead: Corpus = {
    corp ~~ aesopUrn.addVersion(greekVers).addPassage("5.head")
}

showMe( fableFiveGreekHead )

In [None]:
val fableFiveGreekText: Corpus = {
    corp ~~ aesopUrn.addVersion(greekVers).addPassage("5.text")
}

showMe( fableFiveGreekText )

#### Retrieve Multitext Fables

Because the [CITE Architecture](http://cite-architecture.org) has always been developed in the context of the [Homer Multitext](http://www.homermultitext.org), its *raison d’être* has been **identification and retrieval** of passages of texts, by **canonical citation**, across versions. We can capitalize on this here:

In [None]:
val fableFiveHeadAll: Corpus = {
    corp ~~ aesopUrn.addPassage("5.head")
}

showMe(fableFiveHeadAll)

In [None]:
val fableFiveAll: Corpus = {
    corp ~~ aesopUrn.addPassage("5")
}

showMe(fableFiveAll)

### Analysis

For information about using the [OCHO2 library’s built-in analytical tools](https://cite-architecture.github.io/cite-api-docs/ohco2/api/edu/holycross/shot/ohco2/index.html) see the [API documentation](https://cite-architecture.github.io/cite-api-docs/ohco2/api/edu/holycross/shot/ohco2/index.html). We can test our new library, though, with a quick linguistic analysis or two.

We can do a quick search for an NGram, in Greek or Portuguese, or for the whole Corpus.

We start by defining Corpora for analysis.

**N.b.** The `val` named `corp`, the Corpus in our TextRepository, contains both Greek and Portuguese.

In [None]:
val greekCorpus: Corpus = corp ~~ aesopUrn.addVersion(greekVers)

val portCorpus: Corpus = corp ~~ aesopUrn.addVersion(portVers)

We ask for repeating patterns of 3 words that occur more than 2 times:

In [None]:
val threeGramsGreek = greekCorpus.ngramHisto(3, 2)

showMe( threeGramsGreek )

Let's do the same for Portuguese:

In [None]:
val threeGramsPort = portCorpus.ngramHisto(3, 2)

showMe( threeGramsPort )

Let's do the same for both languages!:

In [None]:
val threeGramsAll = corp.ngramHisto(3, 2)

showMe( threeGramsAll )