# Validate a CEX File

## Configuring CITE libraries for almond kernel

First, we'll make a bintray repository with CITE libraries available to your almond kernel.

In [1]:
val myBT = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(myBT)

[36mmyBT[39m: [32mcoursierapi[39m.[32mMavenRepository[39m = MavenRepository(https://dl.bintray.com/neelsmith/maven)

Next, we bring in specific libraries from the new repository using almond's `$ivy` magic:

In [1]:
import $ivy.`edu.holycross.shot::ohco2:10.16.0`
import $ivy.`edu.holycross.shot.cite::xcite:4.1.1`
import $ivy.`edu.holycross.shot::scm:7.2.0`
import $ivy.`edu.holycross.shot::dse:5.2.2`
import $ivy.`edu.holycross.shot::citebinaryimage:3.1.1`
import $ivy.`edu.holycross.shot::citeobj:7.3.4`
import $ivy.`edu.holycross.shot::citerelations:2.5.2`
import $ivy.`edu.holycross.shot::cex:6.3.3`

Downloading https://repo1.maven.org/maven2/edu/holycross/shot/ohco2_2.13/10.16.0/ohco2_2.13-10.16.0.pom
Downloaded https://repo1.maven.org/maven2/edu/holycross/shot/ohco2_2.13/10.16.0/ohco2_2.13-10.16.0.pom
Downloading https://repo1.maven.org/maven2/edu/holycross/shot/ohco2_2.13/10.16.0/ohco2_2.13-10.16.0.pom.sha1
Downloaded https://repo1.maven.org/maven2/edu/holycross/shot/ohco2_2.13/10.16.0/ohco2_2.13-10.16.0.pom.sha1
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/ohco2_2.13/10.16.0/ohco2_2.13-10.16.0.pom
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/ohco2_2.13/10.16.0/ohco2_2.13-10.16.0.pom
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/ohco2_2.13/10.16.0/ohco2_2.13-10.16.0.pom.sha1
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/ohco2_2.13/10.16.0/ohco2_2.13-10.16.0.pom.sha1
Failed to resolve ivy dependencies:Error downloading edu.holycross.shot:ohco2_2.13:10.16.0
  not found: /Users/cblackwel

: 

## Imports

From this point on, your notebook consists of completely generic Scala, with the CITE Libraries available to use.


In [None]:
// Import some CITE libraries
import edu.holycross.shot.cite._
import edu.holycross.shot.ohco2._
import edu.holycross.shot.scm._
import edu.holycross.shot.citeobj._
import edu.holycross.shot.citerelation._
import edu.holycross.shot.dse._
import edu.holycross.shot.citebinaryimage._
import edu.holycross.shot.ohco2._

import almond.display.UpdatableDisplay
import almond.interpreter.api.DisplayData.ContentType
import almond.interpreter.api.{DisplayData, OutputHandler}

import java.io.File
import java.io.PrintWriter

import scala.io.Source



### Useful Functions

Easily write a String to file:

In [None]:
def saveString(s:String, filePath:String = "", fileName:String = "temp.txt"):Unit = {
		 val writer = new PrintWriter(new File(s"${filePath}${fileName}"))
         writer.write(s)
         writer.close()
	}

Given a path to a CEX file, return an `Option[Corpus]`. This lets us drill right down through a potentially complex CITE Library and get just the texts.

In [None]:
def corpusFromCex(cexPath: String): Option[Corpus] = {
    val lib: CiteLibrary = CiteLibrarySource.fromFile(cexPath)
    lib.textRepository match {
        case Some(tr) => Some(tr.corpus)
        case None => None
    }
}

A useful function for printing out collections, vectors, etc.

In [None]:
def showMe(v:Any):Unit = {
  v match {
    case _:StringHistogram => {
        val toStringVec: Vector[String] = v.asInstanceOf[StringHistogram].histogram.map( h => {
            h.toString
        })
        println(s"""\n----\n${toStringVec.mkString("\n")}\n----\n""")
    }
    case _:Vector[Any] => println(s"""\n----\n${v.asInstanceOf[Vector[Any]].mkString("\n")}\n----\n""")
    case _:Iterable[Any] => println(s"""\n----\n${v.asInstanceOf[Iterable[Any]].mkString("\n")}\n----\n""")
    case _ => println(s"\n-----\n${v}\n----\n")
  }
}

## Load Corpora

We want to work with two versions of the *Encheiridion*: 

- The Greek text
- A lemmatized version of the Greek text

So let's load them now. We will make **three** `Corpus` objects. One for each of the versions, and one with both versions in the same Corpus

In [None]:
val epicGrkFilePath = "epictetus_encheiridion_greek_edition.cex"
val epicLemFilePath = "epictetus_encheiridion_greek_lemmatized.cex"

// Now we turn these into Corpus-objects.
// We will assume that the CEX files are present and correct… if anything goes wrong, we'll get an exception

// corpusFromCex returns an Option[Corpus], so we take a chance and `.get` the results

val epicGrk: Corpus = {
    corpusFromCex(epicGrkFilePath).get 
}

val epicLem: Corpus = {
    corpusFromCex(epicLemFilePath).get 
}


val epicAll: Corpus = {
    epicGrk ++ epicLem
}

/* But let's throw a little test in here */

val grkSize: Int = epicGrk.size
val lemSize: Int = epicLem.size
val allSize: Int = epicAll.size

assert( grkSize == lemSize )
assert ( (grkSize + lemSize) == allSize )


## How to find N-Grams

We can use the [OHCO2](https://cite-architecture.github.io/cite-api-docs/ohco2/api/edu/holycross/shot/ohco2/index.html) library to do n-gram queries on our Corpus-objects.

In the first example, we will look for N-Grams in the Greek text.

In [None]:
val n: Int = 3 // number of words in a pattern
val t: Int = 2 // occurring more than this many times in the Corpus

val ngh: StringHistogram = epicGrk.ngramHisto(n, t)

// Pretty-print the result
showMe(ngh)

The next example will show how to get a list of unique words, by frequency. We do this, simply, by using the same method but asking for "1-grams". We'll use a high threshold, for this example, so as not to make the list too long.

In [None]:
val n: Int = 1 // number of words in a pattern
val t: Int = 400 // occurring more than this many times in the Corpus

val ngh: StringHistogram = epicGrk.ngramHisto(n, t)

// Pretty-print the result
showMe(ngh)

Let's compare vocabulary in the first half of the *Encheiridion* with vocabulary in the second half.

In [None]:
// We get two Corpuses, splitting the Greek text in half

val firstHalf: Corpus = {
    // We define the half with a citation
    val urn: CtsUrn = CtsUrn("urn:cts:greekLit:tlg0557.tlg002.perseus-grc1:1-26")
    // We make a new Corpus by "twiddling" the Greek text
    epicGrk ~~ urn
}

val secondHalf: Corpus = {
    // We define the half with a citation
    val urn: CtsUrn = CtsUrn("urn:cts:greekLit:tlg0557.tlg002.perseus-grc1:27-53")
    // We make a new Corpus by "twiddling" the Greek text
    epicGrk ~~ urn
}

// We get full vocabulary for both

val n: Int = 1 // number of words in a pattern
val t: Int = 0 // occurring more than this many times in the Corpus

val histo1: StringHistogram = firstHalf.ngramHisto(n, t)
val histo2: StringHistogram = secondHalf.ngramHisto(n, t)

// Let's map those so we just have the words. We don't need the counts right now.

val vocab1: Vector[String] = histo1.histogram.map(_.s)
val vocab2: Vector[String] = histo2.histogram.map(_.s)


Now that we have that data, we can ask questions of it.

In [None]:
// Comparing sizes

println(s"First Half: ${vocab1.size} unique words")
println(s"Second Half: ${vocab2.size} unique words")

// Comparing the 100 most frequent words

val mf1: Vector[String] = vocab1.take(100)
val mf2: Vector[String] = vocab2.take(100)

showMe(mf1.diff(mf2)) // will show words that mf1 has, but mf2 does not have

showMe(mf2.diff(mf1)) // will show words that mf2 has, but mf1 does not have


/* .diff can be confusing. Here is a very simple example. */

val t1 = Vector(1,2,3,4,5,6)

val t2 = Vector(4,5,6,7,8)

showMe(t1.diff(t2))

showMe(t2.diff(t1))
