## Counting and Zipf's Law

Let's extend our counting skills to see what Zipf's Law looks like in Greek.

The first cell below just loads two long texts:  don't worry yet about how it works, but notice that at the end, we've created a named String value called `iliad` and a named String value called `scholia`.

In [1]:
import scala.io.Source
val scholiaUrl = "https://raw.githubusercontent.com/hmteditors/iliad23-2020/master/presentation/e3ScholiaData"

val scholia = Source.fromURL(scholiaUrl).mkString("")






[32mimport [39m[36mscala.io.Source
[39m
[36mscholiaUrl[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/hmteditors/iliad23-2020/master/presentation/e3ScholiaData"[39m
[36mscholia[39m: [32mString[39m = [32m"""urn:cts:greekLit:tlg5026.e3.hmt:23.e3_294r_4.lemma#ιζ
urn:cts:greekLit:tlg5026.e3.hmt:23.e3_294r_4.comment#Νῆας . τὸν τόπον τῶν νηῶν~ Ἑλλήσποντον δὲ τὴν μέχρι Σιγείου θάλασσαν ⁑
urn:cts:greekLit:tlg5026.e3.hmt:23.e3_294r_5.lemma#ιη
urn:cts:greekLit:tlg5026.e3.hmt:23.e3_294r_5.comment#ἀπ' ἀλλήλων ἐχωρίζοντο ⁑
urn:cts:greekLit:tlg5026.e3.hmt:23.e3_294r_6.lemma#ιθ
urn:cts:greekLit:tlg5026.e3.hmt:23.e3_294r_6.comment#ἀπ' αὐτοῦ χωρίζεσθαι~ ἢ παρέλκει ἡ ἀπό ὡς τὸ ἀχητίμησεν ⁑
urn:cts:greekLit:tlg5026.e3.hmt:23.e3_294r_7.lemma#κ
urn:cts:greekLit:tlg5026.e3.hmt:23.e3_294r_7.comment#τοὺς ὑπὸ τοῖς ὀχήμασι μονώνυχας ἵππους ⁑
urn:cts:greekLit:tlg5026.e3.hmt:23.e3_294r_8.lemma#urn:cite2:hmt:dingbats.v1:dingbats22
urn:cts:greekLit:tlg5026.e3.hmt:23.e3_294r_8.comment#λε

Ever wonder how long the Venetus A *Iliad* or *scholia* are?

In [2]:

scholia.size


[36mres1[39m: [32mInt[39m = [32m17469[39m

## Is the *Iliad* Zipfie?

So let's count "words".  We could spend a long time deciding what a word is, but let's keep it simple today by:

1. throwing away some punctuation characters
2. then splitting the long text on whitespace



In [4]:
// We'll learn more later about what's going on here: the list of
// characters inside square brackets is actually a *regular expression*
val stripped = scholia.replaceAll("[,·\\.:~⁑]", "")
// The expression [ ] means "one or more occurrences of any Whitespace character"
val words = stripped.split("[ \n]+").toVector

// Print first 100 words to see if they look right...
println(words.take(100))

Vector(urnctsgreekLittlg5026e3hmt23e3_294r_4lemma#ιζ, urnctsgreekLittlg5026e3hmt23e3_294r_4comment#Νῆας, τὸν, τόπον, τῶν, νηῶν, Ἑλλήσποντον, δὲ, τὴν, μέχρι, Σιγείου, θάλασσαν, urnctsgreekLittlg5026e3hmt23e3_294r_5lemma#ιη, urnctsgreekLittlg5026e3hmt23e3_294r_5comment#ἀπ', ἀλλήλων, ἐχωρίζοντο, urnctsgreekLittlg5026e3hmt23e3_294r_6lemma#ιθ, urnctsgreekLittlg5026e3hmt23e3_294r_6comment#ἀπ', αὐτοῦ, χωρίζεσθαι, ἢ, παρέλκει, ἡ, ἀπό, ὡς, τὸ, ἀχητίμησεν, urnctsgreekLittlg5026e3hmt23e3_294r_7lemma#κ, urnctsgreekLittlg5026e3hmt23e3_294r_7comment#τοὺς, ὑπὸ, τοῖς, ὀχήμασι, μονώνυχας, ἵππους, urnctsgreekLittlg5026e3hmt23e3_294r_8lemma#urncite2hmtdingbatsv1dingbats22, urnctsgreekLittlg5026e3hmt23e3_294r_8comment#λείπει, ἡ, συν, urnctsgreekLittlg5026e3hmt23e3_294r_9lemma#κα, urnctsgreekLittlg5026e3hmt23e3_294r_9comment#Αἰσχύλος, φησὶν, ὅτι, στεναγμὶ, τῶν, πόνων, ἰάματα, οὐ, μέτρια, τυγχάνουσιν, urnctsgreekLittlg5026e3hmt23e3_294v_1lemma#, urnctsgreekLittlg5026e3hmt23e3_294v_1comment#, του, κλαιειν, ε

[36mstripped[39m: [32mString[39m = [32m"""urnctsgreekLittlg5026e3hmt23e3_294r_4lemma#ιζ
urnctsgreekLittlg5026e3hmt23e3_294r_4comment#Νῆας  τὸν τόπον τῶν νηῶν Ἑλλήσποντον δὲ τὴν μέχρι Σιγείου θάλασσαν 
urnctsgreekLittlg5026e3hmt23e3_294r_5lemma#ιη
urnctsgreekLittlg5026e3hmt23e3_294r_5comment#ἀπ' ἀλλήλων ἐχωρίζοντο 
urnctsgreekLittlg5026e3hmt23e3_294r_6lemma#ιθ
urnctsgreekLittlg5026e3hmt23e3_294r_6comment#ἀπ' αὐτοῦ χωρίζεσθαι ἢ παρέλκει ἡ ἀπό ὡς τὸ ἀχητίμησεν 
urnctsgreekLittlg5026e3hmt23e3_294r_7lemma#κ
urnctsgreekLittlg5026e3hmt23e3_294r_7comment#τοὺς ὑπὸ τοῖς ὀχήμασι μονώνυχας ἵππους 
urnctsgreekLittlg5026e3hmt23e3_294r_8lemma#urncite2hmtdingbatsv1dingbats22
urnctsgreekLittlg5026e3hmt23e3_294r_8comment#λείπει ἡ συν 
urnctsgreekLittlg5026e3hmt23e3_294r_9lemma#κα
urnctsgreekLittlg5026e3hmt23e3_294r_9comment#Αἰσχύλος φησὶν  ὅτι στεναγμὶ τῶν πόνων ἰάματα οὐ μέτρια τυγχάνουσιν 
urnctsgreekLittlg5026e3hmt23e3_294v_1lemma#
urnctsgreekLittlg5026e3hmt23e3_294v_1comment# του κλαιειν ενοπλο

In [5]:
val grouped = words.groupBy( w => w).toVector
val frequencies = grouped.map{ case (word, wordList) => (word, wordList.size)}

[36mgrouped[39m: [32mVector[39m[([32mString[39m, [32mVector[39m[[32mString[39m])] = [33mVector[39m(
  ([32m"\u1f21\u03b4\u1f73\u03c9\u03bd"[39m, [33mVector[39m([32m"\u1f21\u03b4\u1f73\u03c9\u03bd"[39m)),
  (
    [32m"\u03bf\u1f50\u03b4\u03b5\u03bd\u1f78\u03c2"[39m,
    [33mVector[39m([32m"\u03bf\u1f50\u03b4\u03b5\u03bd\u1f78\u03c2"[39m)
  ),
  (
    [32m"\u1f14\u03b4\u03b1\u03c6\u03bf\u03c2"[39m,
    [33mVector[39m([32m"\u1f14\u03b4\u03b1\u03c6\u03bf\u03c2"[39m)
  ),
  ([32m"\u03c4\u03b9\u03ba\u1f79\u03bd'"[39m, [33mVector[39m([32m"\u03c4\u03b9\u03ba\u1f79\u03bd'"[39m)),
  ([32m"\u03bb\u03b5\u03c5\u03ba\u1f70"[39m, [33mVector[39m([32m"\u03bb\u03b5\u03c5\u03ba\u1f70"[39m)),
  (
    [32m"\u03b4\u03b5\u03c5\u03c4\u1f73\u03c1\u03bf\u03c5"[39m,
    [33mVector[39m([32m"\u03b4\u03b5\u03c5\u03c4\u1f73\u03c1\u03bf\u03c5"[39m)
  ),
  ([32m"\u03b1\u1f7b\u03be\u03b7"[39m, [33mVector[39m([32m"\u03b1\u1f7b\u03be\u03b7"[39m)),
  (
    [32m"\u1f40

In [6]:
val mostToFewest = frequencies.sortBy{ case (w,freq) => freq }.reverse
println(mostToFewest.take(100))

Vector((τὸ,54), (καὶ,47), (δὲ,36), (τοῦ,31), (τὴν,20), (γὰρ,18), (τὸν,14), (τῶν,13), (ἢ,12), (ὡς,12), (πρὸς,11), (τοῖς,11), (τὰ,10), (ἐν,9), (διὰ,9), (ἡ,9), (οἱ,9), (τῷ,9), (τὰς,8), (τοῦτο,8), (ὅτι,8), (ὁ,8), (τῆς,7), (ἀντὶ,7), (ἀλλ',7), (παρὰ,7), (αὐτοῦ,6), (μὲν,6), (τῇ,6), (ἐπὶ,6), (μετὰ,5), (εἰς,5), (οὐ,5), (τοὺς,4), (οὖν,4), (νῦν,4), (σῶμα,4), (πλῆθος,4), (φησι,4), (τε,4), (ἐστι,4), (οὐκ,4), (ἔστι,4), (ἐκ,4), (γοῦν,3), (μὴ,3), (ἵνα,3), (χεῖρας,3), (αὐτῷ,3), (πρὸ,3), (πιμελὴν,3), (Ἀχιλλεὺς,3), (περὶ,3), (αὐτὸν,3), (δεῖπνον,3), (δηλοῖ,3), (παρ',3), (ᾖ,3), (γένη,2), (μετ',2), (νεκροῦ,2), (πλευρὰς,2), (αἱ,2), (τάφου,2), (λόγος,2), (εἰ,2), (ὑψοῦ,2), (πῶς,2), (κοτύλην,2), (δέ,2), (εἶναι,2), (ἵν',2), (ἐπιμέλειαν,2), (πρότερον,2), (ποσὶ,2), (σώματος,2), (ἡμῶν,2), (ὅπλα,2), (Ἀχιλλεῖ,2), (ἐστίν,2), (ἐστιν,2), (ἐάν,2), (ταῖς,2), (ὀνείρου,2), (τόπον,2), (ποδῶν,2), (σορὸν,2), (κατὰ,2), (παρέλκει,2), (ἀλλὰ,2), (τάφον,2), (ὑπὲρ,2), (ἐμφαίνει,2), (ὃν,2), (ὄϊν,2), (εἶπε,2), (κηδείας,2), (πένθει,2),

[36mmostToFewest[39m: [32mVector[39m[([32mString[39m, [32mInt[39m)] = [33mVector[39m(
  ([32m"\u03c4\u1f78"[39m, [32m54[39m),
  ([32m"\u03ba\u03b1\u1f76"[39m, [32m47[39m),
  ([32m"\u03b4\u1f72"[39m, [32m36[39m),
  ([32m"\u03c4\u03bf\u1fe6"[39m, [32m31[39m),
  ([32m"\u03c4\u1f74\u03bd"[39m, [32m20[39m),
  ([32m"\u03b3\u1f70\u03c1"[39m, [32m18[39m),
  ([32m"\u03c4\u1f78\u03bd"[39m, [32m14[39m),
  ([32m"\u03c4\u1ff6\u03bd"[39m, [32m13[39m),
  ([32m"\u1f22"[39m, [32m12[39m),
  ([32m"\u1f61\u03c2"[39m, [32m12[39m),
  ([32m"\u03c0\u03c1\u1f78\u03c2"[39m, [32m11[39m),
  ([32m"\u03c4\u03bf\u1fd6\u03c2"[39m, [32m11[39m),
  ([32m"\u03c4\u1f70"[39m, [32m10[39m),
  ([32m"\u1f10\u03bd"[39m, [32m9[39m),
  ([32m"\u03b4\u03b9\u1f70"[39m, [32m9[39m),
  ([32m"\u1f21"[39m, [32m9[39m),
  ([32m"\u03bf\u1f31"[39m, [32m9[39m),
  ([32m"\u03c4\u1ff7"[39m, [32m9[39m),
  ([32m"\u03c4\u1f70\u03c2"[39m, [32m8[39m),
  ([32m"\u03c

In [7]:
// Make plotly libraries available to this notebook:
import $ivy.`org.plotly-scala::plotly-almond:0.7.1`
// Import plotly libraries, and set display defaults suggested for use in Jupyter NBs:
import plotly._, plotly.element._, plotly.layout._, plotly.Almond._
repl.pprinter() = repl.pprinter().copy(defaultHeight = 3)

[32mimport [39m[36m$ivy.$                                      
// Import plotly libraries, and set display defaults suggested for use in Jupyter NBs:
[39m
[32mimport [39m[36mplotly._, plotly.element._, plotly.layout._, plotly.Almond._
[39m

In [8]:
val words = mostToFewest.map(frequency => frequency._1)
val counts = mostToFewest.map(frequency => frequency._2)

val zipf = Vector(
  Bar(x = words, y = counts)
)
plot(zipf)

[36mwords[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"\u03c4\u1f78"[39m,
...
[36mcounts[39m: [32mVector[39m[[32mInt[39m] = [33mVector[39m(
  [32m54[39m,
...
[36mzipf[39m: [32mVector[39m[[32mBar[39m] = [33mVector[39m(
  [33mBar[39m(
...
[36mres7_3[39m: [32mString[39m = [32m"plot-c0ed0bd8-4330-4787-be44-44439c9813a0"[39m

## Can you do this for the *scholia*?