## Counting and Zipf's Law

Let's extend our counting skills to see what Zipf's Law looks like in Greek.

The first cell below just loads two long texts:  don't worry yet about how it works, but notice that at the end, we've created a named String value called `iliad` and a named String value called `scholia`.

In [1]:
import scala.io.Source
val scholiaUrl = "https://raw.githubusercontent.com/hmteditors/iliad23-2020/master/presentation/vbScholiaData"

val scholia = Source.fromURL(scholiaUrl).mkString("")






[32mimport [39m[36mscala.io.Source
[39m
[36mscholiaUrl[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/hmteditors/iliad23-2020/master/presentation/vbScholiaData"[39m
[36mscholia[39m: [32mString[39m = [32m"""urn:cts:greekLit:tlg5026.msB.hmt:23.msB_303r_4.lemma#ιζ
urn:cts:greekLit:tlg5026.msB.hmt:23.msB_303r_4.comment#νῆας τὸν τόπον τῶν νηῶν~ Ἑλλήσποντον δὲ , τὴν μέχρι Σιγείου θάλασσαν ⁑
urn:cts:greekLit:tlg5026.msB.hmt:23.msB_303r_5.lemma#ιη
urn:cts:greekLit:tlg5026.msB.hmt:23.msB_303r_5.comment#ἀπ' ἀλλήλων ἐχωρίζοντο ⁑
urn:cts:greekLit:tlg5026.msB.hmt:23.msB_303r_6.lemma#ιθ
urn:cts:greekLit:tlg5026.msB.hmt:23.msB_303r_6.comment#ἀπ' αὐτοῦ χωρίζεσθαι~ ἢ παρέλκει ἡ ἀπό ὡς τὸ ἀχητίμησεν ⁑
urn:cts:greekLit:tlg5026.msB.hmt:23.msB_303r_7.lemma#κ
urn:cts:greekLit:tlg5026.msB.hmt:23.msB_303r_7.comment#τοὺς ὑπὸ τοῖς ὀχήμασι μονώνυχας ἵππους ⁑
urn:cts:greekLit:tlg5026.msB.hmt:23.msB_303r_8.lemma#urn:cite2:hmt:dingbats.v1:dingbats22
urn:cts:greekLit:tlg5026.msB.hmt:23.m

Ever wonder how long the Venetus A *Iliad* or *scholia* are?

In [2]:

scholia.size


[36mres1[39m: [32mInt[39m = [32m15754[39m

## Is the *Iliad* Zipfie?

So let's count "words".  We could spend a long time deciding what a word is, but let's keep it simple today by:

1. throwing away some punctuation characters
2. then splitting the long text on whitespace



In [8]:
// We'll learn more later about what's going on here: the list of
// characters inside square brackets is actually a *regular expression*
val stripped = scholia.replaceAll("[,·\\.:~⁑]", "")
// The expression [ ] means "one or more occurrences of any Whitespace character"
val words = stripped.split("[ \n]+").toVector

// Print first 100 words to see if they look right...
println(words.take(100))

Vector(urnctsgreekLittlg5026msBhmt23msB_303r_4lemma#ιζ, urnctsgreekLittlg5026msBhmt23msB_303r_4comment#νῆας, τὸν, τόπον, τῶν, νηῶν, Ἑλλήσποντον, δὲ, τὴν, μέχρι, Σιγείου, θάλασσαν, urnctsgreekLittlg5026msBhmt23msB_303r_5lemma#ιη, urnctsgreekLittlg5026msBhmt23msB_303r_5comment#ἀπ', ἀλλήλων, ἐχωρίζοντο, urnctsgreekLittlg5026msBhmt23msB_303r_6lemma#ιθ, urnctsgreekLittlg5026msBhmt23msB_303r_6comment#ἀπ', αὐτοῦ, χωρίζεσθαι, ἢ, παρέλκει, ἡ, ἀπό, ὡς, τὸ, ἀχητίμησεν, urnctsgreekLittlg5026msBhmt23msB_303r_7lemma#κ, urnctsgreekLittlg5026msBhmt23msB_303r_7comment#τοὺς, ὑπὸ, τοῖς, ὀχήμασι, μονώνυχας, ἵππους, urnctsgreekLittlg5026msBhmt23msB_303r_8lemma#urncite2hmtdingbatsv1dingbats22, urnctsgreekLittlg5026msBhmt23msB_303r_8comment#λείπει, ἡ, συν, urnctsgreekLittlg5026msBhmt23msB_303r_9lemma#κα, urnctsgreekLittlg5026msBhmt23msB_303r_9comment#Αἰσχύλος, φησὶν, ὅτι, στεναγμὶ, ;, τῶν, πόνων, ἰάματα, οὐ, μέτρια, τυγχάνουσιν, urnctsgreekLittlg5026msBhmt23msB_303v_1lemma#α, urnctsgreekLittlg5026msBhmt23msB

[36mstripped[39m: [32mString[39m = [32m"""urnctsgreekLittlg5026msBhmt23msB_303r_4lemma#ιζ
urnctsgreekLittlg5026msBhmt23msB_303r_4comment#νῆας τὸν τόπον τῶν νηῶν Ἑλλήσποντ[39m...
[36mwords[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"urnctsgreekLittlg5026msBhmt23msB_303r_4lemma#\u03b9\u03b6"[39m,
...

In [9]:
val grouped = words.groupBy( w => w).toVector
val frequencies = grouped.map{ case (word, wordList) => (word, wordList.size)}

[36mgrouped[39m: [32mVector[39m[([32mString[39m, [32mVector[39m[[32mString[39m])] = [33mVector[39m(
  ([32m"\u1f21\u03b4\u1f73\u03c9\u03bd"[39m, [33mVector[39m([32m"\u1f21\u03b4\u1f73\u03c9\u03bd"[39m)),
...
[36mfrequencies[39m: [32mVector[39m[([32mString[39m, [32mInt[39m)] = [33mVector[39m(
  ([32m"\u1f21\u03b4\u1f73\u03c9\u03bd"[39m, [32m1[39m),
...

In [16]:
val mostToFewest = frequencies.sortBy{ case (w,freq) => freq }.reverse
println(mostToFewest.take(100))



Vector((τὸ,48), (καὶ,44), (δὲ,32), (τοῦ,28), (γὰρ,17), (τὴν,14), (τὸν,13), (ὡς,12), (τῶν,11), (ἢ,10), (πρὸς,10), (οἱ,10), (τῷ,9), (ἐν,8), (τοῦτο,8), (τοῖς,8), (τῆς,8), (διὰ,8), (ἡ,8), (τὰς,7), (τὰ,7), (ὁ,7), (παρὰ,7), (ὅτι,6), (μὲν,6), (ἀντὶ,6), (ἀλλ',6), (μετὰ,5), (αὐτοῦ,5), (οὐ,5), (τῇ,5), (ἐπὶ,5), (οὖν,4), (νῦν,4), (σῶμα,4), (τε,4), (οὐκ,4), (ἔστι,4), (τοὺς,3), (γοῦν,3), (ἦ,3), (μὴ,3), (ἵνα,3), (ἵν',3), (πλῆθος,3), (εἰς,3), (ἐστίν,3), (χεῖρας,3), (αὐτῷ,3), (πρὸ,3), (πιμελὴν,3), (περὶ,3), (αὐτὸν,3), (δεῖπνον,3), (φησὶ,3), (δηλοῖ,3), (παρ',3), (ἐκ,3), (μετ',2), (νεκροῦ,2), (πλευρὰς,2), (αἱ,2), (τάφου,2), (λόγος,2), (εἰ,2), (ὑψοῦ,2), (πῶς,2), (κοτύλην,2), (δέ,2), (ἐπιμέλειαν,2), (πρότερον,2), (σώματος,2), (ἡμῶν,2), (ὅπλα,2), (Ἀχιλλεὺς,2), (ἐάν,2), (ὀνείρου,2), (τόπον,2), (σορὸν,2), (παρέλκει,2), (ἐστι,2), (ἀλλὰ,2), (τάφον,2), (ὑπὲρ,2), (ὃν,2), (ὄϊν,2), (εἶπε,2), (κηδείας,2), (πένθει,2), (δυνάμενον,2), (Ἀχιλλέως,2), (ἀπὸ,2), (ὕστερον,2), (γένη,1), (urnctsgreekLittlg5026msBhmt23msB_303v_

[36mmostToFewest[39m: [32mVector[39m[([32mString[39m, [32mInt[39m)] = [33mVector[39m(
  ([32m"\u03c4\u1f78"[39m, [32m48[39m),
...

In [11]:
// Make plotly libraries available to this notebook:
import $ivy.`org.plotly-scala::plotly-almond:0.7.1`
// Import plotly libraries, and set display defaults suggested for use in Jupyter NBs:
import plotly._, plotly.element._, plotly.layout._, plotly.Almond._
repl.pprinter() = repl.pprinter().copy(defaultHeight = 3)

[32mimport [39m[36m$ivy.$                                      
// Import plotly libraries, and set display defaults suggested for use in Jupyter NBs:
[39m
[32mimport [39m[36mplotly._, plotly.element._, plotly.layout._, plotly.Almond._
[39m

In [13]:
val words = mostToFewest.map(frequency => frequency._1)
val counts = mostToFewest.map(frequency => frequency._2)

val zipf = Vector(
  Bar(x = words, y = counts)
)
plot(zipf)

[36mwords[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"\u03c4\u1f78"[39m,
...
[36mcounts[39m: [32mVector[39m[[32mInt[39m] = [33mVector[39m(
  [32m48[39m,
...
[36mzipf[39m: [32mVector[39m[[32mBar[39m] = [33mVector[39m(
  [33mBar[39m(
...
[36mres12_3[39m: [32mString[39m = [32m"plot-9311b454-ad2a-432e-b9ec-e9ec2f97656a"[39m

## Can you do this for the *scholia*?