## Counting and Zipf's Law

Let's extend our counting skills to see what Zipf's Law looks like in Greek.

The first cell below just loads two long texts:  don't worry yet about how it works, but notice that at the end, we've created a named String value called `iliad` and a named String value called `scholia`.

In [None]:
import scala.io.Source
val iliadUrl = "https://raw.githubusercontent.com/neelsmith/summer2020nbs/master/data/iliad-dipl.txt"
val scholiaUrl = "https://raw.githubusercontent.com/neelsmith/summer2020nbs/master/data/scholia-dipl.txt"

val iliad = Source.fromURL(iliadUrl).mkString("")
val scholia = Source.fromURL(scholiaUrl).mkString("")






Ever wonder how long the Venetus A *Iliad* or *scholia* are?

In [None]:
iliad.size
scholia.size


## Is the *Iliad* Zipfie?

So let's count "words".  We could spend a long time deciding what a word is, but let's keep it simple today by:

1. throwing away some punctuation characters
2. then splitting the long text on whitespace



In [None]:
// We'll learn more later about what's going on here: the list of
// characters inside square brackets is actually a *regular expression*
val stripped = iliad.replaceAll("[,·\\.:~]", "")
// The expression [ ] means "one or more occurrences of any Whitespace character"
val words = stripped.split("[ \n]+").toVector

// Print first 100 words to see if they look right...
println(words.take(100))

In [None]:
val grouped = words.groupBy( w => w).toVector
val frequencies = grouped.map{ case (word, wordList) => (word, wordList.size)}

In [None]:
val mostToFewest = frequencies.sortBy{ case (w,freq) => freq }.reverse
println(mostToFewest.take(100))

In [None]:
// Make plotly libraries available to this notebook:
import $ivy.`org.plotly-scala::plotly-almond:0.7.1`
// Import plotly libraries, and set display defaults suggested for use in Jupyter NBs:
import plotly._, plotly.element._, plotly.layout._, plotly.Almond._
repl.pprinter() = repl.pprinter().copy(defaultHeight = 3)

In [None]:
val words = mostToFewest.map(frequency => frequency._1)
val counts = mostToFewest.map(frequency => frequency._2)

val zipf = Vector(
  Bar(x = words, y = counts)
)
plot(zipf)

## Can you do this for the *scholia*?