# Spark LDA

An example of topic modelling a corpus of texts using Spark ML's LDA.

In the first two code cells, you can define your main decisions about how to topic model your corpus by setting key values, and by downloading and cleaning up your texts.


## Settings

- `k` is the traditional name for the number of topics to find
- `iterations` is the number of cycles the LDA algorithm should run through
- `stopWords` is an Array of words to omit from the model
- `vocabSize` is the number of terms to consider
- `termsToDisplay` is the number of terms to use in describing a topic

In [2]:
val k = 8
val iterations = 50
val stopWords = Array("de", "kai", "to", "thn", "gar", "twn", "h", "tou", "ws", "o", "ths", "ton", "dia", "mh", "oti", "ou", "pros", "eis", "men", "oi", "ouk", "en", "tous", "epi", "ta", "tw", "tois", "auton", "ei", "nun", "peri", "hn", "oun", "autw", "autou", "alla", "tas", "all", "esti", "estin", "te", "th", "touto", "tauta", "apo", "ek", "meta", "ti", "ec", "anti", "oude", "tines", "epei", "d", "outws", "outw", "oux", "ke", "an", "ina", "ai", "ot", "out", "upo", "auton", "mentoi", "tis", "pro", "ti", "ge", "t", "to", "htoi", "tais", "osson", "oson", "ep" , "autw", "einai", "autar", "eite", "eisin", "toutwn", "authn", "auto", "allw", "allois", "autos", "he", "si", "th", "all", "min", "moi", "ote", "oud", "tw")

val vocabSize = 10000
val minimumTokenLength = 4
val termsToDisplay = 15

// Cosmetic setting for table display:
val maxWidth = 1000

[36mk[39m: [32mInt[39m = [32m8[39m
[36miterations[39m: [32mInt[39m = [32m50[39m
[36mstopWords[39m: [32mArray[39m[[32mString[39m] = [33mArray[39m(
  [32m"de"[39m,
  [32m"kai"[39m,
  [32m"to"[39m,
  [32m"thn"[39m,
  [32m"gar"[39m,
  [32m"twn"[39m,
  [32m"h"[39m,
  [32m"tou"[39m,
  [32m"ws"[39m,
  [32m"o"[39m,
  [32m"ths"[39m,
  [32m"ton"[39m,
  [32m"dia"[39m,
  [32m"mh"[39m,
  [32m"oti"[39m,
  [32m"ou"[39m,
  [32m"pros"[39m,
  [32m"eis"[39m,
  [32m"men"[39m,
  [32m"oi"[39m,
  [32m"ouk"[39m,
  [32m"en"[39m,
  [32m"tous"[39m,
  [32m"epi"[39m,
  [32m"ta"[39m,
  [32m"tw"[39m,
  [32m"tois"[39m,
  [32m"auton"[39m,
  [32m"ei"[39m,
  [32m"nun"[39m,
  [32m"peri"[39m,
  [32m"hn"[39m,
  [32m"oun"[39m,
  [32m"autw"[39m,
  [32m"autou"[39m,
  [32m"alla"[39m,
  [32m"tas"[39m,
  [32m"all"[39m,
...
[36mvocabSize[39m: [32mInt[39m = [32m10000[39m
[36mminimumTokenLength[39m: [32mInt[39m = [32m4[39m
[

## Download data and clean up text


This example uses delimited-text data from the OCRE data set. 
We extract column 7, then tidy up the data by:

- converting all text to lower case
- removing all characters *except* alphabetic `a-z` and the space character

In [3]:
val personalRepo = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(personalRepo)

[36mpersonalRepo[39m: [32mcoursierapi[39m.[32mMavenRepository[39m = MavenRepository(https://dl.bintray.com/neelsmith/maven)

In [4]:
import $ivy.`edu.holycross.shot.cite::xcite:4.3.0`
import $ivy.`edu.holycross.shot::ohco2:10.20.3`
import $ivy.`edu.holycross.shot::greek:5.5.3`
import $ivy.`edu.holycross.shot.mid::orthography:2.0.0`

Downloading https://repo1.maven.org/maven2/edu/holycross/shot/cite/xcite_2.12/4.3.0/xcite_2.12-4.3.0.pom
Downloaded https://repo1.maven.org/maven2/edu/holycross/shot/cite/xcite_2.12/4.3.0/xcite_2.12-4.3.0.pom
Downloading https://repo1.maven.org/maven2/edu/holycross/shot/cite/xcite_2.12/4.3.0/xcite_2.12-4.3.0.pom.sha1
Downloaded https://repo1.maven.org/maven2/edu/holycross/shot/cite/xcite_2.12/4.3.0/xcite_2.12-4.3.0.pom.sha1
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/cite/xcite_2.12/4.3.0/xcite_2.12-4.3.0.pom
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/cite/xcite_2.12/4.3.0/xcite_2.12-4.3.0.pom
Downloading https://repo1.maven.org/maven2/org/wvlet/airframe/airframe-log_2.12/20.5.2/airframe-log_2.12-20.5.2.pom
Downloaded https://repo1.maven.org/maven2/org/wvlet/airframe/airframe-log_2.12/20.5.2/airframe-log_2.12-20.5.2.pom
Downloading https://repo1.maven.org/maven2/ch/qos/logback/logback-core/1.2.3/logback-core-1.2.3.pom
Downloading http

Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/greek_2.12/5.5.3/greek_2.12-5.5.3.pom
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/greek_2.12/5.5.3/greek_2.12-5.5.3.pom
Downloading https://repo1.maven.org/maven2/edu/holycross/shot/ohco2_2.12/10.20.0/ohco2_2.12-10.20.0.pom
Downloading https://repo1.maven.org/maven2/edu/holycross/shot/scm_2.12/7.3.0/scm_2.12-7.3.0.pom
Downloading https://repo1.maven.org/maven2/edu/holycross/shot/mid/orthography_2.12/2.0.0/orthography_2.12-2.0.0.pom
Downloading https://repo1.maven.org/maven2/edu/holycross/shot/citevalidator_2.12/1.1.2/citevalidator_2.12-1.1.2.pom
Downloaded https://repo1.maven.org/maven2/edu/holycross/shot/ohco2_2.12/10.20.0/ohco2_2.12-10.20.0.pom
Downloading https://repo1.maven.org/maven2/edu/holycross/shot/ohco2_2.12/10.20.0/ohco2_2.12-10.20.0.pom.sha1
Downloaded https://repo1.maven.org/maven2/edu/holycross/shot/ohco2_2.12/10.20.0/ohco2_2.12-10.20.0.pom.sha1
Downloading https://dl.bintray.co

Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/citebinaryimage_2.12/3.2.0/citebinaryimage_2.12-3.2.0-sources.jar
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/dse_2.12/7.1.1/dse_2.12-7.1.1.jar
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/citeobj_2.12/7.5.0/citeobj_2.12-7.5.0.jar
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/citebinaryimage_2.12/3.2.0/citebinaryimage_2.12-3.2.0.jar
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/mid/orthography_2.12/2.0.0/orthography_2.12-2.0.0.jar
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/scm_2.12/7.3.0/scm_2.12-7.3.0-sources.jar
Downloaded https://repo1.maven.org/maven2/com/github/pathikrit/better-files_2.12/3.5.0/better-files_2.12-3.5.0.jar
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/mid/orthography_2.12/2.0.0/orthography_2.12-2.0.0-sources.jar
Downloaded https://dl.bintray.com/neelsm

[32mimport [39m[36m$ivy.$                                     
[39m
[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36m$ivy.$                                
[39m
[32mimport [39m[36m$ivy.$                                          [39m

In [5]:
import edu.holycross.shot.cite._
import edu.holycross.shot.ohco2._
import edu.holycross.shot.greek._
import edu.holycross.shot.mid.orthography._





[32mimport [39m[36medu.holycross.shot.cite._
[39m
[32mimport [39m[36medu.holycross.shot.ohco2._
[39m
[32mimport [39m[36medu.holycross.shot.greek._
[39m
[32mimport [39m[36medu.holycross.shot.mid.orthography._



[39m

In [6]:
val venetusAbk9Url = "https://raw.githubusercontent.com/neelsmith/transmission-evolution/master/data/poster-data/vascholia-9.cex"
val venetusAbk10Url = "https://raw.githubusercontent.com/neelsmith/transmission-evolution/master/data/poster-data/vascholia-10.cex"
val venetusAbk23Url = "https://raw.githubusercontent.com/neelsmith/transmission-evolution/master/data/poster-data/vascholia-23.cex"
val upsbk9scholia = "https://raw.githubusercontent.com/neelsmith/transmission-evolution/master/data/poster-data/e3scholia-9.cex"
val upsbk10scholia = "https://raw.githubusercontent.com/neelsmith/transmission-evolution/master/data/poster-data/e3scholia-10.cex"
val upsbk23scholia = "https://raw.githubusercontent.com/neelsmith/transmission-evolution/master/data/poster-data/e3scholia-23.cex"
val venetusBbk9Url = "https://raw.githubusercontent.com/neelsmith/transmission-evolution/master/data/poster-data/vbscholia-9.cex"
val venetusBbk10Url = "https://raw.githubusercontent.com/neelsmith/transmission-evolution/master/data/poster-data/vbscholia-10.cex"
val venetusBbk23Url = "https://raw.githubusercontent.com/neelsmith/transmission-evolution/master/data/poster-data/vbscholia-23.cex"


[36mvenetusAbk9Url[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/neelsmith/transmission-evolution/master/data/poster-data/vascholia-9.cex"[39m
[36mvenetusAbk10Url[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/neelsmith/transmission-evolution/master/data/poster-data/vascholia-10.cex"[39m
[36mvenetusAbk23Url[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/neelsmith/transmission-evolution/master/data/poster-data/vascholia-23.cex"[39m
[36mupsbk9scholia[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/neelsmith/transmission-evolution/master/data/poster-data/e3scholia-9.cex"[39m
[36mupsbk10scholia[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/neelsmith/transmission-evolution/master/data/poster-data/e3scholia-10.cex"[39m
[36mupsbk23scholia[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/neelsmith/transmission-evolution/master/data/poster-data/e3scholia-23.cex"[39m
[36mvenetu

In [7]:
// create  source corpora
val venetusAbk9 = CorpusSource.fromUrl(venetusAbk9Url)
val venetusAbk10 = CorpusSource.fromUrl(venetusAbk10Url)
val venetusAbk23 = CorpusSource.fromUrl(venetusAbk23Url)
val upbk9 = CorpusSource.fromUrl(upsbk9scholia)
val upbk10 = CorpusSource.fromUrl(upsbk10scholia)
val upbk23 = CorpusSource.fromUrl(upsbk23scholia)
val venetusBbk9 = CorpusSource.fromUrl(venetusBbk9Url)
val venetusBbk10 = CorpusSource.fromUrl(venetusBbk10Url)
val venetusBbk23 = CorpusSource.fromUrl(venetusBbk23Url)

[36mvenetusAbk9[39m: [32mCorpus[39m = [33mCorpus[39m(
  [33mVector[39m(
    [33mCitableNode[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg5026.msA.hmt:9.1.comment"[39m),
      [32m"\u1f29 \u0399 \u03c4\u1fc6\u03c2 \u1f39\u03bb\u03b9\u1f71\u03b4\u03bf\u03c2 \u1f10\u03c0\u03b9\u03b3\u03c1\u1f71\u03c6\u03b5\u03c4\u03b1\u03b9 \u039b\u03b9\u03c4\u03b7 \u1f45\u03c4\u03b9 \u1f08\u03b3\u03b1\u03bc\u1f73\u03bc\u03bd\u03c9\u03bd \u039d\u1f73\u03c3\u03c4\u03bf\u03c1\u03bf\u03c2 \u03c3\u03c5\u03bc\u03b2\u03bf\u03c5\u03bb\u03b5\u1f7b\u03c3\u03b1\u03bd\u03c4\u03bf\u03c2 \u03c0\u03c1\u03bf\u03c2 \u03c4\u1f78\u03bd \u1f08\u03c7\u03b9\u03bb\u03bb\u1f73\u03b1 \u1f04\u03bd\u03b4\u03c1\u03b1\u03c2 \u03bb\u03af\u03c3\u03c3\u03b5\u03c3\u03b8\u03b1\u03b9 \u1f14\u03c0\u03b5\u03bc\u03c8\u03b5\u03bd \u1f00\u03c1\u03af\u03c3\u03c4\u03bf\u03c5\u03c2 . \u03a6\u03bf\u03af\u03bd\u03b9\u03ba\u03b1 . \u1f48\u03b4\u03c5\u03c3\u03c3\u1f73\u03b1 . \u0391\u1f30\u1f71\u03bd\u03c4  \u2051"[39m
    ),
  

In [12]:
val scholia = venetusAbk9 ++ venetusAbk10 ++ venetusAbk23 ++ upbk9 ++ upbk10 ++ upbk23 ++ venetusBbk9 ++ venetusBbk10 ++ venetusBbk23


[36mscholia[39m: [32mCorpus[39m = [33mCorpus[39m(
  [33mVector[39m(
    [33mCitableNode[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg5026.msA.hmt:9.1.comment"[39m),
      [32m"\u1f29 \u0399 \u03c4\u1fc6\u03c2 \u1f39\u03bb\u03b9\u1f71\u03b4\u03bf\u03c2 \u1f10\u03c0\u03b9\u03b3\u03c1\u1f71\u03c6\u03b5\u03c4\u03b1\u03b9 \u039b\u03b9\u03c4\u03b7 \u1f45\u03c4\u03b9 \u1f08\u03b3\u03b1\u03bc\u1f73\u03bc\u03bd\u03c9\u03bd \u039d\u1f73\u03c3\u03c4\u03bf\u03c1\u03bf\u03c2 \u03c3\u03c5\u03bc\u03b2\u03bf\u03c5\u03bb\u03b5\u1f7b\u03c3\u03b1\u03bd\u03c4\u03bf\u03c2 \u03c0\u03c1\u03bf\u03c2 \u03c4\u1f78\u03bd \u1f08\u03c7\u03b9\u03bb\u03bb\u1f73\u03b1 \u1f04\u03bd\u03b4\u03c1\u03b1\u03c2 \u03bb\u03af\u03c3\u03c3\u03b5\u03c3\u03b8\u03b1\u03b9 \u1f14\u03c0\u03b5\u03bc\u03c8\u03b5\u03bd \u1f00\u03c1\u03af\u03c3\u03c4\u03bf\u03c5\u03c2 . \u03a6\u03bf\u03af\u03bd\u03b9\u03ba\u03b1 . \u1f48\u03b4\u03c5\u03c3\u03c3\u1f73\u03b1 . \u0391\u1f30\u1f71\u03bd\u03c4  \u2051"[39m
    ),
    [

In [13]:
val scholiaAscii = scholia.nodes.map( n => CitableNode( n.urn, LiteraryGreekString(n.text ).ascii))


[34m2020-07-29 21:07:32.679Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:32.691Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:32.781Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:32.800Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:32.865Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:32.888Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:33.859Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:33.871Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:33.892Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:33.959Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:33.966Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:33.974Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:34.796Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:34.816Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:34.832Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:34.864Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:34.878Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:34.885Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:35.491Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:35.500Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:35.510Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:35.521Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:35.524Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ̈[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:35.561Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:36.180Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ‡[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:36.195Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:36.214Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:36.224Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:36.237Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ‡[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:36.265Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:36.800Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:36.808Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:36.822Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:36.831Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:36.868Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:36.877Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:37.506Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:37.521Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:37.572Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:37.580Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:37.595Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:37.623Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:38.312Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii τ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:38.312Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii π[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:38.312Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ε[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:38.312Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ρ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:38.312Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ι[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:38.504Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33m

[34m2020-07-29 21:07:39.197Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:39.202Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:39.206Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:39.257Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:39.262Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ´[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:39.265Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:39.859Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:39.874Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:39.882Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:39.887Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:39.890Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:39.900Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:40.384Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ·[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:40.390Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:40.404Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:40.414Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:40.459Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:40.463Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:40.993Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:41.007Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:41.015Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:41.058Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ‡[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:41.074Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:41.104Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:41.601Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ·[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:41.605Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ·[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:41.607Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:41.609Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ‡[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:41.658Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ·[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:41.663Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:42.292Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:42.296Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ‡[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:42.359Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:42.385Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:42.398Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode Ί[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:42.398Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:42.887Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:42.899Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:42.909Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:42.922Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:42.932Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:42.974Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:43.506Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:43.516Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:43.526Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:43.528Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ‡[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:43.566Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:43.589Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:43.871Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii έ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.872Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii λ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.872Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii λ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.872Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ο[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.872Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii υ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.872Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33m

[34m2020-07-29 21:07:43.880Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii γ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.880Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii η[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.880Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii τ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.880Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii α[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.880Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ι[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.881Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33m

[34m2020-07-29 21:07:43.891Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ῷ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.891Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii η[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.891Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii τ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.892Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ο[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.892Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ῦ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.892Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33m

[34m2020-07-29 21:07:43.973Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii μ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.974Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ε[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.976Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ν[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.977Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ἐ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.977Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii π[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:43.977Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33m

[34m2020-07-29 21:07:44.471Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii α[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:44.471Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ι[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:07:44.580Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ·[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:44.607Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:44.649Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:44.666Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m]

[34m2020-07-29 21:07:45.490Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:45.497Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:45.500Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:45.508Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:45.512Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:45.516Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:46.158Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:46.168Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:46.173Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:46.200Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:46.204Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:46.216Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:47.125Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:47.168Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:47.173Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:47.185Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:47.190Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:47.194Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:47.729Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:47.759Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:47.764Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:47.768Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:47.776Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:47.787Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:48.661Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ’[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:48.727Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode к[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:48.728Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode т[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:48.737Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:48.761Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:48.767Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:49.171Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:49.173Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:49.173Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:49.178Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:49.179Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:49.180Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:49.407Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:49.409Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:49.411Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:49.413Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:49.460Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:49.464Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:07:49.759Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ⁑[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:49.765Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:49.767Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:49.769Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:49.771Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:07:49.772Z[0m  [33mwarn[0m [[37mCodePointTranscoder

: 

In [11]:
scholiaAscii.nodes.map( n => CitableNode( n.urn, LiteraryGreekString(n.text.replaceAll("⁑","" )).ascii))



[34m2020-07-29 21:05:39.397Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ‡[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:39.400Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ‡[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:39.616Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ̈[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:39.621Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ‡[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:39.658Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ‡[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:39.733Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:05:42.765Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ̈[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:42.781Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ‡[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:42.792Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode Ί[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:42.793Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ̈[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:42.808Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode Ά[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:42.821Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:05:45.429Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ‡[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:45.445Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ‡[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:45.474Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode Έ[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:45.537Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ·[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:45.562Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ·[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:45.653Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:05:46.818Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ‡[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:46.909Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode Ί[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:46.910Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ̈[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:46.931Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ‡[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:46.941Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode s[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:46.971Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:05:48.258Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii α[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.259Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii σ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.259Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii τ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.259Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii έ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.260Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii λ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.260Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33m

[34m2020-07-29 21:05:48.273Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ς[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.274Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii λ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.274Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii έ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.274Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii γ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.275Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii η[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.275Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33m

[34m2020-07-29 21:05:48.365Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ἐ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.366Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ν[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.366Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii τ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.366Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ῷ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.366Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii η[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.367Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33m

[34m2020-07-29 21:05:48.465Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii γ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.465Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ν[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.466Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ῶ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.466Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii μ[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.467Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching ascii ε[0m  [34m- (CodePointTranscoder.scala:37)[0m
[34m2020-07-29 21:05:48.467Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33m

[34m2020-07-29 21:05:49.167Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ·[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:49.221Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ·[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:49.293Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ·[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:49.313Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ·[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:49.348Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ·[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:49.380Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:05:50.962Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode /[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:50.970Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:51.000Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:51.033Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:51.058Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:51.071Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:05:51.587Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:51.615Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:51.619Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:51.624Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:51.630Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:51.658Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[34m2020-07-29 21:05:52.592Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:52.605Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ’[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:52.726Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode к[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:52.727Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode т[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:52.737Z[0m  [33mwarn[0m [[37mCodePointTranscoder[0m] [33mCodePointTranscoder: no character matching unicode ~[0m  [34m- (CodePointTranscoder.scala:55)[0m
[34m2020-07-29 21:05:52.749Z[0m  [33mwarn[0m [[37mCodePointTranscoder

[36mres10[39m: [32mVector[39m[[32mCitableNode[39m] = [33mVector[39m(
  [33mCitableNode[39m(
    [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg5026.msA.hmt:9.1.comment"[39m),
    [32m"*h( *i th=s *i(lia/dos e)pigra/fetai *lith o(/ti *a)game/mnwn *ne/storos sumbouleu/santos pros to\\n *a)xille/a a)/ndras li/ssesqai e)/pemyen a)ri/stous . *foi/nika . *o)dusse/a . *ai)a/nt  "[39m
  ),
  [33mCitableNode[39m(
    [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg5026.msA.hmt:9.2.lemma"[39m),
    [32m"w(\\s oi( me\\n *trw=es fulaka\\s e)/xon :"[39m
  ),
  [33mCitableNode[39m(
    [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg5026.msA.hmt:9.2.comment"[39m),
    [32m"kalw=s ei)=pen : oi( me\\n ga\\r *trw=es ta\\s fulaka\\s ei)=xen i(/na mh\\ fu/gwsin oi( *h(ellh/nes dia nukto/s , oi( de\\ *e(/llenes ei)/xonto u(po tou= de/ous "[39m
  ),
  [33mCitableNode[39m(
    [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg5026.msA.hmt:9.3.lemma"[39m),
    [32m"qespesi/h e)/xe fu/za :"[39m
  ),
  

In [14]:
scholiaAscii.size


[36mres13[39m: [32mInt[39m = [32m2247[39m

## Setup a Spark notebook session

Import libraries, configure debugging, start up a local Spark notebook session.  These four cells fall in the category of "stuff you copy and paste in to set up a Jupyter notebook with Spark and don't think about too much."

In [15]:
import $ivy.`org.apache.spark::spark-sql:2.4.5` // Or use any other 2.x version here
import org.apache.spark.sql._
import $ivy.`org.apache.spark::spark-mllib:2.4.5`


Downloading https://repo1.maven.org/maven2/sh/almond/almond-spark_2.12/0.8.2/almond-spark_2.12-0.8.2.pom
Downloading https://repo1.maven.org/maven2/org/apache/spark/spark-sql_2.12/2.4.5/spark-sql_2.12-2.4.5.pom
Downloaded https://repo1.maven.org/maven2/sh/almond/almond-spark_2.12/0.8.2/almond-spark_2.12-0.8.2.pom
Downloaded https://repo1.maven.org/maven2/org/apache/spark/spark-sql_2.12/2.4.5/spark-sql_2.12-2.4.5.pom
Downloading https://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.12/2.4.5/spark-parent_2.12-2.4.5.pom
Downloaded https://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.12/2.4.5/spark-parent_2.12-2.4.5.pom
Downloading https://repo1.maven.org/maven2/org/apache/apache/18/apache-18.pom
Downloaded https://repo1.maven.org/maven2/org/apache/apache/18/apache-18.pom
Downloading https://repo1.maven.org/maven2/org/spark-project/spark/unused/1.0.0/unused-1.0.0.pom
Downloading https://repo1.maven.org/maven2/org/apache/orc/orc-core/1.5.5/orc-core-1.5.5.pom
Downloading h

Downloaded https://repo1.maven.org/maven2/net/razorvine/pyrolite/4.13/pyrolite-4.13.pom
Downloading https://repo1.maven.org/maven2/org/apache/arrow/arrow-memory/0.10.0/arrow-memory-0.10.0.pom
Downloaded https://repo1.maven.org/maven2/org/apache/commons/commons-crypto/1.0.0/commons-crypto-1.0.0.pom
Downloading https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.25/slf4j-api-1.7.25.pom
Downloaded https://repo1.maven.org/maven2/org/apache/arrow/arrow-memory/0.10.0/arrow-memory-0.10.0.pom
Downloaded https://repo1.maven.org/maven2/org/glassfish/jersey/containers/jersey-container-servlet/2.22.2/jersey-container-servlet-2.22.2.pom
Downloading https://repo1.maven.org/maven2/com/carrotsearch/hppc/0.7.2/hppc-0.7.2.pom
Downloaded https://repo1.maven.org/maven2/org/codehaus/janino/janino/3.0.9/janino-3.0.9.pom
Downloading https://repo1.maven.org/maven2/org/slf4j/jcl-over-slf4j/1.7.16/jcl-over-slf4j-1.7.16.pom
Downloaded https://repo1.maven.org/maven2/org/codehaus/janino/commons-compiler/3.0.9/c

Downloaded https://repo1.maven.org/maven2/org/apache/spark/spark-unsafe_2.12/2.4.5/spark-unsafe_2.12-2.4.5.pom
Downloading https://repo1.maven.org/maven2/org/apache/spark/spark-network-common_2.12/2.4.5/spark-network-common_2.12-2.4.5.pom
Downloaded https://repo1.maven.org/maven2/commons-lang/commons-lang/2.6/commons-lang-2.6.pom
Downloading https://repo1.maven.org/maven2/org/apache/parquet/parquet-common/1.10.1/parquet-common-1.10.1.pom
Downloaded https://repo1.maven.org/maven2/org/glassfish/jersey/core/jersey-common/2.22.2/jersey-common-2.22.2.pom
Downloading https://repo1.maven.org/maven2/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.pom
Downloaded https://repo1.maven.org/maven2/io/airlift/aircompressor/0.10/aircompressor-0.10.pom
Downloading https://repo1.maven.org/maven2/com/twitter/chill-java/0.9.3/chill-java-0.9.3.pom
Downloaded https://repo1.maven.org/maven2/org/antlr/antlr4-runtime/4.7/antlr4-runtime-4.7.pom
Downloaded https://repo1.maven.org/maven2/org/apache/av

Downloading https://repo1.maven.org/maven2/org/apache/commons/commons-parent/40/commons-parent-40.pom
Downloading https://repo1.maven.org/maven2/com/fasterxml/jackson/jackson-parent/2.6.1/jackson-parent-2.6.1.pom
Downloaded https://repo1.maven.org/maven2/com/fasterxml/jackson/jackson-parent/2.6.1/jackson-parent-2.6.1.pom
Downloaded https://repo1.maven.org/maven2/org/slf4j/slf4j-parent/1.7.16/slf4j-parent-1.7.16.pom
Downloaded https://repo1.maven.org/maven2/org/apache/commons/commons-parent/40/commons-parent-40.pom
Downloading https://repo1.maven.org/maven2/org/apache/commons/commons-parent/35/commons-parent-35.pom
Downloaded https://repo1.maven.org/maven2/org/apache/commons/commons-parent/35/commons-parent-35.pom
Downloading https://repo1.maven.org/maven2/org/apache/commons/commons-parent/34/commons-parent-34.pom
Downloaded https://repo1.maven.org/maven2/org/apache/commons/commons-parent/34/commons-parent-34.pom
Downloading https://repo1.maven.org/maven2/org/apache/commons/commons-pare

Downloaded https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-app/2.6.5/hadoop-mapreduce-client-app-2.6.5.pom
Downloaded https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-yarn-api/2.6.5/hadoop-yarn-api-2.6.5.pom
Downloading https://repo1.maven.org/maven2/org/apache/curator/curator-framework/2.6.0/curator-framework-2.6.0.pom
Downloaded https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-http/9.4.20.v20190813/jetty-http-9.4.20.v20190813.pom
Downloaded https://repo1.maven.org/maven2/org/apache/commons/commons-compress/1.8.1/commons-compress-1.8.1.pom
Downloaded https://repo1.maven.org/maven2/org/glassfish/jersey/media/jersey-media-jaxb/2.22.2/jersey-media-jaxb-2.22.2.pom
Downloaded https://repo1.maven.org/maven2/org/glassfish/jersey/bundles/repackaged/jersey-guava/2.22.2/jersey-guava-2.22.2.pom
Downloaded https://repo1.maven.org/maven2/org/apache/curator/curator-framework/2.6.0/curator-framework-2.6.0.pom
Downloading https://repo1.maven.org/maven2/com/google

Downloaded https://repo1.maven.org/maven2/commons-collections/commons-collections/3.2.2/commons-collections-3.2.2.pom
Downloaded https://repo1.maven.org/maven2/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.pom
Downloaded https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-yarn-common/2.6.5/hadoop-yarn-common-2.6.5.pom
Downloaded https://repo1.maven.org/maven2/org/objenesis/objenesis/2.5.1/objenesis-2.5.1.pom
Downloading https://repo1.maven.org/maven2/commons-cli/commons-cli/1.2/commons-cli-1.2.pom
Downloaded https://repo1.maven.org/maven2/com/google/code/gson/gson/2.2.4/gson-2.2.4.pom
Downloading https://repo1.maven.org/maven2/org/apache/curator/curator-client/2.6.0/curator-client-2.6.0.pom
Downloading https://repo1.maven.org/maven2/org/json4s/json4s-ast_2.12/3.5.3/json4s-ast_2.12-3.5.3.pom
Downloading https://repo1.maven.org/maven2/org/javassist/javassist/3.18.1-GA/javassist-3.18.1-GA.pom
Downloading https://repo1.maven.org/maven2/org/mortbay/jetty/jett

Downloaded https://repo1.maven.org/maven2/com/google/inject/guice/3.0/guice-3.0.pom
Downloaded https://repo1.maven.org/maven2/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.pom
Downloading https://repo1.maven.org/maven2/org/apache/directory/server/apacheds-i18n/2.0.0-M15/apacheds-i18n-2.0.0-M15.pom
Downloaded https://repo1.maven.org/maven2/org/apache/httpcomponents/httpcore/4.2.4/httpcore-4.2.4.pom
Downloaded https://repo1.maven.org/maven2/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.pom
Downloaded https://repo1.maven.org/maven2/jline/jline/0.9.94/jline-0.9.94.pom
Downloaded https://repo1.maven.org/maven2/org/apache/directory/server/apacheds-i18n/2.0.0-M15/apacheds-i18n-2.0.0-M15.pom
Downloaded https://repo1.maven.org/maven2/org/apache/directory/api/api-asn1-api/1.0.0-M20/api-asn1-api-1.0.0-M20.pom
Downloading https://repo1.maven.org/maven2/com/google/inject/guice-parent/3.0/guice-parent-3.0.pom
Downloading https://repo1.maven.org/maven2/org/apache/directory/api/ap

Downloading https://repo1.maven.org/maven2/com/google/guava/guava/16.0.1/guava-16.0.1.jar
Downloaded https://repo1.maven.org/maven2/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar
Downloading https://repo1.maven.org/maven2/org/apache/curator/curator-framework/2.6.0/curator-framework-2.6.0.jar
Downloaded https://repo1.maven.org/maven2/org/apache/httpcomponents/httpcore/4.2.4/httpcore-4.2.4.jar
Downloading https://repo1.maven.org/maven2/org/htrace/htrace-core/3.0.4/htrace-core-3.0.4.jar
Downloaded https://repo1.maven.org/maven2/org/apache/directory/api/api-util/1.0.0-M20/api-util-1.0.0-M20.jar
Downloading https://repo1.maven.org/maven2/org/apache/directory/server/apacheds-i18n/2.0.0-M15/apacheds-i18n-2.0.0-M15.jar
Downloaded https://repo1.maven.org/maven2/org/apache/curator/curator-framework/2.6.0/curator-framework-2.6.0.jar
Downloading https://repo1.maven.org/maven2/org/apache/commons/commons-math3/3.4.1/commons-math3-3.4.1.jar
Downloaded https://repo1.mave

Downloading https://repo1.maven.org/maven2/org/apache/commons/commons-lang3/3.5/commons-lang3-3.5.jar
Downloaded https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-core/2.6.5/hadoop-mapreduce-client-core-2.6.5.jar
Downloading https://repo1.maven.org/maven2/org/mortbay/jetty/jetty-util/6.1.26/jetty-util-6.1.26.jar
Downloaded https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-databind/2.6.7.3/jackson-databind-2.6.7.3.jar
Downloading https://repo1.maven.org/maven2/org/apache/spark/spark-kvstore_2.12/2.4.5/spark-kvstore_2.12-2.4.5.jar
Downloaded https://repo1.maven.org/maven2/io/netty/netty/3.9.9.Final/netty-3.9.9.Final.jar
Downloading https://repo1.maven.org/maven2/org/slf4j/jcl-over-slf4j/1.7.16/jcl-over-slf4j-1.7.16.jar
Downloaded https://repo1.maven.org/maven2/org/apache/commons/commons-crypto/1.0.0/commons-crypto-1.0.0.jar
Downloading https://repo1.maven.org/maven2/io/netty/netty-all/4.1.42.Final/netty-all-4.1.42.Final.jar
Downloaded https://repo1.

Downloaded https://repo1.maven.org/maven2/org/glassfish/hk2/external/javax.inject/2.4.0-b34/javax.inject-2.4.0-b34.jar
Downloading https://repo1.maven.org/maven2/org/glassfish/jersey/core/jersey-client/2.22.2/jersey-client-2.22.2.jar
Downloaded https://repo1.maven.org/maven2/org/glassfish/hk2/osgi-resource-locator/1.0.1/osgi-resource-locator-1.0.1.jar
Downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-common/2.6.5/hadoop-mapreduce-client-common-2.6.5.jar
Downloaded https://repo1.maven.org/maven2/org/json4s/json4s-ast_2.12/3.5.3/json4s-ast_2.12-3.5.3.jar
Downloaded https://repo1.maven.org/maven2/org/glassfish/jersey/media/jersey-media-jaxb/2.22.2/jersey-media-jaxb-2.22.2.jar
Downloading https://repo1.maven.org/maven2/com/clearspring/analytics/stream/2.7.0/stream-2.7.0.jar
Downloading https://repo1.maven.org/maven2/org/glassfish/jersey/core/jersey-server/2.22.2/jersey-server-2.22.2.jar
Downloaded https://repo1.maven.org/maven2/org/glassfish/jersey/bundle

Downloaded https://repo1.maven.org/maven2/org/apache/ivy/ivy/2.4.0/ivy-2.4.0.jar
Downloading https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-server/9.4.20.v20190813/jetty-server-9.4.20.v20190813.jar
Downloaded https://repo1.maven.org/maven2/sh/almond/ammonite-spark_2.12/0.7.2/ammonite-spark_2.12-0.7.2.jar
Downloading https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-http/9.4.20.v20190813/jetty-http-9.4.20.v20190813.jar
Downloaded https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-server/9.4.20.v20190813/jetty-server-9.4.20.v20190813.jar
Downloading https://repo1.maven.org/maven2/com/carrotsearch/hppc/0.7.2/hppc-0.7.2.jar
Downloaded https://repo1.maven.org/maven2/org/apache/spark/spark-sql_2.12/2.4.5/spark-sql_2.12-2.4.5.jar
Downloading https://repo1.maven.org/maven2/com/univocity/univocity-parsers/2.7.3/univocity-parsers-2.7.3-sources.jar
Downloaded https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-http/9.4.20.v20190813/jetty-http-9.4.20.v20190813.jar
Downloading h

Downloaded https://repo1.maven.org/maven2/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0-sources.jar
Downloading https://repo1.maven.org/maven2/commons-cli/commons-cli/1.2/commons-cli-1.2-sources.jar
Downloaded https://repo1.maven.org/maven2/commons-digester/commons-digester/1.8/commons-digester-1.8-sources.jar
Downloaded https://repo1.maven.org/maven2/commons-net/commons-net/3.1/commons-net-3.1-sources.jar
Downloading https://repo1.maven.org/maven2/org/apache/directory/server/apacheds-kerberos-codec/2.0.0-M15/apacheds-kerberos-codec-2.0.0-M15-sources.jar
Downloading https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.25/slf4j-api-1.7.25-sources.jar
Downloaded https://repo1.maven.org/maven2/com/google/code/gson/gson/2.2.4/gson-2.2.4-sources.jar
Downloading https://repo1.maven.org/maven2/xmlenc/xmlenc/0.52/xmlenc-0.52-sources.jar
Downloaded https://repo1.maven.org/maven2/org/apache/httpcomponents/httpcore/4.2.4/httpcore-4.2.4-sources.jar
Downloading https://repo1.maven.o

Downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client/2.6.5/hadoop-client-2.6.5-sources.jar
Downloaded https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-core/2.6.5/hadoop-mapreduce-client-core-2.6.5-sources.jar
Downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-yarn-server-nodemanager/2.6.5/hadoop-yarn-server-nodemanager-2.6.5-sources.jar
Downloaded https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-yarn-server-common/2.6.5/hadoop-yarn-server-common-2.6.5-sources.jar
Downloading https://repo1.maven.org/maven2/org/fusesource/leveldbjni/leveldbjni-all/1.8/leveldbjni-all-1.8-sources.jar
Downloaded https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-shuffle/2.6.5/hadoop-mapreduce-client-shuffle-2.6.5-sources.jar
Downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-app/2.6.5/hadoop-mapreduce-client-app-2.6.5-sources.jar
Downloaded https://repo1.maven.org/maven2/org/apache/h

Downloading https://repo1.maven.org/maven2/org/json4s/json4s-scalap_2.12/3.5.3/json4s-scalap_2.12-3.5.3-sources.jar
Downloaded https://repo1.maven.org/maven2/org/json4s/json4s-scalap_2.12/3.5.3/json4s-scalap_2.12-3.5.3-sources.jar
Downloading https://repo1.maven.org/maven2/org/glassfish/hk2/hk2-api/2.4.0-b34/hk2-api-2.4.0-b34-sources.jar
Downloaded https://repo1.maven.org/maven2/org/roaringbitmap/RoaringBitmap/0.7.45/RoaringBitmap-0.7.45-sources.jar
Downloading https://repo1.maven.org/maven2/org/glassfish/hk2/hk2-utils/2.4.0-b34/hk2-utils-2.4.0-b34-sources.jar
Downloaded https://repo1.maven.org/maven2/org/javassist/javassist/3.18.1-GA/javassist-3.18.1-GA-sources.jar
Downloading https://repo1.maven.org/maven2/org/json4s/json4s-ast_2.12/3.5.3/json4s-ast_2.12-3.5.3-sources.jar
Downloaded https://repo1.maven.org/maven2/com/google/inject/guice/3.0/guice-3.0-sources.jar
Downloading https://repo1.maven.org/maven2/com/clearspring/analytics/stream/2.7.0/stream-2.7.0-sources.jar
Downloaded https

Downloading https://repo1.maven.org/maven2/org/codehaus/janino/commons-compiler/3.0.9/commons-compiler-3.0.9-sources.jar
Downloaded https://repo1.maven.org/maven2/org/apache/parquet/parquet-format/2.4.0/parquet-format-2.4.0-sources.jar
Downloading https://repo1.maven.org/maven2/org/glassfish/hk2/external/aopalliance-repackaged/2.4.0-b34/aopalliance-repackaged-2.4.0-b34-sources.jar
Downloaded https://repo1.maven.org/maven2/org/glassfish/hk2/external/aopalliance-repackaged/2.4.0-b34/aopalliance-repackaged-2.4.0-b34-sources.jar
Downloading https://repo1.maven.org/maven2/org/apache/parquet/parquet-encoding/1.10.1/parquet-encoding-1.10.1-sources.jar
Downloaded https://repo1.maven.org/maven2/io/dropwizard/metrics/metrics-json/3.1.5/metrics-json-3.1.5-sources.jar
Downloading https://repo1.maven.org/maven2/com/vlkan/flatbuffers/1.2.0-3f79e055/flatbuffers-1.2.0-3f79e055-sources.jar
Downloaded https://repo1.maven.org/maven2/org/codehaus/janino/commons-compiler/3.0.9/commons-compiler-3.0.9-source

Downloading https://repo1.maven.org/maven2/com/google/code/findbugs/jsr305/1.3.9/jsr305-1.3.9.pom
Downloading https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1.pom
Downloading https://repo1.maven.org/maven2/net/sf/opencsv/opencsv/2.3/opencsv-2.3.pom
Downloading https://repo1.maven.org/maven2/org/scalanlp/breeze-macros_2.12/0.13.2/breeze-macros_2.12-0.13.2.pom
Downloaded https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.16/slf4j-api-1.7.16.pom
Downloading https://repo1.maven.org/maven2/com/github/fommil/netlib/core/1.1.2/core-1.1.2.pom
Downloaded https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1.pom
Downloaded https://repo1.maven.org/maven2/org/scalanlp/breeze-macros_2.12/0.13.2/breeze-macros_2.12-0.13.2.pom
Downloaded https://repo1.maven.org/maven2/com/google/code/findbugs/jsr305/1.3.9/jsr305-1.3.9.pom
Downloading https://repo1.maven.org/maven2/com/github/rwl/jtransforms/2.4.0/jtransf

[32mimport [39m[36m$ivy.$                                   // Or use any other 2.x version here
[39m
[32mimport [39m[36morg.apache.spark.sql._
[39m
[32mimport [39m[36m$ivy.$                                    
[39m

In [16]:
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)

[32mimport [39m[36morg.apache.log4j.{Level, Logger}
[39m

In [17]:
val spark = {
  NotebookSparkSession.builder()
    .master("local[*]")
    .getOrCreate()
}

Loading spark-stubs


Downloading https://repo1.maven.org/maven2/sh/almond/spark-stubs_24_2.12/0.7.2/spark-stubs_24_2.12-0.7.2.pom
Downloaded https://repo1.maven.org/maven2/sh/almond/spark-stubs_24_2.12/0.7.2/spark-stubs_24_2.12-0.7.2.pom
Downloading https://repo1.maven.org/maven2/sh/almond/spark-stubs_24_2.12/0.7.2/spark-stubs_24_2.12-0.7.2-sources.jar
Downloading https://repo1.maven.org/maven2/sh/almond/spark-stubs_24_2.12/0.7.2/spark-stubs_24_2.12-0.7.2.jar
Downloaded https://repo1.maven.org/maven2/sh/almond/spark-stubs_24_2.12/0.7.2/spark-stubs_24_2.12-0.7.2-sources.jar
Downloaded https://repo1.maven.org/maven2/sh/almond/spark-stubs_24_2.12/0.7.2/spark-stubs_24_2.12-0.7.2.jar


Getting spark JARs
Creating SparkSession


Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties


[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@724c4429

## Topic modelling with Spark LDA

After importing a small mountain of Spark libraries, the following cells go through the basic steps of topic modelling:

1. Create a text corpus
2. Tokenize
3. Filter stop words
4. Count word occurrences for each text
5. Create the LDA model by "fitting" it to our data
6. Apply the model to compute the topics and their distribution in each document of our corpus


In [18]:
import org.apache.spark.ml.clustering.LDA
import org.apache.spark.ml.feature.RegexTokenizer
import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.ml.feature.CountVectorizer
import org.apache.spark.mllib.linalg.Vector
import scala.collection.mutable.WrappedArray
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.functions._

[32mimport [39m[36morg.apache.spark.ml.clustering.LDA
[39m
[32mimport [39m[36morg.apache.spark.ml.feature.RegexTokenizer
[39m
[32mimport [39m[36morg.apache.spark.ml.feature.StopWordsRemover
[39m
[32mimport [39m[36morg.apache.spark.ml.feature.CountVectorizer
[39m
[32mimport [39m[36morg.apache.spark.mllib.linalg.Vector
[39m
[32mimport [39m[36mscala.collection.mutable.WrappedArray
[39m
[32mimport [39m[36morg.apache.spark.sql.types.IntegerType
[39m
[32mimport [39m[36morg.apache.spark.sql.functions._[39m

### 1. Create `DataFrame` with text corpus

Getting your clean text into a Spark `DataFrame` is an awkward, two-step process. (This should be simpler in futuer versions of Spark.)

The important output is `corpus_df`, a `DataFrame` with one row for every text.


In [19]:
// Create RDD:
val scholiaText = scholiaAscii.nodes.map(n => n.text)
val txtRdd = spark.sparkContext.parallelize(scholiaText).zipWithIndex



[36mscholiaText[39m: [32mcollection[39m.[32mimmutable[39m.[32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"\u1f29 \u0399 \u03c4\u1fc6\u03c2 \u1f39\u03bb\u03b9\u1f71\u03b4\u03bf\u03c2 \u1f10\u03c0\u03b9\u03b3\u03c1\u1f71\u03c6\u03b5\u03c4\u03b1\u03b9 \u039b\u03b9\u03c4\u03b7 \u1f45\u03c4\u03b9 \u1f08\u03b3\u03b1\u03bc\u1f73\u03bc\u03bd\u03c9\u03bd \u039d\u1f73\u03c3\u03c4\u03bf\u03c1\u03bf\u03c2 \u03c3\u03c5\u03bc\u03b2\u03bf\u03c5\u03bb\u03b5\u1f7b\u03c3\u03b1\u03bd\u03c4\u03bf\u03c2 \u03c0\u03c1\u03bf\u03c2 \u03c4\u1f78\u03bd \u1f08\u03c7\u03b9\u03bb\u03bb\u1f73\u03b1 \u1f04\u03bd\u03b4\u03c1\u03b1\u03c2 \u03bb\u03af\u03c3\u03c3\u03b5\u03c3\u03b8\u03b1\u03b9 \u1f14\u03c0\u03b5\u03bc\u03c8\u03b5\u03bd \u1f00\u03c1\u03af\u03c3\u03c4\u03bf\u03c5\u03c2 . \u03a6\u03bf\u03af\u03bd\u03b9\u03ba\u03b1 . \u1f48\u03b4\u03c5\u03c3\u03c3\u1f73\u03b1 . \u0391\u1f30\u1f71\u03bd\u03c4  \u2051"[39m,
  [32m"\u1f63\u03c2 \u03bf\u1f31 \u03bc\u1f72\u03bd \u03a4\u03c1\u1ff6\u03b5\u03c

In [20]:
// Import implicits *after* creation of context.
import spark.sqlContext.implicits._

val corpus_df = txtRdd.toDF("corpus", "id")

[32mimport [39m[36mspark.sqlContext.implicits._

[39m
[36mcorpus_df[39m: [32mDataFrame[39m = [corpus: string, id: bigint]

While we're at it, we can paste it this handy snippet defining a function that will beautify our display of Spark `DataFrame`s in HTML.  (We'll use the `showHTML` function later.)

In [21]:
// based on a snippet by Ivan Zaitsev
// https://github.com/almond-sh/almond/issues/180#issuecomment-364711999
implicit class RichDF(val df: DataFrame) {
  def showHTML(limit:Int = 20, truncate: Int = 20) = {
    import xml.Utility.escape
    val data = df.take(limit)
    val header = df.schema.fieldNames.toSeq
    val rows: Seq[Seq[String]] = data.map { row =>
      row.toSeq.map { cell =>
        val str = cell match {
          case null => "null"
          case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
          case array: Array[_] => array.mkString("[", ", ", "]")
          case seq: Seq[_] => seq.mkString("[", ", ", "]")
          case _ => cell.toString
        }
        if (truncate > 0 && str.length > truncate) {
          // do not show ellipses for strings shorter than 4 characters.
          if (truncate < 4) str.substring(0, truncate)
          else str.substring(0, truncate - 3) + "..."
        } else {
          str
        }
      }: Seq[String]
    }

    publish.html(s"""
      <table class="table">
        <tr>
        ${header.map(h => s"<th>${escape(h)}</th>").mkString}
        </tr>
        ${rows.map { row =>
          s"<tr>${row.map { c => s"<td>${escape(c)}</td>" }.mkString}</tr>"
        }.mkString
        }
      </table>""")
  }
}

defined [32mclass[39m [36mRichDF[39m

### 2. Tokenize

In [22]:
val tokenizer = new RegexTokenizer().setPattern("[\\W_]+").setMinTokenLength(minimumTokenLength).setInputCol("corpus").setOutputCol("tokens")
val tokenized_df = tokenizer.transform(corpus_df)


[36mtokenizer[39m: [32mRegexTokenizer[39m = regexTok_c9d707217415
[36mtokenized_df[39m: [32mDataFrame[39m = [corpus: string, id: bigint ... 1 more field]

### 3. Filter out stop words

Well, think about a serious stop-word list at some point, but here's the technique.

In [23]:
val remover = new StopWordsRemover().setStopWords(stopWords).setInputCol("tokens").setOutputCol("filtered")
val filtered_df = remover.transform(tokenized_df)





[36mremover[39m: [32mStopWordsRemover[39m = stopWords_19991768bb25
[36mfiltered_df[39m: [32mDataFrame[39m = [corpus: string, id: bigint ... 2 more fields]

### 4. Compute counts of each token for each text


In [24]:
val vectorizer = new CountVectorizer().setInputCol("filtered").setOutputCol("features").setVocabSize(vocabSize).setMinDF(5).fit(filtered_df)
val countVectors = vectorizer.transform(filtered_df).select("id", "features")



: 

### 5. Create ("fit") LDA model

In [21]:
val lda = new LDA().setK(k).setMaxIter(iterations)
val model = lda.fit(countVectors)

20/07/29 18:47:23 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
20/07/29 18:47:23 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS


[36mlda[39m: [32mLDA[39m = lda_21e790e5eb14
[36mmodel[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32mml[39m.[32mclustering[39m.[32mLDAModel[39m = lda_21e790e5eb14

### 6. Compute topics and their distribution in each document

Each topic is a set of terms with corresponding weights.


In [22]:
val topics = model.describeTopics(termsToDisplay)


[36mtopics[39m: [32mDataFrame[39m = [topic: int, termIndices: array<int> ... 1 more field]

In [23]:
topics.showHTML(truncate=1000)

topic,termIndices,termWeights
0,"[2, 4, 3, 0, 5, 7, 1, 11, 16, 20, 19, 10, 15, 17, 37]","[0.01986889626600968, 0.01932522649981618, 0.019323379541894697, 0.018745833002878125, 0.013076311991565037, 0.012980631820756445, 0.008930328573828014, 0.007860709814748767, 0.0067491893833538435, 0.006498753243127346, 0.006286027639712898, 0.0057437515600670935, 0.005335177991950501, 0.0049954086050735666, 0.004758706842736377]"
1,"[211, 672, 263, 493, 50, 173, 1027, 626, 420, 1497, 1423, 1294, 835, 2646, 631]","[0.013414109308485404, 0.007941193880247291, 0.0065626891318716915, 0.006007845709038321, 0.005487971652209451, 0.005179925976446777, 0.004980840957096819, 0.00493300458474624, 0.004895292640291317, 0.004797147236058964, 0.004772138882516078, 0.004616209217937882, 0.004503089364053169, 0.004437047663657107, 0.004241483923575095]"
2,"[214, 518, 223, 87, 423, 621, 138, 194, 744, 133, 111, 395, 75, 1009, 942]","[0.00915490718650201, 0.008856476283498154, 0.008712854152544323, 0.007840280736737332, 0.006863206400065984, 0.006107387760620054, 0.0058826763674664955, 0.005739410123345928, 0.005728924330512992, 0.005436379905677169, 0.005430206109917215, 0.0053863326145196355, 0.005342586322654278, 0.005258679134534685, 0.005184976362331733]"
3,"[1, 6, 0, 9, 12, 18, 14, 10, 71, 68, 91, 63, 47, 56, 76]","[0.026150227855462265, 0.022135753761362748, 0.01819767961914994, 0.014115597379082141, 0.009961834117858122, 0.009196594203493261, 0.007255978473848224, 0.005394743133536403, 0.0049112484585441835, 0.004709830270836235, 0.004311627532390238, 0.004176979892500168, 0.004130215089818105, 0.003935797680399047, 0.003913904824382821]"
4,"[8, 21, 189, 182, 343, 318, 107, 212, 351, 407, 1196, 914, 761, 302, 50]","[0.09003188812965747, 0.03364120209066239, 0.01456984641737162, 0.009830547729720914, 0.009556463503811188, 0.009317827350900714, 0.006271738380387499, 0.006112270991840083, 0.005904507996195954, 0.004638306259383047, 0.004177922593453237, 0.004073193233113424, 0.004069608961355008, 0.003990981355650367, 0.003909623139734178]"
5,"[13, 40, 25, 22, 5, 96, 108, 143, 30, 60, 149, 0, 173, 66, 103]","[0.02534459683987993, 0.011355902526753725, 0.009132024060401795, 0.007595144968968877, 0.007517514349217672, 0.007253415120621234, 0.00644957261150314, 0.005864512897930556, 0.005845395831852146, 0.0053342732689725175, 0.005226404897535996, 0.004778899656773448, 0.004769607450541042, 0.0046574241865424155, 0.004427850028516633]"
6,"[51, 26, 1, 67, 79, 153, 183, 209, 29, 477, 140, 162, 174, 128, 23]","[0.011169682669454468, 0.010669247735154907, 0.007576665848133972, 0.007306281069739649, 0.006895240906197557, 0.006330255089279749, 0.006231158335656667, 0.004813236572446734, 0.004554480537488906, 0.004458653735388815, 0.004435406248664073, 0.004383993689298136, 0.0042354616586295765, 0.003984022339646123, 0.0038004802113434373]"
7,"[121, 573, 832, 1312, 1063, 517, 1356, 443, 1466, 1857, 1549, 1227, 778, 1849, 1539]","[0.015450685350239243, 0.0091268365300425, 0.007072391793670206, 0.006137663673153784, 0.0058289053901272505, 0.005648822551735348, 0.005469017727016834, 0.004897992949877147, 0.004623272287933023, 0.004577739230994999, 0.0043240662827098075, 0.003941706281900579, 0.0038225120422273164, 0.003797338355977286, 0.0035304234152881693]"


## 7. Label topics

For human readers, we'll replace index numbers for each term with the actual term.

1. Create a new DataFrame with ordered lists ot terms by looking up the term for each term index.
2. Number the rows of this DataFrame so we can join it with the existing topic data.

In [24]:
val topicLabels = topics.select("termIndices").map { case Row(r:  WrappedArray[Integer]) => r.map( i => vectorizer.vocabulary(i) ) }
val labelsNumberedLong = topicLabels.rdd.zipWithIndex.toDF("terms", "topicLong")
val labelsIndexed = labelsNumberedLong.withColumn("topic", $"topicLong".cast(IntegerType)).drop("topicLong")

val topicsWithTerms = labelsIndexed.join(topics, labelsIndexed.col("topic") === topics.col("topic")).drop(labelsIndexed.col("topic"))





[36mtopicLabels[39m: [32mDataset[39m[[32mWrappedArray[39m[[32mString[39m]] = [value: array<string>]
[36mlabelsNumberedLong[39m: [32mDataFrame[39m = [terms: array<string>, topicLong: bigint]
[36mlabelsIndexed[39m: [32mDataFrame[39m = [terms: array<string>, topic: int]
[36mtopicsWithTerms[39m: [32mDataFrame[39m = [terms: array<string>, topic: int ... 2 more fields]

In [25]:
val weightedLabels = topicsWithTerms.withColumn("termsWithWeight", expr("zip_with(terms, termWeights, (t,w) -> concat(t, ' ', w))"))


[36mweightedLabels[39m: [32mDataFrame[39m = [terms: array<string>, topic: int ... 3 more fields]

In [26]:
// Flat view
weightedLabels.select("topic", "termsWithWeight").showHTML(truncate=1000)



topic,termsWithWeight
0,"[greeklit 0.01986889626600968, tlg0012 0.01932522649981618, tlg001 0.019323379541894697, kata 0.018745833002878125, legei 0.013076311991565037, echs 0.012980631820756445, para 0.008930328573828014, fhsin 0.007860709814748767, diastalteon 0.0067491893833538435, dios 0.006498753243127346, braxu 0.006286027639712898, fhsi 0.0057437515600670935, oion 0.005335177991950501, dunatai 0.0049954086050735666, oute 0.004758706842736377]"
1,"[oros 0.013414109308485404, egxei 0.007941193880247291, monos 0.0065626891318716915, upomnhmatwn 0.006007845709038321, outos 0.005487971652209451, tisi 0.005179925976446777, olumpos 0.004980840957096819, pezos 0.00493300458474624, xeiras 0.004895292640291317, cite2 0.004797147236058964, dingbats 0.004772138882516078, anwgei 0.004616209217937882, enia 0.004503089364053169, amunai 0.004437047663657107, armata 0.004241483923575095]"
2,"[safws 0.00915490718650201, trwessi 0.008856476283498154, xarin 0.008712854152544323, aqetountai 0.007840280736737332, aspida 0.006863206400065984, ektori 0.006107387760620054, stixoi 0.0058826763674664955, doru 0.005739410123345928, eqnos 0.005728924330512992, treis 0.005436379905677169, omoion 0.005430206109917215, taxews 0.0053863326145196355, axillews 0.005342586322654278, acia 0.005258679134534685, logoi 0.005184976362331733]"
3,"[para 0.026150227855462265, aristarxos 0.022135753761362748, kata 0.01819767961914994, zhnodotos 0.014115597379082141, grafei 0.009961834117858122, omoiws 0.009196594203493261, exei 0.007255978473848224, fhsi 0.005394743133536403, tonon 0.0049112484585441835, askalwniths 0.004709830270836235, telous 0.004311627532390238, enqade 0.004176979892500168, oper 0.004130215089818105, egeneto 0.003935797680399047, onoma 0.003913904824382821]"
4,"[aristarx 0.09003188812965747, alloi 0.03364120209066239, arist 0.01456984641737162, eixon 0.009830547729720914, akws 0.009556463503811188, aristofanous 0.009317827350900714, pantes 0.006271738380387499, shmeiountai 0.006112270991840083, pasai 0.005904507996195954, oios 0.004638306259383047, ponton 0.004177922593453237, kuna 0.004073193233113424, agan 0.004069608961355008, axaioi 0.003990981355650367, outos 0.003909623139734178]"
5,"[grafetai 0.02534459683987993, trwwn 0.011355902526753725, diplh 0.009132024060401795, palin 0.007595144968968877, legei 0.007517514349217672, ektwr 0.007253415120621234, amfi 0.00644957261150314, apac 0.005864512897930556, aristarxou 0.005845395831852146, pote 0.0053342732689725175, ippwn 0.005226404897535996, kata 0.004778899656773448, tisi 0.004769607450541042, polla 0.0046574241865424155, entauqa 0.004427850028516633]"
6,"[prwton 0.011169682669454468, zeus 0.010669247735154907, para 0.007576665848133972, arxhs 0.007306281069739649, enqa 0.006895240906197557, polemou 0.006330255089279749, allhs 0.006231158335656667, trwas 0.004813236572446734, toutou 0.004554480537488906, pasan 0.004458653735388815, upostikteon 0.004435406248664073, perispasteon 0.004383993689298136, perissos 0.0042354616586295765, allwn 0.003984022339646123, logos 0.0038004802113434373]"
7,"[teixos 0.015450685350239243, skhptron 0.0091268365300425, xrushn 0.007072391793670206, agorh 0.006137663673153784, prwi 0.0058289053901272505, tode 0.005648822551735348, poseidawn 0.005469017727016834, ariston 0.004897992949877147, anastrefetai 0.004623272287933023, belewn 0.004577739230994999, taxista 0.0043240662827098075, esqlon 0.003941706281900579, kunes 0.0038225120422273164, upnou 0.003797338355977286, nukta 0.0035304234152881693]"


Here's the same information, but displayed one term at a time:

In [27]:
// Exploded view
val explodedTerms = weightedLabels.select(col("*"),explode(col("termsWithWeight"))).select("topic","col")

explodedTerms.showHTML(explodedTerms.count.toInt, 1000)

topic,col
0,greeklit 0.01986889626600968
0,tlg0012 0.01932522649981618
0,tlg001 0.019323379541894697
0,kata 0.018745833002878125
0,legei 0.013076311991565037
0,echs 0.012980631820756445
0,para 0.008930328573828014
0,fhsin 0.007860709814748767
0,diastalteon 0.0067491893833538435
0,dios 0.006498753243127346


[36mexplodedTerms[39m: [32mDataFrame[39m = [topic: int, col: string]

## 8. Compute distribution of topics per document


To apply this topic model to a specific document or set of documents, we can compute the weight of each topic in each document..

In [28]:
val transformed = model.transform(countVectors)
transformed.printSchema // show(false)



root
 |-- id: long (nullable = false)
 |-- features: vector (nullable = true)
 |-- topicDistribution: vector (nullable = true)



[36mtransformed[39m: [32mDataFrame[39m = [id: bigint, features: vector ... 1 more field]

Here's the weightings for the first ten documents:

In [29]:
val documentsToShow = 10
transformed.showHTML(documentsToShow, 1000)

id,features,topicDistribution
0,"(3383,[20,35,40,80,89,97,128,132,188,200,232,261,402,674,768,920,1030,1127,1129,1226,1525,1534,1939,1947,2172,2348,2385,2455],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])","[0.4848243899854027,0.003253415905908029,0.003349604114554357,0.004532637197832403,0.003489071342027341,0.2663936934412065,0.230936699589109,0.00322048842395955]"
1,"(3383,[188,438,460,2401,2973],[1.0,1.0,1.0,1.0,1.0])","[0.028146261339950303,0.01640553873108122,0.016891458135848986,0.022846722893950008,0.017596345271649105,0.863250613504095,0.018620036522588687,0.016243023600836588]"
2,"(3383,[31,70,807,1397,1830,2476,2851],[2.0,1.0,1.0,1.0,1.0,1.0,1.0])","[0.01877510701204562,0.010897262396726315,0.011220898727280607,0.015163533966856569,0.011688856759179687,0.9090928712185797,0.012372097160097052,0.010789372759234363]"
3,"(3383,[21,27,200,406,457,677,807,1499,1722],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])","[0.016873190651323807,0.00980040910873928,0.010090609396704085,0.013652939991871466,0.20357005634074538,0.012560833521244933,0.723748675525299,0.009703285464072104]"
4,"(3383,[386,898,2175,2851],[1.0,1.0,1.0,1.0])","[0.03372544876207572,0.019729935190353376,0.020314249299078466,0.027478127461575064,0.021162075379496152,0.8356627641738856,0.022393019147866053,0.019534380585669597]"
5,"(3383,[71,372,952,1362,2068,2227,2263],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])","[0.02136227764629989,0.012270844567493455,0.012634116957550715,0.8987363081982651,0.01316224540706082,0.015751104035816903,0.013933742467166995,0.012149360720345994]"
6,"(3383,[1,7,45,47,78,84,93,113,166,249,362,490,1069,1180,1253,1260,3135],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])","[0.009324087442202042,0.005428896995968504,0.005589693455872278,0.9553371812725351,0.005823273295088367,0.0069587801696187265,0.006162834634425509,0.005375252734289472]"
7,"(3383,[101,471,1646,2559],[1.0,1.0,1.0,1.0])","[0.672282612374509,0.1915518502568753,0.020325828440241177,0.02747545202779478,0.02116192889068281,0.025270950485707887,0.02239689790318204,0.019534479621006837]"
8,"(3383,[33,65,76,105,261,317,367,379,682,1015,1028,1473,2061,2467,2559],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])","[0.5652570052102823,0.006110399777950387,0.006293560864773296,0.07031740039895461,0.006553986911305018,0.33247073800572013,0.006947049778050828,0.006049859052963416]"
9,"(3383,[27,101,272,517,1258,2640,3004],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])","[0.6815114203604435,0.012271528634023188,0.012634795282772313,0.01708839458257975,0.01316160499870884,0.015716232387833956,0.1151482465113082,0.13246777724233016]"


[36mdocumentsToShow[39m: [32mInt[39m = [32m10[39m

## 9. Exploring results

When I ran the anlayusis in the previous cell, document 7 (indexed 6) came up as heavily weighted to the first topic (topic 0).

Let's compare the contents of document 7 with the definition of topic 0.

We can just index directly into our original Corpus of texts to see the contents of that "document":

In [30]:
val documentIndex = 7


scholiaAscii.nodes(documentIndex)

[36mdocumentIndex[39m: [32mInt[39m = [32m7[39m
[36mres29_1[39m: [32mCitableNode[39m = [33mCitableNode[39m(
  [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg5026.e3.e3_simpleascii:9.e3_109v_8"[39m),
  [32m"h eikotws tauta poiei tois khruci pros to duswpein ekeinous h gar sumfora tapeinoi kai ta megala fronhmata"[39m
)

We can set a condition on the `weightedLabels` data frame to filter it to a given topic.

In [31]:
val topicIndex = 0

val topic = weightedLabels.filter(weightedLabels("topic") === topicIndex).select("termsWithWeight") //.showHTML(truncate=1000)



[36mtopicIndex[39m: [32mInt[39m = [32m0[39m
[36mtopic[39m: [32mDataFrame[39m = [termsWithWeight: array<string>]

We can break the resulting array out to one element per line with Spark's `explode` method.


In [32]:
topic.select( explode(col("termsWithWeight"))).showHTML(truncate=maxWidth)


col
greeklit 0.01986889626600968
tlg0012 0.01932522649981618
tlg001 0.019323379541894697
kata 0.018745833002878125
legei 0.013076311991565037
echs 0.012980631820756445
para 0.008930328573828014
fhsin 0.007860709814748767
diastalteon 0.0067491893833538435
dios 0.006498753243127346
