# Spark LDA

An example of topic modelling a corpus of texts using Spark ML's LDA.

In the first two code cells, you can define your main decisions about how to topic model your corpus by setting key values, and by downloading and cleaning up your texts.


## Settings

- `k` is the traditional name for the number of topics to find
- `iterations` is the number of cycles the LDA algorithm should run through
- `stopWords` is an Array of words to omit from the model
- `vocabSize` is the number of terms to consider
- `termsToDisplay` is the number of terms to use in describing a topic

In [1]:
val k = 10
val iterations = 20
val stopWords = Array("de", "kai", "to", "thn", "gar", "twn", "h", "tou", "ws", "o", "ths", "ton", "dia", "mh", "oti", "ou", "pros", "eis", "men", "oi", "ouk", "en", "tous", "epi", "ta", "tw|", "tois", "auton", "ei", "nun", "peri", "hn", "oun", "autw|", "autou", "alla", "tas", "all'", "esti", "estin", "te", "th|", "touto", "tauta", "apo", "ek", "meta", "ti", "ec", "anti", "oude")

val vocabSize = 10000
val minimumTokenLength = 4
val termsToDisplay = 15

// Cosmetic setting for table display:
val maxWidth = 1000

[36mk[39m: [32mInt[39m = [32m10[39m
[36miterations[39m: [32mInt[39m = [32m20[39m
[36mstopWords[39m: [32mArray[39m[[32mString[39m] = [33mArray[39m(
  [32m"de"[39m,
  [32m"kai"[39m,
  [32m"to"[39m,
  [32m"thn"[39m,
  [32m"gar"[39m,
  [32m"twn"[39m,
  [32m"h"[39m,
  [32m"tou"[39m,
  [32m"ws"[39m,
  [32m"o"[39m,
  [32m"ths"[39m,
  [32m"ton"[39m,
  [32m"dia"[39m,
  [32m"mh"[39m,
  [32m"oti"[39m,
  [32m"ou"[39m,
  [32m"pros"[39m,
  [32m"eis"[39m,
  [32m"men"[39m,
  [32m"oi"[39m,
  [32m"ouk"[39m,
  [32m"en"[39m,
  [32m"tous"[39m,
  [32m"epi"[39m,
  [32m"ta"[39m,
  [32m"tw|"[39m,
  [32m"tois"[39m,
  [32m"auton"[39m,
  [32m"ei"[39m,
  [32m"nun"[39m,
  [32m"peri"[39m,
  [32m"hn"[39m,
  [32m"oun"[39m,
  [32m"autw|"[39m,
  [32m"autou"[39m,
  [32m"alla"[39m,
  [32m"tas"[39m,
  [32m"all'"[39m,
...
[36mvocabSize[39m: [32mInt[39m = [32m10000[39m
[36mminimumTokenLength[39m: [32mInt[39m = [32m4[39

## Download data and clean up text


This example uses delimited-text data from the OCRE data set. 
We extract column 7, then tidy up the data by:

- converting all text to lower case
- removing all characters *except* alphabetic `a-z` and the space character

In [2]:
val personalRepo = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(personalRepo)

[36mpersonalRepo[39m: [32mcoursierapi[39m.[32mMavenRepository[39m = MavenRepository(https://dl.bintray.com/neelsmith/maven)

In [3]:
import $ivy.`edu.holycross.shot.cite::xcite:4.3.0`
import $ivy.`edu.holycross.shot::ohco2:10.20.3`
import $ivy.`edu.holycross.shot::greek:5.5.3`
import $ivy.`edu.holycross.shot.mid::orthography:2.0.0`

Downloading https://repo1.maven.org/maven2/edu/holycross/shot/cite/xcite_2.12/4.3.0/xcite_2.12-4.3.0.pom
Downloaded https://repo1.maven.org/maven2/edu/holycross/shot/cite/xcite_2.12/4.3.0/xcite_2.12-4.3.0.pom
Downloading https://repo1.maven.org/maven2/edu/holycross/shot/cite/xcite_2.12/4.3.0/xcite_2.12-4.3.0.pom.sha1
Downloaded https://repo1.maven.org/maven2/edu/holycross/shot/cite/xcite_2.12/4.3.0/xcite_2.12-4.3.0.pom.sha1
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/cite/xcite_2.12/4.3.0/xcite_2.12-4.3.0.pom
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/cite/xcite_2.12/4.3.0/xcite_2.12-4.3.0.pom
Downloading https://repo1.maven.org/maven2/org/wvlet/airframe/airframe-log_2.12/20.5.2/airframe-log_2.12-20.5.2.pom
Downloaded https://repo1.maven.org/maven2/org/wvlet/airframe/airframe-log_2.12/20.5.2/airframe-log_2.12-20.5.2.pom
Downloading https://repo1.maven.org/maven2/org/scala-lang/modules/scala-collection-compat_2.12/2.1.6/scala-collectio

Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/greek_2.12/5.5.3/greek_2.12-5.5.3.pom
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/greek_2.12/5.5.3/greek_2.12-5.5.3.pom
Downloading https://repo1.maven.org/maven2/edu/holycross/shot/ohco2_2.12/10.20.0/ohco2_2.12-10.20.0.pom
Downloading https://repo1.maven.org/maven2/edu/holycross/shot/scm_2.12/7.3.0/scm_2.12-7.3.0.pom
Downloading https://repo1.maven.org/maven2/edu/holycross/shot/citevalidator_2.12/1.1.2/citevalidator_2.12-1.1.2.pom
Downloading https://repo1.maven.org/maven2/edu/holycross/shot/mid/orthography_2.12/2.0.0/orthography_2.12-2.0.0.pom
Downloaded https://repo1.maven.org/maven2/edu/holycross/shot/ohco2_2.12/10.20.0/ohco2_2.12-10.20.0.pom
Downloaded https://repo1.maven.org/maven2/edu/holycross/shot/scm_2.12/7.3.0/scm_2.12-7.3.0.pom
Downloading https://repo1.maven.org/maven2/edu/holycross/shot/ohco2_2.12/10.20.0/ohco2_2.12-10.20.0.pom.sha1
Downloading https://repo1.maven.org/maven2/edu

Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/greek_2.12/5.5.3/greek_2.12-5.5.3.jar
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/scm_2.12/7.3.0/scm_2.12-7.3.0.jar
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/citeobj_2.12/7.5.0/citeobj_2.12-7.5.0-sources.jar
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/citeobj_2.12/7.5.0/citeobj_2.12-7.5.0-sources.jar
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/mid/orthography_2.12/2.0.0/orthography_2.12-2.0.0-sources.jar
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/citevalidator_2.12/1.1.2/citevalidator_2.12-1.1.2.jar
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/dse_2.12/7.1.1/dse_2.12-7.1.1-sources.jar
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/scm_2.12/7.3.0/scm_2.12-7.3.0-sources.jar
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/ohc

[32mimport [39m[36m$ivy.$                                     
[39m
[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36m$ivy.$                                
[39m
[32mimport [39m[36m$ivy.$                                          [39m

In [4]:
import edu.holycross.shot.cite._
import edu.holycross.shot.ohco2._
import edu.holycross.shot.greek._
import edu.holycross.shot.mid.orthography._





[32mimport [39m[36medu.holycross.shot.cite._
[39m
[32mimport [39m[36medu.holycross.shot.ohco2._
[39m
[32mimport [39m[36medu.holycross.shot.greek._
[39m
[32mimport [39m[36medu.holycross.shot.mid.orthography._



[39m

In [5]:
val venetusBUrl = "https://raw.githubusercontent.com/hmteditors/iliad23-2020/master/presentation/vbScholiaData"
val upsbk9scholia = "https://raw.githubusercontent.com/neelsmith/transmission-evolution/master/data/texts/diplomatic/ascii/upsilon9scholia_ascii.cex"


[36mvenetusBUrl[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/hmteditors/iliad23-2020/master/presentation/vbScholiaData"[39m
[36mupsbk9scholia[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/neelsmith/transmission-evolution/master/data/texts/diplomatic/ascii/upsilon9scholia_ascii.cex"[39m

In [6]:
// create  source corpora
val upbk9 = CorpusSource.fromUrl(upsbk9scholia)
val vbScholia = CorpusSource.fromUrl(venetusBUrl)

[36mupbk9[39m: [32mCorpus[39m = [33mCorpus[39m(
  [33mVector[39m(
    [33mCitableNode[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg5026.e3.e3_simpleascii:9.e3_109v_1"[39m),
      [32m"a litas men thn rayw|dian kalousin epei dh de oi trwes ek paradocou nikwsi belesi dios ouk oikeia| dunamei panti ponw| thn tuxhn fullatousi parembolhn epi tw| naustaqmw| poioumenoi tois de ellhsin apanta dusxerh prwta men en kairw| mh parontos agaqou summaxou eita kai meta parabasin tosouton eutuxountwn trwwn oi keraunoi tou dios malista de pantwn o thn aitian exwn agamemnwn axqetai ot' an de allwn pragmatwn arxesqai ws oi nomimoi twn istoriografwn paragrafas emballei metabainwn gar epi ta ellhnwn apekorufwse ton logon"[39m
    ),
    [33mCitableNode[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg5026.e3.e3_simpleascii:9.e3_109v_2"[39m),
      [32m"b ora pws to antiqeton eni edhlwse rhmati trwes exon axaious exe"[39m
    ),
    [33mCitableNode[39m(
      [33mCtsUrn[39

In [7]:
val scholiaAscii = upbk9 ++ vbScholia


[36mscholiaAscii[39m: [32mCorpus[39m = [33mCorpus[39m(
  [33mVector[39m(
    [33mCitableNode[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg5026.e3.e3_simpleascii:9.e3_109v_1"[39m),
      [32m"a litas men thn rayw|dian kalousin epei dh de oi trwes ek paradocou nikwsi belesi dios ouk oikeia| dunamei panti ponw| thn tuxhn fullatousi parembolhn epi tw| naustaqmw| poioumenoi tois de ellhsin apanta dusxerh prwta men en kairw| mh parontos agaqou summaxou eita kai meta parabasin tosouton eutuxountwn trwwn oi keraunoi tou dios malista de pantwn o thn aitian exwn agamemnwn axqetai ot' an de allwn pragmatwn arxesqai ws oi nomimoi twn istoriografwn paragrafas emballei metabainwn gar epi ta ellhnwn apekorufwse ton logon"[39m
    ),
    [33mCitableNode[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg5026.e3.e3_simpleascii:9.e3_109v_2"[39m),
      [32m"b ora pws to antiqeton eni edhlwse rhmati trwes exon axaious exe"[39m
    ),
    [33mCitableNode[39m(
      [33mCts

In [8]:
scholiaAscii.size


[36mres7[39m: [32mInt[39m = [32m477[39m

## Setup a Spark notebook session

Import libraries, configure debugging, start up a local Spark notebook session.  These four cells fall in the category of "stuff you copy and paste in to set up a Jupyter notebook with Spark and don't think about too much."

In [9]:
import $ivy.`org.apache.spark::spark-sql:2.4.5` // Or use any other 2.x version here
import org.apache.spark.sql._
import $ivy.`org.apache.spark::spark-mllib:2.4.5`


Downloading https://repo1.maven.org/maven2/sh/almond/almond-spark_2.12/0.8.2/almond-spark_2.12-0.8.2.pom
Downloading https://repo1.maven.org/maven2/org/apache/spark/spark-sql_2.12/2.4.5/spark-sql_2.12-2.4.5.pom
Downloaded https://repo1.maven.org/maven2/org/apache/spark/spark-sql_2.12/2.4.5/spark-sql_2.12-2.4.5.pom
Downloaded https://repo1.maven.org/maven2/sh/almond/almond-spark_2.12/0.8.2/almond-spark_2.12-0.8.2.pom
Downloading https://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.12/2.4.5/spark-parent_2.12-2.4.5.pom
Downloaded https://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.12/2.4.5/spark-parent_2.12-2.4.5.pom
Downloading https://repo1.maven.org/maven2/org/apache/apache/18/apache-18.pom
Downloaded https://repo1.maven.org/maven2/org/apache/apache/18/apache-18.pom
Downloading https://repo1.maven.org/maven2/sh/almond/ammonite-spark_2.12/0.7.2/ammonite-spark_2.12-0.7.2.pom
Downloading https://repo1.maven.org/maven2/org/apache/spark/spark-sketch_2.12/2.4.5/spark-ske

Downloading https://repo1.maven.org/maven2/org/apache/arrow/arrow-memory/0.10.0/arrow-memory-0.10.0.pom
Downloading https://repo1.maven.org/maven2/org/scala-lang/modules/scala-parser-combinators_2.12/1.1.0/scala-parser-combinators_2.12-1.1.0.pom
Downloaded https://repo1.maven.org/maven2/org/apache/spark/spark-launcher_2.12/2.4.5/spark-launcher_2.12-2.4.5.pom
Downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client/2.6.5/hadoop-client-2.6.5.pom
Downloading https://repo1.maven.org/maven2/org/apache/avro/avro/1.8.2/avro-1.8.2.pom
Downloading https://repo1.maven.org/maven2/log4j/log4j/1.2.17/log4j-1.2.17.pom
Downloaded https://repo1.maven.org/maven2/net/razorvine/pyrolite/4.13/pyrolite-4.13.pom
Downloading https://repo1.maven.org/maven2/io/dropwizard/metrics/metrics-core/3.1.5/metrics-core-3.1.5.pom
Downloaded https://repo1.maven.org/maven2/org/scala-lang/modules/scala-parser-combinators_2.12/1.1.0/scala-parser-combinators_2.12-1.1.0.pom
Downloading https://repo1.maven.or

Downloading https://repo1.maven.org/maven2/javax/servlet/javax.servlet-api/3.1.0/javax.servlet-api-3.1.0.pom
Downloaded https://repo1.maven.org/maven2/commons-lang/commons-lang/2.6/commons-lang-2.6.pom
Downloading https://repo1.maven.org/maven2/org/json4s/json4s-jackson_2.12/3.5.3/json4s-jackson_2.12-3.5.3.pom
Downloaded https://repo1.maven.org/maven2/org/json4s/json4s-jackson_2.12/3.5.3/json4s-jackson_2.12-3.5.3.pom
Downloading https://repo1.maven.org/maven2/org/glassfish/jersey/containers/jersey-container-servlet/2.22.2/jersey-container-servlet-2.22.2.pom
Downloaded https://repo1.maven.org/maven2/org/glassfish/jersey/core/jersey-client/2.22.2/jersey-client-2.22.2.pom
Downloaded https://repo1.maven.org/maven2/org/apache/spark/spark-network-shuffle_2.12/2.4.5/spark-network-shuffle_2.12-2.4.5.pom
Downloaded https://repo1.maven.org/maven2/org/apache/curator/curator-recipes/2.6.0/curator-recipes-2.6.0.pom
Downloading https://repo1.maven.org/maven2/org/apache/spark/spark-kvstore_2.12/2.4.5

Downloading https://repo1.maven.org/maven2/org/slf4j/slf4j-parent/1.7.16/slf4j-parent-1.7.16.pom
Downloading https://repo1.maven.org/maven2/com/fasterxml/jackson/jackson-parent/2.6.1/jackson-parent-2.6.1.pom
Downloaded https://repo1.maven.org/maven2/com/fasterxml/jackson/jackson-parent/2.6.1/jackson-parent-2.6.1.pom
Downloaded https://repo1.maven.org/maven2/org/slf4j/slf4j-parent/1.7.16/slf4j-parent-1.7.16.pom
Downloaded https://repo1.maven.org/maven2/org/apache/commons/commons-parent/40/commons-parent-40.pom
Downloading https://repo1.maven.org/maven2/org/apache/commons/commons-parent/35/commons-parent-35.pom
Downloaded https://repo1.maven.org/maven2/org/apache/commons/commons-parent/35/commons-parent-35.pom
Downloading https://repo1.maven.org/maven2/org/apache/commons/commons-parent/34/commons-parent-34.pom
Downloaded https://repo1.maven.org/maven2/org/apache/commons/commons-parent/34/commons-parent-34.pom
Downloading https://repo1.maven.org/maven2/org/apache/commons/commons-parent/23

Downloaded https://repo1.maven.org/maven2/org/apache/avro/avro-ipc/1.8.2/avro-ipc-1.8.2.pom
Downloaded https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-http/9.4.20.v20190813/jetty-http-9.4.20.v20190813.pom
Downloaded https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-jobclient/2.6.5/hadoop-mapreduce-client-jobclient-2.6.5.pom
Downloaded https://repo1.maven.org/maven2/com/esotericsoftware/kryo-shaded/4.0.2/kryo-shaded-4.0.2.pom
Downloading https://repo1.maven.org/maven2/org/glassfish/hk2/external/javax.inject/2.4.0-b34/javax.inject-2.4.0-b34.pom
Downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-core/2.6.5/hadoop-mapreduce-client-core-2.6.5.pom
Downloading https://repo1.maven.org/maven2/org/apache/commons/commons-compress/1.8.1/commons-compress-1.8.1.pom
Downloading https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-annotations/2.6.7/jackson-annotations-2.6.7.pom
Downloading https://repo1.maven.org/maven2/javax

Downloaded https://repo1.maven.org/maven2/commons-collections/commons-collections/3.2.2/commons-collections-3.2.2.pom
Downloaded https://repo1.maven.org/maven2/xerces/xercesImpl/2.9.1/xercesImpl-2.9.1.pom
Downloading https://repo1.maven.org/maven2/commons-io/commons-io/2.4/commons-io-2.4.pom
Downloaded https://repo1.maven.org/maven2/org/json4s/json4s-ast_2.12/3.5.3/json4s-ast_2.12-3.5.3.pom
Downloading https://repo1.maven.org/maven2/org/mortbay/jetty/jetty-util/6.1.26/jetty-util-6.1.26.pom
Downloaded https://repo1.maven.org/maven2/xmlenc/xmlenc/0.52/xmlenc-0.52.pom
Downloading https://repo1.maven.org/maven2/org/glassfish/hk2/hk2-utils/2.4.0-b34/hk2-utils-2.4.0-b34.pom
Downloading https://repo1.maven.org/maven2/org/javassist/javassist/3.18.1-GA/javassist-3.18.1-GA.pom
Downloaded https://repo1.maven.org/maven2/com/google/code/gson/gson/2.2.4/gson-2.2.4.pom
Downloading https://repo1.maven.org/maven2/org/json4s/json4s-scalap_2.12/3.5.3/json4s-scalap_2.12-3.5.3.pom
Downloading https://repo1

Downloading https://repo1.maven.org/maven2/com/google/inject/guice/3.0/guice-3.0.pom
Downloaded https://repo1.maven.org/maven2/org/apache/directory/api/api-util/1.0.0-M20/api-util-1.0.0-M20.pom
Downloaded https://repo1.maven.org/maven2/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.pom
Downloading https://repo1.maven.org/maven2/jline/jline/0.9.94/jline-0.9.94.pom
Downloaded https://repo1.maven.org/maven2/org/codehaus/jettison/jettison/1.1/jettison-1.1.pom
Downloaded https://repo1.maven.org/maven2/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.pom
Downloaded https://repo1.maven.org/maven2/com/google/inject/guice/3.0/guice-3.0.pom
Downloaded https://repo1.maven.org/maven2/jline/jline/0.9.94/jline-0.9.94.pom
Downloaded https://repo1.maven.org/maven2/org/apache/directory/api/api-asn1-api/1.0.0-M20/api-asn1-api-1.0.0-M20.pom
Downloading https://repo1.maven.org/maven2/com/google/inject/guice-parent/3.0/guice-parent-3.0.pom
Downloading https://repo1.maven.org/maven2/org/apa

Downloading https://repo1.maven.org/maven2/com/google/code/gson/gson/2.2.4/gson-2.2.4.jar
Downloaded https://repo1.maven.org/maven2/org/apache/directory/api/api-asn1-api/1.0.0-M20/api-asn1-api-1.0.0-M20.jar
Downloading https://repo1.maven.org/maven2/org/apache/directory/server/apacheds-i18n/2.0.0-M15/apacheds-i18n-2.0.0-M15.jar
Downloaded https://repo1.maven.org/maven2/org/apache/curator/curator-client/2.6.0/curator-client-2.6.0.jar
Downloading https://repo1.maven.org/maven2/org/apache/directory/api/api-util/1.0.0-M20/api-util-1.0.0-M20.jar
Downloaded https://repo1.maven.org/maven2/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar
Downloaded https://repo1.maven.org/maven2/org/apache/directory/server/apacheds-kerberos-codec/2.0.0-M15/apacheds-kerberos-codec-2.0.0-M15.jar
Downloading https://repo1.maven.org/maven2/io/netty/netty/3.9.9.Final/netty-3.9.9.Final.jar
Downloading https://repo1.maven.org/maven2/com/esotericsoftware/kryo-shaded/4.0.2/kryo-shaded-4.0.2.jar
Downloade

Downloaded https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-yarn-server-nodemanager/2.6.5/hadoop-yarn-server-nodemanager-2.6.5.jar
Downloading https://repo1.maven.org/maven2/com/ning/compress-lzf/1.0.3/compress-lzf-1.0.3.jar
Downloaded https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-yarn-common/2.6.5/hadoop-yarn-common-2.6.5.jar
Downloading https://repo1.maven.org/maven2/org/roaringbitmap/shims/0.7.45/shims-0.7.45.jar
Downloaded https://repo1.maven.org/maven2/io/dropwizard/metrics/metrics-core/3.1.5/metrics-core-3.1.5.jar
Downloaded https://repo1.maven.org/maven2/javax/servlet/javax.servlet-api/3.1.0/javax.servlet-api-3.1.0.jar
Downloaded https://repo1.maven.org/maven2/org/roaringbitmap/shims/0.7.45/shims-0.7.45.jar
Downloading https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-core/2.7.9/jackson-core-2.7.9.jar
Downloading https://repo1.maven.org/maven2/org/apache/spark/spark-network-shuffle_2.12/2.4.5/spark-network-shuffle_2.12-2.4.5.jar
Downloading https

Downloaded https://repo1.maven.org/maven2/io/dropwizard/metrics/metrics-json/3.1.5/metrics-json-3.1.5.jar
Downloading https://repo1.maven.org/maven2/io/dropwizard/metrics/metrics-graphite/3.1.5/metrics-graphite-3.1.5.jar
Downloaded https://repo1.maven.org/maven2/org/glassfish/jersey/containers/jersey-container-servlet/2.22.2/jersey-container-servlet-2.22.2.jar
Downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-app/2.6.5/hadoop-mapreduce-client-app-2.6.5.jar
Downloaded https://repo1.maven.org/maven2/com/fasterxml/jackson/module/jackson-module-paranamer/2.7.9/jackson-module-paranamer-2.7.9.jar
Downloading https://repo1.maven.org/maven2/io/dropwizard/metrics/metrics-jvm/3.1.5/metrics-jvm-3.1.5.jar
Downloaded https://repo1.maven.org/maven2/com/fasterxml/jackson/module/jackson-module-scala_2.12/2.6.7.1/jackson-module-scala_2.12-2.6.7.1.jar
Downloading https://repo1.maven.org/maven2/oro/oro/2.0.8/oro-2.0.8.jar
Downloaded https://repo1.maven.org/maven2/org/ap

Downloading https://repo1.maven.org/maven2/org/apache/avro/avro/1.8.2/avro-1.8.2-sources.jar
Downloaded https://repo1.maven.org/maven2/org/apache/parquet/parquet-common/1.10.1/parquet-common-1.10.1.jar
Downloading https://repo1.maven.org/maven2/org/apache/avro/avro-ipc/1.8.2/avro-ipc-1.8.2-sources.jar
Downloaded https://repo1.maven.org/maven2/org/tukaani/xz/1.5/xz-1.5-sources.jar
Downloading https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-mapper-asl/1.9.13/jackson-mapper-asl-1.9.13-sources.jar
Downloaded https://repo1.maven.org/maven2/org/apache/parquet/parquet-column/1.10.1/parquet-column-1.10.1.jar
Downloaded https://repo1.maven.org/maven2/org/apache/arrow/arrow-memory/0.10.0/arrow-memory-0.10.0.jar
Downloaded https://repo1.maven.org/maven2/com/carrotsearch/hppc/0.7.2/hppc-0.7.2.jar
Downloading https://repo1.maven.org/maven2/org/spark-project/spark/unused/1.0.0/unused-1.0.0-sources.jar
Downloading https://repo1.maven.org/maven2/org/glassfish/jersey/core/jersey-server/2.22

Downloaded https://repo1.maven.org/maven2/org/apache/xbean/xbean-asm6-shaded/4.8/xbean-asm6-shaded-4.8-sources.jar
Downloading https://repo1.maven.org/maven2/org/apache/curator/curator-client/2.6.0/curator-client-2.6.0-sources.jar
Downloaded https://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.2.5/httpclient-4.2.5-sources.jar
Downloading https://repo1.maven.org/maven2/org/apache/curator/curator-recipes/2.6.0/curator-recipes-2.6.0-sources.jar
Downloaded https://repo1.maven.org/maven2/jline/jline/0.9.94/jline-0.9.94-sources.jar
Downloading https://repo1.maven.org/maven2/org/apache/curator/curator-framework/2.6.0/curator-framework-2.6.0-sources.jar
Downloaded https://repo1.maven.org/maven2/org/apache/curator/curator-recipes/2.6.0/curator-recipes-2.6.0-sources.jar
Downloading https://repo1.maven.org/maven2/org/apache/directory/api/api-util/1.0.0-M20/api-util-1.0.0-M20-sources.jar
Downloaded https://repo1.maven.org/maven2/commons-configuration/commons-configuration/1.6/comm

Downloaded https://repo1.maven.org/maven2/commons-codec/commons-codec/1.10/commons-codec-1.10.jar
Downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-yarn-api/2.6.5/hadoop-yarn-api-2.6.5-sources.jar
Downloaded https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs/2.6.5/hadoop-hdfs-2.6.5-sources.jar
Downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-common/2.6.5/hadoop-mapreduce-client-common-2.6.5-sources.jar
Downloaded https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-yarn-common/2.6.5/hadoop-yarn-common-2.6.5-sources.jar
Downloading https://repo1.maven.org/maven2/org/lz4/lz4-java/1.4.0/lz4-java-1.4.0-sources.jar
Downloaded https://repo1.maven.org/maven2/org/apache/spark/spark-network-shuffle_2.12/2.4.5/spark-network-shuffle_2.12-2.4.5-sources.jar
Downloading https://repo1.maven.org/maven2/org/apache/spark/spark-unsafe_2.12/2.4.5/spark-unsafe_2.12-2.4.5-sources.jar
Downloaded https://repo1.maven.org/maven2/org/apache/spark

Downloaded https://repo1.maven.org/maven2/com/ning/compress-lzf/1.0.3/compress-lzf-1.0.3-sources.jar
Downloading https://repo1.maven.org/maven2/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0-sources.jar
Downloaded https://repo1.maven.org/maven2/javax/xml/bind/jaxb-api/2.2.2/jaxb-api-2.2.2-sources.jar
Downloading https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-annotations/2.6.7/jackson-annotations-2.6.7-sources.jar
Downloaded https://repo1.maven.org/maven2/org/glassfish/hk2/external/javax.inject/2.4.0-b34/javax.inject-2.4.0-b34-sources.jar
Downloading https://repo1.maven.org/maven2/com/fasterxml/jackson/module/jackson-module-paranamer/2.7.9/jackson-module-paranamer-2.7.9-sources.jar
Downloaded https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-yarn-server-nodemanager/2.6.5/hadoop-yarn-server-nodemanager-2.6.5-sources.jar
Downloading https://repo1.maven.org/maven2/org/glassfish/jersey/containers/jersey-container-servlet/2.22.2/jersey-container-se

Downloading https://repo1.maven.org/maven2/org/apache/parquet/parquet-encoding/1.10.1/parquet-encoding-1.10.1-sources.jar
Downloaded https://repo1.maven.org/maven2/org/apache/commons/commons-crypto/1.0.0/commons-crypto-1.0.0.jar
Downloading https://repo1.maven.org/maven2/com/carrotsearch/hppc/0.7.2/hppc-0.7.2-sources.jar
Downloaded https://repo1.maven.org/maven2/joda-time/joda-time/2.9.9/joda-time-2.9.9-sources.jar
Downloading https://repo1.maven.org/maven2/org/apache/spark/spark-sql_2.12/2.4.5/spark-sql_2.12-2.4.5-sources.jar
Downloaded https://repo1.maven.org/maven2/org/apache/parquet/parquet-jackson/1.10.1/parquet-jackson-1.10.1-sources.jar
Downloading https://repo1.maven.org/maven2/org/apache/arrow/arrow-memory/0.10.0/arrow-memory-0.10.0-sources.jar
Downloaded https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.5/spark-core_2.12-2.4.5-sources.jar
Downloading https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.10.1/parquet-hadoop-1.10.1-sources.jar
Dow

Downloading https://repo1.maven.org/maven2/org/scalanlp/breeze-macros_2.12/0.13.2/breeze-macros_2.12-0.13.2.pom
Downloaded https://repo1.maven.org/maven2/org/spire-math/spire_2.12/0.13.0/spire_2.12-0.13.0.pom
Downloading https://repo1.maven.org/maven2/com/google/code/findbugs/jsr305/1.3.9/jsr305-1.3.9.pom
Downloaded https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.16/slf4j-api-1.7.16.pom
Downloaded https://repo1.maven.org/maven2/net/sf/opencsv/opencsv/2.3/opencsv-2.3.pom
Downloaded https://repo1.maven.org/maven2/com/github/fommil/netlib/core/1.1.2/core-1.1.2.pom
Downloaded https://repo1.maven.org/maven2/org/scalanlp/breeze-macros_2.12/0.13.2/breeze-macros_2.12-0.13.2.pom
Downloading https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1.pom
Downloaded https://repo1.maven.org/maven2/com/google/code/findbugs/jsr305/1.3.9/jsr305-1.3.9.pom
Downloaded https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_

[32mimport [39m[36m$ivy.$                                   // Or use any other 2.x version here
[39m
[32mimport [39m[36morg.apache.spark.sql._
[39m
[32mimport [39m[36m$ivy.$                                    
[39m

In [10]:
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)

[32mimport [39m[36morg.apache.log4j.{Level, Logger}
[39m

In [11]:
val spark = {
  NotebookSparkSession.builder()
    .master("local[*]")
    .getOrCreate()
}

Loading spark-stubs


Downloading https://repo1.maven.org/maven2/sh/almond/spark-stubs_24_2.12/0.7.2/spark-stubs_24_2.12-0.7.2.pom
Downloaded https://repo1.maven.org/maven2/sh/almond/spark-stubs_24_2.12/0.7.2/spark-stubs_24_2.12-0.7.2.pom
Downloading https://repo1.maven.org/maven2/sh/almond/spark-stubs_24_2.12/0.7.2/spark-stubs_24_2.12-0.7.2.jar
Downloading https://repo1.maven.org/maven2/sh/almond/spark-stubs_24_2.12/0.7.2/spark-stubs_24_2.12-0.7.2-sources.jar
Downloaded https://repo1.maven.org/maven2/sh/almond/spark-stubs_24_2.12/0.7.2/spark-stubs_24_2.12-0.7.2.jar
Downloaded https://repo1.maven.org/maven2/sh/almond/spark-stubs_24_2.12/0.7.2/spark-stubs_24_2.12-0.7.2-sources.jar


Getting spark JARs
Creating SparkSession


Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties


[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@3cadf653

## Topic modelling with Spark LDA

After importing a small mountain of Spark libraries, the following cells go through the basic steps of topic modelling:

1. Create a text corpus
2. Tokenize
3. Filter stop words
4. Count word occurrences for each text
5. Create the LDA model by "fitting" it to our data
6. Apply the model to compute the topics and their distribution in each document of our corpus


In [12]:
import org.apache.spark.ml.clustering.LDA
import org.apache.spark.ml.feature.RegexTokenizer
import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.ml.feature.CountVectorizer
import org.apache.spark.mllib.linalg.Vector
import scala.collection.mutable.WrappedArray
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.functions._

[32mimport [39m[36morg.apache.spark.ml.clustering.LDA
[39m
[32mimport [39m[36morg.apache.spark.ml.feature.RegexTokenizer
[39m
[32mimport [39m[36morg.apache.spark.ml.feature.StopWordsRemover
[39m
[32mimport [39m[36morg.apache.spark.ml.feature.CountVectorizer
[39m
[32mimport [39m[36morg.apache.spark.mllib.linalg.Vector
[39m
[32mimport [39m[36mscala.collection.mutable.WrappedArray
[39m
[32mimport [39m[36morg.apache.spark.sql.types.IntegerType
[39m
[32mimport [39m[36morg.apache.spark.sql.functions._[39m

### 1. Create `DataFrame` with text corpus

Getting your clean text into a Spark `DataFrame` is an awkward, two-step process. (This should be simpler in futuer versions of Spark.)

The important output is `corpus_df`, a `DataFrame` with one row for every text.


In [13]:
// Create RDD:
val scholiaText = scholiaAscii.nodes.map(n => n.text)
val txtRdd = spark.sparkContext.parallelize(scholiaText).zipWithIndex



[36mscholiaText[39m: [32mcollection[39m.[32mimmutable[39m.[32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"a litas men thn rayw|dian kalousin epei dh de oi trwes ek paradocou nikwsi belesi dios ouk oikeia| dunamei panti ponw| thn tuxhn fullatousi parembolhn epi tw| naustaqmw| poioumenoi tois de ellhsin apanta dusxerh prwta men en kairw| mh parontos agaqou summaxou eita kai meta parabasin tosouton eutuxountwn trwwn oi keraunoi tou dios malista de pantwn o thn aitian exwn agamemnwn axqetai ot' an de allwn pragmatwn arxesqai ws oi nomimoi twn istoriografwn paragrafas emballei metabainwn gar epi ta ellhnwn apekorufwse ton logon"[39m,
  [32m"b ora pws to antiqeton eni edhlwse rhmati trwes exon axaious exe"[39m,
  [32m"g h boulhsei qewn progegenhmenh fuza de aei men h meta deous fugh oqen kai fuzakinh|s elafoisi nun de ekplhcis apologeitai de oti ek qewn kruoeis de o yuxros to gar qermon epileipei tous dediotas"[39m,
  [32m"d oi men alloi en fughi oide aristoi en pe

In [14]:
// Import implicits *after* creation of context.
import spark.sqlContext.implicits._

val corpus_df = txtRdd.toDF("corpus", "id")

[32mimport [39m[36mspark.sqlContext.implicits._

[39m
[36mcorpus_df[39m: [32mDataFrame[39m = [corpus: string, id: bigint]

While we're at it, we can paste it this handy snippet defining a function that will beautify our display of Spark `DataFrame`s in HTML.  (We'll use the `showHTML` function later.)

In [15]:
// based on a snippet by Ivan Zaitsev
// https://github.com/almond-sh/almond/issues/180#issuecomment-364711999
implicit class RichDF(val df: DataFrame) {
  def showHTML(limit:Int = 20, truncate: Int = 20) = {
    import xml.Utility.escape
    val data = df.take(limit)
    val header = df.schema.fieldNames.toSeq
    val rows: Seq[Seq[String]] = data.map { row =>
      row.toSeq.map { cell =>
        val str = cell match {
          case null => "null"
          case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
          case array: Array[_] => array.mkString("[", ", ", "]")
          case seq: Seq[_] => seq.mkString("[", ", ", "]")
          case _ => cell.toString
        }
        if (truncate > 0 && str.length > truncate) {
          // do not show ellipses for strings shorter than 4 characters.
          if (truncate < 4) str.substring(0, truncate)
          else str.substring(0, truncate - 3) + "..."
        } else {
          str
        }
      }: Seq[String]
    }

    publish.html(s"""
      <table class="table">
        <tr>
        ${header.map(h => s"<th>${escape(h)}</th>").mkString}
        </tr>
        ${rows.map { row =>
          s"<tr>${row.map { c => s"<td>${escape(c)}</td>" }.mkString}</tr>"
        }.mkString
        }
      </table>""")
  }
}

defined [32mclass[39m [36mRichDF[39m

### 2. Tokenize

In [16]:
val tokenizer = new RegexTokenizer().setPattern("[\\W_]+").setMinTokenLength(minimumTokenLength).setInputCol("corpus").setOutputCol("tokens")
val tokenized_df = tokenizer.transform(corpus_df)


[36mtokenizer[39m: [32mRegexTokenizer[39m = regexTok_66260550c083
[36mtokenized_df[39m: [32mDataFrame[39m = [corpus: string, id: bigint ... 1 more field]

### 3. Filter out stop words

Well, think about a serious stop-word list at some point, but here's the technique.

In [17]:
val remover = new StopWordsRemover().setStopWords(stopWords).setInputCol("tokens").setOutputCol("filtered")
val filtered_df = remover.transform(tokenized_df)





[36mremover[39m: [32mStopWordsRemover[39m = stopWords_67d898e11b3c
[36mfiltered_df[39m: [32mDataFrame[39m = [corpus: string, id: bigint ... 2 more fields]

### 4. Compute counts of each token for each text


In [18]:
val vectorizer = new CountVectorizer().setInputCol("filtered").setOutputCol("features").setVocabSize(vocabSize).setMinDF(5).fit(filtered_df)
val countVectors = vectorizer.transform(filtered_df).select("id", "features")



[36mvectorizer[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32mml[39m.[32mfeature[39m.[32mCountVectorizerModel[39m = cntVec_6b4f2c036b22
[36mcountVectors[39m: [32mDataFrame[39m = [id: bigint, features: vector]

### 5. Create ("fit") LDA model

In [19]:
val lda = new LDA().setK(k).setMaxIter(iterations)
val model = lda.fit(countVectors)

20/07/29 16:23:41 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
20/07/29 16:23:41 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS


[36mlda[39m: [32mLDA[39m = lda_c6ce2a090a2b
[36mmodel[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32mml[39m.[32mclustering[39m.[32mLDAModel[39m = lda_c6ce2a090a2b

### 6. Compute topics and their distribution in each document

Each topic is a set of terms with corresponding weights.


In [20]:
val topics = model.describeTopics(termsToDisplay)


[36mtopics[39m: [32mDataFrame[39m = [topic: int, termIndices: array<int> ... 1 more field]

In [21]:
topics.showHTML(truncate=1000)

topic,termIndices,termWeights
0,"[5, 23, 20, 21, 123, 90, 32, 118, 53, 49, 67, 33, 50, 81, 7]","[0.07629388823238725, 0.0421964060813928, 0.0341188060878836, 0.02209522488221627, 0.016766780597135784, 0.012476529449122671, 0.011980008439663846, 0.01197131896365704, 0.011903895188608321, 0.011396389572251564, 0.01100073285499437, 0.01093382317837741, 0.010545400069597661, 0.010464894381428844, 0.009472207668041617]"
1,"[63, 44, 2, 51, 80, 32, 94, 95, 102, 65, 3, 16, 115, 68, 57]","[0.04387274279047389, 0.036096253844634586, 0.03360926152841028, 0.03066059759994718, 0.02857518031127364, 0.02853293762246996, 0.023055400869623474, 0.019032073560368044, 0.014603967207137879, 0.014356728160156698, 0.013819142796785932, 0.013785776474238905, 0.013369630800874397, 0.012411429062133798, 0.01068429377613894]"
2,"[50, 83, 108, 96, 51, 55, 48, 34, 93, 20, 2, 80, 28, 118, 127]","[0.023855769810755812, 0.017049146763120207, 0.012280103388996454, 0.011728545054366704, 0.011200596532987056, 0.011010744409512786, 0.010806956702435146, 0.00952939980409811, 0.008909589115458716, 0.008746783271841124, 0.008593494549460944, 0.008445526411978061, 0.00842672579660268, 0.00836810194083666, 0.008207810554005863]"
3,"[1, 7, 0, 3, 4, 13, 9, 11, 19, 47, 22, 30, 49, 6, 82]","[0.0726934532179258, 0.045626373094066336, 0.043591586399756015, 0.03117105118728241, 0.02395845851647052, 0.02337485796109314, 0.0226886666213506, 0.02221387419391414, 0.02139575495804602, 0.020109443966723588, 0.018767024527635012, 0.017280082595090233, 0.015421513382679936, 0.015418335434392826, 0.014674902490904924]"
4,"[92, 54, 104, 94, 31, 90, 37, 105, 99, 50, 63, 43, 66, 96, 24]","[0.03164100644774529, 0.029266730076525443, 0.024025813930056397, 0.0169564611960709, 0.014625670726986293, 0.013114570394136264, 0.0113926696727895, 0.010219768512357686, 0.009791630316457576, 0.009768340307663674, 0.00975610865822214, 0.009681631089544283, 0.009654095673067355, 0.009091503509576037, 0.00906419858282177]"
5,"[10, 18, 57, 45, 8, 34, 33, 48, 21, 17, 1, 40, 106, 37, 107]","[0.04433758332168167, 0.04027142922496963, 0.03340346062583548, 0.030123905956587118, 0.029154934083337677, 0.026215589709238096, 0.024875446288202586, 0.024197930203958398, 0.022029755327049698, 0.02084353476991295, 0.02082326609499786, 0.02004019567782944, 0.017342179557552538, 0.01722238018269003, 0.017001712971046086]"
6,"[25, 14, 126, 125, 101, 85, 56, 54, 115, 89, 42, 13, 59, 110, 77]","[0.06108338762499844, 0.050086879799897836, 0.021876191581130206, 0.021660409937316435, 0.021365114835497818, 0.02118510144075782, 0.019420260749494902, 0.018215165582676184, 0.015686942731461113, 0.014683220213558333, 0.013364365079721266, 0.01127037834592032, 0.01078149718440151, 0.00925454117144154, 0.008378656597591511]"
7,"[102, 61, 14, 38, 67, 74, 126, 31, 52, 12, 16, 114, 129, 113, 120]","[0.03833262513282234, 0.028026918026789346, 0.01819015040552277, 0.017371556607963117, 0.017086335271709028, 0.010353792560759667, 0.009088350118238407, 0.008474162286043798, 0.008220409267421359, 0.008188296279883503, 0.00816858049709522, 0.008009674858458342, 0.007996517858598978, 0.007866317249051324, 0.007804490702390968]"
8,"[0, 29, 6, 53, 62, 26, 9, 52, 28, 12, 59, 91, 86, 15, 131]","[0.08219075091390693, 0.028433896110593375, 0.023263289725412795, 0.022928989719292036, 0.020839902532620802, 0.02017322628211502, 0.020135051370491295, 0.01935667678155314, 0.019209736490551337, 0.018409735001317415, 0.01724486008737611, 0.01719966698695932, 0.016688124624582433, 0.015575555871653282, 0.015188657511512053]"
9,"[110, 4, 100, 127, 79, 77, 125, 128, 76, 101, 10, 24, 64, 129, 106]","[0.03484179068446127, 0.025645847885370644, 0.02363245162194494, 0.02350952619364318, 0.011347427633089453, 0.010132252392672682, 0.009399042851140084, 0.009228908747780625, 0.009196232774473208, 0.00890251360808275, 0.008702819186548178, 0.008666898050477387, 0.00827651609932364, 0.00827074239525636, 0.008252085906584284]"


## 7. Label topics

For human readers, we'll replace index numbers for each term with the actual term.

1. Create a new DataFrame with ordered lists ot terms by looking up the term for each term index.
2. Number the rows of this DataFrame so we can join it with the existing topic data.

In [22]:
val topicLabels = topics.select("termIndices").map { case Row(r:  WrappedArray[Integer]) => r.map( i => vectorizer.vocabulary(i) ) }
val labelsNumberedLong = topicLabels.rdd.zipWithIndex.toDF("terms", "topicLong")
val labelsIndexed = labelsNumberedLong.withColumn("topic", $"topicLong".cast(IntegerType)).drop("topicLong")

val topicsWithTerms = labelsIndexed.join(topics, labelsIndexed.col("topic") === topics.col("topic")).drop(labelsIndexed.col("topic"))





[36mtopicLabels[39m: [32mDataset[39m[[32mWrappedArray[39m[[32mString[39m]] = [value: array<string>]
[36mlabelsNumberedLong[39m: [32mDataFrame[39m = [terms: array<string>, topicLong: bigint]
[36mlabelsIndexed[39m: [32mDataFrame[39m = [terms: array<string>, topic: int]
[36mtopicsWithTerms[39m: [32mDataFrame[39m = [terms: array<string>, topic: int ... 2 more fields]

In [23]:
val weightedLabels = topicsWithTerms.withColumn("termsWithWeight", expr("zip_with(terms, termWeights, (t,w) -> concat(t, ' ', w))"))


[36mweightedLabels[39m: [32mDataFrame[39m = [terms: array<string>, topic: int ... 3 more fields]

In [24]:
// Flat view
weightedLabels.select("topic", "termsWithWeight").showHTML(truncate=1000)



topic,termsWithWeight
0,"[poiei 0.07629388823238725, dwrwn 0.0421964060813928, eauton 0.0341188060878836, autwn 0.02209522488221627, toutwn 0.016766780597135784, plhqos 0.012476529449122671, eisi 0.011980008439663846, sumferon 0.01197131896365704, mallon 0.011903895188608321, legwn 0.011396389572251564, gnwmhn 0.01100073285499437, oqen 0.01093382317837741, usteron 0.010545400069597661, tote 0.010464894381428844, tais 0.009472207668041617]"
1,"[qelei 0.04387274279047389, tines 0.036096253844634586, para 0.03360926152841028, outws 0.03066059759994718, oper 0.02857518031127364, eisi 0.02853293762246996, basileus 0.023055400869623474, logou 0.019032073560368044, oion 0.014603967207137879, autois 0.014356728160156698, axilleus 0.013819142796785932, exwn 0.013785776474238905, peiqw 0.013369630800874397, toiouton 0.012411429062133798, odusseus 0.01068429377613894]"
2,"[usteron 0.023855769810755812, polla 0.017049146763120207, ginetai 0.012280103388996454, axaiwn 0.011728545054366704, outws 0.011200596532987056, uper 0.011010744409512786, isws 0.010806956702435146, opws 0.00952939980409811, eautou 0.008909589115458716, eauton 0.008746783271841124, para 0.008593494549460944, oper 0.008445526411978061, axillea 0.00842672579660268, sumferon 0.00836810194083666, dingbats 0.008207810554005863]"
3,"[autw 0.0726934532179258, tais 0.045626373094066336, fhsi 0.043591586399756015, axilleus 0.03117105118728241, agamemnonos 0.02395845851647052, epei 0.02337485796109314, legei 0.0226886666213506, axillews 0.02221387419391414, foinic 0.02139575495804602, pantwn 0.020109443966723588, monon 0.018767024527635012, goun 0.017280082595090233, legwn 0.015421513382679936, kalws 0.015418335434392826, logos 0.014674902490904924]"
4,"[andrwn 0.03164100644774529, qewn 0.029266730076525443, outw 0.024025813930056397, basileus 0.0169564611960709, malista 0.014625670726986293, plhqos 0.013114570394136264, einai 0.0113926696727895, legetai 0.010219768512357686, prwhn 0.009791630316457576, usteron 0.009768340307663674, qelei 0.00975610865822214, ouden 0.009681631089544283, allws 0.009654095673067355, axaiwn 0.009091503509576037, ellhnwn 0.00906419858282177]"
5,"[logon 0.04433758332168167, oide 0.04027142922496963, odusseus 0.03340346062583548, legein 0.030123905956587118, autos 0.029154934083337677, opws 0.026215589709238096, oqen 0.024875446288202586, isws 0.024197930203958398, autwn 0.022029755327049698, axillei 0.02084353476991295, autw 0.02082326609499786, allwn 0.02004019567782944, foiniki 0.017342179557552538, einai 0.01722238018269003, kairon 0.017001712971046086]"
6,"[logwn 0.06108338762499844, autous 0.050086879799897836, eipe 0.021876191581130206, tosouton 0.021660409937316435, bouletai 0.021365114835497818, piqanws 0.02118510144075782, panta 0.019420260749494902, qewn 0.018215165582676184, peiqw 0.015686942731461113, eipwn 0.014683220213558333, eisin 0.013364365079721266, epei 0.01127037834592032, kefalaion 0.01078149718440151, oikeia 0.00925454117144154, htoi 0.008378656597591511]"
7,"[oion 0.03833262513282234, agamemnoni 0.028026918026789346, autous 0.01819015040552277, axille 0.017371556607963117, gnwmhn 0.017086335271709028, xarin 0.010353792560759667, eipe 0.009088350118238407, malista 0.008474162286043798, eipen 0.008220409267421359, dhloi 0.008188296279883503, exwn 0.00816858049709522, deiknusin 0.008009674858458342, aitian 0.007996517858598978, eautw 0.007866317249051324, allois 0.007804490702390968]"
8,"[fhsi 0.08219075091390693, prwton 0.028433896110593375, kalws 0.023263289725412795, mallon 0.022928989719292036, exei 0.020839902532620802, oute 0.02017322628211502, legei 0.020135051370491295, eipen 0.01935667678155314, axillea 0.019209736490551337, dhloi 0.018409735001317415, kefalaion 0.01724486008737611, qumou 0.01719966698695932, pantas 0.016688124624582433, kata 0.015575555871653282, axaious 0.015188657511512053]"
9,"[oikeia 0.03484179068446127, agamemnonos 0.025645847885370644, cite2 0.02363245162194494, dingbats 0.02350952619364318, foinika 0.011347427633089453, htoi 0.010132252392672682, tosouton 0.009399042851140084, eita 0.009228908747780625, dios 0.009196232774473208, bouletai 0.00890251360808275, logon 0.008702819186548178, ellhnwn 0.008666898050477387, nestwr 0.00827651609932364, aitian 0.00827074239525636, foiniki 0.008252085906584284]"


Here's the same information, but displayed one term at a time:

In [25]:
// Exploded view
val explodedTerms = weightedLabels.select(col("*"),explode(col("termsWithWeight"))).select("topic","col")

explodedTerms.showHTML(explodedTerms.count.toInt, 1000)

topic,col
0,poiei 0.07629388823238725
0,dwrwn 0.0421964060813928
0,eauton 0.0341188060878836
0,autwn 0.02209522488221627
0,toutwn 0.016766780597135784
0,plhqos 0.012476529449122671
0,eisi 0.011980008439663846
0,sumferon 0.01197131896365704
0,mallon 0.011903895188608321
0,legwn 0.011396389572251564


[36mexplodedTerms[39m: [32mDataFrame[39m = [topic: int, col: string]

## 8. Compute distribution of topics per document


To apply this topic model to a specific document or set of documents, we can compute the weight of each topic in each document..

In [26]:
val transformed = model.transform(countVectors)
transformed.printSchema // show(false)



root
 |-- id: long (nullable = false)
 |-- features: vector (nullable = true)
 |-- topicDistribution: vector (nullable = true)



[36mtransformed[39m: [32mDataFrame[39m = [id: bigint, features: vector ... 1 more field]

Here's the weightings for the first ten documents:

In [27]:
val documentsToShow = 10
transformed.showHTML(documentsToShow, 1000)

id,features,topicDistribution
0,"(133,[10,13,16,24,31,36,40,47,76,110,125,128,129],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0])","[0.006318574758219746,0.006266214380606355,0.006029286444847529,0.5679254307974858,0.006095060859757801,0.18423373339442792,0.006184459412492185,0.006057284370300238,0.007126315780652294,0.20376363980121004]"
1,"(133,[131],[1.0])","[0.04812002227162666,0.047721397211695,0.045917241766029196,0.05677168042148435,0.0464165901822108,0.05230344888547392,0.04709641609770613,0.04613091205609922,0.5631569549322223,0.046365336175452423]"
2,"(133,[33,54],[1.0,2.0])","[0.023848967574098826,0.023649634567674882,0.022755706520829994,0.028133484085038456,0.5957484807042899,0.20979437664829637,0.02334107759996324,0.022861608644809694,0.026888944544356242,0.02297771911064245]"
3,"(133,[18,21,31],[1.0,1.0,1.0])","[0.0238483555472512,0.023649437753341267,0.022755529872690833,0.028133046754818758,0.023003660332384095,0.7825417946104626,0.02333933521769802,0.022861313440449955,0.026890216891254863,0.022977309579648487]"
4,"(133,[],[])","[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
5,"(133,[],[])","[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
6,"(133,[2,32,44,51,63,80,95],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])","[0.011871007874839185,0.8906203939651702,0.011327598185397999,0.014005222416433885,0.011450672036911216,0.012903902190318879,0.011618103954871426,0.011380159969306174,0.013385057653609917,0.011437881753140986]"
7,"(133,[5],[1.0])","[0.5570340347023075,0.04772026570143764,0.04591655058009276,0.05676482681489387,0.04641559165355366,0.05230074539610605,0.04709425327972048,0.0461297629055399,0.054260194031476715,0.04636377493487142]"
8,"(133,[7,13,53,66,90,94,99,104,129],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])","[0.00862402665876497,0.00855225080894248,0.008228680421585036,0.4150752048630358,0.5154036738722374,0.009373286952609855,0.008439774953355324,0.008266848032022931,0.009727382864276575,0.008308870573169508]"
9,"(133,[5,21],[1.0,1.0])","[0.7064276470900485,0.03162556966994906,0.030430217148911767,0.03761976755398109,0.030760919824128014,0.0346701009161836,0.031210687032882906,0.030571465216898328,0.03595700206898631,0.0307266234780303]"


[36mdocumentsToShow[39m: [32mInt[39m = [32m10[39m

## 9. Exploring results

When I ran the anlayusis in the previous cell, document 7 (indexed 6) came up as heavily weighted to the first topic (topic 0).

Let's compare the contents of document 7 with the definition of topic 0.

We can just index directly into our original Corpus of texts to see the contents of that "document":

In [28]:
val documentIndex = 7


scholiaAscii.nodes(documentIndex)

[36mdocumentIndex[39m: [32mInt[39m = [32m7[39m
[36mres27_1[39m: [32mCitableNode[39m = [33mCitableNode[39m(
  [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg5026.e3.e3_simpleascii:9.e3_109v_8"[39m),
  [32m"h eikotws tauta poiei tois khruci pros to duswpein ekeinous h gar sumfora tapeinoi kai ta megala fronhmata"[39m
)

We can set a condition on the `weightedLabels` data frame to filter it to a given topic.

In [29]:
val topicIndex = 0

val topic = weightedLabels.filter(weightedLabels("topic") === topicIndex).select("termsWithWeight") //.showHTML(truncate=1000)



[36mtopicIndex[39m: [32mInt[39m = [32m0[39m
[36mtopic[39m: [32mDataFrame[39m = [termsWithWeight: array<string>]

We can break the resulting array out to one element per line with Spark's `explode` method.


In [30]:
topic.select( explode(col("termsWithWeight"))).showHTML(truncate=maxWidth)


col
poiei 0.07629388823238725
dwrwn 0.0421964060813928
eauton 0.0341188060878836
autwn 0.02209522488221627
toutwn 0.016766780597135784
plhqos 0.012476529449122671
eisi 0.011980008439663846
sumferon 0.01197131896365704
mallon 0.011903895188608321
legwn 0.011396389572251564
