In [1]:
spark

Waiting for a Spark session to start...

In [2]:
import edu.umd.cloud9.collection.XMLInputFormat

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io._ //{LongWritable, Text}

In [4]:
import edu.umd.cloud9.collection.wikipedia._//WikipediaPage
import edu.umd.cloud9.collection.wikipedia.language._//EnglishWikipediaPage

In [5]:
 def wikiXmlToPlainText(pageXml: String): Option[(String, String)] = {
    val page = new EnglishWikipediaPage()

    // Wikipedia has updated their dumps slightly since Cloud9 was written, so this hacky replacement is sometimes
    // required to get parsing to work.
    val hackedPageXml = pageXml.replaceFirst(
      "<text xml:space=\"preserve\" bytes=\"\\d+\">", "<text xml:space=\"preserve\">")

    WikipediaPage.readPage(page, hackedPageXml)
    if (page.isEmpty || !page.isArticle || page.isRedirect || page.isDisambiguation ||
        page.getTitle.contains("(disambiguation)")) {
      None
    } else {
      Some((page.getTitle, page.getContent))
    }
  }

wikiXmlToPlainText: (pageXml: String)Option[(String, String)]


In [6]:
val path = "/user/kranthidr/dataSets/wikidump-02-sept-2018.xml"
@transient val conf = new Configuration()

    conf.set(XMLInputFormat.START_TAG_KEY, "<page>")
    conf.set(XMLInputFormat.END_TAG_KEY, "</page>")

    val kvs = spark.sparkContext.newAPIHadoopFile(path, 
                                                  classOf[XMLInputFormat], 
                                                  classOf[LongWritable],
                                                  classOf[Text], 
                                                  conf)

Waiting for a Spark session to start...

path = /user/kranthidr/dataSets/wikidump-02-sept-2018.xml
conf = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml
kvs = /user/kranthidr/dataSets/wikidump-02-sept-2018.xml NewHadoopRDD[0] at newAPIHadoopFile at <console>:50


/user/kranthidr/dataSets/wikidump-02-sept-2018.xml NewHadoopRDD[0] at newAPIHadoopFile at <console>:50

In [7]:
    val rawXmls = kvs.map(_._2.toString).toDS()

rawXmls = [value: string]


[value: string]

In [8]:
rawXmls.take(2)

Array(<page>
    <title>AccessibleComputing</title>
    <ns>0</ns>
    <id>10</id>
    <redirect title="Computer accessibility" />
    <revision>
      <id>854851586</id>
      <parentid>834079434</parentid>
      <timestamp>2018-08-14T06:47:24Z</timestamp>
      <contributor>
        <username>Godsy</username>
        <id>23257138</id>
      </contributor>
      <comment>remove from category for seeking instructions on rcats</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">#REDIRECT [[Computer accessibility]]
{{R from move}}
{{R from CamelCase}}
{{R unprintworthy}}</text>
      <sha1>42l0cvblwtb4nnupxm6wo000d27t6kf</sha1>
    </revision>
  </page>, <page>
    <title>Anarchism</title>
    <ns>0</ns>
    <...


[<page>
    <title>AccessibleComputing</title>
    <ns>0</ns>
    <id>10</id>
    <redirect title="Computer accessibility" />
    <revision>
      <id>854851586</id>
      <parentid>834079434</parentid>
      <timestamp>2018-08-14T06:47:24Z</timestamp>
      <contributor>
        <username>Godsy</username>
        <id>23257138</id>
      </contributor>
      <comment>remove from category for seeking instructions on rcats</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">#REDIRECT [[Computer accessibility]]

{{R from move}}
{{R from CamelCase}}
{{R unprintworthy}}</text>
      <sha1>42l0cvblwtb4nnupxm6wo000d27t6kf</sha1>
    </revision>
  </page>, <page>
    <title>Anarchism</title>
    <ns>0</ns>
    <id>12</id>
    <revision>
      <id>854967525</id>
      <parentid>854693523</parentid>
      <timestamp>2018-08-15T00:47:35Z</timestamp>
      <contributor>
        <username>Beland</username>
        <id>57939</id>
      </contri

In [9]:
  val docTexts =  rawXmls.filter(_ != null).flatMap(wikiXmlToPlainText)

docTexts = [_1: string, _2: string]


[_1: string, _2: string]

In [10]:
docTexts.take(1).foreach(println)

of Lucifer, the Light-Bearer  and many anarchists were "ardent freethinkers; reprints from freethought papers such as Lucifer, the Light-Bearer, Freethought and The Truth Seeker appeared in Liberty... The church was viewed as a common ally of the state and as a repressive force in and of itself". In 1901, Catalan anarchist and free thinker Francesc Ferrer i Guàrdia established "modern" or progressive schools in Barcelona in defiance of an educational system controlled by the Catholic Church.    The schools' stated goal was to "educate the working class in a rational, secular and non-coercive setting". Fiercely anti-clerical, Ferrer believed in "freedom in education", education free from the authority of church and state.  Murray Bookchin wrote: "This period [1890s] was the heyday of libertarian schools and pedagogical projects in all areas of the country where Anarchists exercised some degree of influence. Perhaps the best-known effort in this field was Francisco Ferrer's Modern School

In [11]:
// docTexts.show(1, false) 
//Not Good print

+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [12]:
docTexts

[_1: string, _2: string]

In [13]:
rawXmls.printSchema()

root
 |-- value: string (nullable = true)



In [14]:
docTexts.printSchema()

root
 |-- _1: string (nullable = true)
 |-- _2: string (nullable = true)



In [15]:
import scala.collection.JavaConverters._
import scala.collection.mutable.ArrayBuffer

import edu.stanford.nlp.ling.CoreAnnotations._ //{LemmaAnnotation, SentencesAnnotation, TokensAnnotation}
import edu.stanford.nlp.pipeline._ //{Annotation, StanfordCoreNLP}

import java.util.Properties

import org.apache.spark.ml.feature._ //{CountVectorizer, IDF}
import org.apache.spark.sql.functions._
import org.apache.spark.sql._ //{DataFrame, Dataset, SparkSession}

In [16]:
 def createNLPPipeline(): StanfordCoreNLP = {
    val props = new Properties()
    props.put("annotators", "tokenize, ssplit, pos, lemma")
    new StanfordCoreNLP(props)
  }

createNLPPipeline: ()edu.stanford.nlp.pipeline.StanfordCoreNLP


In [17]:
  def isOnlyLetters(str: String): Boolean = {
    str.forall(c => Character.isLetter(c))
  }

isOnlyLetters: (str: String)Boolean


In [18]:
  def plainTextToLemmas(text: String, stopWords: Set[String], pipeline: StanfordCoreNLP)
    : Seq[String] = {
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences.asScala;
         token <- sentence.get(classOf[TokensAnnotation]).asScala) {
      val lemma = token.get(classOf[LemmaAnnotation])
      if (lemma.length > 2 && !stopWords.contains(lemma) && isOnlyLetters(lemma)) {
        lemmas += lemma.toLowerCase
      }
    }
    lemmas
  }

plainTextToLemmas: (text: String, stopWords: Set[String], pipeline: edu.stanford.nlp.pipeline.StanfordCoreNLP)Seq[String]


In [20]:
val stopWords = scala.io.Source.fromFile("./stopwords.txt").getLines().toSet

stopWords = Set(down, it's, ourselves, that's, for, further, she'll, any, there's, this, haven't, in, ought, myself, have, your, off, once, i'll, are, is, his, why, too, why's, am, than, isn't, didn't, himself, but, you're, below, what, would, i'd, if, you'll, own, they'll, up, we're, they'd, so, our, do, all, him, had, nor, before, it, a, she's, as, hadn't, because, has, she, yours, or, above, yourself, herself, she'd, such, they, each, can't, don't, i, until, that, out, he's, cannot, to, we've, hers, you, did, let's, most, here, these, hasn't, was, there, when's, shan't, doing, at, through, been, over, i've, on, being, same, how, whom, my, after, who, itself, me, them, by, then, couldn't, he, should, few, wasn't, again, while, their, not, with, ...


lastException: Throwable = null


Set(down, it's, ourselves, that's, for, further, she'll, any, there's, this, haven't, in, ought, myself, have, your, off, once, i'll, are, is, his, why, too, why's, am, than, isn't, didn't, himself, but, you're, below, what, would, i'd, if, you'll, own, they'll, up, we're, they'd, so, our, do, all, him, had, nor, before, it, a, she's, as, hadn't, because, has, she, yours, or, above, yourself, herself, she'd, such, they, each, can't, don't, i, until, that, out, he's, cannot, to, we've, hers, you, did, let's, most, here, these, hasn't, was, there, when's, shan't, doing, at, through, been, over, i've, on, being, same, how, whom, my, after, who, itself, me, them, by, then, couldn't, he, should, few, wasn't, again, while, their, not, with, from, you've, they've, what's, wouldn't, both, could, its, under, which, you'd, an, be, here's, into, where, he'll, her, themselves, were, more, we'd, where's, they're, who's, between, aren't, ours, about, doesn't, how's, against, during, no, very, we, ha

In [21]:
val bStopWords = spark.sparkContext.broadcast(stopWords)

bStopWords = Broadcast(4)


Broadcast(4)

In [25]:
bStopWords.value

Set(down, it's, ourselves, that's, for, further, she'll, any, there's, this, haven't, in, ought, myself, have, your, off, once, i'll, are, is, his, why, too, why's, am, than, isn't, didn't, himself, but, you're, below, what, would, i'd, if, you'll, own, they'll, up, we're, they'd, so, our, do, all, him, had, nor, before, it, a, she's, as, hadn't, because, has, she, yours, or, above, yourself, herself, she'd, such, they, each, can't, don't, i, until, that, out, he's, cannot, to, we've, hers, you, did, let's, most, here, these, hasn't, was, there, when's, shan't, doing, at, through, been, over, i've, on, being, same, how, whom, my, after, who, itself, me, them, by, then, couldn't, he, should, few, wasn't, again, while, their, not, with, from, you've, they've, what's, wouldn't, both, could, its, under, which, you'd, an, be, here's, into, where, he'll, her, themselves, were, more, we'd, where's, they're, who's, between, aren't, ours, about, doesn't, how's, against, during, no, very, we, ha

In [26]:
val terms: Dataset[(String, Seq[String])] = 
docTexts.mapPartitions { iter =>
      val pipeline = createNLPPipeline()
      iter.map { case (title, contents) => (title, plainTextToLemmas(contents, bStopWords.value, pipeline)) }
    }

terms = [_1: string, _2: array<string>]


[_1: string, _2: array<string>]

In [27]:
terms.take(1)

[(Anarchism,List(anarchism, anarchism, political, philosophy, advocate, society, base, voluntary, institution, often, describe, stateless, society, although, several, author, define, specifically, institution, base, free, association, anarchism, hold, state, undesirable, unnecessary, harmful, slevin, carl, anarchism, concise, oxford, dictionary, politics, iain, mclean, alistair, mcmillan, oxford, university, press, opposition, state, central, anarchism, specifically, entail, oppose, authority, hierarchical, organisation, conduct, human, relation, anarchism, usually, consider, ideology, much, anarchist, economics, anarchist, legal, philosophy, reflect, interpretation, communism, collectivism, syndicalism, mutualism, participatory, economics, anarchism, offer, fix, body, doctrine, single, particular, world, view, instead, flux, flow, philosophy, many, type, tradition, anarchism, exist, mutually, exclusive, anarchist, school, thought, can, differ, fundamentally, support, anything, extreme