# Prometheus - Feature Notebook

## Getting Setup with Jupyter

`sbt publishM2`

Repeat if the project has been updated, and also do:

`rm -rf ~/.m2/repository/sonymobile`

`rm -rf metastore_db`

(*Important*: `kernel -> restart all` otherwise the previously used version of the promethues-relation-model lib be used again! Very confusing!) <- not sure if this is correct

`prometheus-relation-model` requires [docforia](https://github.com/marcusklang/docforia), which must be git-cloned and `mvn install`:ed first (not yet on Maven repository unfortunately).

## Data
This notebooks requires the annontated herd data.

In [1]:
%AddDeps se.lth.cs.nlp docforia 1.0-SNAPSHOT --transitive --repository file:/Users/erik/.m2/repository

Marking se.lth.cs.nlp:docforia:1.0-SNAPSHOT for download
Preparing to fetch from:
-> file:/var/folders/x9/cvvmdcld0szc31byy1ns_vfr0000gn/T/toree_add_deps2145311932042217219/
-> file:/Users/erik/.m2/repository
-> https://repo1.maven.org/maven2
-> New file at /Users/erik/.m2/repository/se/lth/cs/nlp/docforia/1.0-SNAPSHOT/docforia-1.0-SNAPSHOT.jar
-> New file at /Users/erik/.m2/repository/com/fasterxml/jackson/core/jackson-annotations/2.7.0/jackson-annotations-2.7.0.jar
-> New file at /Users/erik/.m2/repository/junit/junit/4.11/junit-4.11.jar
-> New file at /Users/erik/.m2/repository/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.jar
-> New file at /Users/erik/.m2/repository/com/fasterxml/jackson/core/jackson-core/2.7.3/jackson-core-2.7.3.jar
-> New file at /Users/erik/.m2/repository/com/google/protobuf/protobuf-java/2.6.1/protobuf-java-2.6.1.jar
-> New file at /Users/erik/.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.7.3/jackson-databind-2.7.3.jar
-> New file at /Users/e

In [2]:
import se.lth.cs.docforia.Document
import java.io.IOError

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.{Accumulator, SparkContext}
import se.lth.cs.docforia.Document
import se.lth.cs.docforia.memstore.MemoryDocumentIO

def readCorpus(
    file: String,
    sampleSize: Double = 1.0)
    (implicit sqlContext: SQLContext, sc: SparkContext): RDD[Document] = {

    var df: DataFrame = sqlContext.read.parquet(file)
    df = df.where(df("type").equalTo("ARTICLE"))

    val ioErrors: Accumulator[Int] = sc.accumulator(0, "IO_ERRORS")

    // we might need to filter for only articles here but that wouldn't be a generelized solution.
    val docs = (if(sampleSize == 1.0) df else df.sample(false, sampleSize)).flatMap{row =>
      try {
        val doc: Document = MemoryDocumentIO.getInstance().fromBytes(row.getAs(5): Array[Byte])
        List(doc)
      } catch {
        case e:IOError =>
          ioErrors.add(1)
          List()
      }
    }

    docs
  }

In [3]:
val docs = readCorpus("../../data/wikipedia-corpus-herd/sv/")(sqlContext, sc)

In [4]:
docs.count()

6205

# Dependency Features

In [5]:
val doc = docs.take(1)(0)

In [6]:
doc

Microtegeus rugosus är en kvalsterart som beskrevs av Sandór Mahunka 1982. Microtegeus rugosus ingår i släktet Microtegeus och familjen Microtegeidae. Inga underarter finns listade i Catalogue of Life.

Källor. (2007) , PDF, Subías 2007: World Oribatida catalog. Bisby F.A., Roskov Y.R., Orrell T.M., Nicolson D., Paglinawan L.E., Bailly N., Kirk P.M., Bourgoin T., Baillargeon G., Ouvrard D. (red.) (1 mars). ”Species 2000 & ITIS Catalogue of Life: 2011 Annual Checklist.”. Species 2000: Reading, UK. http://www.catalogueoflife.org/annual-checklist/2011/search/all/key/microtegeus+rugosus/match/1. Läst 24 september. ITIS: The Integrated Taxonomic Information System. Orrell T. (custodian), 2011-04-26.

Externa länkar. Wikispecies har information om Microteg...

In [7]:
import se.lth.cs.docforia.graph.text.{Sentence, Token, DependencyRelation, Mention}
import se.lth.cs.docforia.query.QueryCollectors
import scala.collection.JavaConverters._

val DR = DependencyRelation.`var`()
val T = Token.`var`()

val tokens = doc.nodes(classOf[Token]).asScala.toSeq.toList
tokens

List(Microtegeus, rugosus, är, en, kvalsterart, som, beskrevs, av, Sandór, Mahunka, 1982, ., Microtegeus, rugosus, ingår, i, släktet, Microtegeus, och, familjen, Microtegeidae, ., Inga, underarter, finns, listade, i, Catalogue, of, Life, ., Källor, ., (, 2007, ), ,, PDF, ,, Subías, 2007, :, World, Oribatida, catalog, ., Bisby, F, ., A, ., ,, Roskov, Y, ., R, ., ,, Orrell, T, ., M, ., ,, Nicolson, D, ., ,, Paglinawan, L, ., E, ., ,, Bailly, N, ., ,, Kirk, P, ., M, ., ,, Bourgoin, T, ., ,, Baillargeon, G, ., ,, Ouvrard, D, ., (, red, ., ), (, 1, mars, ), ., ”, Species, 2000, &, ITIS, Catalogue, of, Life, :, 2011, Annual, Checklist, ., ”, ., Species, 2000, :, Reading, ,, UK, ., http://www.catalogueoflife.org/annual-checklist/2011/search/all...

## Lemma-zation

In [39]:
tokens.map(_.getLemma)

List(microtegeus, rugosus, vara, en, kvalsterart, som, beskriva, av, Sandór, Mahunka, 1982, ., microtegeus, rugosus, ingå, i, släkte, Microtegeus, och, familj, Microtegeidae, ., ingen, underart, finnas, lista, i, catalogue, of, life, ., källa, ., (, 2007, ), ,, pdf, ,, Subías, 2007, :, World, Oribatida, catalog, ., Bisby, F, ., a, ., ,, roskov, y, ., r, ., ,, Orrell, T, ., m, ., ,, Nicolson, D, ., ,, Paglinawan, L, ., e, ., ,, Bailly, N, ., ,, Kirk, P, ., m, ., ,, bourgoin, t, ., ,, Baillargeon, G, ., ,, Ouvrard, D, ., (, rida, ., ), (, 1:a, mars, ), ., ”, specie, 2000, &, itis, catalogue, of, life, :, 2011, annual, Checklist, ., ”, ., species, 2000, :, reading, ,, uk, ., http://www.catalogueoflife.org/annual-checklist/2011/search/all/key/microtegeus+rugosus/match/...

In [8]:
val deps = tokens(1).connectedEdges(classOf[DependencyRelation]).toList.asScala

In [9]:
deps.map(d => s"${d.getTail} -- ${d.getRelation} --> ${d.getHead}")

ArrayBuffer(Microtegeus -- DT --> rugosus, rugosus -- SS --> är)

In [10]:
// Create a dependency path between these two entities.
val ent1 = tokens(0)
val ent2 = tokens(8)

In [11]:
import scala.collection.mutable

def dfs(current: Token, visited: Set[Token], chain: Seq[DependencyRelation], target: Token): Seq[DependencyRelation] = {
    if(current == target) {
        chain
    }else if(visited.contains(current)){
        Seq()
    }else{
        val deps = current.connectedEdges(classOf[DependencyRelation]).toList.asScala
        val newVisited = visited + current
        deps.flatMap(d => {
            dfs(d.getHead[Token], newVisited, chain :+ d, target) ++ dfs(d.getTail[Token], newVisited, chain :+ d, target)
        })
    }      
}
val path = dfs(ent2, Set(), Seq(), ent1)
path.map(d => (d, d.getHead[Token].getStart < d.getTail[Token].getStart)).map{
    case (d, dir) => 
        if (dir) s"${d.getHead} <-- ${d.getRelation} -- ${d.getTail}" else s"${d.getTail} -- ${d.getRelation} --> ${d.getHead}" 
}

ArrayBuffer(av <-- PA -- Sandór, beskrevs <-- AG -- av, kvalsterart <-- ET -- beskrevs, är <-- SP -- kvalsterart, rugosus -- SS --> är, Microtegeus -- DT --> rugosus)

## NE tag

In [17]:
import se.lth.cs.docforia.graph.disambig.NamedEntityDisambiguation
import se.lth.cs.docforia.graph.text.{NamedEntity}
val neds = doc.nodes(classOf[NamedEntityDisambiguation]).asScala.toSeq.toList
neds

List(Sandór Mahunka, Microtegeus, Microtegeidae, Catalogue of Life, PDF, Catalogue of Life, spindeldjur)

In [13]:
val N = NamedEntity.`var`()
val ned = doc.select(N).where(N).coveredBy(tokens(0))


## Huvudverbet

## Chunking

In [38]:
val groups = docs.map(doc => {
    val NED = NamedEntityDisambiguation.`var`()
    val T = Token.`var`()
    val nedGroups = doc.select(NED, T).where(T).coveredBy(NED)
    .stream()
    .collect(QueryCollectors.groupBy(doc, NED).values(T).collector())
    .asScala
    .toList
    
    nedGroups.map(pg => {
        pg.key(NED).getIdentifier
        pg.value(0, T).text
        val values = pg.nodes(T).asScala
        if(values.size > 1){
            val head = values.head
            val last = values.last
            head.setRange(head.getStart, last.getEnd)
            values.tail.foreach(doc.remove)
        }
    })
    doc
}).take(1)
groups(0).nodes(classOf[Token]).asScala.toSeq.toList

List(Microtegeus, rugosus, är, en, kvalsterart, som, beskrevs, av, Sandór Mahunka, 1982, ., Microtegeus, rugosus, ingår, i, släktet, Microtegeus, och, familjen, Microtegeidae, ., Inga, underarter, finns, listade, i, Catalogue of Life, ., Källor, ., (, 2007, ), ,, PDF, ,, Subías, 2007, :, World, Oribatida, catalog, ., Bisby, F, ., A, ., ,, Roskov, Y, ., R, ., ,, Orrell, T, ., M, ., ,, Nicolson, D, ., ,, Paglinawan, L, ., E, ., ,, Bailly, N, ., ,, Kirk, P, ., M, ., ,, Bourgoin, T, ., ,, Baillargeon, G, ., ,, Ouvrard, D, ., (, red, ., ), (, 1, mars, ), ., ”, Species, 2000, &, ITIS, Catalogue of Life, :, 2011, Annual, Checklist, ., ”, ., Species, 2000, :, Reading, ,, UK, ., http://www.catalogueoflife.org/annual-checklist/2011/search/all/key...