# Prometheus - Feature Notebook

## Getting Setup with Jupyter

`sbt publishM2`

Repeat if the project has been updated, and also do:

`rm -rf ~/.m2/repository/sonymobile`

`rm -rf metastore_db`

(*Important*: `kernel -> restart all` otherwise the previously used version of the promethues-relation-model lib be used again! Very confusing!) <- not sure if this is correct

`prometheus-relation-model` requires [docforia](https://github.com/marcusklang/docforia), which must be git-cloned and `mvn install`:ed first (not yet on Maven repository unfortunately).

## Data
This notebooks requires the annontated herd data.

In [1]:
%AddDeps se.lth.cs.nlp docforia 1.0-SNAPSHOT --transitive --repository file:/Users/axel/.m2/repository

Marking se.lth.cs.nlp:docforia:1.0-SNAPSHOT for download
Preparing to fetch from:
-> file:/var/folders/28/5mj8jbrd13z_nssxk35nd25c0000gn/T/toree_add_deps8234528462432356968/
-> file:/Users/axel/.m2/repository
-> https://repo1.maven.org/maven2
-> New file at /Users/axel/.m2/repository/se/lth/cs/nlp/docforia/1.0-SNAPSHOT/docforia-1.0-SNAPSHOT.jar
-> New file at /Users/axel/.m2/repository/com/fasterxml/jackson/core/jackson-annotations/2.7.0/jackson-annotations-2.7.0.jar
-> New file at /Users/axel/.m2/repository/junit/junit/4.11/junit-4.11.jar
-> New file at /Users/axel/.m2/repository/com/fasterxml/jackson/core/jackson-core/2.7.3/jackson-core-2.7.3.jar
-> New file at /Users/axel/.m2/repository/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.jar
-> New file at /Users/axel/.m2/repository/com/google/protobuf/protobuf-java/2.6.1/protobuf-java-2.6.1.jar
-> New file at /Users/axel/.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.7.3/jackson-databind-2.7.3.jar
-> New file at /Users/a

In [2]:
import se.lth.cs.docforia.Document
import java.io.IOError

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.{Accumulator, SparkContext}
import se.lth.cs.docforia.Document
import se.lth.cs.docforia.memstore.MemoryDocumentIO

def readCorpus(
    file: String,
    sampleSize: Double = 1.0)
    (implicit sqlContext: SQLContext, sc: SparkContext): RDD[Document] = {

    var df: DataFrame = sqlContext.read.parquet(file)
    df = df.where(df("type").equalTo("ARTICLE"))

    val ioErrors: Accumulator[Int] = sc.accumulator(0, "IO_ERRORS")

    // we might need to filter for only articles here but that wouldn't be a generelized solution.
    val docs = (if(sampleSize == 1.0) df else df.sample(false, sampleSize)).flatMap{row =>
      try {
        val doc: Document = MemoryDocumentIO.getInstance().fromBytes(row.getAs(5): Array[Byte])
        List(doc)
      } catch {
        case e:IOError =>
          ioErrors.add(1)
          List()
      }
    }

    docs
  }

In [3]:
val docs = readCorpus("../../data/wikipedia-corpus-herd/en/")(sqlContext, sc)

In [4]:
docs.count()

21195

# Dependency Features

In [5]:
val doc = docs.take(3)(2)

In [6]:
doc

Hamblin Bay is a bay of Lake Mead on the Colorado River, to the east of Las Vegas and Callville Bay in the U.S. state of Nevada. It lies between Sandy Cove which lies to the west and Rotary Cove and Rufus Cove which lie to the east. Hamblin Bay is also a fault of the same name in the vicinity, which "strikes at a low angle to the easternmost mapped branch of the Las Vegas Shear Zone".

Name. It is named after Mormon missionary William Hamblin.

References. Geological Survey Professional Paper. U.S. Government Printing Office. 1974. p. 3. American Mining Congress (1977). Proceedings of the First Annual William T. Pecora Memorial Symposium, October 1975, Sioux Falls, South Dakota. U.S. Government Printing Office. p. 259. Carlson, Helen S. (1 January 19...

In [67]:
import se.lth.cs.docforia.graph.text.{Sentence, Token, DependencyRelation, Mention}
import se.lth.cs.docforia.query.QueryCollectors
import scala.collection.JavaConverters._

val DR = DependencyRelation.`var`()
val T = Token.`var`()

val tokens = doc.nodes(classOf[Token]).asScala.toSeq.toList
%Truncation off
tokens.zipWithIndex

Output will NOT be truncated


List((Hamblin,0), (Bay,1), (is,2), (a,3), (bay,4), (of,5), (Lake,6), (Mead,7), (on,8), (the,9), (Colorado,10), (River,11), (,,12), (to,13), (the,14), (east,15), (of,16), (Las,17), (Vegas,18), (and,19), (Callville,20), (Bay,21), (in,22), (the,23), (U.S.,24), (state,25), (of,26), (Nevada,27), (.,28), (It,29), (lies,30), (between,31), (Sandy,32), (Cove,33), (which,34), (lies,35), (to,36), (the,37), (west,38), (and,39), (Rotary,40), (Cove,41), (and,42), (Rufus,43), (Cove,44), (which,45), (lie,46), (to,47), (the,48), (east,49), (.,50), (Hamblin,51), (Bay,52), (is,53), (also,54), (a,55), (fault,56), (of,57), (the,58), (same,59), (name,60), (in,61), (the,62), (vicinity,63), (,,64), (which,65), (",66), (strikes,67), (at,68), (a,69), (low,70), (angle,71), (to,72), (the,73), (easternmost,74), (mapped,75), (branch,76), (of,77), (the,78), (Las,79), (Vegas,80), (Shear,81), (Zone,82), (",83), (.,84), (Name,85), (.,86), (It,87), (is,88), (named,89), (after,90), (Mormon,91), (missionary,92), (William,

## Lemma-zation

In [68]:
tokens.map(_.getLemma)

List(Hamblin, Bay, be, a, bay, of, Lake, Mead, on, the, Colorado, River, ,, to, the, east, of, Las, Vegas, and, Callville, Bay, in, the, U.S., state, of, Nevada, ., it, lie, between, Sandy, Cove, which, lie, to, the, west, and, rotary, cove, and, Rufus, Cove, which, lie, to, the, east, ., Hamblin, Bay, be, also, a, fault, of, the, same, name, in, the, vicinity, ,, which, ``, strike, at, a, low, angle, to, the, easternmost, map, branch, of, the, Las, Vegas, Shear, Zone, '', ., name, ., it, be, name, after, Mormon, missionary, William, Hamblin, ., reference, ., Geological, Survey, Professional, Paper, ., U.S., Government, Printing, Office, ., 1974, ., p., 3, ., american, mining, Congress, -lrb-, 1977, -rrb-, ., Proceedings, of, the, First, annual, William, T., Pecora, Memorial, Symposium, ,, October, 1975, ,, Sioux, Falls, ,, South, Dakota, ., U.S., Government, Printing, Office, ., p., 259, ., Carlson, ,, Helen, S., -lrb-, 1, January, 1974, -rrb-, ., Nevada, Place, Names, :, a, geographi

## Dependency Path

In [61]:
val deps = tokens(0).connectedEdges(classOf[DependencyRelation]).toList.asScala

In [62]:
deps.map(d => s"${d.getTail} -- ${d.getRelation} --> ${d.getHead}")

ArrayBuffer(Hamblin -- compound --> Bay)

In [69]:
// Create a dependency path between these two entities.
val ent1 = tokens(52)
val ent2 = tokens(80)
(ent1, ent2)

(Bay,Vegas)

In [70]:
import scala.collection.mutable

def dfs(current: Token, visited: Set[Token], chain: Seq[DependencyRelation], target: Token): Seq[DependencyRelation] = {
    if(current == target) {
        chain
    }else if(visited.contains(current)){
        Seq()
    }else{
        val deps = current.connectedEdges(classOf[DependencyRelation]).toList.asScala
        val newVisited = visited + current
        deps.flatMap(d => {
            dfs(d.getHead[Token], newVisited, chain :+ d, target) ++ dfs(d.getTail[Token], newVisited, chain :+ d, target)
        })
    }      
}
val path = dfs(ent1, Set(), Seq(), ent2)
path.map(d => (d, d.getHead[Token].getStart < d.getTail[Token].getStart)).map{
    case (d, dir) => 
        if (dir) s"${d.getHead} <-- ${d.getRelation} -- ${d.getTail}" else s"${d.getTail} -- ${d.getRelation} --> ${d.getHead}" 
}

ArrayBuffer(Bay -- nsubj --> fault, fault <-- nmod -- name, name <-- acl:relcl -- strikes, strikes <-- nmod -- angle, angle <-- nmod -- branch, branch <-- nmod -- Zone, Vegas -- compound --> Zone)

## Dependency Window
Gets all tokens connected to an entity through a dependency relation that isn't included in the dependency path.

In [71]:
val visited = path.flatMap(p => {Seq(p.getHead[Token], p.getTail[Token])}).toSet
visited

Set(Vegas, Bay, name, branch, Zone, strikes, fault, angle)

In [81]:
def entityWindow(entity: Token, dependencyPath: Seq[DependencyRelation]): Set[DependencyRelation] = {
    val excluded: Set[Token] = dependencyPath.flatMap(p => {Seq(p.getHead[Token], p.getTail[Token])}).toSet
    entity.connectedEdges(classOf[DependencyRelation]).toList.asScala.filter(d => {
     (!excluded.contains(d.getTail[Token]) || !excluded.contains(d.getHead[Token]))
    }).toSet
}

val window = entityWindow(ent1, path)
window.map(d => (d, d.getHead[Token].getStart < d.getTail[Token].getStart)).map{
    case (d, dir) => 
        if (dir) s"${d.getHead} <-- ${d.getRelation} -- ${d.getTail}" else s"${d.getTail} -- ${d.getRelation} --> ${d.getHead}" 
}

Set(Hamblin -- compound --> Bay)

## NE tag

In [17]:
import se.lth.cs.docforia.graph.disambig.NamedEntityDisambiguation
import se.lth.cs.docforia.graph.text.{NamedEntity}
val neds = doc.nodes(classOf[NamedEntityDisambiguation]).asScala.toSeq.toList
neds

List(Sandór Mahunka, Microtegeus, Microtegeidae, Catalogue of Life, PDF, Catalogue of Life, spindeldjur)

In [13]:
val N = NamedEntity.`var`()
val ned = doc.select(N).where(N).coveredBy(tokens(0))


## Huvudverbet

## Chunking

In [38]:
val groups = docs.map(doc => {
    val NED = NamedEntityDisambiguation.`var`()
    val T = Token.`var`()
    val nedGroups = doc.select(NED, T).where(T).coveredBy(NED)
    .stream()
    .collect(QueryCollectors.groupBy(doc, NED).values(T).collector())
    .asScala
    .toList
    
    nedGroups.map(pg => {
        pg.key(NED).getIdentifier
        pg.value(0, T).text
        val values = pg.nodes(T).asScala
        if(values.size > 1){
            val head = values.head
            val last = values.last
            head.setRange(head.getStart, last.getEnd)
            values.tail.foreach(doc.remove)
        }
    })
    doc
}).take(1)
groups(0).nodes(classOf[Token]).asScala.toSeq.toList

List(Microtegeus, rugosus, är, en, kvalsterart, som, beskrevs, av, Sandór Mahunka, 1982, ., Microtegeus, rugosus, ingår, i, släktet, Microtegeus, och, familjen, Microtegeidae, ., Inga, underarter, finns, listade, i, Catalogue of Life, ., Källor, ., (, 2007, ), ,, PDF, ,, Subías, 2007, :, World, Oribatida, catalog, ., Bisby, F, ., A, ., ,, Roskov, Y, ., R, ., ,, Orrell, T, ., M, ., ,, Nicolson, D, ., ,, Paglinawan, L, ., E, ., ,, Bailly, N, ., ,, Kirk, P, ., M, ., ,, Bourgoin, T, ., ,, Baillargeon, G, ., ,, Ouvrard, D, ., (, red, ., ), (, 1, mars, ), ., ”, Species, 2000, &, ITIS, Catalogue of Life, :, 2011, Annual, Checklist, ., ”, ., Species, 2000, :, Reading, ,, UK, ., http://www.catalogueoflife.org/annual-checklist/2011/search/all/key...

In [1]:
case class DependencyPath(dependency: String, word: String, direction: java.lang.Boolean)

case class TrainingDataPoint(
  relationId: String,
  relationName: String,
  relationClass: Long,
  pointType: String,
  wordFeatures: Seq[String],
  posFeatures: Seq[String],
  wordsBetween: Seq[String],
  posBetween: Seq[String],
  ent1PosTags: Seq[String],
  ent2PosTags: Seq[String],
  ent1Type: String,
  ent2Type: String,
  dependencyPath: Seq[DependencyPath],
  ent1DepWindow: Seq[DependencyPath],
  ent2DepWindow: Seq[DependencyPath])

In [None]:
val file = "/Users/axel/utveckling/prometheus/data/wip/relation_model/sv/training_sentences/part-r-00207-4e1f6240-21e9-407f-b1ff-dcc30efd740c.gz.parquet"


## Coref Resolution

In [2]:
%AddDeps org.scalaj scalaj-http_2.10 2.3.0 --transitive
%AddDeps se.lth.cs.nlp docforia 1.0-SNAPSHOT --transitive --repository file:/Users/axel/.m2/repository
import scalaj.http._
import se.lth.cs.docforia.Document
import se.lth.cs.docforia.memstore.MemoryDocument
import se.lth.cs.docforia.graph.disambig.NamedEntityDisambiguation
import se.lth.cs.docforia.graph.text.{CoreferenceChainEdge, CoreferenceChain, CoreferenceMention, NamedEntity, Sentence, Token}
import se.lth.cs.docforia.memstore.MemoryDocumentIO
import se.lth.cs.docforia.query.{QueryCollectors, StreamUtils}
import scala.collection.JavaConverters._

Marking org.scalaj:scalaj-http_2.10:2.3.0 for download
Preparing to fetch from:
-> file:/var/folders/28/5mj8jbrd13z_nssxk35nd25c0000gn/T/toree_add_deps8234528462432356968/
-> https://repo1.maven.org/maven2
-> New file at /var/folders/28/5mj8jbrd13z_nssxk35nd25c0000gn/T/toree_add_deps8234528462432356968/https/repo1.maven.org/maven2/org/scalaj/scalaj-http_2.10/2.3.0/scalaj-http_2.10-2.3.0.jar
Marking se.lth.cs.nlp:docforia:1.0-SNAPSHOT for download
Preparing to fetch from:
-> file:/var/folders/28/5mj8jbrd13z_nssxk35nd25c0000gn/T/toree_add_deps8234528462432356968/
-> file:/Users/axel/.m2/repository
-> https://repo1.maven.org/maven2
-> New file at /Users/axel/.m2/repository/se/lth/cs/nlp/docforia/1.0-SNAPSHOT/docforia-1.0-SNAPSHOT.jar
-> New file at /Users/axel/.m2/repository/com/fasterxml/jackson/core/jackson-annotations/2.7.0/jackson-annotations-2.7.0.jar
-> New file at /Users/axel/.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.7.3/jackson-databind-2.7.3.jar
-> New file at 

In [3]:
def annotate(input: String, lang: String, conf: String): Either[String, Document] = {
    val CONNECTION_TIMEOUT = 2000
    val READ_TIMEOUT = 10000
    val vildeURL = s"http://vilde.cs.lth.se:9000/$lang/$conf/api/json"
    try {
      val response: HttpResponse[String] = Http(vildeURL)
        .timeout(connTimeoutMs = CONNECTION_TIMEOUT, readTimeoutMs = READ_TIMEOUT)
        .postData(input)
        .header("content-type", "application/json; charset=UTF-8")
        .asString

      val docJson = response.body
      Right(MemoryDocumentIO.getInstance().fromJson(docJson))
    } catch {
      case e: java.net.SocketTimeoutException => Left(e.getMessage)
    }
  }

In [3]:
def resolveCoref(doc: Document): Document = {
    val T = Token.`var`()
    val M = CoreferenceMention.`var`()
    val NED = NamedEntityDisambiguation.`var`()

    val chains = doc.select(T, M, NED).where(T).coveredBy(M).where(NED).coveredBy(M)
      .stream()
      .collect(QueryCollectors.groupBy(doc, M, NED).values(T).collector())
      .asScala
      .map(pg => {
        val mention = pg.key(M)
        val corefs = mention
          .connectedEdges(classOf[CoreferenceChainEdge]).asScala
          .flatMap(edge => edge.getHead[CoreferenceChain].connectedNodes(classOf[CoreferenceMention]).asScala)

        val ned = pg.key(NED)
        val mentions = corefs.filter(m => m.getProperty("mention-type") != "PROPER").map(m => {
          val newNed = new NamedEntityDisambiguation(doc)
            .setRange(m.getStart, m.getEnd)
            .setIdentifier(ned.getIdentifier)
            .setScore(ned.getScore)
            if (ned.hasProperty("LABEL"))
                newNed.putProperty("LABEL", ned.getProperty("LABEL"))
          m
        })
          
        println(s"$ned -> $mentions")
        println("*"*20)
        (ned, mentions)
      })
      .toList
      // println(chains.mkString(" "))
    
    // Print all the NED:s in the doc to check that it worked
    doc.select(T, NED).where(T).coveredBy(NED)
        .stream()
        .collect(QueryCollectors.groupBy(doc, NED).values(T).collector())
        .asScala
        .foreach(pg => {
            val ned = pg.key(NED)
            println(s"ned: $ned ${ned.getIdentifier}")
        })

    doc
}

In [4]:
val str = "Barack Obama was a detective married to Michelle. He became the president. He was elected by the majority of the people."
val annDoc = annotate(str, "en", "herd")
annDoc match {
    case Right(doc) => {
        val resolvedDoc = resolveCoref(doc)
    }
    case Left(e) => println(s"error $e")
}

Barack Obama -> List(He, He)
********************
ned: Barack Obama urn:wikidata:Q76
ned: Michelle urn:wikidata:Q13133
ned: He urn:wikidata:Q76
ned: He urn:wikidata:Q76
