# Named Entity Recognition Pipeline

El pipeline toma una URL de un feed en formato RSS, obtiene el título y descripción de los artículos en el feed, detecta las NER con un modelo pre-entrenado, y las muestra ordenadas por frecuencia de aparición.

### Versiones
Probado con:
* Almond 0.6.0
* Ammonite 1.6.7
* Scala library version **2.11.12** -- Copyright 2002-2017, LAMP/EPFL
* Java 1.8.0_282

Para ver más información ir a (Help -> About Scala Kernel)

## 1. Obtener texto

### 1.1 Importar librerías

In [1]:
// Equivalent of adding dependencies to maven or sbt files
// For example, to add "org.scalaj" %% "scalaj-http" % "2.4.2" 
import $ivy.`org.scalaj::scalaj-http:2.4.2`
// "org.scala-lang.modules" %% "scala-xml" % "1.3.0"
import $ivy.`org.scala-lang.modules::scala-xml:1.3.0`

[32mimport [39m[36m$ivy.$                              
// "org.scala-lang.modules" %% "scala-xml" % "1.3.0"
[39m
[32mimport [39m[36m$ivy.$                                        [39m

In [2]:
import scalaj.http.{Http, HttpResponse}
import scala.xml.XML

[32mimport [39m[36mscalaj.http.{Http, HttpResponse}
[39m
[32mimport [39m[36mscala.xml.XML[39m

In [3]:
import $ivy.`org.json4s::json4s-jackson:3.4.0`

[32mimport [39m[36m$ivy.$                                 [39m

In [4]:
import org.json4s.JsonDSL._
import org.json4s._

[32mimport [39m[36morg.json4s.JsonDSL._
[39m
[32mimport [39m[36morg.json4s._[39m

In [5]:
import org.json4s.jackson.JsonMethods._

[32mimport [39m[36morg.json4s.jackson.JsonMethods._[39m

In [6]:
import scala.util.matching.Regex

[32mimport [39m[36mscala.util.matching.Regex[39m

In [7]:
import scala.collection.mutable.ArrayBuffer

[32mimport [39m[36mscala.collection.mutable.ArrayBuffer[39m

### 1.1 Obtener el texto del RSS Feed

Realizamos una consulta HTTP, que nos devuelve una instancia de HTTPResponse. Dentro del atributo `body` de la HTTPResponse, se encuentra el texto del feed en formato XML. Luego, se parsea el XML para extraer los campos `title` y `description`.

In [8]:
class ParseRSS {
    def request(url: String): String={
        val response: HttpResponse[String] = Http(url).timeout(connTimeoutMs = 2000, readTimeoutMs = 5000).asString
        response.body
    }
    def parseURL(responsebody : String): Seq[String]={
        val xml = XML.loadString(responsebody)
        // Extract text from title and description
        (xml \\ "item").map { item =>
    ((item \ "title").text +" " + (item \ "description").text)}
        
    }
    
}

class ParseJSON extends ParseRSS{
    implicit val formats = DefaultFormats
    override def parseURL(responsebody: String): Seq[String]={ //Cuerpo de respuesta completo
        val pattern ="(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]".r //Patron
             //parse Reddit feed in JSON
            val jsonString = pattern.replaceAllIn(responsebody,"").trim() //Cuerpo de la respuesta sin urls
            val result = (parse(jsonString) \ "data" \ "children" \ "data")
    .extract[List[Map[String, Any]]] //Parses the jsonString as a List of Maps //Cuerpo de la funcion sin urls parseado en Json dentro de una [List[Map[String,Any]]]
            val filteredmap = result.map((_.filter((t) => t._1 == "title" || t._1 == "selftext").mapValues(_.asInstanceOf[String]).values)) //Filters the map and returns only "titles" and "selftexts"
            filteredmap.flatten.toSeq  //List[Seq[String,Any]] -(1)-> List[Seq[String,Any]] -(2)-> List[Seq[String,String]] -(3)-> List[Seq[String]] -(4)-> Seq[String]
    }
}

defined [32mclass[39m [36mParseRSS[39m
defined [32mclass[39m [36mParseJSON[39m

In [9]:
val rssmodel = new ParseRSS
val jsonmodel = new ParseJSON

[36mrssmodel[39m: [32mParseRSS[39m = ammonite.$sess.cmd7$Helper$ParseRSS@5dc597ac
[36mjsonmodel[39m: [32mParseJSON[39m = ammonite.$sess.cmd7$Helper$ParseJSON@75edf494

In [10]:
val rssresponsebody = rssmodel.request("https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:[now-2d+TO+now]&sort=display_date:desc")
val jsonresponsebody = jsonmodel.request("https://www.reddit.com/r/Android/hot/.json?count=10")

[36mrssresponsebody[39m: [32mString[39m = [32m"""<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"><channel><title>Chicago Tribune</title><link>https://www.chicagotribune.com</link><language>en-US</language><copyright>© 2021 Chicago Tribune</copyright><atom:link href="https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:%5Bnow-2d+TO+now%5D&amp;sort=display_date:desc" rel="self" type="application/rss+xml"/><description>Chicago Tribune News Feed</description><lastBuildDate>Tue, 01 Jun 2021 22:35:52 +0000</lastBuildDate><ttl>1</ttl><sy:updatePeriod>hourly</sy:updatePeriod><sy:updateFrequency>1</sy:updateFrequency><item><title>NCAA athletes in Illinois could sign endorsement deals under bill passed by state lawmaker

In [11]:
val parsedrss = rssmodel.parseURL(rssresponsebody)
val parsedjson = jsonmodel.parseURL(jsonresponsebody)

[36mparsedrss[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m(
  [32m"NCAA athletes in Illinois could sign endorsement deals under bill passed by state lawmakers Illinois college athletes would be able to hire agents and sign endorsement deals starting this summer under legislation passed by state lawmakers."[39m,
  [32m"Jos\u00e9 Abreu is named the American League Player of the Week after the Chicago White Sox first baseman had 10 RBIs in 7 games Chicago White Sox first baseman Jos\u00e9 Abreu earned American League Player of the Week honors for the Week of May 24-30 on Tuesday. It\u2019s his sixth career weekly honor."[39m,
  [32m"Column: The future is now for Adbert Alzolay, the exuberant Chicago Cubs pitcher fast becoming a fan favorite The Chicago Cubs haven't had a starting pitching prospect break through since the rebuild began in 2012. Adbert Alzolay is ready to stop the streak."[39m,
  [32m"Japan\u2019s vaccine push ahead of the Olympics looks to be too late and

In [18]:
class FeedService[T <: ParseRSS]{
    var urls: Seq[(String,T)] = Seq[(String,T)]()
    def subscribe(urlTemplate: String,params: Seq[String],parser: T){ //ParseJSON
        val pat = "%s".r
        val urlparam = params.map {s => pat replaceAllIn(urlTemplate,s)}.map{url => (url,parser)}
        urls = urls ++ urlparam
    }
    def getText(): Seq[String] = {
        val texts = urls.map { url =>
            val parser = url._2 //parser = ParseJSON
            parser.parseURL(url._1)
            
        }
        texts.map(_.mkString(","))
    }
}


defined [32mclass[39m [36mFeedService[39m

In [21]:
val feed = new FeedService[ParseRSS]

[36mfeed[39m: [32mFeedService[39m[[32mParseRSS[39m] = ammonite.$sess.cmd17$Helper$FeedService@45b5974e

In [22]:
feed.subscribe("https://www.chicagotribune.com/arcio/rss/category/%s/?query=display_date:[now-2d+TO+now]&sort=display_date:desc",Seq("sports","business"),rssmodel)

In [23]:
feed.urls

[36mres22[39m: [32mSeq[39m[([32mString[39m, [32mParseRSS[39m)] = [33mList[39m(
  (
    [32m"https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"[39m,
    ammonite.$sess.cmd13$Helper$ParseRSS@523dfd90
  ),
  (
    [32m"https://www.chicagotribune.com/arcio/rss/category/business/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"[39m,
    ammonite.$sess.cmd13$Helper$ParseRSS@523dfd90
  )
)

In [21]:
feed.getText

: 

## 2. Detectar las entidades nombradas

### 2.1 Crear el modelo

El **modelo** es sólo la función `getNEs`, que recibe una lista de textos.
Para cada texto, se separa las palabras del texto usando los espacios, y considera que es una entidad nombrada si empieza con mayúscula.

Este código lista los signos de puntuación y algunas palabras comunes del inglés que se van a sacar del texto.

In [13]:
class NERmodel{
     val STOPWORDS = Seq (
    "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you",
    "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she",
    "her", "hers", "herself", "it", "its", "itself", "they", "them", "your",
    "their", "theirs", "themselves", "what", "which", "who", "whom",
    "this", "that", "these", "those", "am", "is", "are", "was", "were",
    "be", "been", "being", "have", "has", "had", "having", "do", "does",
    "did", "doing", "a", "an", "the", "and", "but", "if", "or",
    "because", "as", "until", "while", "of", "at", "by", "for", "with",
    "about", "against", "between", "into", "through", "during", "before",
    "after", "above", "below", "to", "from", "up", "down", "in", "out",
    "off", "over", "under", "again", "further", "then", "once", "here",
    "there", "when", "where", "why", "how", "all", "any", "both", "each",
    "few", "more", "most", "other", "some", "such", "no", "nor", "not",
    "only", "own", "same", "so", "than", "too", "very", "s", "t", "can",
    "will", "just", "don", "should", "now", "on",
    // Contractions without '
    "im", "ive", "id", "Youre", "youd", "youve",
    "hes", "hed", "shes", "shed", "itd", "were", "wed", "weve",
    "theyre", "theyd", "theyve",
    "shouldnt", "couldnt", "musnt", "cant", "wont",
    // Common uppercase words
    "hi", "hello"
)
    val punctuationSymbols = ".,()!?;:'`´\n"
    val punctuationRegex = "\\" + punctuationSymbols.split("").mkString("|\\")
   
    //Funciones que obtienen los valores
    def getNEsSingle(text: String): Seq[String] =
  text.replaceAll(punctuationRegex, "").split(" ")
    .filter { word:String => word.length > 1 &&
              Character.isUpperCase(word.charAt(0)) &&
              !STOPWORDS.contains(word.toLowerCase) }.toSeq

    def getNEs(textList: Seq[String]): Seq[Seq[String]] = textList.map(getNEsSingle)
    
    //Aplicar pipeline
    def ExtractNEs(rssText: Seq[String]): Seq[Seq[String]] ={
            getNEs(rssText)
    }
    //Contar y ordenar
    def SortNEs(result: Seq[Seq[String]]): List[(String, Int)]={
        val counts: Map[String, Int] = result.flatten
  .foldLeft(Map.empty[String, Int]) {
     (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) }
         counts.toList
  .sortBy(_._2)(Ordering[Int].reverse)
        
    } 
}

defined [32mclass[39m [36mNERmodel[39m

In [14]:
val model = new NERmodel

[36mmodel[39m: [32mNERmodel[39m = ammonite.$sess.cmd12$Helper$NERmodel@529d6a2c

### 2.2 Aplicar el "Modelo" a los datos

In [15]:
val resultjson = model.ExtractNEs(parsedjson)
val resultrss = model.ExtractNEs(parsedrss)

[36mresultjson[39m: [32mSeq[39m[[32mSeq[39m[[32mString[39m]] = [33mList[39m(
  [33mArrayBuffer[39m(
    [32m"Note"[39m,
    [32m"Join"[39m,
    [32m"Ctrl-F"[39m,
    [32m"Note"[39m,
    [32m"Join"[39m,
    [32m"IRC"[39m,
    [32m"Telegram"[39m,
    [32m"Monday"[39m,
    [32m"Please"[39m,
    [32m"Feel"[39m,
    [32m"Home"[39m,
    [32m"Smartphones"[39m,
    [32m"Top"[39m,
    [32m"Phones"[39m
  ),
  [33mArrayBuffer[39m([32m"Weekly"[39m, [32m"Superthread"[39m, [32m"Jun"[39m),
  [33mArrayBuffer[39m(),
  [33mArrayBuffer[39m([32m"HTC"[39m, [32m"Pixel"[39m, [32m"XL"[39m),
  [33mArrayBuffer[39m(),
  [33mArrayBuffer[39m([32m"It\u2019s"[39m, [32m"Flagship"[39m, [32m"Exynos"[39m, [32m"AMD"[39m, [32m"RDNA2"[39m, [32m"GPU"[39m),
  [33mArrayBuffer[39m(),
  [33mArrayBuffer[39m([32m"Nearby"[39m, [32m"Shares"[39m, [32m"Comes"[39m, [32m"ChromeOS"[39m, [32m"Share"[39m, [32m"Android"[39m),
  [33mArrayBuffer[39m()

## 3. Contar y ordenar las entidades

Concatenar todas las listas, contar cada Named Entity, y luego ordernar por frecuencia

In [16]:
val sortedNEsjson = model.SortNEs(resultjson)
val sortedNEsrss = model.SortNEs(resultrss)

[36msortedNEsjson[39m: [32mList[39m[([32mString[39m, [32mInt[39m)] = [33mList[39m(
  ([32m"Android"[39m, [32m10[39m),
  ([32m"Google"[39m, [32m9[39m),
  ([32m"Note"[39m, [32m5[39m),
  ([32m"Join"[39m, [32m5[39m),
  ([32m"Samsung"[39m, [32m3[39m),
  ([32m"Telegram"[39m, [32m3[39m),
  ([32m"IRC"[39m, [32m3[39m),
  ([32m"Please"[39m, [32m2[39m),
  ([32m"OTA-Package"[39m, [32m2[39m),
  ([32m"Xperia"[39m, [32m2[39m),
  ([32m"Tablet"[39m, [32m2[39m),
  ([32m"Ctrl-F"[39m, [32m2[39m),
  ([32m"ETA"[39m, [32m2[39m),
  ([32m"AMOLED"[39m, [32m2[39m),
  ([32m"Monday"[39m, [32m2[39m),
  ([32m"Yes"[39m, [32m2[39m),
  ([32m"May"[39m, [32m2[39m),
  ([32m"III"[39m, [32m2[39m),
  ([32m"Nexus"[39m, [32m2[39m),
  ([32m"Sunday"[39m, [32m2[39m),
  ([32m"Fire"[39m, [32m2[39m),
  ([32m"Wear"[39m, [32m2[39m),
  ([32m"PRIME"[39m, [32m2[39m),
  ([32m"OS"[39m, [32m2[39m),
  ([32m"New"[39m, [32m2[39m),
  (