# P3 Avistamiento de aves

## Obtener datos (scraping)

Primero vamos a obtener los datos. Para ello nos visitaremos las web de avistamientos de aves [shorebirder](https://www.shorebirder.com/), [trevorsbirding](https://www.trevorsbirding.com/) y [dantallmansbirdblog](https://dantallmansbirdblog.blogspot.com/).

Durante la visita a la web y haciendo uso del inspector (F12) podemos ver que las descripciones que necesitamos se encuentran en los tag de párrafo (entre *\<p\> TEXTO \</p\>*). Sabiendo eso vamos a crear funciones de utilidad que se encargarán de descargar el contenido de la web y extraer el texto.

Las descargas las realizaremos en `data/raw` mientras que en `data/posts` guardaremos los textos encontrados.

** `dantallmansbirdblog` tiene una estructura ligeramente diferente (entre *\<p\>\</p\> TEXTO \<p\>\</p\>*), a lo que tendremos que modificar la función `get_texts` (a continuación) para obtener sus textos.

In [1]:
# I/O utils
import os
import re
import wget
import tqdm

data_posts_cache = "../data/cache" # guardar resultados de queries a sparql
data_raw_path = "../data/raw" # descargas
data_posts_path = "../data/posts" # guardar los textos de los post scrapeados

def maybe_mkdir(path):
  try:
    os.mkdir(path)
  except OSError as error:
    print(error)

maybe_mkdir("../data")
maybe_mkdir(data_raw_path)
maybe_mkdir(data_posts_path)

def download(url, out_label):
  return wget.download(url, out=f"{data_raw_path}/{out_label}")

def get_texts(filename):
  file = open(filename, 'r')
  text = file.read()
  file.close()

  # get texts
  get_p = re.compile(r'<p>((.|\n)*?)</p>')
  texts = get_p.findall(text)

  # remove styling and inner tags
  remove_tags = re.compile(r'(<.*?>)|\\n| +(?= )|\\|\&.+?\;')
  return map(lambda text: re.sub(remove_tags, "", str(text[0]).lower()), texts)

def write(path, filename, data):
  filepath = f"{path}/{filename}.txt"
  file = open(filepath, "a", encoding="utf-8")
  for item in data:
    file.write(str(item)+"\n")
  file.close()
  return filepath


A continuación haciendo uso de las funciones anteriores scrapeamos la home de `shorebirder`.

In [2]:
# scrap shorebirder.com
shorebirder_filename = "shorebirder_home.html"
shorebirder_home = download("https://www.shorebirder.com/", shorebirder_filename)
posts = get_texts(shorebirder_home)
shorebirder_posts_file = write(data_posts_path, shorebirder_filename, posts)

open(shorebirder_posts_file, "r").readlines()[0]

"my late march solo visit to norway is in the books and was about as much fun as i've had in a while. the middle few days of the trip were spent birding around varanger, bookended by more touristy time intromsÃ£Â¸and oslo. at some point in the coming months there will be a full trip report here plus a very detailed cloudbirders submission. in the meantime, here is some proof that i actually went.\n"

Lo mismo para `trevorsbirding`.

In [3]:
# scrap trevorsbirding.com
trevorsbirding_filename = "trevorsbirding_home.html"
trevorsbirding_home = download("https://www.trevorsbirding.com/", trevorsbirding_filename)
posts = get_texts(trevorsbirding_home)
trevorsbirding_posts_file = write(data_posts_path, trevorsbirding_filename, posts)

open(trevorsbirding_posts_file, "r").readlines()[2]

'earlier this week i glanced out of my sunroom window to check whether there were any birds at my birdbaths. i currently have three birdbaths just outside the room, one on the ground, one on a pedestal at about 60cm and one hanging from a tree branch at a height of about 1.5 metres. i was delighted to see a small flock of purple-crowned lorikeets having a drink and dipping into the water for a bath. i have just checked my list of species to have visited the birdbaths. this was bird species number 36, in addition to the three reptiles and two mammal species.\n'

## Intento 1: Usar spacy sin modificar

In [4]:
# Cargar dependencias para el nlp
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_lg")

In [5]:
# Prueba
for text in open(shorebirder_posts_file, "r").readlines()[3:5]:
  train = nlp(text)
  displacy.render(train, jupyter=True, style="ent")

Vemos que es capaz de identificar diferentes entidades dentro de las frases, pero no pájaros.

## Intento 2: Usando sparql query para encontrar aves

Idea: Usando el tokenizer de spacy como tokenizer trocear las frases. A partir de los token etiquetados como nombre (`NOUN`) lanzamos una petición a la dbpedia. Ya el resultado de la dbpedia nos dirá si existe y cual es su etiqueta / url.

In [6]:
# SparQL class extension
# Prefixes and Class based from https://github.com/ejrav/pydbpedia
from SPARQLWrapper import SPARQLWrapper, JSON

class SparqlEndpoint(object):

    def __init__(self, endpoint, prefixes={}):
        self.sparql = SPARQLWrapper(endpoint)
        self.prefixes = {
            "dbo": "http://dbpedia.org/ontology/",
            "owl": "http://www.w3.org/2002/07/owl#",
            "xsd": "http://www.w3.org/2001/XMLSchema#",
            "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
            "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
            "foaf": "http://xmlns.com/foaf/0.1/",
            "dc": "http://purl.org/dc/elements/1.1/",
            "dbpedia2": "http://dbpedia.org/property/",
            "dbpedia": "http://dbpedia.org/",
            "skos": "http://www.w3.org/2004/02/skos/core#",
            "foaf": "http://xmlns.com/foaf/0.1/",
            "yago": "http://dbpedia.org/class/yago/",
            }
        self.prefixes.update(prefixes)
        self.sparql.setReturnFormat(JSON)

    def query(self, q):
        lines = ["PREFIX %s: <%s>" % (k, r) for k, r in self.prefixes.items()]
        lines.extend(q.split("\n"))
        query = "\n".join(lines)
        self.sparql.setQuery(query)
        results = self.sparql.query().convert()
        return results["results"]["bindings"]


class DBpediaEndpoint(SparqlEndpoint):
    def __init__(self, endpoint, prefixes = {}):
        super(DBpediaEndpoint, self).__init__(endpoint, prefixes)

s = DBpediaEndpoint(endpoint = "http://dbpedia.org/sparql")

In [7]:
# Función para buscar una ave dado su nombre
def search_bird_dbpedia(token):
  return s.query('''
    SELECT *
    WHERE {
      ?bird a dbo:Bird ;
            rdfs:label ?name ;
            dbo:abstract ?comment .

      filter (!isLiteral(?name) ||
              langmatches(lang(?name), "en")) .

      filter (!isLiteral(?comment) ||
              langmatches(lang(?comment), "en")) .

      filter (CONTAINS(LCASE(STR(?name)), "{token}")) .
    }
    limit 5
  '''.replace("{token}", token))

search_bird_dbpedia("falcon")

[{'bird': {'type': 'uri',
   'value': 'http://dbpedia.org/resource/Collared_falconet'},
  'name': {'type': 'literal', 'xml:lang': 'en', 'value': 'Collared falconet'},
  'comment': {'type': 'literal',
   'xml:lang': 'en',
   'value': 'The collared falconet (Microhierax caerulescens) is a species of bird of prey in the family Falconidae. It is found in the Indian Subcontinent and Southeast Asia, ranging across Bangladesh, Bhutan, Cambodia, India, Laos, Myanmar, Nepal, Thailand, Malaysia, and Vietnam.Its natural habitat is temperate forest, often on the edges of broadleaf forest. It is 18 cm long. Rapid wingbeats are interspersed with long glides. When perched, it is described as being "rather shrikelike."'}},
 {'bird': {'type': 'uri',
   'value': 'http://dbpedia.org/resource/Peregrine_falcon'},
  'name': {'type': 'literal', 'xml:lang': 'en', 'value': 'Peregrine falcon'},
  'comment': {'type': 'literal',
   'xml:lang': 'en',
   'value': 'The peregrine falcon (Falco peregrinus), also known

In [8]:
# Prueba
nlp = spacy.load("en_core_web_lg")
maybe_matches = {}
for text in open(shorebirder_posts_file, "r").readlines()[4:5]:
  doc = nlp(text)
  for chunk in doc.noun_chunks:
    for token in chunk:
      if token.pos_ == 'NOUN':
        results = search_bird_dbpedia(token.lemma_)
        if len(results) > 0:
          maybe_matches[token] = results
          print(token)

body
time
ol'
car
gull
ring
gull
wing
peregrine
falcon


Vale, funciona? Lo que es muy lento y estamos machacando la dbpedia a queries.

## Intento 3: Cachear / indexar la dbpedia

Del intento anterior vamos a coger los resultados de todos los pájaros y lo convertiremos en un diccionario para que nos sea más fácil buscar y solo haremos n queries a la dbpedia. Por supuesto, esta estrategia es solo factible si el conjunto es finito. Como es nuestro caso, va haber n especies de pájaros, pero no va a estar creciendo dia a dia.

### Estrategia
- Obtener lista de todos los nombres de pájaros.
- Con spacy analizaremos la entrada del avistamiento y obtenemos los `noun chunk`.
  - Para hacer spacy más rápido vamos a deshabilitar las pipelines que no usemos, que son `lemmatizer` y `ner`.
- Con cada `chunk` usando fuzzy-search en la lista de nombres de pájaros para encontrar aquellos chunk que parezcan nombres de pájaros.

In [9]:
# Obtener todos los pájaros. Obtenemos las descripciones para intentos posteriores.
birds_sparql = s.query("""
  SELECT DISTINCT *
  WHERE {
    ?bird a dbo:Bird ;
          rdfs:label ?name ;
          dbo:abstract ?comment .

    filter (!isLiteral(?name) ||
            langmatches(lang(?name), "en")) .

    filter (!isLiteral(?comment) ||
            langmatches(lang(?comment), "en")) .
    
  }
  limit 10000
""")

birds_sparql += s.query("""
  SELECT DISTINCT *
  WHERE {
    ?bird a dbo:Bird ;
          rdfs:label ?name ;
          dbo:abstract ?comment .

    filter (!isLiteral(?name) ||
            langmatches(lang(?name), "en")) .

    filter (!isLiteral(?comment) ||
            langmatches(lang(?comment), "en")) .
    
  }
  limit 10000
  offset 10000
""")

write("../data", "birds", birds_sparql)

print(f"Hemos obtenido los nombres de {len(birds_sparql)} pájaros")
for d in birds_sparql[0:5]:
  print(d['name']['value'])

Hemos obtenido los nombres de 10369 pájaros
Actenoides
African goshawk
African pitta
African red-eyed bulbul
Alcedo


Convertimos los datos en crudo a un diccionario

In [10]:
# sparql a diccionario
birds = {}
for bird in birds_sparql:
  key = bird["name"]["value"].lower()
  birds[key] = {
    "name": bird["name"]["value"],
    "url": bird["bird"]["value"],
    "description": bird["comment"]["value"],
  }
bird_keys = birds.keys() # buscaremos por las key
assert len(birds_sparql) == len(bird_keys) # aseguramos que no haya ninguna key

In [11]:
# deshabilitar pipes de spacy que no necesitamos
nlp = spacy.load("en_core_web_lg", disable=['lemmatizer', 'ner'])

doc = nlp("Black-billed flycatcher")
displacy.render(doc, jupyter=True)


Podemos ver que el `tokenizer` de spacy nos separa las palabras compuestas con guion.

Para solucionarlo vamos a modificar el tokenizer para que no separe las palabras con guion.

In [12]:
# hacer que el tokenizer no separe palabras con guion
# https://stackoverflow.com/questions/59993683/how-can-i-get-spacy-to-stop-splitting-both-hyphenated-numbers-and-words-into-sep

from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex

def custom_tokenizer(nlp):
    inf = list(nlp.Defaults.infixes)               # Default infixes
    inf.remove(r"(?<=[0-9])[+\-\*^](?=[0-9-])")    # Remove the generic op between numbers or between a number and a -
    inf = tuple(inf)                               # Convert inf to tuple
    infixes = inf + tuple([r"(?<=[0-9])[+*^](?=[0-9-])", r"(?<=[0-9])-(?=-)"])  # Add the removed rule after subtracting (?<=[0-9])-(?=[0-9]) pattern
    infixes = [x for x in infixes if '-|–|—|--|---|——|~' not in x] # Remove - between letters rule
    infix_re = compile_infix_regex(infixes)

    return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
                                suffix_search=nlp.tokenizer.suffix_search,
                                infix_finditer=infix_re.finditer,
                                token_match=nlp.tokenizer.token_match,
                                rules=nlp.Defaults.tokenizer_exceptions)

In [13]:
# Test
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp("Black-billed flycatcher")
displacy.render(doc, jupyter=True)

Genial! A demás conseguimos que `Black-billed` se detecte como adjetivo y no como adjetivo + verbo. 

In [14]:
# Busqueda de pájaros por fuzzysearch
from fuzzywuzzy import fuzz
def search_bird_dict(chunk):
  best_match = (False, None, 0)
  for key in bird_keys:
    score = fuzz.ratio(key, chunk)
    if score > 80 and score > best_match[-1]:
      # print(f"\tchunk: '{chunk}' compare '{key}' score: '{score}'")
      best_match = (True, key, score)
  return best_match

def get_nouns(chunk):
  only_nouns = []
  for token in chunk:
    if token.pos_ == "NOUN" or token.pos_ == "ADJ":
      only_nouns.append(token.lower_)

  if len(only_nouns) > 0:
    return " ".join(only_nouns)
  return None

maybe_matches = []
for text in open(shorebirder_posts_file, "r").readlines()[0:5]:
  doc = nlp(text)
  for chunk in [get_nouns(chunk) for chunk in doc.noun_chunks]:
    if chunk != None:
      (found, bird_key, score) = search_bird_dict(chunk)
      if found:
        maybe_matches.append(bird_key)
        print(f"Found '{bird_key}' in '{chunk}' with score of {score}")
          
# Test result
print(maybe_matches)



Found 'ring-billed gull' in 'ring-billed gull' with score of 100
Found 'peregrine falcon' in 'female peregrine falcon' with score of 82
['ring-billed gull', 'peregrine falcon']


In [15]:
# calcular para todas las review
maybe_matches = []
for text in tqdm.tqdm(open(shorebirder_posts_file, "r").readlines()):
  doc = nlp(text)
  for chunk in [get_nouns(chunk) for chunk in doc.noun_chunks]:
    if chunk != None:
      (found, bird_key, score) = search_bird_dict(chunk)
      if found:
        maybe_matches.append(bird_key)

print(maybe_matches)

100%|██████████| 56/56 [05:42<00:00,  6.12s/it]

['ring-billed gull', 'peregrine falcon', 'gull', 'gull', 'gull', 'gull', 'least bittern', 'triller', 'khopesh', 'american bittern', 'sunbittern', 'gull', 'tern', 'khopesh', 'owl', 'great horned owl', 'black-crowned night heron', 'egret', 'gull', 'common blackbird', 'gull', 'reed cormorant', 'little blue heron', 'yellow-bellied sapsucker', 'scarlet tanager', 'passer', "bonaparte's gull", 'common tern', 'least sandpiper', 'american bittern', 'surf scoter', 'trocaz pigeon', 'andean avocet', 'mountain robin', 'collared inca', 'hooded mountain tanager', 'chestnut-crested cotinga', 'cock-of-the-rock', 'solitary eagle', 'mountaingem', 'lyre-tailed nightjar', 'andean cock-of-the-rock', 'wire-crested thorntail', 'vervain hummingbird', 'trogon', 'macaw', 'yellow-bellied dacnis', 'yellow-browed sparrow', 'sand-coloured nighthawk', 'nighthawk', 'parrot', 'blue-headed parrot', 'hoatzin', 'seabird', 'tern', 'tern', 'peruvian booby', 'peruvian diving petrel', 'shearwater', 'purple sandpiper', 'pelica




In [16]:
# Pintar resultado
result_lines = []
for bird_key in set(maybe_matches):
  bird = dict(birds[bird_key])
  name = bird["name"]
  url = bird["url"]
  result_lines.append(f"Hemos encontrado '{name}' con entrada en la dbpedia '{url}'.")

result_lines_file = write(data_posts_path, "shorebirder_results_3", result_lines)
open(result_lines_file, "r").readlines()[0:5]

["Hemos encontrado 'Trogon' con entrada en la dbpedia 'http://dbpedia.org/resource/Trogon'.\n",
 "Hemos encontrado 'Least bittern' con entrada en la dbpedia 'http://dbpedia.org/resource/Least_bittern'.\n",
 "Hemos encontrado 'Cock-of-the-rock' con entrada en la dbpedia 'http://dbpedia.org/resource/Cock-of-the-rock'.\n",
 "Hemos encontrado 'Yellow-bellied sapsucker' con entrada en la dbpedia 'http://dbpedia.org/resource/Yellow-bellied_sapsucker'.\n",
 "Hemos encontrado 'Common blackbird' con entrada en la dbpedia 'http://dbpedia.org/resource/Common_blackbird'.\n"]

Primera versión funcional sin machacar a la dbpedia. Pero sigue siendo muy lento. Vamos a seguir probando.

## Intento 4: Entrenar Spacy

Hemos visto que podemos obtener los resultados usando una mezcla de código propio y spacy. En este apartado vamos a intentar usar solo spacy para encontrar los pájaros.

La [pipeline](https://spacy.io/usage/spacy-101#pipelines) de spacy que se encarga de etiquetar las entities se llama [Entity Recognizer](https://spacy.io/api/entityrecognizer) o `ner`. Para poder etiquetar los pájaros vamos a usar las descripciones de la dbpedia para [entrenar spacy](https://spacy.io/usage/training).

In [17]:
# Transformar las descripciones en datos de entreno y test para spacy
import random
import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

# build training sentences
training_data = [
  # ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]), # example
]

tag = "BIRD"

remove_parentesis_text = re.compile(r'\(.*\)')

# mejorable pero hace su trabajo
for key in tqdm.tqdm(bird_keys):
  bird = birds[key]
  bird_name = bird["name"]

  if len(bird_name) == 0:
    continue
  
  bird_name = re.sub(remove_parentesis_text, "", str(bird_name))
  train_sentence = bird["description"]
  positions = []
  for match in re.finditer(strip_accents(bird_name), strip_accents(train_sentence), re.IGNORECASE):
    positions.append(match.span() + (tag,))

  training_data.append(
    (train_sentence, positions)
  )

def outer_join(lst1, lst2):
  lst3 = [value for value in lst1 if not value in lst2]
  return lst3

test_data = random.sample(training_data, k=round(len(training_data)*0.2))
training_data = outer_join(training_data, test_data)

for text, annotations in training_data[0:3]:
  print(text, annotations)

100%|██████████| 10369/10369 [00:02<00:00, 3999.91it/s]


Actenoides is a genus of kingfishers in the subfamily Halcyoninae. The genus Actenoides was introduced by the French ornithologist Charles Lucien Bonaparte in 1850. The type species is Hombron's kingfisher (Actenoides hombroni). The name of the genus is from the classical Greek aktis, aktinos for "beam" or "brightness" and -oidēs for "resembling". A molecular study published in 2017 found that the genus Actenoides, as currently defined, is paraphyletic. The glittering kingfisher in the monotypic genus Caridonax is a member of the clade containing the species in the genus Actenoides. The genus contains the following species: 
* Green-backed kingfisher (Actenoides monachus) 
* Black-headed kingfisher (Actenoides monachus capucinus) 
* Scaly-breasted kingfisher (Actenoides princeps) 
* Plain-backed kingfisher (Actenoides princeps regalis) 
* Moustached kingfisher (Actenoides bougainvillei) 
* Guadalcanal moustached kingfisher (Actenoides bougainvillei excelsus) 
* Spotted wood kingfisher 

In [18]:
from spacy.tokens import DocBin
from spacy.util import filter_spans

nlp = spacy.blank("en")
skips = 0
# the DocBin will store the example documents
db = DocBin()
for text, annotations in tqdm.tqdm(training_data):
  doc = nlp(text)
  ents = []
  for start, end, label in annotations:
    span = doc.char_span(start, end, label=label, alignment_mode="contract")
    if span is None:
      skips += 1
    else:
      ents.append(span)
  filtered_ents = filter_spans(ents)
  doc.ents = filtered_ents
  db.add(doc)

print(f"Skipped {skips} entries")
db.to_disk("./birds_train.spacy")

100%|██████████| 8295/8295 [00:07<00:00, 1133.50it/s]


Skipped 302 entries


In [19]:
nlp = spacy.blank("en")
skips = 0
# the DocBin will store the example documents
db = DocBin()
for text, annotations in test_data:
  doc = nlp(text)
  ents = []
  for start, end, label in annotations:
    span = doc.char_span(start, end, label=label, alignment_mode="contract")
    if span is None:
      skips += 1
    else:
      ents.append(span)
  filtered_ents = filter_spans(ents)
  doc.ents = filtered_ents
  db.add(doc)

print(f"Skipped {skips} entries")
db.to_disk("./birds_test.spacy")

Skipped 123 entries


In [20]:
# Train the model https://spacy.io/usage/training
os.environ['KMP_DUPLICATE_LIB_OK']='True'

# train the model
!python -m spacy init fill-config birds_config.cfg config.cfg
!python -m spacy train config.cfg --output ./output --paths.train ./birds_train.spacy --paths.dev ./birds_test.spacy 

✔ Auto-filled config with all values
✔ Saved config
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
✔ Created output directory: output
ℹ Saving to output directory: output
ℹ Using CPU
[1m
✔ Initialized pipeline
[1m
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     25.33    0.05    0.18    0.03    0.00
  0     200        781.44   1797.31   63.66   76.17   54.67    0.64
  0     400         96.82    393.86   75.13   94.32   62.43    0.75
  0     600        262.38    389.24   74.66   83.61   67.45    0.75
  0     800        243.92    495.40   72.90   87.23   62.61    0.73
  0    1000       1174.98    457.54   76.36   90.28   66.16    0.76
  0    1200        486.68    539.92   75.76   90.48   65.16    0.76
  0    1400  

[2022-05-12 22:09:02,139] [INFO] Set up nlp object from config
[2022-05-12 22:09:02,155] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-05-12 22:09:02,161] [INFO] Created vocabulary
[2022-05-12 22:09:02,163] [INFO] Finished initializing nlp object
[2022-05-12 22:09:15,787] [INFO] Initialized pipeline components: ['tok2vec', 'ner']


In [21]:
nlp = spacy.load("en_core_web_lg", exclude=["ner"])
nlp_entity = spacy.load("./output/model-best")
nlp.add_pipe("ner", source=nlp_entity)

for text in open(shorebirder_posts_file, "r").readlines()[4:5]:
  doc = nlp(text)
  if len(doc.ents) > 0:
    displacy.render(doc, jupyter=True, style="ent")



Parece que solo del contexto solo no es capaz de identificar las aves.

## 5. Solución! Nueva pipeline [entity_ruler](https://spacy.io/api/entityruler)

En esta aproximación vamos a añadir una pipe más al nlp `en_core_web_lg` pre-entrenado de spacy. Para ello necesitamos hacer una lista de todos los patterns que queramos poner. Es decir, debemos introducir los nombres de los pájaros que queremos que se detecten como patterns y añadir la nueva pipe al nlp.

In [22]:
# usando entity_ruler
from spacy.lang.en import English

auxiliar_nlp = spacy.load("en_core_web_lg", exclude=['tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner'])
nlp = English()

auxiliar_nlp.tokenizer = custom_tokenizer(auxiliar_nlp)
nlp.tokenizer = custom_tokenizer(nlp)

# Añadir los nombres de pájaros
patterns = []
for key in bird_keys:
  bird = birds[key]
  doc = nlp(bird["name"])
  pattern = []
  for token in doc:
    pattern.append({
      "LOWER": token.lower_
    })

  patterns.append({
    "label": tag,
    "pattern": pattern
  })

ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

doc = nlp("that feeding gull flock continued to produce by sucking in passers by. at one point a Bonaparte's gull got in on the action, and a flock of 21 common terns appeared from the east and eventually settled into that flock.")
displacy.render(doc, jupyter=True, style="ent")

In [23]:
# calcular para todas las review
maybe_matches = []
for text in tqdm.tqdm(open(shorebirder_posts_file, "r").readlines()):
  doc = nlp(text)
  for ent in doc.ents:
    if ent.label_ == "BIRD":
      maybe_matches.append(str(ent))

print(maybe_matches)

100%|██████████| 56/56 [00:00<00:00, 85.61it/s] 

['gull', 'ring-billed gull', 'peregrine falcon', 'gull', 'gull', 'gull', 'iceland gull', 'american bittern', 'hawk', 'little blue heron', 'glossy ibis', 'yellow-bellied sapsucker', 'scarlet tanager', 'gull', "bonaparte's gull", 'least sandpiper', 'saltmarsh sparrow', "nelson's sparrow", 'marsh wren', 'american bittern', 'puna plover', 'hummingbird', 'collared inca', 'shining sunbeam', 'chestnut-crested cotinga', 'cock-of-the-rock', 'solitary eagle', 'lyre-tailed nightjar', 'andean cock-of-the-rock', 'wire-crested thorntail', 'hummingbird', 'white-throated toucan', 'green honeycreeper', 'yellow-bellied dacnis', 'owl', 'nighthawk', 'parrot', 'hoatzin', 'elaenia', 'seabird', 'cinclodes', 'purple sandpiper', 'humboldt penguin']





In [24]:
# Pintar y guardar resultado
result_lines = []
for bird_key in set(maybe_matches):
  bird = dict(birds[bird_key])
  name = bird["name"]
  url = bird["url"]
  result_lines.append(f"Hemos encontrado '{name}' con entrada en la dbpedia '{url}'.")

result_lines_file = write(data_posts_path, "shorebirder_results_5", result_lines)
open(result_lines_file, "r").readlines()[0:5]

["Hemos encontrado 'Cock-of-the-rock' con entrada en la dbpedia 'http://dbpedia.org/resource/Cock-of-the-rock'.\n",
 "Hemos encontrado 'Hummingbird' con entrada en la dbpedia 'http://dbpedia.org/resource/Hummingbird'.\n",
 "Hemos encontrado 'Yellow-bellied sapsucker' con entrada en la dbpedia 'http://dbpedia.org/resource/Yellow-bellied_sapsucker'.\n",
 "Hemos encontrado 'Cinclodes' con entrada en la dbpedia 'http://dbpedia.org/resource/Cinclodes'.\n",
 "Hemos encontrado 'Nighthawk' con entrada en la dbpedia 'http://dbpedia.org/resource/Nighthawk'.\n"]