# P3 Avistamiento de aves

## Obtener datos (scraping)

Primero vamos a obtener los datos. Para ello nos visitaremos las web de avistamientos de aves [shorebirder](https://www.shorebirder.com/), [trevorsbirding](https://www.trevorsbirding.com/) y [dantallmansbirdblog](https://dantallmansbirdblog.blogspot.com/).

Durante la visita a la web y haciendo uso del inspector (F12) podemos ver que las descripciones que necesitamos se encuentran en los tag de párrafo (entre *\<p\> TEXTO \</p\>*). Sabiendo eso vamos a crear funciones de utilidad que se encargarán de descargar el contenido de la web y extraer el texto.

Las descargas las realizaremos en `data/raw` mientras que en `data/posts` guardaremos los textos encontrados.

** `dantallmansbirdblog` tiene una estructura ligeramente diferente (entre *\<p\>\</p\> TEXTO \<p\>\</p\>*), a lo que tendremos que modificar la función `get_texts` (a continuación) para obtener sus textos.

In [1]:
import os
import re
import wget

data_posts_cache = "../data/cache" # guardar resultados de queries a sparql
data_raw_path = "../data/raw" # descargas
data_posts_path = "../data/posts" # guardar los textos de los post scrapeados

def maybe_mkdir(path):
  try:
    os.mkdir(path)
  except OSError as error:
    print(error)

maybe_mkdir("../data")
maybe_mkdir(data_raw_path)
maybe_mkdir(data_posts_path)

def download(url, out_label):
  return wget.download(url, out=f"{data_raw_path}/{out_label}")

def get_texts(filename):
  file = open(filename, 'r')
  text = file.read()
  file.close()

  # get texts
  get_p = re.compile(r'<p>((.|\n)*?)</p>')
  texts = get_p.findall(text)

  # remove styling and inner tags
  remove_tags = re.compile(r'(<.*?>)|\\n| +(?= )|\\|\&.+?\;')
  return map(lambda text: re.sub(remove_tags, "", str(text[0]).lower()), texts)

def write(path, filename, data):
  filepath = f"{path}/{filename}.txt"
  file = open(filepath, "a", encoding="utf-8")
  for item in data:
    file.write(str(item)+"\n")
  file.close()
  return filepath


A continuación haciendo uso de las funciones anteriores scrapeamos la home de `shorebirder`.

In [2]:
shorebirder_filename = "shorebirder_home.html"
shorebirder_home = download("https://www.shorebirder.com/", shorebirder_filename)
posts = get_texts(shorebirder_home)
shorebirder_posts_file = write(data_posts_path, shorebirder_filename, posts)

open(shorebirder_posts_file, "r").readlines()[0]

"my late march solo visit to norway is in the books and was about as much fun as i've had in a while. the middle few days of the trip were spent birding around varanger, bookended by more touristy time intromsøand oslo. at some point in the coming months there will be a full trip report here plus a very detailed cloudbirders submission. in the meantime, here is some proof that i actually went.\n"

Lo mismo para `trevorsbirding`.

In [3]:
trevorsbirding_filename = "trevorsbirding_home.html"
trevorsbirding_home = download("https://www.trevorsbirding.com/", trevorsbirding_filename)
posts = get_texts(trevorsbirding_home)
trevorsbirding_posts_file = write(data_posts_path, trevorsbirding_filename, posts)

open(trevorsbirding_posts_file, "r").readlines()[2]

'earlier this week i glanced out of my sunroom window to check whether there were any birds at my birdbaths. i currently have three birdbaths just outside the room, one on the ground, one on a pedestal at about 60cm and one hanging from a tree branch at a height of about 1.5 metres. i was delighted to see a small flock of purple-crowned lorikeets having a drink and dipping into the water for a bath. i have just checked my list of species to have visited the birdbaths. this was bird species number 36, in addition to the three reptiles and two mammal species.\n'

## Analizar artículos

In [4]:
# Cargar dependencias para el nlp
import spacy
from collections import Counter
from spacy import displacy
nlp = spacy.load("en_core_web_lg")

In [5]:
for text in open(shorebirder_posts_file, "r").readlines()[3:5]:
  train = nlp(text)
  displacy.render(train, jupyter=True, style="ent")

Este nlp no nos sirve o bien nos falta entrenarlo. No es capaz de encontrar los nombres de pájaros.

vamos a probar de entrenarlo usando nombres de pájaros de la dbpedia.

In [6]:
# Prefixes and Class based from https://github.com/ejrav/pydbpedia
from SPARQLWrapper import SPARQLWrapper, JSON

class SparqlEndpoint(object):

    def __init__(self, endpoint, prefixes={}):
        self.sparql = SPARQLWrapper(endpoint)
        self.prefixes = {
            "dbo": "http://dbpedia.org/ontology/",
            "owl": "http://www.w3.org/2002/07/owl#",
            "xsd": "http://www.w3.org/2001/XMLSchema#",
            "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
            "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
            "foaf": "http://xmlns.com/foaf/0.1/",
            "dc": "http://purl.org/dc/elements/1.1/",
            "dbpedia2": "http://dbpedia.org/property/",
            "dbpedia": "http://dbpedia.org/",
            "skos": "http://www.w3.org/2004/02/skos/core#",
            "foaf": "http://xmlns.com/foaf/0.1/",
            "yago": "http://dbpedia.org/class/yago/",
            }
        self.prefixes.update(prefixes)
        self.sparql.setReturnFormat(JSON)

    def query(self, q):
        lines = ["PREFIX %s: <%s>" % (k, r) for k, r in self.prefixes.items()]
        lines.extend(q.split("\n"))
        query = "\n".join(lines)
        self.sparql.setQuery(query)
        results = self.sparql.query().convert()
        return results["results"]["bindings"]


class DBpediaEndpoint(SparqlEndpoint):
    def __init__(self, endpoint, prefixes = {}):
        super(DBpediaEndpoint, self).__init__(endpoint, prefixes)

In [115]:
s = DBpediaEndpoint(endpoint = "http://dbpedia.org/sparql")

# birds = s.query("""
#   SELECT DISTINCT *
#   WHERE {
#     ?bird a dbo:Bird ;
#           a yago:Bird101503061 ;
#           rdfs:label ?name ; # inicialmente esta propiedad era para evitar coger eventos y años en la busqueda
#           dbo:abstract ?comment .

#     filter (!isLiteral(?name) ||
#             langmatches(lang(?name), "en")) .

#     filter (!isLiteral(?comment) ||
#             langmatches(lang(?comment), "en")) .
    
#   }
#   limit 10000
# """)

birds = s.query("""
  SELECT DISTINCT *
  WHERE {
    ?bird a dbo:Bird ;
          rdfs:label ?name ;
          dbo:abstract ?comment .

    filter (!isLiteral(?name) ||
            langmatches(lang(?name), "en")) .

    filter (!isLiteral(?comment) ||
            langmatches(lang(?comment), "en")) .
    
  }
  limit 10000
""")

birds += s.query("""
  SELECT DISTINCT *
  WHERE {
    ?bird a dbo:Bird ;
          rdfs:label ?name ;
          dbo:abstract ?comment .

    filter (!isLiteral(?name) ||
            langmatches(lang(?name), "en")) .

    filter (!isLiteral(?comment) ||
            langmatches(lang(?comment), "en")) .
    
  }
  limit 10000
  offset 10000
""")

write("../data", "birds", birds)

print(f"Hemos obtenido los nombres de {len(birds)} pájaros")
for d in birds[0:5]:
  print(d['name']['value'])

Hemos obtenido los nombres de 10369 pájaros
Actenoides
African goshawk
African pitta
African red-eyed bulbul
Alcedo


In [22]:
import random
import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

# build training sentences
training_data = [
  # ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]), # example
]

tag = "BIRD"

remove_parentesis_text = re.compile(r'\(.*\)')

# mejorable pero hace su trabajo
for bird in birds:
  bird_name = bird['name']['value']
  if len(bird_name) == 0:
    continue
  
  bird_name = re.sub(remove_parentesis_text, "", str(bird_name).lower())
  train_sentence = bird['comment']['value']
  try:
    start_at = strip_accents(train_sentence).lower().index(strip_accents(bird_name))
    ends_at = start_at + len(bird_name)
    training_data.append(
      (bird_name, [(0, len(bird_name), tag)])
    )
    training_data.append(
      (train_sentence, [(start_at, ends_at, tag)])
    )
  except:
    # print("Error: substing not found")
    # print(bird_name)
    # print(train_sentence)
    # print("-"*10)
    training_data.append(
      (bird_name, [(0, len(bird_name), tag)])
    )

def outer_join(lst1, lst2):
  lst3 = [value for value in lst1 if not value in lst2]
  return lst3

test_data = random.sample(training_data, k=round(len(training_data)*0.10))
training_data = outer_join(training_data, test_data)

for text, annotations in training_data[0:5]:
  print(text)
  print(annotations)

african pitta
[(0, 13, 'BIRD')]
The African pitta (Pitta angolensis) is an Afrotropical bird of the family Pittidae. It is a locally common to uncommon species, resident and migratory in the west, and an intra-African migrant between equatorial and southeastern Africa. They are elusive and hard to observe despite their brightly coloured plumage, and their loud, explosive calls are infrequently heard. The plump, somewhat thrush-like birds forage on leaf litter under the canopy of riparian or coastal forest and thickets, or in climax miombo forest. They spend much time during mornings and at dusk scratching in leaf litter or around termitaria, or may stand motionless for long periods. Following rains breeding birds call and display from the mid-canopy.
[(4, 17, 'BIRD')]
african red-eyed bulbul
[(0, 23, 'BIRD')]
andean coot
[(0, 11, 'BIRD')]
The Andean coot (Fulica ardesiaca), also known as the slate-coloured coot, is a species of bird in the family Rallidae. It is found in the Andes from

In [23]:
from spacy.tokens import DocBin
from spacy.util import filter_spans

nlp = spacy.blank("en")
skips = 0
# the DocBin will store the example documents
db = DocBin()
for text, annotations in training_data:
  doc = nlp(text)
  ents = []
  for start, end, label in annotations:
    span = doc.char_span(start, end, label=label, alignment_mode="contract")
    if span is None:
      skips += 1
    else:
      ents.append(span)
  filtered_ents = filter_spans(ents)
  doc.ents = filtered_ents
  db.add(doc)

print(f"Skipped {skips} entries")
db.to_disk("./birds_train.spacy")

Skipped 11 entries


In [24]:
nlp = spacy.blank("en")
skips = 0
# the DocBin will store the example documents
db = DocBin()
for text, annotations in test_data:
  doc = nlp(text)
  ents = []
  for start, end, label in annotations:
    span = doc.char_span(start, end, label=label, alignment_mode="contract")
    if span is None:
      skips += 1
    else:
      ents.append(span)
  filtered_ents = filter_spans(ents)
  doc.ents = filtered_ents
  db.add(doc)

print(f"Skipped {skips} entries")
db.to_disk("./birds_test.spacy")

Skipped 0 entries


In [25]:
# train the model
!python -m spacy init fill-config birds_config.cfg config.cfg

# dev.spacy it's the validation set ... windows users better use powershell... ("cd src" and run the comand without !)
!python -m spacy train config.cfg --output ./output --paths.train ./birds_train.spacy --paths.dev ./birds_test.spacy 

✔ Auto-filled config with all values
✔ Saved config
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy

✘ Invalid config override '#': name should start with --



In [26]:
my_nlp = spacy.load("./output/model-best")
for text in open(shorebirder_posts_file, "r").readlines()[4:5]:
  doc = my_nlp(text)
  if len(doc.ents) > 0:
    displacy.render(doc, jupyter=True, style="ent")

Fracaso de nlp... ultimo intento usando spacy

In [79]:
nlp = spacy.load("en_core_web_lg")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [116]:
def search_bird_dbpedia(token):
  return s.query('''
    SELECT *
    WHERE {
      ?bird a dbo:Bird ;
            rdfs:label ?name ;
            dbo:abstract ?comment .

      filter (!isLiteral(?name) ||
              langmatches(lang(?name), "en")) .

      filter (!isLiteral(?comment) ||
              langmatches(lang(?comment), "en")) .

      filter (CONTAINS(LCASE(STR(?name)), "{token}")) .
    }
    limit 5
  '''.replace("{token}", token))

search_bird_dbpedia("falcon")

[{'bird': {'type': 'uri',
   'value': 'http://dbpedia.org/resource/Collared_falconet'},
  'name': {'type': 'literal', 'xml:lang': 'en', 'value': 'Collared falconet'},
  'comment': {'type': 'literal',
   'xml:lang': 'en',
   'value': 'The collared falconet (Microhierax caerulescens) is a species of bird of prey in the family Falconidae. It is found in the Indian Subcontinent and Southeast Asia, ranging across Bangladesh, Bhutan, Cambodia, India, Laos, Myanmar, Nepal, Thailand, Malaysia, and Vietnam.Its natural habitat is temperate forest, often on the edges of broadleaf forest. It is 18 cm long. Rapid wingbeats are interspersed with long glides. When perched, it is described as being "rather shrikelike."'}},
 {'bird': {'type': 'uri',
   'value': 'http://dbpedia.org/resource/Peregrine_falcon'},
  'name': {'type': 'literal', 'xml:lang': 'en', 'value': 'Peregrine falcon'},
  'comment': {'type': 'literal',
   'xml:lang': 'en',
   'value': 'The peregrine falcon (Falco peregrinus), also known

In [None]:
maybe_matches = {}
for text in open(shorebirder_posts_file, "r").readlines()[4:5]:
  doc = nlp(text)
  for chunk in doc.noun_chunks:
    for token in chunk:
      if token.pos_ == 'NOUN':
        results = search_bird_dbpedia(token.lemma_)
        if len(results) > 0:
          maybe_matches[token] = results
          print(token)

Vale, mejor. Lo que es muuuy lento y estamos machacando la dbpedia a queries. Vamos a mezclar ambas estrategias.

Del intento 1 vamos a coger los resultados de todos los pájaros y lo convertiremos en un diccionario para que nos sea más fácil buscar.
Con spacy solo tokenizaremos y obtendremos los `noun chunk` y usando fuzzy search buscaremos las aves.
Para hacer spacy más rápido vamos a deshabilitar las pipelines que no usemos, que son `lemmatizer` y `ner`.

In [200]:
birds_dic = {}
for bird in birds:
  key = bird["name"]["value"].lower()
  birds_dic[key] = {
    "name": bird["name"]["value"],
    "url": bird["bird"]["value"],
    "description": bird["comment"]["value"],
  }
birds_keys = birds_dic.keys() # buscaremos por las key
assert len(birds) == len(birds_keys) # aseguramos que no haya ninguna key

In [168]:
# deshabilitar pipes de spacy
slim_nlp = spacy.load("en_core_web_lg", disable=['lemmatizer', 'ner'])

for text in open(shorebirder_posts_file, "r").readlines()[4:5]:
  doc = slim_nlp(text)
  displacy.render(doc, jupyter=True)


In [None]:
# hacer que el tokenizador no separe palabras con guion
# https://stackoverflow.com/questions/59993683/how-can-i-get-spacy-to-stop-splitting-both-hyphenated-numbers-and-words-into-sep

from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex

def custom_tokenizer(slim_nlp):
    inf = list(slim_nlp.Defaults.infixes)               # Default infixes
    inf.remove(r"(?<=[0-9])[+\-\*^](?=[0-9-])")    # Remove the generic op between numbers or between a number and a -
    inf = tuple(inf)                               # Convert inf to tuple
    infixes = inf + tuple([r"(?<=[0-9])[+*^](?=[0-9-])", r"(?<=[0-9])-(?=-)"])  # Add the removed rule after subtracting (?<=[0-9])-(?=[0-9]) pattern
    infixes = [x for x in infixes if '-|–|—|--|---|——|~' not in x] # Remove - between letters rule
    infix_re = compile_infix_regex(infixes)

    return Tokenizer(slim_nlp.vocab, prefix_search=slim_nlp.tokenizer.prefix_search,
                                suffix_search=slim_nlp.tokenizer.suffix_search,
                                infix_finditer=infix_re.finditer,
                                token_match=slim_nlp.tokenizer.token_match,
                                rules=slim_nlp.Defaults.tokenizer_exceptions)

slim_nlp.tokenizer = custom_tokenizer(slim_nlp)

# Test
for text in open(shorebirder_posts_file, "r").readlines()[4:5]:
  doc = slim_nlp(text)
  displacy.render(doc, jupyter=True)

In [186]:
from fuzzywuzzy import fuzz
def search_bird_dict(chunk):
  best_match = (False, None, 0)
  for key in birds_keys:
    score = fuzz.ratio(key, chunk)
    if score > 80 and score > best_match[-1]:
      # print(f"\tchunk: '{chunk}' compare '{key}' score: '{score}'")
      best_match = (True, key, score)
  return best_match

def get_nouns(chunk):
  only_nouns = []
  for token in chunk:
    if token.pos_ == "NOUN" or token.pos_ == "ADJ":
      only_nouns.append(token.lower_)

  if len(only_nouns) > 0:
    return " ".join(only_nouns)
  return None

maybe_matches = []
for text in open(shorebirder_posts_file, "r").readlines()[4:5]:
  doc = slim_nlp(text)
  for chunk in [get_nouns(chunk) for chunk in doc.noun_chunks]:
    if chunk != None:
      (found, bird_key, score) = search_bird_dict(chunk)
      if found:
        maybe_matches.append(bird_key)
        print(f"Found '{bird_key}' in '{chunk}' with score of {score}")
          
# Test result
print(maybe_matches)

Found 'ring-billed gull' in 'ring-billed gull' with score of 100
Found 'peregrine falcon' in 'female peregrine falcon' with score of 82
['ring-billed gull', 'peregrine falcon']


Hemos ajustado para que recoja los pájaros de esa frase. vamos a ver si puede coger los de las otras.

In [187]:
maybe_matches = []
for text in open(shorebirder_posts_file, "r").readlines():
  doc = slim_nlp(text)
  for chunk in [get_nouns(chunk) for chunk in doc.noun_chunks]:
    if chunk != None:
      (found, bird_key, score) = search_bird_dict(chunk)
      if found:
        maybe_matches.append(bird_key)
        print(f"Found '{bird_key}' in '{chunk}' with score of {score}")

print(maybe_matches)

Found 'ring-billed gull' in 'ring-billed gull' with score of 100
Found 'peregrine falcon' in 'female peregrine falcon' with score of 82
Found 'gull' in 'gull' with score of 100
Found 'gull' in 'gull' with score of 100
Found 'gull' in 'gulls' with score of 89
Found 'gull' in 'gulls' with score of 89
Found 'least bittern' in 'least bit' with score of 82
Found 'triller' in 'tiller' with score of 92
Found 'khopesh' in 'hopes' with score of 83
Found 'american bittern' in 'american bittern' with score of 100
Found 'sunbittern' in 'bittern' with score of 82
Found 'gull' in 'gulls' with score of 89
Found 'tern' in 'terns' with score of 89
Found 'khopesh' in 'hopes' with score of 83
Found 'owl' in 'owls' with score of 86
Found 'great horned owl' in 'great horned owls' with score of 97
Found 'black-crowned night heron' in 'only black-crowned night-herons' with score of 86
Found 'egret' in 'egrets' with score of 91
Found 'gull' in 'gulls' with score of 89
Found 'common blackbird' in 'common backy

In [210]:
result_lines = []
for bird_key in set(maybe_matches):
  bird = dict(birds_dic[bird_key])
  name = bird["name"]
  url = bird["url"]
  result_lines.append(f"Hemos encontrado '{name}' con entrada en la dbpedia '{url}'.")

result_lines_file = write(data_posts_path, "shorebirder_results", result_lines)
open(result_lines_file, "r").readlines()[0:5]

["Hemos encontrado 'Yellow-bellied dacnis' con entrada en la dbpedia 'http://dbpedia.org/resource/Yellow-bellied_dacnis'.\n",
 "Hemos encontrado 'Grey gull' con entrada en la dbpedia 'http://dbpedia.org/resource/Grey_gull'.\n",
 "Hemos encontrado 'Great horned owl' con entrada en la dbpedia 'http://dbpedia.org/resource/Great_horned_owl'.\n",
 "Hemos encontrado 'Trogon' con entrada en la dbpedia 'http://dbpedia.org/resource/Trogon'.\n",
 "Hemos encontrado 'Yellow-bellied sapsucker' con entrada en la dbpedia 'http://dbpedia.org/resource/Yellow-bellied_sapsucker'.\n"]