# P3 Avistamiento de aves

## Obtener datos (scraping)

Primero vamos a obtener los datos. Para ello nos visitaremos las web de avistamientos de aves [shorebirder](https://www.shorebirder.com/), [trevorsbirding](https://www.trevorsbirding.com/) y [dantallmansbirdblog](https://dantallmansbirdblog.blogspot.com/).

Durante la visita a la web y haciendo uso del inspector (F12) podemos ver que las descripciones que necesitamos se encuentran en los tag de párrafo (entre *\<p\> TEXTO \</p\>*). Sabiendo eso vamos a crear funciones de utilidad que se encargarán de descargar el contenido de la web y extraer el texto.

Las descargas las realizaremos en `data/raw` mientras que en `data/posts` guardaremos los textos encontrados.

** `dantallmansbirdblog` tiene una estructura ligeramente diferente (entre *\<p\>\</p\> TEXTO \<p\>\</p\>*), a lo que tendremos que modificar la función `get_texts` (a continuación) para obtener sus textos.

In [1]:
import os
import re
import wget

data_raw_path = "../data/raw"
data_posts_path = "../data/posts"

def maybe_mkdir(path):
  try:
    os.mkdir(path)
  except OSError as error:
    print(error)

maybe_mkdir("../data")
maybe_mkdir(data_raw_path)
maybe_mkdir(data_posts_path)

def download(url, out_label):
  return wget.download(url, out=f"{data_raw_path}/{out_label}")

def get_texts(filename):
  file = open(filename, 'r')
  text = file.read()
  file.close()

  # get texts
  get_p = re.compile(r'<p>((.|\n)*?)</p>')
  texts = get_p.findall(text)

  # remove styling and inner tags
  remove_tags = re.compile(r'(<.*?>)|\\n| +(?= )|\\|\&.+?\;')
  return map(lambda text: re.sub(remove_tags, "", str(text[0]).lower()), texts)

def write_posts(filename, posts):
  filepath = f"{data_posts_path}/{filename}.txt"
  file = open(filepath, "w")
  for post in posts:
    file.write(post+"\n")
  file.close()
  return filepath

A continuación haciendo uso de las funciones anteriores scrapeamos la home de `shorebirder`.

In [2]:
shorebirder_filename = "shorebirder_home.html"
shorebirder_home = download("https://www.shorebirder.com/", shorebirder_filename)
posts = get_texts(shorebirder_home)
shorebirder_posts_file = write_posts(shorebirder_filename, posts)

open(shorebirder_posts_file, "r").readlines()[0]

"my late march solo visit to norway is in the books and was about as much fun as i've had in a while. the middle few days of the trip were spent birding around varanger, bookended by more touristy time intromsøand oslo. at some point in the coming months there will be a full trip report here plus a very detailed cloudbirders submission. in the meantime, here is some proof that i actually went.\n"

Lo mismo para `trevorsbirding`.

In [3]:
trevorsbirding_filename = "trevorsbirding_home.html"
trevorsbirding_home = download("https://www.trevorsbirding.com/", trevorsbirding_filename)
posts = get_texts(trevorsbirding_home)
trevorsbirding_posts_file = write_posts(trevorsbirding_filename, posts)

open(trevorsbirding_posts_file, "r").readlines()[2]

'earlier this week i glanced out of my sunroom window to check whether there were any birds at my birdbaths. i currently have three birdbaths just outside the room, one on the ground, one on a pedestal at about 60cm and one hanging from a tree branch at a height of about 1.5 metres. i was delighted to see a small flock of purple-crowned lorikeets having a drink and dipping into the water for a bath. i have just checked my list of species to have visited the birdbaths. this was bird species number 36, in addition to the three reptiles and two mammal species.\n'

## Analizar articulos

In [4]:
# Cargar dependencias para el nlp
import spacy
from collections import Counter
from spacy import displacy
spacy.prefer_gpu()
nlp = spacy.load("en_core_web_lg")

In [5]:
for text in open(shorebirder_posts_file, "r").readlines()[3:5]:
  train = nlp(text)
  displacy.render(train, jupyter=True, style="ent")

Este nlp no nos sirve o bien nos falta entrenarlo. No es capaz de encontrar los nombres de pájaros.

vamos a probar de entrenarlo usando nombres de pájaros de la dbpedia.

In [6]:
# Prefixes and Class based from https://github.com/ejrav/pydbpedia
from SPARQLWrapper import SPARQLWrapper, JSON
import sys
import pprint
class SparqlEndpoint(object):

    def __init__(self, endpoint, prefixes={}):
        self.sparql = SPARQLWrapper(endpoint)
        self.prefixes = {
            "dbpedia-owl": "http://dbpedia.org/ontology/",
            "owl": "http://www.w3.org/2002/07/owl#",
            "xsd": "http://www.w3.org/2001/XMLSchema#",
            "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
            "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
            "foaf": "http://xmlns.com/foaf/0.1/",
            "dc": "http://purl.org/dc/elements/1.1/",
            "dbpedia2": "http://dbpedia.org/property/",
            "dbpedia": "http://dbpedia.org/",
            "skos": "http://www.w3.org/2004/02/skos/core#",
            "foaf": "http://xmlns.com/foaf/0.1/",
            }
        self.prefixes.update(prefixes)
        self.sparql.setReturnFormat(JSON)

    def query(self, q):
        lines = ["PREFIX %s: <%s>" % (k, r) for k, r in self.prefixes.items()]
        lines.extend(q.split("\n"))
        query = "\n".join(lines)
        self.sparql.setQuery(query)
        results = self.sparql.query().convert()
        return results["results"]["bindings"]


class DBpediaEndpoint(SparqlEndpoint):
    def __init__(self, endpoint, prefixes = {}):
        super(DBpediaEndpoint, self).__init__(endpoint, prefixes)

In [7]:
s = DBpediaEndpoint(endpoint = "http://dbpedia.org/sparql")

results = s.query("""
  SELECT DISTINCT *
  WHERE {
    ?bird a dbpedia-owl:Bird ;
          rdfs:label ?name .

    filter (!isLiteral(?name) ||
            langmatches(lang(?name), "en"))
  }
""")
print(len(results))
for d in results[0:5]:
    print(d['name']['value'])

10000
Actenoides
African goshawk
African pitta
African red-eyed bulbul
Alcedo


In [10]:
config = {
   "phrase_matcher_attr": None,
   "validate": True,
   "overwrite_ents": False,
   "ent_id_sep": "||",
}
try:
  ruler = nlp.add_pipe("entity_ruler", config=config)
except:
  ruler = nlp.get_pipe("entity_ruler")

patterns = []
for bird in results:
  name = bird['name']['value']
  patterns.append({"label": "BIRD", "pattern": name})

for d in patterns[0:5]:
    print(d)

ruler.add_patterns(patterns=patterns)

for text in open(shorebirder_posts_file, "r").readlines()[4:5]:
  doc = nlp(text)
  # for np in doc.noun_chunks:
  #   for noun in np.ents:
  #     print(noun.lemma_)
  #     print(noun.label_)
  #     print(list(noun.lefts))
  #     print("-"*10)
  displacy.render(doc, jupyter=True, style="ent")

{'label': 'BIRD', 'pattern': 'Actenoides'}
{'label': 'BIRD', 'pattern': 'African goshawk'}
{'label': 'BIRD', 'pattern': 'African pitta'}
{'label': 'BIRD', 'pattern': 'African red-eyed bulbul'}
{'label': 'BIRD', 'pattern': 'Alcedo'}


  self.phrase_matcher.add(label, [pattern])  # type: ignore
