Regex discards important text #1

gvilarino · 2013-08-13T20:24:22Z

The scraper miss-detects stuff and discards important data. For instance, scraping Cedom's bill 400 yields the following result:

{
"sancion": "01/06/2000",
"publicacion": "BOCBA N� 989 del 21/07/2000",
"promulgacion": "De Hecho del 03/07/2000",
"_id": {
  "$oid": "520a8ec68be1e20000000002"
},
"articulos": [
  {
    "articulo": "</b> Proh&iacute;bese a los establecimientos educativos ",
    "_id": {
      "$oid": "520a8ec68be1e20000000008"
    }
  },
  {
    "articulo": "</b> Ning&uacute;n alumno, con motivo de mora en el ",
    "_id": {
      "$oid": "520a8ec68be1e20000000007"
    }
  },
  {
    "articulo": " </b>los alumnos de los establecimientos citados ",
    "_id": {
      "$oid": "520a8ec68be1e20000000006"
    }
  },
  {
    "articulo": "</b> De verse configurados los extremos descriptos en ",
    "_id": {
      "$oid": "520a8ec68be1e20000000005"
    }
  },
  {
    "articulo": "</b> La Secretar&iacute;a de Educaci&oacute;n podr&aacute; ",
    "_id": {
      "$oid": "520a8ec68be1e20000000004"
    }
  },
  {
    "articulo": "</b> Comun&iacute;quese, etc</P>",
    "_id": {
      "$oid": "520a8ec68be1e20000000003"
    }
  }
],
"__v": 0
}

As you can see, the articles' text are not quite complete.

On other cases, like when part of the text contains double quotes (e.g.: "Some text"), all of the article's text up to that section is also discarded.

As a general rule, ALL text between two articles' titles should be included as part of the article.

…g and parse it and this fixed a lot of problems, we are very happy now. #1 #2

ultraklon · 2013-08-29T04:06:24Z

Guido, can you check if this is solved? Data is dirty with HTML but at least is not lost, please check

ghost assigned ultraklon Aug 13, 2013

ultraklon mentioned this issue Aug 21, 2013

Parser says Error parsing: SyntaxError: Unexpected token #7

Open

ultraklon added a commit that referenced this issue Aug 29, 2013

modified the scrapper to create an object directly instead of a strin…

c12aace

…g and parse it and this fixed a lot of problems, we are very happy now. #1 #2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex discards important text #1

Regex discards important text #1

gvilarino commented Aug 13, 2013

ultraklon commented Aug 29, 2013

Regex discards important text #1

Regex discards important text #1

Comments

gvilarino commented Aug 13, 2013

ultraklon commented Aug 29, 2013