Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex discards important text #1

Open
gvilarino opened this issue Aug 13, 2013 · 1 comment
Open

Regex discards important text #1

gvilarino opened this issue Aug 13, 2013 · 1 comment
Assignees
Milestone

Comments

@gvilarino
Copy link
Member

The scraper miss-detects stuff and discards important data. For instance, scraping Cedom's bill 400 yields the following result:

{
"sancion": "01/06/2000",
"publicacion": "BOCBA N� 989 del 21/07/2000",
"promulgacion": "De Hecho del 03/07/2000",
"_id": {
  "$oid": "520a8ec68be1e20000000002"
},
"articulos": [
  {
    "articulo": "</b> Proh&iacute;bese a los establecimientos educativos ",
    "_id": {
      "$oid": "520a8ec68be1e20000000008"
    }
  },
  {
    "articulo": "</b> Ning&uacute;n alumno, con motivo de mora en el ",
    "_id": {
      "$oid": "520a8ec68be1e20000000007"
    }
  },
  {
    "articulo": " </b>los alumnos de los establecimientos citados ",
    "_id": {
      "$oid": "520a8ec68be1e20000000006"
    }
  },
  {
    "articulo": "</b> De verse configurados los extremos descriptos en ",
    "_id": {
      "$oid": "520a8ec68be1e20000000005"
    }
  },
  {
    "articulo": "</b> La Secretar&iacute;a de Educaci&oacute;n podr&aacute; ",
    "_id": {
      "$oid": "520a8ec68be1e20000000004"
    }
  },
  {
    "articulo": "</b> Comun&iacute;quese, etc</P>",
    "_id": {
      "$oid": "520a8ec68be1e20000000003"
    }
  }
],
"__v": 0
}

As you can see, the articles' text are not quite complete.

On other cases, like when part of the text contains double quotes (e.g.: "Some text"), all of the article's text up to that section is also discarded.

As a general rule, ALL text between two articles' titles should be included as part of the article.

@ghost ghost assigned ultraklon Aug 13, 2013
ultraklon added a commit that referenced this issue Aug 29, 2013
…g and parse it and this fixed a lot of problems, we are very happy now. #1 #2
@ultraklon
Copy link
Contributor

Guido, can you check if this is solved? Data is dirty with HTML but at least is not lost, please check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants