We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The scraper miss-detects stuff and discards important data. For instance, scraping Cedom's bill 400 yields the following result:
{ "sancion": "01/06/2000", "publicacion": "BOCBA N� 989 del 21/07/2000", "promulgacion": "De Hecho del 03/07/2000", "_id": { "$oid": "520a8ec68be1e20000000002" }, "articulos": [ { "articulo": "</b> Prohíbese a los establecimientos educativos ", "_id": { "$oid": "520a8ec68be1e20000000008" } }, { "articulo": "</b> Ningún alumno, con motivo de mora en el ", "_id": { "$oid": "520a8ec68be1e20000000007" } }, { "articulo": " </b>los alumnos de los establecimientos citados ", "_id": { "$oid": "520a8ec68be1e20000000006" } }, { "articulo": "</b> De verse configurados los extremos descriptos en ", "_id": { "$oid": "520a8ec68be1e20000000005" } }, { "articulo": "</b> La Secretaría de Educación podrá ", "_id": { "$oid": "520a8ec68be1e20000000004" } }, { "articulo": "</b> Comuníquese, etc</P>", "_id": { "$oid": "520a8ec68be1e20000000003" } } ], "__v": 0 }
As you can see, the articles' text are not quite complete.
On other cases, like when part of the text contains double quotes (e.g.: "Some text"), all of the article's text up to that section is also discarded.
As a general rule, ALL text between two articles' titles should be included as part of the article.
The text was updated successfully, but these errors were encountered:
modified the scrapper to create an object directly instead of a strin…
c12aace
…g and parse it and this fixed a lot of problems, we are very happy now. #1 #2
Guido, can you check if this is solved? Data is dirty with HTML but at least is not lost, please check
Sorry, something went wrong.
ultraklon
No branches or pull requests
The scraper miss-detects stuff and discards important data. For instance, scraping Cedom's bill 400 yields the following result:
As you can see, the articles' text are not quite complete.
On other cases, like when part of the text contains double quotes (e.g.: "Some text"), all of the article's text up to that section is also discarded.
As a general rule, ALL text between two articles' titles should be included as part of the article.
The text was updated successfully, but these errors were encountered: