Rework notebooks to use the static self-hosted fake job board #350

martin-martin · 2022-12-21T13:56:42Z

indeed.com has tightened their bot protection against web scraping, which is why requests to their site as they are described in this course return 403 Forbidden status codes.

I've attempted to circumvent this using fake headers (something that would be explainable in an intro course) but no luck, 403 prevails.

I've previously reworked the written tutorial to use a self-hosted fake job board that I set up just for the purpose of the tutorial.

As a quick fix for the video course, I added an explanatory lesson to the video coure and reworked the Jupyter notebooks.

The information and processes that I explain in the rest of the course are still valid and a good introduction for how to approach scraping a static website.

Where to put new files:

New files should go into a top-level subfolder, named after the article slug. For example: my-awesome-article

How to merge your changes:

Make sure the CI code style tests all pass (+ run the automatic code formatter if necessary).
Find an RP Team member on Slack and ask them to review & approve your PR.
Once the PR has one positive ("approved") review, GitHub lets you merge the PR.
🎉

indeed.com has tightened their bot protection against web scraping, which is why requests to their site as they are described in this course return 403 Forbidden status codes. I've attempted to circumvent this using fake headers (something that would be explainable in an intro course) but no luck, 403 prevails. I've previously [reworked the written tutorial](https://realpython.com/beautiful-soup-web-scraper-python/#step-1-inspect-your-data-source) to use a self-hosted [fake job board](https://realpython.github.io/fake-jobs/) that I set up just for the purpose of the tutorial. As a quick fix for the video course, I added an explanatory lesson to the video coure and reworked the Jupyter notebooks. The information and processes that I explain in the rest of the course are still valid and a good introduction for how to approach scraping a static website.

gahjelle

@martin-martin Great job updating this! I agree with you in removing the output from the notebooks!

I found one tiny bug (title -> title_element) that's noted as a line comment.

Otherwise, this looks good to me!

We could potentially ask @KateFinegan to have a quick LE glance on the changes as well.

gahjelle · 2023-02-06T14:02:40Z

build-a-web-scraper/03_parse.ipynb

   "source": [
-    "link_text = title_link.text\n",
-    "link_text"
+    "title = title.text\n",


title is currently not defined, should we refer to title_element?

Suggested change

"title = title.text\n",

"title = title_element.text\n",

Co-authored-by: gahjelle <geirarne@gmail.com>

martin-martin and others added 3 commits December 21, 2022 14:48

Merge branch 'master' into patch-bs4-course

4a45d53

Merge branch 'master' into patch-bs4-course

e8612bb

martin-martin requested review from gahjelle and digiglean February 6, 2023 11:29

gahjelle approved these changes Feb 6, 2023

View reviewed changes

Apply TR suggestions

eb80ade

Co-authored-by: gahjelle <geirarne@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework notebooks to use the static self-hosted fake job board #350

Rework notebooks to use the static self-hosted fake job board #350

martin-martin commented Dec 21, 2022

gahjelle left a comment

gahjelle Feb 6, 2023

Rework notebooks to use the static self-hosted fake job board #350

Are you sure you want to change the base?

Rework notebooks to use the static self-hosted fake job board #350

Conversation

martin-martin commented Dec 21, 2022

gahjelle left a comment

Choose a reason for hiding this comment

gahjelle Feb 6, 2023

Choose a reason for hiding this comment