In [None]:
import requests

URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)
print(page)

print(page.text)

**Beautiful Soup is a Python library for parsing structured data.            Import the library in your Python script and create a Beautiful Soup object.   A Beautiful Soup object that takes page.content, which is the HTML content you scraped earlier, as its input.                                                                         The second argument, "html.parser", makes sure that you use the appropriate parser for HTML content.
Note:You’ll want to pass page.content instead of page.text to avoid problems with character encoding. The .content attribute holds raw bytes, which can be decoded better than the text representation you printed earlier using the .text attribute.**

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, "html.parser")

**Find Elements by ID In an HTML web page, every element can have an id attribute assigned. As the name already suggests, that id attribute makes the element uniquely identifiable on the page. You can begin to parse your page by selecting a specific element by its ID.**

**Beautiful Soup allows you to find that specific HTML element by its ID:**

In [None]:
results = soup.find(id="ResultsContainer")

**For easier viewing, you can prettify any Beautiful Soup object when you print it out. If you call .prettify() on the results variable that you just assigned above, then you’ll see all the HTML contained within the <div>:**

In [None]:
print(results.prettify())

**Find Elements by HTML Class Name**

**You’ve seen that every job posting is wrapped in a <div> element with the class card-content. Now you can work with your new object called results and select only the job postings in it.**

In [None]:
job_elements = results.find_all("div", class_="card-content")

Here, you call .find_all() on a Beautiful Soup object, which returns an iterable containing all the HTML for all the job listings displayed on that page.

In [None]:
for job_element in job_elements:
    print(job_element, end="\n"*2)

That’s already pretty neat, but there’s still a lot of HTML! You saw earlier that your page has descriptive class names on some elements. You can pick out those child elements from each job posting with .find():



In [None]:
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    print(title_element)
    print(company_element)
    print(location_element)
    print()

Each job_element is another BeautifulSoup() object. Therefore, you can use the same methods on it as you did on its parent element, results.

**Extract Text From HTML Elements**  
You can add .text to a Beautiful Soup object to return only the text content of the HTML elements that the object contains:

In [None]:
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    print(title_element.text)
    print(company_element.text)
    print(location_element.text)
    print()

However, it’s possible that you’ll also get some extra whitespace. Since you’re now working with Python strings, you can .strip() the superfluous whitespace. You can also apply any other familiar Python string methods to further clean up your text:



In [None]:
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print()

However, you’re looking for a position as a software developer, and these results contain job postings in many other fields as well.



**Find Elements by Class Name and Text **
Not all of the job listings are developer jobs. Instead of printing out all the jobs listed on the website, you’ll first filter them using keywords.

You know that job titles in the page are kept within h2 elements. To filter for only specific jobs, you can use the string argument:

In [None]:
python_jobs = results.find_all("h2", string="Python")

This code finds all <h2> elements where the contained string matches "Python" exactly. Note that you’re directly calling the method on your first results variable. If you go ahead and print() the output of the above code snippet to your console, then you might be disappointed because it’ll be empty:

In [None]:
print(python_jobs)

[]


There was a Python job in the search results, so why is it not showing up?

When you use string= as you did above, your program looks for that string exactly. Any differences in the spelling, capitalization, or whitespace will prevent the element from matching. 

**Pass a Function to a Beautiful Soup Method**
In addition to strings, you can sometimes pass functions as arguments to Beautiful Soup methods. You can change the previous line of code to use a function instead:

In [None]:
python_jobs = results.find_all(
    "h2", string=lambda text: "python" in text.lower()
)

Now you’re passing an anonymous function to the string= argument. The lambda function looks at the text of each h2 element, converts it to lowercase, and checks whether the substring "python" is found anywhere. You can check whether you managed to identify all the Python jobs with this approach:

In [None]:
print(len(python_jobs))

10


Your program has found 10 matching job posts that include the word "python" in their job title!

In [None]:
for job_element in python_jobs:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print()

AttributeError: ignored

However, when you try to run your scraper to print out the information of the filtered Python jobs, you’ll run into an error:

**Identify Error Conditions**
When you look at a single element in python_jobs, you’ll see that it consists of only the <h2> element that contains the job title:



When you revisit the code you used to select the items, you’ll see that that’s what you targeted. You filtered for only the h2 title elements of the job postings that contain the word "python". As you can see, these elements don’t include the rest of the information about the job.

You tried to find the job title, the company name, and the job’s location in each element in python_jobs, but each element contains only the job title text.

Your diligent parsing library still looks for the other ones, too, and returns None because it can’t find them. Then, print() fails with the shown error message when you try to extract the .text attribute from one of these None objects.

The text you’re looking for is nested in sibling elements of the h2 elements your filter returned. Beautiful Soup can help you to select sibling, child, and parent elements of each Beautiful Soup object.

**Access Parent Elements**
One way to get access to all the information you need is to step up in the hierarchy of the DOM starting from the h2 elements that you identified. Take another look at the HTML of a single job posting. Find the h2 element that contains the job title as well as its closest parent element that contains all the information that you’re interested in:

The div element with the card-content class contains all the information you want. It’s a third-level parent of the h2 title element that you found using your filter.

With this information in mind, you can now use the elements in python_jobs and fetch their great-grandparent elements instead to get access to all the information you want:

In [None]:
python_jobs = results.find_all(
    "h2", string=lambda text: "python" in text.lower()
)

python_job_elements = [
    h2_element.parent.parent.parent for h2_element in python_jobs
]


for job_element in python_job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print()
    
    links = job_element.find_all("a")
    for link in links:
        link_url = link["href"]
        print(f"Apply here: {link_url}\n")

Senior Python Developer
Payne, Roberts and Davis
Stewartbury, AA

Apply here: https://www.realpython.com

Apply here: https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html

Software Engineer (Python)
Garcia PLC
Ericberg, AE

Apply here: https://www.realpython.com

Apply here: https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html

Python Programmer (Entry-Level)
Moss, Duncan and Allen
Port Sara, AE

Apply here: https://www.realpython.com

Apply here: https://realpython.github.io/fake-jobs/jobs/python-programmer-entry-level-20.html

Python Programmer (Entry-Level)
Cooper and Sons
West Victor, AE

Apply here: https://www.realpython.com

Apply here: https://realpython.github.io/fake-jobs/jobs/python-programmer-entry-level-30.html

Software Developer (Python)
Adams-Brewer
Brockburgh, AE

Apply here: https://www.realpython.com

Apply here: https://realpython.github.io/fake-jobs/jobs/software-developer-python-40.html

Python Developer
Rivera and Son

When you run your script another time, you’ll see that your code once again has access to all the relevant information. That’s because you’re now looping over the div class="card-content" elements instead of just the h2 title elements.

You added a list comprehension that operates on each of the h2 title elements in python_jobs that you got by filtering with the lambda expression. You’re selecting the parent element of the parent element of the parent element of each h2 title element. That’s three generations up!

When you were looking at the HTML of a single job posting, you identified that this specific parent element with the class name card-content contains all the information you need.

Now you can adapt the code in your for loop to iterate over the parent elements instead:



In [None]:
for link in links:
        link_url = link["href"]
        print(f"Apply here: {link_url}\n")

Apply here: https://www.realpython.com

Apply here: https://realpython.github.io/fake-jobs/jobs/software-developer-python-90.html



In this code snippet, you first fetched all links from each of the filtered job postings. Then you extracted the href attribute, which contains the URL, using "href" and printed it to your console.