# Webscrapping Tutorial using BeautifulSoup

This notebook includes my notes on the webscrapping tutorial using Beautiful Soup from realpython.com.  The markdown panels are where I jot down my thought processes.  I also included code snippets where I put some of these concepts to use.  The tutorial takes a comprehensive journey through a **web scraping pipeline** from 1. Inspecting the webpage, 2. Understanding DOMs and HTML structures, 3. Connecting to Data Source and Navigating the HTML structures with Beautiful Soup Package.  The webpage we use is a static fake job website where we will be accessing various information on there.  All relevant links are included below.

The purpose of this notebook is for me to remember what I have learned on the tutorial and sharing my thought process for other learners.  I must say not everything on this page is up to the best coding practice.  I just started learning to code with Python and I would love it if you can provide feedback whether if you spot any errors, syntax, and give me suggestions to improve my coding etiquette.

Shout out to Martin Beuss for making this comprehensive tutorial.  I have learned a lot from it and I hope you do as well!


Data source: https://realpython.github.io/fake-jobs/

Tutorial link: https://realpython.com/beautiful-soup-web-scraper-python

HTML formatter: https://webformatter.com/html


In [None]:
import requests 

URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL) # retrieve HTML data and stores it into a Python object

print(page.text) # print out .text attribute which looks just like the html source codes 

The code below extracts all the HTML elements relating to the individual job panels.  

In [None]:
import requests
from bs4 import BeautifulSoup # Allows parsing of structured data

URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL) 

soup = BeautifulSoup(page.content, "html.parser") # Input: HTML content, html.parser is the appropriate Parser for our data

results = soup.find(id="ResultsContainer") # Finding specific HTML element of id "ResultsContainer"
print(results.prettify()) # .prettify() puts all HTML content contained behind a <div>

Although the content is more structured into the hierarchy formate we typically see HTML elements in, it is still long and messy.  

### Finding elements by HTML Classes

Every job posting on the page is wrapped in ```<div\>``` with class "card-content" (e.g. ```<div class="card-content"\>```).  The code below allows us to only view job postings by specifying the class (i.e. "card-content") that we are intereted in.  It seems that **HTML classes doesn't have to exactly mimic the class content, as long as you just include a "unique part of it"**

In [None]:
job_elements = results.find_all("div", class_="card-content")  # Creates an 'iterable' containing all HTML for job postings

# Using a for loop to print out all elements of the 'iterable'
for job_element in job_elements:
    print(job_element.prettify(), end="\n"*2)


Output is still very lengthy but a lot neater than the previous output.  We can still trim this down by picking the child elements from each of the job-postings with ```.find()```.  These child elements are refering to the job titles, company, and locations.


Each child element is also a BeautifulSoup object so we can use ```.find()``` to locate these elements like the above.

In [None]:
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title") 
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    print(title_element.prettify())
    print(company_element.prettify())
    print(location_element.prettify())
    print()

Now it is MUCH more organized.  The codeis showing us the job title, the company, and the location of the job.  However, **there are still HTML elements floating around.**  Our goal is to only extract text information from the webpage.


### Extracting Text from HTML elements


Adding .text to a Beautiful Soup object returns the text content of the HTML element.  After we have converted it to a text element, we can edit the text however we want using Python functions.


In [None]:
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip()) # .strip() gets rid of all the white spaces 
    print()

### Finding Elements by Class Name and Text Content

Say we are only interested in finding developer jobs, we will need to further filter down our findings.  We can do so in the ```.find_all()``` by passing the HTML elements and the specific strings.  For instance, say we want to find all Python jobs:

In [16]:
python_jobs = results.find_all("h2", string = "Python")
print(python_jobs)

[]


What? How come no results are showing up? Well that is because the ```string =``` only look for the specified string exactly.  We need to search for strings in a more general sense.

### Passing function to a Beautiful Soup Method


Here we pass a **anonymous function**  (the lambda function) into the string = arguement, which looks at the text of each ```<h2>``` element, converting it to lowercase, and check whether the substring "python" is found anywhere.

In [18]:
python_jobs = results.find_all("h2", string = lambda text: "python" in text.lower())
print(len(python_jobs)) # show number of html elements found

10


We are see there are 10 python jobs. (Note: the 10 refers to 10 html elements found which are each of the texts enclosed wthin ```<h2>```)  We can try and use the loop we previously made to extract all the job titles, companies, and locations.  

In [None]:
python_jobs = results.find_all(
    "h2", string=lambda text: "python" in text.lower()
)

for job_element in python_jobs:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print()

We get an error message ```'AttributeError: 'NoneType' object has no attribute 'text''```.  If we were to see all the elements in python_jobs, we would only find job being included in there.

In [None]:
python_jobs = results.find_all("h2", string = lambda text: "python" in text.lower())
print(python_jobs)

I tried to get through this issue by adding an IF statement condition we are familiar with, where it would only give us the job description if the word "Python" is found in the job title element of the HTML.  I set this condition under a loop through the job_elements iterable we created in the previous section.  It seems to work fine. 

In [40]:
job_list = []

for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    if "Python" in title_element.text.strip(): # Only print out job if it has "Python" in the job title
        job_list.append(title_element.text.strip())
        job_list.append(company_element.text.strip())
        job_list.append(location_element.text.strip())

print(job_list)
print(len(job_list)) # 30 meaning 10 jobs since each job card has 3 parts to it...

['Senior Python Developer', 'Payne, Roberts and Davis', 'Stewartbury, AA', 'Software Engineer (Python)', 'Garcia PLC', 'Ericberg, AE', 'Python Programmer (Entry-Level)', 'Moss, Duncan and Allen', 'Port Sara, AE', 'Python Programmer (Entry-Level)', 'Cooper and Sons', 'West Victor, AE', 'Software Developer (Python)', 'Adams-Brewer', 'Brockburgh, AE', 'Python Developer', 'Rivera and Sons', 'East Michaelfort, AA', 'Back-End Web Developer (Python, Django)', 'Stewart-Alexander', 'South Kimberly, AA', 'Back-End Web Developer (Python, Django)', 'Jackson, Ali and Mckee', 'New Elizabethside, AA', 'Python Programmer (Entry-Level)', 'Mathews Inc', 'Robertborough, AP', 'Software Developer (Python)', 'Moreno-Rodriguez', 'Martinezburgh, AE']
30


### Accessing Parent Elements

See the hiearchy of the DOM starting from ```<h2>``` that we are interested in.  Then identify the parent elements associated with it which emcompasses all the elements we need (```<div>``` stands for division, which is used as a container for HTML).  We are interested in the ```<div>``` element with card-content class because it contains not just job title, but alss company and location.  Itâ€™s a **3rd level** parent of the ```<h2>``` title element (hint: on visual code studio, count the number of lines that lead you to that divison).

In [None]:
<div class="card">
    <div class="card-content"> 
      <div class="media">
        <div class="media-left">
          <figure class="image is-48x48">
            <img
              src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg"
              alt="Real Python Logo"
            />
          </figure>
        </div>
        <div class="media-content">
          <h2 class="title is-5">Senior Python Developer</h2>
          <h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
        </div>
      </div>
  
      <div class="content">
        <p class="location">Stewartbury, AA</p>
        <p class="is-small has-text-grey">
          <time datetime="2021-04-08">2021-04-08</time>
        </p>
      </div>
      <footer class="card-footer">
        <a
          href="https://www.realpython.com"
          target="_blank"
          class="card-footer-item"
          >Learn</a
        >
        <a
          href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html"
          target="_blank"
          class="card-footer-item"
          >Apply</a
        >
      </footer>
    </div>
  </div>

Once we know the "number of nodes" from the one we were using to filter, we can use .parent to extract information we need.

In [None]:
# Find all elements with "python" in text
python_jobs = results.find_all(
    "h2", string=lambda text: "python" in text.lower()
) 

# List comprehension to extract the 3 levels of hiearchy, looping through python jobs
python_job_elements = [
    h2_element.parent.parent.parent for h2_element in python_jobs # show the 3 levels of hiearchy between <div> content-card and <h2> 
]

# Note that python_job_elements is still there 

# Print out all the Python related jobs 
job_list = []

for job_element in python_job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    job_list.append(title_element.text.strip())
    job_list.append(company_element.text.strip())
    job_list.append(location_element.text.strip())
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print()

print(len(job_list))

We can see now that our loop loops over the entire the \<div class="card-content"> elements instead of just the \<h2> elements.  All this was because we identified that \<div class="card-content"> is the HTML element that contains all the HTML elements we are interested in.

On a side note, it also seems that the number of elements in our job list matches the number of elements in the job list we extracted using our simple IF statement.  But note the second method is better because it navigates through the hiearchy.  If the websites have multiple hiearchies with the word "Python" attached to it, I would be picking that up even though it may not be in the job description.

### Extract Attributes From HTML Elements

Next we will try to scrap the links from these job descriptions.  Note that there is a "Learn" and "Apply" link on each of the job card.  In HTML, all links are enclosed in ```<a href= >```.  We can try to extract them the same way as we extracted the job titles...

In [None]:
for job_element in python_job_elements:
    # -- snip --
    links = job_element.find_all("a")  # Links all have an <a href = tag associated with them 
    for link in links:
        print(link.text.strip())

What? We don't actually get the links... Well this is because .text attribute only gives us the visible content, AKA the texts, of the HTML elements.  So first we need to extract all the ```<a>``` elements in the job card and then the href values using square-bracket notations.

In [None]:
for job_element in python_job_elements:
    # -- snip --
    links = job_element.find_all("a")  # Links all have an <a href = tag associated with them 
    for link in links:
        link_url = link["href"] # Extract link URL using [] reference
        print(f"Apply here: {link_url}\n") # This format prints link URLs

As you can see, we also included "Learn" links.  If we only want "Apply" links we have to find a way to filter out the "Learn" links.  Luckily for us, we can add parameters to the ```.find_all()``` methods.  I did this by looking at the DOM and see what unique texts are contained in the "Apply" link.  Then I copied the **anonymous function** format above.

In [None]:
for job_element in python_job_elements:
    # -- snip --
    links = job_element.find_all(
        "a", string = lambda text:"apply" in text.lower() # Only look for "a" container with "apply" text contained in
        ) 
    for link in links:
        link_url = link["href"] 
        print(f"Apply here: {link_url}\n") 

This is the method suggested in the tutorial.  We know that the "Apply" link is the second link that shows up in the DOM.  Since ```find_all()``` gives us a iterable, we can access the items in there similar to a list.  We just need to ```find_all``` the html elements with "a" in it, select the second item and the link, which is always behind a "href".

In [None]:
# Here is another way to do it

for job_element in python_job_elements:
    link_url = job_element.find_all("a")[1]["href"] # Find the second link in the job_element container
    print(f"Apply here: {link_url}\n")



### Summary

Thank you for sticking with me if you are still here!  I really enjoyed following through this tutorial and I hope you do as well.  Just to sanity check, here is all the aspects of webscrapping and things we have accomplished so far:

1. Retrieve HTML elements from URL using ```requests``` and parsing through them with ```BeautifulSoup```
2. Creating iterable by using  ```.find_all()``` method to locate and extract HTML elements using **tags** and HTML **classes**
3. Using ```.find()``` method to extract specific HTML elements using **IDs**
4. Extracting text from HTML elements using ```.text.strip()``` methods
5. Passing **anonymous functions** into ```.find_all()``` such as ```lambda``` that allow us to search for substrings
6. Using ```.parent``` methods to access parent elements that encompasses all the information we want to extract
7. Extracting **links** which have ```<a href =``` tag

That being said, the below code is a condense version of hat we did above.  The code retrieves HTML elements from "Fake Job" website and extracts all the python jobs.

In [None]:
# Import Libraries 
import requests 
from bs4 import BeautifulSoup

# Retrieve HTML elements from URL and parse 
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

# Filter the hiearchy that encompasses all job cards using specific ID
results = soup.find(id="ResultsContainer")

# Create iterable containing HTML elements with job title containing "python"
python_jobs = results.find_all(
    "h2", string = lambda text:"python" in text.lower() # tag_numer, string = lambda text: "..."
)

# Create a list to extract all levels of hiearchies in python_jobs 
python_job_elements = [
    h2_elements.parent.parent.parent for h2_elements in python_jobs
]
print(len(python_job_elements)) # Check how many jobs selected

# Print out all python related jobs and their links
for job_element in python_job_elements:
    # Print job title
    title = job_element.find("h2", class_ = "title")
    print(title.text.strip())
    # Print Company 
    company = job_element.find("h3", class_ = "company") 
    print(company.text.strip())
    # Print location
    location = job_element.find("p", class_ = "location")
    print(location.text.strip())
    # Print apply links
    apply_link = job_element.find_all("a")[1]["href"]
    print(f"Apply here: {apply_link}\n")


# Damn, that's quite a lot! Congrats!!!

### Addtional Notes

Query parameters in a URL consist of three main components:
1. **Start symbol**: A question mark (?) denotes the beginning of the query parameters.
2. **Information pairs**: Key-value pairs joined by an equal sign (key=value) hold the information.
3. **Separator**: Multiple query parameters are separated by an ampersand symbol (&).