<a href="https://colab.research.google.com/github/Ben-dsti/DSTI_PythonML_learning/blob/main/python_labs_day2_webcrawling_albert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Crawling with Beautiful Soup

BeautifulSoup documentation: [beautiful-soup-4.readthedocs.io](https://beautiful-soup-4.readthedocs.io)



In [None]:
# Install the beautiful soup library to the environment
!pip install beautifulsoup4



In [None]:
# Import the module
from bs4 import BeautifulSoup

For this exercise, we try to extract all job titles from [this website](https://realpython.github.io/fake-jobs/). Before we start, we inspect the structure of the website via Chrome developer tools:    

*   In Chrome, right-click anywhere on a website
*   In the contect menu, select `Developer Tools`
*   In the new pane, click on the top left icon to inspect elements
*   Hover over the element you want to inspect and click

We see that all job titles are listed inside `<h2>` tags. Let's try to access those.

First attempt to work with BeautifulSoup: Following this example in the documentation, we try to open a webpage in the same way we would open a file.

In [None]:
# See if we can load a website directly into bs4
url = 'https://realpython.github.io/fake-jobs/'

with open(url) as fp:
    soup = BeautifulSoup(fp)

# Turns out this does not work

FileNotFoundError: ignored

Lesson learned: `open` is to read in file content and does not work with urls.

For our second attempt, we rely on the `requests` library to get the content of a webpage and generate a `bs4` object.

In [None]:
# Import the new library
import requests

In [None]:
# Use requests to receive a webpage
my_url = 'https://realpython.github.io/fake-jobs/'
my_page = requests.get(url)

In [None]:
# Test if the request was successful
print(my_page)

# Response [200] is good news, the request was okay.
# Response [4xx] is bad news, we did not receive the response we wanted.

<Response [200]>


In [None]:
# Receive a string representation of the webpage's html content
# print(my_page.text)
# Commented out because the output is massive.

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Fake Python</title>
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css">
  </head>
  <body>
  <section class="section">
    <div class="container mb-5">
      <h1 class="title is-1">
        Fake Python
      </h1>
      <p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>
    </div>
    <div class="container">
    <div id="ResultsContainer" class="columns is-multiline">
    <div class="column is-half">
<div class="card">
  <div class="card-content">
    <div class="media">
      <div class="media-left">
        <figure class="image is-48x48">
          <img src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1" alt="Real Python Logo">
        </figure>
      </div>
      <div class="media-content">
        <h2 class="title is-

In [None]:
# We pass the string we received through requests and pass it to the beautiful
# soup function to receive a beautiful soup object.
soup = BeautifulSoup(my_page.text)

In [None]:
# Test if we can retrieve all h2 elements with soup.find_all()
all_h2 = soup.find_all('h2')

In [None]:
# Let's see what we've received
print(all_h2)

# We see this method has returned a list of elements.

[<h2 class="title is-5">Senior Python Developer</h2>, <h2 class="title is-5">Energy engineer</h2>, <h2 class="title is-5">Legal executive</h2>, <h2 class="title is-5">Fitness centre manager</h2>, <h2 class="title is-5">Product manager</h2>, <h2 class="title is-5">Medical technical officer</h2>, <h2 class="title is-5">Physiological scientist</h2>, <h2 class="title is-5">Textile designer</h2>, <h2 class="title is-5">Television floor manager</h2>, <h2 class="title is-5">Waste management officer</h2>, <h2 class="title is-5">Software Engineer (Python)</h2>, <h2 class="title is-5">Interpreter</h2>, <h2 class="title is-5">Architect</h2>, <h2 class="title is-5">Meteorologist</h2>, <h2 class="title is-5">Audiological scientist</h2>, <h2 class="title is-5">English as a second language teacher</h2>, <h2 class="title is-5">Surgeon</h2>, <h2 class="title is-5">Equities trader</h2>, <h2 class="title is-5">Newspaper journalist</h2>, <h2 class="title is-5">Materials engineer</h2>, <h2 class="title is-

In [None]:
# Checking the type of the elements
type(all_h2)

bs4.element.ResultSet

In [None]:
# Testing if we can access the text in a tag with .string - success!
type(all_h2[1].string)

bs4.element.NavigableString

In [None]:
# Looping through all h2 elements in the website to extract the title
for element in all_h2:
    print(element.string)

Senior Python Developer
Energy engineer
Legal executive
Fitness centre manager
Product manager
Medical technical officer
Physiological scientist
Textile designer
Television floor manager
Waste management officer
Software Engineer (Python)
Interpreter
Architect
Meteorologist
Audiological scientist
English as a second language teacher
Surgeon
Equities trader
Newspaper journalist
Materials engineer
Python Programmer (Entry-Level)
Product/process development scientist
Scientist, research (maths)
Ecologist
Materials engineer
Historic buildings inspector/conservation officer
Data scientist
Psychiatrist
Structural engineer
Immigration officer
Python Programmer (Entry-Level)
Neurosurgeon
Broadcast engineer
Make
Nurse, adult
Air broker
Editor, film/video
Production assistant, radio
Engineer, communications
Sales executive
Software Developer (Python)
Futures trader
Tour manager
Cytogeneticist
Designer, multimedia
Trade union research officer
Chemist, analytical
Programmer, multimedia
Engineer, b

Quick recap of what we've done so far:    
* We inspected the webpage to learn about its structure
* We learned that all job titles are listed inside `<h2>` tags - jackpot!
* We downloaded the webpage using the `requests` library
* We've passed a string of all html content to `BeautifulSoup`
* We used the `.find_all()` method to find all h2 elements in our webpage
* We accessed the .string property of an individual element to retrieve the job title we're looking for.

Great success! Now we're creating three functions:    
1. Retrieve a webpage and generate a Beautiful Soup object
2. Extract a list of all elements featuring a certain tag
3. Extract text from these elements (as we did with the job titles)

In [None]:
# Function to retrieve a website returns it as bs4 object
def read_in_website(url):
    '''
    Accepts an URL, retrieves a webpage and generates
    a bs4 object. Requires the 'requests' module to 
    function.
    '''
    page = requests.get(url)
    return BeautifulSoup(page.text)

In [None]:
# Testing the read_in function
webpage_content = read_in_website(my_url)

In [None]:
# print(webpage_content)
# Commented out because the output is massive

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Fake Python</title>
<link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
</head>
<body>
<section class="section">
<div class="container mb-5">
<h1 class="title is-1">
        Fake Python
      </h1>
<p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>
</div>
<div class="container">
<div class="columns is-multiline" id="ResultsContainer">
<div class="column is-half">
<div class="card">
<div class="card-content">
<div class="media">
<div class="media-left">
<figure class="image is-48x48">
<img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
</figure>
</div>
<div class="media-content">
<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
</div>
</div>


In [None]:
# Function to extract all elements with a certain tag
def extract_elements(content, tag):
    '''
    Accepts a bs4 object and a string of text. Returns
    a list of elements from the bs4 objects that match
    the tag.
    '''
    return content.find_all(tag)

In [None]:
# Testing the element extraction function with the h2 tag
my_tag = 'h2'
my_results = extract_elements(webpage_content, my_tag)

In [None]:
print(my_results)

[<h2 class="title is-5">Senior Python Developer</h2>, <h2 class="title is-5">Energy engineer</h2>, <h2 class="title is-5">Legal executive</h2>, <h2 class="title is-5">Fitness centre manager</h2>, <h2 class="title is-5">Product manager</h2>, <h2 class="title is-5">Medical technical officer</h2>, <h2 class="title is-5">Physiological scientist</h2>, <h2 class="title is-5">Textile designer</h2>, <h2 class="title is-5">Television floor manager</h2>, <h2 class="title is-5">Waste management officer</h2>, <h2 class="title is-5">Software Engineer (Python)</h2>, <h2 class="title is-5">Interpreter</h2>, <h2 class="title is-5">Architect</h2>, <h2 class="title is-5">Meteorologist</h2>, <h2 class="title is-5">Audiological scientist</h2>, <h2 class="title is-5">English as a second language teacher</h2>, <h2 class="title is-5">Surgeon</h2>, <h2 class="title is-5">Equities trader</h2>, <h2 class="title is-5">Newspaper journalist</h2>, <h2 class="title is-5">Materials engineer</h2>, <h2 class="title is-

In [None]:
# Function to extract text from elements
def extract_text(result_set):
    '''
    Accepts a bs4 ResultSet and extracts text from each tag.
    Returns a list of NavigableString objects.
    '''
    return [element.string for element in result_set]

    # The above statement is a list comprehension and equivalent to
    # this for loop:
    #
    # results = []
    # for element in result_set:
    #   results.append(element.string)
    #
    # return results

In [None]:
# Testing the function
all_jobs = extract_text(my_results)

# Print first five elements to test
for i in range(5):
    print(all_jobs[i])

Senior Python Developer
Energy engineer
Legal executive
Fitness centre manager
Product manager
