# Web Scrapping Tutorial

This is a realpython.com web scrapping tutorial on web scrapping.

## Challenges of Web Scraping

The Web has grown organically out of many sources. It combines many different technologies, styles, and personalities, and it continues to grow to this day. In other words, the Web is a hot mess! Because of this, you’ll run into some challenges when scraping the Web:

* **Variety:** Every website is different. While you’ll encounter general structures that repeat themselves, each website is unique and will need personal treatment if you want to extract the relevant information.

* **Durability:** Websites constantly change. Say you’ve built a shiny new web scraper that automatically cherry-picks what you want from your resource of interest. The first time you run your script, it works flawlessly. But when you run the same script only a short while later, you run into a discouraging and lengthy stack of tracebacks!

## Web Scraping Alternatives

API (Application Programming Interface) is an alternative to web scrapping offered by some websites. APIs allow easy access to data in a predefined manner using HTTP requests. 

### API Properties

* You don't need to parse HTML with this.
* Data is accessed in JSON and XML formats.
* It is more stable to use than web scrapping
* Change in websites HTML does not affect API structure
* API structure is more permanent than HTML but it can change too
* An API requires Docs to be inspected
* Variety and Durability apply to APIs too

## Scrapping the [Fake Python Job Site](https://realpython.github.io/fake-jobs/)

In this tutorial, you’ll build a web scraper that fetches Python software developer job listings from the [Fake Python Jobs site](https://realpython.github.io/fake-jobs/). It’s an example site with fake job postings that you can freely scrape to train your skills. Your web scraper will parse the HTML on the site to pick out the relevant information and filter that content for specific words.

You can scrap any website on the internet with these fundamentals.

### Step 1: Inspect Your Data Source

Before you write any Python code, you need to get to know the website that you want to scrape. That should be your first step for any web scraping project you want to tackle. You’ll need to understand the site structure to extract the information that’s relevant for you. Start by opening the site you want to scrape with your favorite browser.

#### Explore the Website
Click through the site and interact with it just like any typical job searcher would. 

##### Decipher the Information in URLs
This is a URL:
```https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html```

You can deconstruct the above URL into two main parts:

1. **The base URL** represents the path to the search functionality of the website. In the example above, the base URL is:
```https://realpython.github.io/fake-jobs/.```

2. **The specific site location** that ends with .html is the path to the job description’s unique resource.

Any job posted on this website will use the same base URL. However, the unique resources’ location will be different depending on what specific job posting you’re viewing.

URLs can hold more information than just the location of a file. Some websites use query parameters to encode values that you submit when performing a search. You can think of them as query strings that you send to the database to retrieve specific records.

You’ll find query parameters at the end of a URL. For example, if you go to [Indeed](https://www.indeed.com/) and search for “software developer” in “Australia” through their search bar, you’ll see that the URL changes to include these values as query parameters:

```https://au.indeed.com/jobs?q=software+developer&l=Australia```
The query parameters in this URL are ?q=software+developer&l=Australia. Query parameters consist of three parts:

1. **Start:** The beginning of the query parameters is denoted by a question mark (?).
2. **Information:** The pieces of information constituting one query parameter are encoded in key-value pairs, where related keys and values are joined together by an equals sign (key=value).
3. **Separator:** Every URL can have multiple query parameters, separated by an ampersand symbol (&).
Equipped with this information, you can pick apart the URL’s query parameters into two key-value pairs:

1. **q=software+developer** selects the type of job.
2. **l=Australia selects** the location of the job.

If you change and submit the values in the website’s search box, then it’ll be directly reflected in the URL’s query parameters and vice versa. If you change either of them, then you’ll see different results on the website.

#### Inspect the Site Using Developer Tools

Next, you’ll want to learn more about how the data is structured for display. You’ll need to understand the page structure to pick what you want from the HTML response that you’ll collect in one of the upcoming steps.

Developer tools can help you understand the structure of a website. All modern browsers come with developer tools installed. In this section, you’ll see how to work with the developer tools in Chrome. The process will be very similar to other modern browsers.

In Chrome on macOS, you can open up the developer tools through the menu by selecting View → Developer → Developer Tools. On Windows and Linux, you can access them by clicking the top-right menu button (⋮) and selecting More Tools → Developer Tools. You can also access your developer tools by right-clicking on the page and selecting the Inspect option or using a keyboard shortcut:
* Mac: Cmd+Alt+I
* Windows/Linux: Ctrl+Shift+I
Developer tools allow you to interactively explore the site’s document object model (DOM) to better understand your source. To dig into your page’s DOM, select the Elements tab in developer tools. You’ll see a structure with clickable HTML elements.

Play around and explore! The more you get to know the page you’re working with, the easier it will be to scrape it. However, don’t get too overwhelmed with all that HTML text. You’ll use the power of programming to step through this maze and cherry-pick the information that’s relevant to you.

## Step 2: Scrape HTML Content From a Page

You'll use the requests library to interact with the URL. 

**NOTE:**
Always create a new virtual environment for new tasks.

In [1]:
try:
    import requests
except ModuleNotFoundError:
    !pip install requests
    import requests

In [3]:
URL = "https://realpython.github.io/fake-jobs/"

page = requests.get(URL) #This is a normal API GET request


#print(page) #this returns the response code
print(page.text) # returns the HTML of the page

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Fake Python</title>
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css">
  </head>
  <body>
  <section class="section">
    <div class="container mb-5">
      <h1 class="title is-1">
        Fake Python
      </h1>
      <p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>
    </div>
    <div class="container">
    <div id="ResultsContainer" class="columns is-multiline">
    <div class="column is-half">
<div class="card">
  <div class="card-content">
    <div class="media">
      <div class="media-left">
        <figure class="image is-48x48">
          <img src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1" alt="Real Python Logo">
        </figure>
      </div>
      <div class="media-content">
        <h2 class="title is-

This code issues an HTTP GET request to the given URL. It retrieves the HTML data that the server sends back and stores that data in a Python object.

If you print the .text attribute of page, then you’ll notice that it looks just like the HTML that you inspected earlier with your browser’s developer tools. You successfully fetched the static site content from the Internet! You now have access to the site’s HTML from within your Python script.

It can be challenging to wrap your head around a long block of HTML code. To make it easier to read, you can use an [HTML formatter](https://htmlformatter.com/) to clean it up automatically. Good readability helps you better understand the structure of any code block. While it may or may not help improve the HTML formatting, it’s always worth a try.

**Note:** 
Keep in mind that every website will look different. That’s why it’s necessary to inspect and understand the structure of the site you’re currently working with before moving forward


The HTML you’ll encounter will sometimes be confusing. Luckily, the HTML of this job board has descriptive class names on the elements that you’re interested in:

* class="title is-5" contains the title of the job posting.
* class="subtitle is-6 company" contains the name of the company that offers the position.

In case you ever get lost in a large pile of HTML, remember that you can always go back to your browser and use the developer tools to further explore the HTML structure interactively.

By now, you’ve successfully harnessed the power and user-friendly design of Python’s requests library. With only a few lines of code, you managed to scrape static HTML content from the Web and make it available for further processing.

However, there are more challenging situations that you might encounter when you’re scraping websites. Before you learn how to pick the relevant information from the HTML that you just scraped, you’ll take a quick look at two of these more challenging situations.

### Hidden Websites
Some pages contain information that’s hidden behind a login. That means you’ll need an account to be able to scrape anything from the page. The process to make an HTTP request from your Python script is different from how you access a page from your browser. Just because you can log in to the page through your browser doesn’t mean you’ll be able to scrape it with your Python script.

However, the requests library comes with the built-in capacity to [handle authentication](https://docs.python-requests.org/en/master/user/authentication/). With these techniques, you can log in to websites when making the HTTP request from your Python script and then scrape information that’s hidden behind a login. You won’t need to log in to access the job board information, which is why this tutorial won’t cover authentication.

### Dynamic Websites

In this tutorial, you’ll learn how to scrape a static website. Static sites are straightforward to work with because the server sends you an HTML page that already contains all the page information in the response. You can parse that HTML response and immediately begin to pick out the relevant data.

On the other hand, with a dynamic website, the server might not send back any HTML at all. Instead, you could receive JavaScript code as a response. This code will look completely different from what you saw when you inspected the page with your browser’s developer tools.

What happens in the browser is not the same as what happens in your script. Your browser will diligently execute the JavaScript code it receives from a server and create the DOM and HTML for you locally. However, if you request a dynamic website in your Python script, then you won’t get the HTML page content.

When you use requests, you only receive what the server sends back. In the case of a dynamic website, you’ll end up with some JavaScript code instead of HTML. The only way to go from the JavaScript code you received to the content that you’re interested in is to execute the code, just like your browser does. The requests library can’t do that for you, but there are other solutions that can.

For example, [requests-html](https://github.com/psf/requests-html) is a project created by the author of the requests library that allows you to render JavaScript using syntax that’s similar to the syntax in requests. It also includes capabilities for parsing the data by using Beautiful Soup under the hood.

**Note:** Another popular choice for scraping dynamic content is [Selenium](https://realpython.com/modern-web-automation-with-python-and-selenium/). You can think of Selenium as a slimmed-down browser that executes the JavaScript code for you before passing on the rendered HTML response to your script.

## Step 3: Parse HTML Code With Beautiful Soup

You’ve successfully scraped some HTML from the Internet, but when you look at it, it just seems like a huge mess. There are tons of HTML elements here and there, thousands of attributes scattered around—and wasn’t there some JavaScript mixed in as well? It’s time to parse this lengthy code response with the help of Python to make it more accessible and pick out the data you want.

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library for parsing structured data. It allows you to interact with HTML in a similar way to how you interact with a web page using developer tools. The library exposes a couple of intuitive functions you can use to explore the HTML you received. To get started, use your terminal to install Beautiful Soup:



In [16]:
try:
    from bs4 import BeautifulSoup
    import requests
    import pandas as pd
except ModuleNotFoundError:
    !pip install requests
    !pip install pandas
    !pip install beautifulsoup4
    
    import requests
    import pandas as pd
    from bs4 import BeautifulSoup #

In [7]:
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser") #

When you add the two highlighted lines of code, you create a Beautiful Soup object that takes page.content, which is the HTML content you scraped earlier, as its input.

**Note:** You’ll want to pass page.content instead of page.text to avoid problems with character encoding. The .content attribute holds raw bytes, which can be decoded better than the text representation you printed earlier using the .text attribute.

The second argument, "html.parser", makes sure that you use the appropriate parser for HTML content.

### Find Elements by ID

In an HTML web page, every element can have an id attribute assigned. As the name already suggests, that id attribute makes the element uniquely identifiable on the page. You can begin to parse your page by selecting a specific element by its ID.

Switch back to developer tools and identify the HTML object that contains all the job postings. Explore by hovering over parts of the page and using right-click to Inspect.

Note: It helps to periodically switch back to your browser and interactively explore the page using developer tools. This helps you learn how to find the exact elements you’re looking for.

In [8]:
results = soup.find(id="ResultsContainer")
print(results.prettify()) # Makes sure HTML is properly formatted

<div class="columns is-multiline" id="ResultsContainer">
 <div class="column is-half">
  <div class="card">
   <div class="card-content">
    <div class="media">
     <div class="media-left">
      <figure class="image is-48x48">
       <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
      </figure>
     </div>
     <div class="media-content">
      <h2 class="title is-5">
       Senior Python Developer
      </h2>
      <h3 class="subtitle is-6 company">
       Payne, Roberts and Davis
      </h3>
     </div>
    </div>
    <div class="content">
     <p class="location">
      Stewartbury, AA
     </p>
     <p class="is-small has-text-grey">
      <time datetime="2021-04-08">
       2021-04-08
      </time>
     </p>
    </div>
    <footer class="card-footer">
     <a class="card-footer-item" href="https://www.realpython.com" target="_blank">
      Learn
     </a>
     <a class="card-footer-item" href=

### Find Elements by HTML Class Name

You’ve seen that every job posting is wrapped in a `<div>` element with the class card-content. Now you can work with your new object called results and select only the job postings in it. These are, after all, the parts of the HTML that you’re interested in! You can do this in one line of code:

In [9]:
job_elements = results.find_all("div", class_="card-content") 

`Find_all()` returns a list of all Elements that fit the description. So we need to loop through them to print each one out

In [11]:
for job_element in job_elements:
    print(job_element.prettify(), end="\n"*2)

<div class="card-content">
 <div class="media">
  <div class="media-left">
   <figure class="image is-48x48">
    <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
   </figure>
  </div>
  <div class="media-content">
   <h2 class="title is-5">
    Senior Python Developer
   </h2>
   <h3 class="subtitle is-6 company">
    Payne, Roberts and Davis
   </h3>
  </div>
 </div>
 <div class="content">
  <p class="location">
   Stewartbury, AA
  </p>
  <p class="is-small has-text-grey">
   <time datetime="2021-04-08">
    2021-04-08
   </time>
  </p>
 </div>
 <footer class="card-footer">
  <a class="card-footer-item" href="https://www.realpython.com" target="_blank">
   Learn
  </a>
  <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">
   Apply
  </a>
 </footer>
</div>


<div class="card-content">
 <div class="media">
  <div class="media-lef

That’s already pretty neat, but there’s still a lot of HTML! You saw earlier that your page has descriptive class names on some elements. You can pick out those child elements from each job posting with .find():

In [21]:
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    date_element = job_element.find("time")
    print(title_element)
    print(company_element)
    print(location_element)
    print(date_element)
    print('')

<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
<p class="location">
        Stewartbury, AA
      </p>
<time datetime="2021-04-08">2021-04-08</time>

<h2 class="title is-5">Energy engineer</h2>
<h3 class="subtitle is-6 company">Vasquez-Davidson</h3>
<p class="location">
        Christopherville, AA
      </p>
<time datetime="2021-04-08">2021-04-08</time>

<h2 class="title is-5">Legal executive</h2>
<h3 class="subtitle is-6 company">Jackson, Chambers and Levy</h3>
<p class="location">
        Port Ericaburgh, AA
      </p>
<time datetime="2021-04-08">2021-04-08</time>

<h2 class="title is-5">Fitness centre manager</h2>
<h3 class="subtitle is-6 company">Savage-Bradley</h3>
<p class="location">
        East Seanview, AP
      </p>
<time datetime="2021-04-08">2021-04-08</time>

<h2 class="title is-5">Product manager</h2>
<h3 class="subtitle is-6 company">Ramirez Inc</h3>
<p class="location">
        North Jamieview, AP
  

Each job_element is another BeautifulSoup() object. Therefore, you can use the same methods on it as you did on its parent element, results.

With this code snippet, you’re getting closer and closer to the data that you’re actually interested in. Still, there’s a lot going on with all those HTML tags and attributes floating around.

### Extract Text From HTML Elements

You only want to see the title, company, and location of each job posting. And behold! Beautiful Soup has got you covered. You can add .text to a Beautiful Soup object to return only the text content of the HTML elements that the object contains.

In [22]:
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    date_element = job_element.find("time")
    print(title_element.text)
    print(company_element.text)
    print(location_element.text)
    print(date_element.text)
    print()

Senior Python Developer
Payne, Roberts and Davis

        Stewartbury, AA
      
2021-04-08

Energy engineer
Vasquez-Davidson

        Christopherville, AA
      
2021-04-08

Legal executive
Jackson, Chambers and Levy

        Port Ericaburgh, AA
      
2021-04-08

Fitness centre manager
Savage-Bradley

        East Seanview, AP
      
2021-04-08

Product manager
Ramirez Inc

        North Jamieview, AP
      
2021-04-08

Medical technical officer
Rogers-Yates

        Davidville, AP
      
2021-04-08

Physiological scientist
Kramer-Klein

        South Christopher, AE
      
2021-04-08

Textile designer
Meyers-Johnson

        Port Jonathan, AE
      
2021-04-08

Television floor manager
Hughes-Williams

        Osbornetown, AE
      
2021-04-08

Waste management officer
Jones, Williams and Villa

        Scotttown, AP
      
2021-04-08

Software Engineer (Python)
Garcia PLC

        Ericberg, AE
      
2021-04-08

Interpreter
Gregory and Sons

        Ramireztown, AE
      
2021-04-0

Run the above code snippet, and you’ll see the text of each element displayed. However, it’s possible that you’ll also get some extra whitespace. Since you’re now working with Python strings, you can .strip() the superfluous whitespace. You can also apply any other familiar Python string methods to further clean up your text:

In [23]:
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    date_element = job_element.find("time")
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print(date_element.text.strip())
    print()

Senior Python Developer
Payne, Roberts and Davis
Stewartbury, AA
2021-04-08

Energy engineer
Vasquez-Davidson
Christopherville, AA
2021-04-08

Legal executive
Jackson, Chambers and Levy
Port Ericaburgh, AA
2021-04-08

Fitness centre manager
Savage-Bradley
East Seanview, AP
2021-04-08

Product manager
Ramirez Inc
North Jamieview, AP
2021-04-08

Medical technical officer
Rogers-Yates
Davidville, AP
2021-04-08

Physiological scientist
Kramer-Klein
South Christopher, AE
2021-04-08

Textile designer
Meyers-Johnson
Port Jonathan, AE
2021-04-08

Television floor manager
Hughes-Williams
Osbornetown, AE
2021-04-08

Waste management officer
Jones, Williams and Villa
Scotttown, AP
2021-04-08

Software Engineer (Python)
Garcia PLC
Ericberg, AE
2021-04-08

Interpreter
Gregory and Sons
Ramireztown, AE
2021-04-08

Architect
Clark, Garcia and Sosa
Figueroaview, AA
2021-04-08

Meteorologist
Bush PLC
Kelseystad, AA
2021-04-08

Audiological scientist
Salazar-Meyers
Williamsburgh, AE
2021-04-08

English a

## Step 4: Save in a CSV file

Now we store our results in a csv file. First we append all the required data to individual lists

In [25]:
title = []
company = []
location = []
date_posted = []


for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    date_element = job_element.find("time")
    title.append(title_element.text.strip())
    company.append(company_element.text.strip())
    location.append(location_element.text.strip())
    date_posted.append(date_element.text.strip())

In [26]:
print('title: \n',title)
print('\nlocation: \n',location)
print('\ncompany: \n', company)
print('\ndate_posted: \n', date_posted)

title: 
 ['Senior Python Developer', 'Energy engineer', 'Legal executive', 'Fitness centre manager', 'Product manager', 'Medical technical officer', 'Physiological scientist', 'Textile designer', 'Television floor manager', 'Waste management officer', 'Software Engineer (Python)', 'Interpreter', 'Architect', 'Meteorologist', 'Audiological scientist', 'English as a second language teacher', 'Surgeon', 'Equities trader', 'Newspaper journalist', 'Materials engineer', 'Python Programmer (Entry-Level)', 'Product/process development scientist', 'Scientist, research (maths)', 'Ecologist', 'Materials engineer', 'Historic buildings inspector/conservation officer', 'Data scientist', 'Psychiatrist', 'Structural engineer', 'Immigration officer', 'Python Programmer (Entry-Level)', 'Neurosurgeon', 'Broadcast engineer', 'Make', 'Nurse, adult', 'Air broker', 'Editor, film/video', 'Production assistant, radio', 'Engineer, communications', 'Sales executive', 'Software Developer (Python)', 'Futures trade

Then we merge the lists into a list of list and convert to a DataFrame using pandas

In [31]:
column_titles = ['title', 'company', 'location', 'date posted']

job_posting_list = pd.DataFrame(list(zip(title, company, location, date_posted)),columns=column_titles)
print(job_posting_list.head(10))

                       title                     company  \
0    Senior Python Developer    Payne, Roberts and Davis   
1            Energy engineer            Vasquez-Davidson   
2            Legal executive  Jackson, Chambers and Levy   
3     Fitness centre manager              Savage-Bradley   
4            Product manager                 Ramirez Inc   
5  Medical technical officer                Rogers-Yates   
6    Physiological scientist                Kramer-Klein   
7           Textile designer              Meyers-Johnson   
8   Television floor manager             Hughes-Williams   
9   Waste management officer   Jones, Williams and Villa   

                location date posted  
0        Stewartbury, AA  2021-04-08  
1   Christopherville, AA  2021-04-08  
2    Port Ericaburgh, AA  2021-04-08  
3      East Seanview, AP  2021-04-08  
4    North Jamieview, AP  2021-04-08  
5         Davidville, AP  2021-04-08  
6  South Christopher, AE  2021-04-08  
7      Port Jonathan, AE  2

Finally we write this to a CSV file

In [34]:
job_posting_list.to_csv('fake-python-job-listing-data.csv', encoding='utf-8')

That’s a readable list of jobs that also includes the company name and each job’s location which has been exported as a CSV file. However, you’re looking for a position as a software developer, and these results contain job postings in many other fields as well.

Find Elements by Class Name and Text Content
Not all of the job listings are developer jobs. Instead of printing out all the jobs listed on the website, you’ll first filter them using keywords.

You know that job titles in the page are kept within <h2> elements. To filter for only specific jobs, you can use the string argument: