# Introduction to Web Scraping

## What is web scraping?

Web scraping is the automated process of extracting data from websites. It's a technique that is widely used for gathering data, allowing programmers to extract structured data from the web directly when an API either doesn't exist or is too restrictive.

At a high level, the steps for any webscraping project goes something like:
1. Fetch the page
2. Parse the HTML
3. Extract the info needed


![Image](webscraping.png)



## Legal and Ethical Considerations

It's important to consider both legal and ethical implications before writing any code. If you’re scraping a page respectfully for educational purposes, then you’re unlikely to have any problems. Still, it’s a good idea to do some research on your own and make sure that you’re not violating any Terms of Services before you start a large-scale project.

Check out this [article](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01) about the ethics of web scraping. To summarize, you should always 

- Be transparent about scraping intentions
- Respect site load by scraping at reasonable rates
- Store only necessary data.
- Credit original sources and respect their content.



## Use Cases and Applications

When you try to get the information you want manually, you might spend a lot of time clicking, scrolling, and searching, especially if you need large amounts of data from websites that are regularly updated with new content. Manual web scraping can take a lot of time and repetition, and doesn't sound like the most fun or productive way to spend your time.

For example, if you're doing a project that relies on product/price information from an ecommerce website, it'd be much easier to write some code to collect the data for you rather than clicking into each product, and copy pasting data manually. A few other suitable applications of web scraping include automatically collecting:

- offers, discounts, limited-time deals
- job postings and internships
- phone numbers/emails for sales and marketing campaigns

![Image](applications.png)


## Overview of Tools and Libraries for Web Scraping in Python

Python offers several libraries that help for web scraping, the most popular being `requests` and `BeautifulSoup`. The requests library allows users to easily send HTTP requests to websites, while BeautifulSoup is used for parsing HTML and XML documents. Together, these tools enable developers to automate data collection and process web content efficiently. 

It's important to note that BeautifulSoup is excellent for static content, but not that great when it comes to dynamic webpages that rely heavily on JavaScript for rendering their content. In these cases, `Selenium` is a better solution. Selenium is a web automation tool primarily used for testing web applications, but it's also highly effective for web scraping purposes. Unlike BeautifulSoup, Selenium can interact with web pages just like a human user would, navigating through pages, clicking on buttons, and even filling out forms. This capability allows it to execute JavaScript and scrape data from websites that update their content dynamically.


# Demo

Here's the website we'll use in this part of the guide: https://realpython.github.io/fake-jobs/

It's a static page designed for educational purposes, allowing learners to practice their web scraping skills without the ethical and legal complexities that come with scraping real-world websites.

Start off by scrolling and clicking through the website to get yourself familiarized with it. There are lots of fake job postings in a card format, and each of them has two buttons. If you click Apply, then you’ll see a new page that contains more detailed descriptions of the selected job.

Next, inspect the website with your browser's developer tools (right click anywhere, and click "Inspect"). This little window will be your best friend when webscraping as it provides a detailed view of the HTML structure of the webpage. You can identify the specific tags, classes, and ids associated with the data you wish to extract.

Let's get to writing some code! Make sure you have the `bs4` and `requests` packages.

In [None]:
# !pip install bs4 requests

In [6]:
import requests

# Send a GET request to the URL, returns the raw HTML
site = requests.get('https://realpython.github.io/fake-jobs/')

print(site.text)

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Fake Python</title>
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css">
  </head>
  <body>
  <section class="section">
    <div class="container mb-5">
      <h1 class="title is-1">
        Fake Python
      </h1>
      <p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>
    </div>
    <div class="container">
    <div id="ResultsContainer" class="columns is-multiline">
    <div class="column is-half">
<div class="card">
  <div class="card-content">
    <div class="media">
      <div class="media-left">
        <figure class="image is-48x48">
          <img src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1" alt="Real Python Logo">
        </figure>
      </div>
      <div class="media-content">
        <h2 class="title is-

In [7]:
from bs4 import BeautifulSoup

# Give the HTML to bs4 to parse
soup = BeautifulSoup(site.text, 'html.parser')

Let's go through some essential functions:

## Find Elements by ID


In HTML, the `id` attribute is used to assign a unique identifier to an HTML element, no two elements should have the same `id` value in a single page.

If you inspect the webpage, you can find the HTML object that contains all the job postings. Notice the 

``` <div id="ResultsContainer" class="columns is-multiline">```

![Image](ID.png)

In [12]:
# find the ResultsContainer id
ids = soup.find(id="ResultsContainer")

If you call .prettify() on the ids variable, then you’ll see all the HTML contained within the div element, in a formatted and readable way (as opposed to ids.text).

In [14]:
print(ids.prettify())

<div class="columns is-multiline" id="ResultsContainer">
 <div class="column is-half">
  <div class="card">
   <div class="card-content">
    <div class="media">
     <div class="media-left">
      <figure class="image is-48x48">
       <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
      </figure>
     </div>
     <div class="media-content">
      <h2 class="title is-5">
       Senior Python Developer
      </h2>
      <h3 class="subtitle is-6 company">
       Payne, Roberts and Davis
      </h3>
     </div>
    </div>
    <div class="content">
     <p class="location">
      Stewartbury, AA
     </p>
     <p class="is-small has-text-grey">
      <time datetime="2021-04-08">
       2021-04-08
      </time>
     </p>
    </div>
    <footer class="card-footer">
     <a class="card-footer-item" href="https://www.realpython.com" target="_blank">
      Learn
     </a>
     <a class="card-footer-item" href=

## Find Elements by HTML Class Name

Now that we have this object that holds the individual job listings, we can access the individual cards and all of the information in them.

In [23]:
# finds all the divs within the container, with class "card-content"
job_elements = ids.find_all("div", class_="card-content")

In [24]:
# loop through each div, and print its contents

for job_element in job_elements:
    print(job_element.prettify())
    print("------")

<div class="card-content">
 <div class="media">
  <div class="media-left">
   <figure class="image is-48x48">
    <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
   </figure>
  </div>
  <div class="media-content">
   <h2 class="title is-5">
    Senior Python Developer
   </h2>
   <h3 class="subtitle is-6 company">
    Payne, Roberts and Davis
   </h3>
  </div>
 </div>
 <div class="content">
  <p class="location">
   Stewartbury, AA
  </p>
  <p class="is-small has-text-grey">
   <time datetime="2021-04-08">
    2021-04-08
   </time>
  </p>
 </div>
 <footer class="card-footer">
  <a class="card-footer-item" href="https://www.realpython.com" target="_blank">
   Learn
  </a>
  <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">
   Apply
  </a>
 </footer>
</div>

------
<div class="card-content">
 <div class="media">
  <div class="med

If you scroll through the output of the last cell, you can notice some patterns. For example, 

- all the job titles are in a ```<h2 class="title is-5">```, 
- all the companies in ```<h3 class="subtitle is-6 company">```, 
- and all the locations in ```<p class="location">```.

Let's extract each of these fields!

In [30]:
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title is-5")
    company_element = job_element.find("h3", class_="subtitle is-6 company")
    location_element = job_element.find("p", class_="location")
    print(title_element)
    print(company_element)
    print(location_element)
    print("----------------------------------------")

<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
<p class="location">
        Stewartbury, AA
      </p>
----------------------------------------
<h2 class="title is-5">Energy engineer</h2>
<h3 class="subtitle is-6 company">Vasquez-Davidson</h3>
<p class="location">
        Christopherville, AA
      </p>
----------------------------------------
<h2 class="title is-5">Legal executive</h2>
<h3 class="subtitle is-6 company">Jackson, Chambers and Levy</h3>
<p class="location">
        Port Ericaburgh, AA
      </p>
----------------------------------------
<h2 class="title is-5">Fitness centre manager</h2>
<h3 class="subtitle is-6 company">Savage-Bradley</h3>
<p class="location">
        East Seanview, AP
      </p>
----------------------------------------
<h2 class="title is-5">Product manager</h2>
<h3 class="subtitle is-6 company">Ramirez Inc</h3>
<p class="location">
        North Jamieview, AP
      </p>
---------------

Still a little messy, we can clean it up with ```.strip()```:

In [31]:
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title is-5")
    company_element = job_element.find("h3", class_="subtitle is-6 company")
    location_element = job_element.find("p", class_="location")
    print(f"Title: {title_element.text.strip()}")
    print(f"Company: {company_element.text.strip()}")
    print(f"Location: {location_element.text.strip()}")
    print("----------------------------------------")

Title: Senior Python Developer
Company: Payne, Roberts and Davis
Location: Stewartbury, AA
----------------------------------------
Title: Energy engineer
Company: Vasquez-Davidson
Location: Christopherville, AA
----------------------------------------
Title: Legal executive
Company: Jackson, Chambers and Levy
Location: Port Ericaburgh, AA
----------------------------------------
Title: Fitness centre manager
Company: Savage-Bradley
Location: East Seanview, AP
----------------------------------------
Title: Product manager
Company: Ramirez Inc
Location: North Jamieview, AP
----------------------------------------
Title: Medical technical officer
Company: Rogers-Yates
Location: Davidville, AP
----------------------------------------
Title: Physiological scientist
Company: Kramer-Klein
Location: South Christopher, AE
----------------------------------------
Title: Textile designer
Company: Meyers-Johnson
Location: Port Jonathan, AE
----------------------------------------
Title: Televisi

## Loading into a pandas dataframe

So we can extract and print some elements in a page, but now we want to build a dataset out of it.

In [34]:
import pandas as pd

# lists to store the extracted data
titles = []
companies = []
locations = []

# Extract each element and append to the respective list
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title is-5")
    company_element = job_element.find("h3", class_="subtitle is-6 company")
    location_element = job_element.find("p", class_="location")
    
    # if the element exists, append to list
    if title_element:
        titles.append(title_element.text.strip())
    else:
        titles.append(None)
    
    if company_element:
        companies.append(company_element.text.strip())
    else:
        companies.append(None)
    
    if location_element:
        locations.append(location_element.text.strip())
    else:
        locations.append(None)

# Create a DataFrame using pandas
jobs_df = pd.DataFrame({
    'Title': titles,
    'Company': companies,
    'Location': locations
})

Now we have a dataframe that we can export as a csv!

In [36]:
jobs_df.head()

# jobs_df.to_csv("Jobs.csv")

Unnamed: 0,Title,Company,Location
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA"
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA"
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA"
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP"
4,Product manager,Ramirez Inc,"North Jamieview, AP"


# Selenium

So you've now seen an example of scraping data from a static webpage. However, pretty much all modern websites incorporate some form of dynamic content generated or manipulated by JavaScript, making pure HTML scraping insufficient for comprehensive data collection. Some common examples of things you can do with selenium, on top of finding tags and elements that are loaded with JavaScript include: drag and drops, clicking buttons, filling out forms, entering things in search fields. 

## Setup
1. pip install selenium
2. Follow [this guide](https://chromedriver.chromium.org/getting-started) to download and setup ChromeDriver
    - make sure this codeblock runs as your final step:

    ```
    from selenium import webdriver
    driver = webdriver.Chrome('/path/to/chromedriver')
    ```

    


After installing, try running this cell. A separate chrome window should pop up with the ```quotes.toscrape.com/js-delayed``` webpage. This page is different from the previous example, as it uses JavaScript to dynamically load content and there is an intentional delay when loading pages.

In [30]:
# !pip install selenium
from selenium.webdriver import Chrome, ChromeOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


# note: after setting your path to your chromedriver once, you can just init like so (without specifying path)
driver = Chrome() 

driver.get('https://quotes.toscrape.com/js-delayed/')

## Scraping Quotes

Let's see what happens if I try to use bs4 on this website:

In [40]:
s = requests.get('https://quotes.toscrape.com/js-delayed/')

soup = BeautifulSoup(s.text, 'html.parser')

quotes = soup.find_all('div', class_='quote')

# print all the quotes
for quote in quotes:
    print(quote)

# no output at all! bs4 does not work here.

What's happening here? 

When you make a request with `requests.get`, it retrieves the static HTML that the server sends back on the initial request. It does not execute JavaScript. If a website uses JavaScript to load or modify content after the initial page load (as is common with most modern websites), that dynamically loaded content won't be included in the HTML retrieved by requests.get.

Selenium on the other hand, acts as an actual person navigating to a website. This is what allows it to run JavaScript just like a regular user's browser would, accessing the dynamically loaded content. Combining it with `WebDriverWait` and expected conditions (`EC`), 
Selenium can be programmed to wait for certain elements to become visible, ensuring that it does indeed have access to the content that was loaded or altered by JavaScript.



In [39]:
driver = Chrome() 

driver.get('https://quotes.toscrape.com/js-delayed/')

# these lines set up a condition to wait for up to 20 seconds for all elements with the class name "quote" to be visible on the page
wait = WebDriverWait(driver, 20)
quotes = wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "quote")))

for quote in quotes:
    print(quote.text)
    print("-----")

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein
Tags: change deep-thoughts thinking world
-----
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
by J.K. Rowling
Tags: abilities choices
-----
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
by Albert Einstein
Tags: inspirational life live miracle miracles
-----
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
by Jane Austen
Tags: aliteracy books classic humor
-----
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
by Marilyn Monroe
Tags: be-yourself inspirational
-----
“Try not to become a man of success. Rather become a man of value.”
by Albert Einstein
Tags: adulthood success value
-----
“It is better to be hated fo

## Form Filling with Selenium

Now that we've seen how to collect dynamically loaded content with Selenium, let's explore some of its other capabilities. Notice how there's a Login button on the top right of the page. Let's get Selenium to click on that button, fill out a username and password, and click login.

The first thing we need to do is figure out how to find the Login button. By a quick inspect element, you can see that the Login button is in an anchor (a) tag with the attribute href="/login",

`<a href="/login">Login</a>`

We can find this using a CSS selector, which in the context of webscraping, are used to identify and interact with specific elements on a webpage based on their tag name, id, class, attributes, and even their hierarchy or position within the HTML document.

So when you see a CSS selector like `a[href="/login"]`, it's saying, "Find me an anchor tag with an href attribute equal to '/login'". Right after finding the button, we can called `.click()`, to navigate to the href and bring us to the Login page.

Notice that the Login button does not take time to be loaded, unlike the actual quotes. So we can throw out the `WebDriverWait`.



In [43]:
# should bring you to the Login page

driver = Chrome()

driver.get('https://quotes.toscrape.com/js-delayed/')

login_button = driver.find_element(By.CSS_SELECTOR, 'a[href="/login"]')
login_button.click()

Next, we need to find the username and password fields, and type in them. Inspecting the page, we can see that these fields are in input elements like so: 

`<input type="text" class="form-control" id="username" name="username">`

and


`<input type="password" class="form-control" id="password" name="password">`

We can easily find these using `.find_element()` and filtering `By.Name` (or `By.ID`).

After finding the elements, we can "send keys" to populate these fields.

In [45]:
driver = Chrome()

driver.get('https://quotes.toscrape.com/js-delayed/')

login_button = driver.find_element(By.CSS_SELECTOR, 'a[href="/login"]')
login_button.click()


# find the username and password fields
username_field = driver.find_element(By.ID, 'username')
password_field = driver.find_element(By.ID, 'password')

# send login info
username_field.send_keys('your_username')
password_field.send_keys('your_password')



Finally, we find and click on the submit button.

Here's the final script

In [47]:
driver = Chrome()

driver.get('https://quotes.toscrape.com/js-delayed/')

login_button = driver.find_element(By.CSS_SELECTOR, 'a[href="/login"]')
login_button.click()


username_field = driver.find_element(By.ID, 'username')
password_field = driver.find_element(By.ID, 'password')

username_field.send_keys('your_username')
password_field.send_keys('your_password')

submit_button = driver.find_element(By.CSS_SELECTOR, 'input[type="submit"]')
submit_button.click()
