# Web Scraping

## Setup
Web scraping is a process of extracting data from website. This process is very important to know as a data scientist since sometimes we cannot get data easily as we querying the data from the database or download Kaggle. 

For this session, we are going to scrape a website using BeautifulSoup and Selenium. To install these packages, you can run the command below.
```
pip install bs4 selenium
```

### For Safari
- To use Selenium with Safari, first we need to enable Remote Automation feature in Safari.
    1. Open Safari settings/preferences and go to **Advanced** tab.
    <img src="https://github.com/FTDS-learning-materials/phase-0/blob/main/img/web-scraping-1.png?raw=true" />
    2. Check the "Show features for web developers" option.
    <img src="https://github.com/FTDS-learning-materials/phase-0/blob/main/img/web-scraping-2.png?raw=true" />
    3. After that, click on the **Developer** tab and check the "Allow Remote Automation" option.
    <img src="https://github.com/FTDS-learning-materials/phase-0/blob/main/img/web-scraping-3.png?raw=true" />

## Basic Web Component

The website that you are scraping in this lesson contains several components. Those are:
- HTML — the main content of the page.
- CSS — used to add styling to make the page look nicer.
- JS — Javascript files add interactivity to web pages.
- Images — image formats, such as JPG and PNG, allow web pages to show pictures.

There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look primarily at the HTML. 

Hence, we must first learn the fundamentals of how HTML works. But don't worry, we don't need to dive in deeply into it.

### HTML Structure

HyperText Markup Language (HTML) is the standard markup language for creating Web pages which consists of series of elements.

HTML has many functions that are similar to what you might find in a word processor like Microsoft Word — it can make text bold, create paragraphs, and so on.

<img src="https://developer.mozilla.org/en-US/docs/Glossary/Element/anatomy-of-an-html-element.png"/><br>


## Accessing the Web

Now, we will access https://www.scrapethissite.com/pages/forms/ for this lesson.

<img src="https://www.scrapingbee.com/blog/getting-started-with-mechanicalsoup/hockey-teams-page_hu2277696443619028977.png" />

To start scraping for data we can setup our code like below

In [2]:
# import packages
from selenium import webdriver
from bs4 import BeautifulSoup

# initialize selenium browser - Chrome
driver = webdriver.Chrome()

# define target url
url = "https://www.scrapethissite.com/pages/forms"

# tell the browser to open the web page
driver.get(url)

# extracting html from the page - in strings
html = driver.page_source

# convert html string to BeautifulSoup object
soup = BeautifulSoup(html, "html.parser")

# display object as output
display(soup)

# close the browser session
driver.quit()

<html lang="en"><head>
<meta charset="utf-8"/>
<title>Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping</title>
<link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components." name="description"/>
<link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"/>
<link href="/static/css/styles.css" rel="stylesheet" type="text/css"/>
<meta content="noindex" name="robots"/>
</head>
<bod

As we can see from the output, we successfully retrieve the HTML data from the targeted page. But this data is still raw, so next we must define which data that we want to extract from the page. For example, we want to extract team names from the page. To do this, first we have to identify which element will be targeted using Inspect Element feature in our browser.

<img src="https://github.com/FTDS-learning-materials/phase-0/blob/main/img/web-scraping-4.png?raw=true" />

From inspect element, we know that each team name is located in `<td>` element with attribute `class="name"`. Next we can start extracting the targeted elements using `find()` or `find_all()` method.

- `find()` -> will return 1 element.
- `find_all()` -> will return 1 or many elements.

Both of these methods have 2 parameters where first parameter is **required** for the element name, and second one is **optional** for the element attribute(s).

Example

- `find("td", {"class": "name"})`
- `find_all("td", {"class": "name"})`

In [3]:
# import packages
from selenium import webdriver
from bs4 import BeautifulSoup

# initialize selenium browser - Chrome
driver = webdriver.Chrome()

# define target url
url = "https://www.scrapethissite.com/pages/forms"

# tell the browser to open the web page
driver.get(url)

# extracting html from the page - in strings
html = driver.page_source

# convert html string to BeautifulSoup object
soup = BeautifulSoup(html, "html.parser")

# find td elements
result = soup.find_all("td", {"class": "name"})

# because the result is in list, we can use loop to process each elements
for element in result:
    print("result: ", element)

# close the browser session
driver.quit()

result:  <td class="name">
                            Boston Bruins
                        </td>
result:  <td class="name">
                            Buffalo Sabres
                        </td>
result:  <td class="name">
                            Calgary Flames
                        </td>
result:  <td class="name">
                            Chicago Blackhawks
                        </td>
result:  <td class="name">
                            Detroit Red Wings
                        </td>
result:  <td class="name">
                            Edmonton Oilers
                        </td>
result:  <td class="name">
                            Hartford Whalers
                        </td>
result:  <td class="name">
                            Los Angeles Kings
                        </td>
result:  <td class="name">
                            Minnesota North Stars
                        </td>
result:  <td class="name">
                            Montreal Canadiens
       

Although we have successfully get each `td` elements, but this data is not clean yet. Now we have to extract the content using `get_text()` from each element.

In [4]:
# import packages
from selenium import webdriver
from bs4 import BeautifulSoup

# initialize selenium browser - Chrome
driver = webdriver.Chrome()

# define target url
url = "https://www.scrapethissite.com/pages/forms"

# tell the browser to open the web page
driver.get(url)

# extracting html from the page - in strings
html = driver.page_source

# convert html string to BeautifulSoup object
soup = BeautifulSoup(html, "html.parser")

# find td elements
result = soup.find_all("td", {"class": "name"})

for element in result:
    # process each element
    # extract the content using get_text()
    # add strip() to remove whitespace from the extracted content
    print("result: ", element.get_text().strip())

# close the browser session
driver.quit()

result:  Boston Bruins
result:  Buffalo Sabres
result:  Calgary Flames
result:  Chicago Blackhawks
result:  Detroit Red Wings
result:  Edmonton Oilers
result:  Hartford Whalers
result:  Los Angeles Kings
result:  Minnesota North Stars
result:  Montreal Canadiens
result:  New Jersey Devils
result:  New York Islanders
result:  New York Rangers
result:  Philadelphia Flyers
result:  Pittsburgh Penguins
result:  Quebec Nordiques
result:  St. Louis Blues
result:  Toronto Maple Leafs
result:  Vancouver Canucks
result:  Washington Capitals
result:  Winnipeg Jets
result:  Boston Bruins
result:  Buffalo Sabres
result:  Calgary Flames
result:  Chicago Blackhawks


Now we sucessfully extract each team names from the page. Next we can store this result to DataFrame for analysis purpose.

In [5]:
# import packages
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd

# initialize DataFrame
df = pd.DataFrame()

# prepare temp list for storing all team names
teamNames = []

# initialize selenium browser - Chrome
driver = webdriver.Chrome()

# define target url
url = "https://www.scrapethissite.com/pages/forms"

# tell the browser to open the web page
driver.get(url)

# extracting html from the page - in strings
html = driver.page_source

# convert html string to BeautifulSoup object
soup = BeautifulSoup(html, "html.parser")

# find td elements
result = soup.find_all("td", {"class": "name"})

for element in result:
    # add validation before append to list
    # check if the element exists
    if element != None:
        teamNames.append(element.get_text().strip())
    else:
        # else add None to list
        teamNames.append(None)

# close the browser session
driver.quit()

# populate the DataFrame
df['Team'] = teamNames

display(df)

Unnamed: 0,Team
0,Boston Bruins
1,Buffalo Sabres
2,Calgary Flames
3,Chicago Blackhawks
4,Detroit Red Wings
5,Edmonton Oilers
6,Hartford Whalers
7,Los Angeles Kings
8,Minnesota North Stars
9,Montreal Canadiens


And that's it, we have done simple web scraping from a sandbox website. After this you can use this technique to any use case that you may found in the nearest future.

## More Samples
### Extracting from Multiple Pages

In [1]:
# import packages
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd

# initialize DataFrame
df = pd.DataFrame()

# prepare temp list for storing all datas
teamNames = []
teamYears = []
teamWins = []
teamLosses = []

# initialize selenium browser - Chrome
driver = webdriver.Chrome()


# using for loop to iterate multiple pages from the website
for page in range(1, 6):
    # define target url
    url = f"https://www.scrapethissite.com/pages/forms?page_num={page}"

    # tell the browser to open the web page
    driver.get(url)

    # extracting html from the page - in strings
    html = driver.page_source

    # convert html string to BeautifulSoup object
    soup = BeautifulSoup(html, "html.parser")

    # find tr elements with class team
    result = soup.find_all("tr", {"class": "team"})

    for tr_element in result:
        # find every td element inside of tr
        name = tr_element.find('td', {"class": "name"})
        year = tr_element.find('td', {"class": "year"})
        win = tr_element.find('td', {"class": "wins"})
        loss = tr_element.find('td', {"class": "losses"})

        # extract each td element inside tr
        # don't forget to validate if each element exists before extracting with get_text()
        if name != None:
            teamNames.append(name.get_text().strip())
        else:
            teamNames.append(None)

        if year != None:
            teamYears.append(year.get_text().strip())
        else:
            teamYears.append(None)

        if win != None:
            teamWins.append(win.get_text().strip())
        else:
            teamWins.append(None)

        if loss != None:
            teamLosses.append(loss.get_text().strip())
        else:
            teamLosses.append(None)

# close the browser session
driver.quit()

# populate the DataFrame
df['Team'] = teamNames
df['Year'] = teamYears
df['Wins'] = teamWins
df['Losses'] = teamLosses

display(df)

Unnamed: 0,Team,Year,Wins,Losses
0,Boston Bruins,1990,44,24
1,Buffalo Sabres,1990,31,30
2,Calgary Flames,1990,46,26
3,Chicago Blackhawks,1990,49,23
4,Detroit Red Wings,1990,34,38
...,...,...,...,...
120,Boston Bruins,1995,40,31
121,Buffalo Sabres,1995,33,42
122,Calgary Flames,1995,34,37
123,Chicago Blackhawks,1995,40,28


**Notes**
- In some cases you may need to add `time.sleep()` when performing scraping, this will make `BeautifulSoup` wait for the pages finished loading its data.
    ```
    ...
    import time --> import `time` package first
    ...
    driver.get(url)

    time.sleep(5) --> add `time.sleep()` in between opening the web page and before extracting html from the page

    html = driver.page_source
    ...
    ```

### Accessing Each Individual Detail 

<img src="https://github.com/FTDS-learning-materials/phase-0/blob/main/img/web-scraping-5.png?raw=true">

In this example, we're going to use Gramedia website. In this [page](https://www.gramedia.com/categories/buku/komik) we can see that there are many subcategories ("Fantasi", "Fiksi Sejarah", "Horror", etc.). Now our target is to scrape every book information inside each subcategory. So, we will access the individual page and scrape some information about the book `title`, `author`, and `price`.

In [None]:
# import packages
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time

# initialize DataFrame
df = pd.DataFrame()

# prepare list variables
bookTitles = []
bookAuthors = []
bookPrices = []

# open browser
driver = webdriver.Chrome()

# open web page
url = "https://www.gramedia.com/categories/buku/komik"
driver.get(url)

# add delay for 3 seconds to wait the page to finished loading.
time.sleep(3)

# extract HTML from the page
html = driver.page_source
page = BeautifulSoup(html, "html.parser")

# iterate every subcategories
for subcategory in page.find_all('a', {"data-sentry-component": "CategoryPill"}):
    # open every subcategory page
    driver.get(subcategory['href'].lower())

    # add delay for 5 seconds to wait the page to finished loading.
    time.sleep(5)

    # extract HTML from the page
    html = driver.page_source
    sub_page = BeautifulSoup(html, "html.parser")

    # iterate every books inside subcategory page
    for books in sub_page.find_all('div', {"data-testid": "productCardContent"}):
        # find each element
        title = books.find('h2', {"data-testid": "productCardTitle"})
        author = books.find('div', {"data-testid": "productCardAuthor"})
        price = books.find('div', {"data-testid": "productCardFinalPrice"})

        # extract each element
        if title != None:
            bookTitles.append(title.get_text().strip())
        else:
            bookTitles.append(None)

        if author != None:
            bookAuthors.append(author.get_text().strip())
        else:
            bookAuthors.append(None)

        if price != None:
            bookPrices.append(price.get_text().strip())
        else:
            bookPrices.append(None)

driver.quit()

# populate DataFrame
df['Title'] = bookTitles
df['Author'] = bookAuthors
df['Price'] = bookPrices

display(df)

Unnamed: 0,Title,Author,Price
0,Koloni : Sakti Family Begins,Alisnaik,Rp45.000
1,Koloni Gundala: Amuk Vol. 2,"Iskandar Salim, Wahyu Widiatmoko, Wastukancono...",Rp51.000
2,Light Novel: Overlord 3 - The Bloody Valkyrie,Kugane Maruyama,Rp135.000
3,Light Novel The Rising Shield Hero 02,Aneko Yusagi,Rp115.000
4,"Light Novel So I'm a Spider, So What? 2",Okina Baba,Rp115.000
...,...,...,...
116,Koloni Jawara Sejati,Tanfidz Tammamudin & Ragha Sukma,Rp36.000
117,Eknath,Sotogakpaketomat,Rp89.100
118,Komik Jingga dalam Elegi,Esti Kinasih,Rp62.250
119,Flesh Out,Bella Zmr,Rp33.750
