## Tools Preparation
Scrapping is basically one of the way to retrieve the data and this process is very important to know as a data scientist since sometimes we cannot get data easily as we querying the data from the database or download Kaggle. We're going to scrape Gramedia.com in this lesson using Beautifulsoup. Before we're going further, please install beautifulsoup.

To install beautifulsoup, you may run one of the following commands on Anaconda Prompt (Windows) or Terminal (Linux/Mac/VSCode):

```
pip install bs4 selenium
```

and also you need to install requests to acces a web address by running:

```
pip install requests
```

### Selenium WebDriver

Selenium WebDriver is a powerful tool for automating browser interactions and testing web applications. It provides a programming interface to control browser behavior and perform actions such as clicking buttons, filling forms, and navigating through web pages.

To get started with Selenium WebDriver for different browsers, you'll need to ensure that you have the appropriate browser drivers installed and set up correctly. Each browser requires its specific driver to communicate with Selenium.

1. Google Chrome:
   - You need to download the ChromeDriver executable and place it in a location that is in your system's PATH.
   - Official ChromeDriver download page: https://sites.google.com/chromium.org/driver/

2. Safari:
   - SafariDriver is automatically installed with Safari on macOS.
   - To enable it, go to Safari preferences, then to the 'Advanced' tab, and check the "Show Develop menu in menu bar" option.
   - After that, in the Develop menu, go to "Allow Remote Automation" to enable SafariDriver.

3. Firefox:
   - You need to download the geckodriver executable and place it in a location that is in your system's PATH.
   - Official geckodriver download page: https://github.com/mozilla/geckodriver/releases

4. Microsoft Edge:
   - For Microsoft Edge (Chromium-based version), you need to download the Microsoft Edge Driver (also known as MSEdgeDriver) and place it in a location that is in your system's PATH.
   - Official MSEdgeDriver download page: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/

Once you have set up the appropriate drivers, you can use Selenium WebDriver in your preferred programming language (Python, Java, C#, etc.) to automate interactions with the browsers.

Here's the codes of how to define Selenium WebDriver with Python for Chrome, Safari, Firefox, and Microsoft Edge:

```python
from selenium import webdriver


# Create a new instance of the Chrome browser
driver = webdriver.Chrome("/path/to/chromedriver")

# Create a new instance of the Safari browser
driver = webdriver.Safari()

# Create a new instance of the Firefox browser
driver = webdriver.Firefox("/path/to/geckodriver")

# Create a new instance of the Microsoft Edge browser
driver = webdriver.Edge("/path/to/msedgedriver")
```

Similarly, you can use WebDriver with other browsers by using the appropriate driver for each browser and modifying the setup accordingly.

Always ensure you are using the latest versions of Selenium WebDriver and browser drivers to avoid compatibility issues. You can check the Selenium official website (https://www.selenium.dev/) and the respective browser driver download pages for updates and documentation.

## Basic Web Component

The website that you are scraping in this lesson contains several components. Those are:
- HTML — the main content of the page.
- CSS — used to add styling to make the page look nicer.
- JS — Javascript files add interactivity to web pages.
- Images — image formats, such as JPG and PNG, allow web pages to show pictures.

There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look primarily at the HTML. Hence, you need to know some HTML structure to ease your scraping works. But don't worry, you don't need to dive in deeply into it.

### HTML Structure

HyperText Markup Language (HTML) is the language that web pages are created in. HTML isn’t a programming language, like Python, though. It’s a markup language that tells a browser how to display content.

HTML has many functions that are similar to what you might find in a word processor like Microsoft Word — it can make text bold, create paragraphs, and so on.

Below an example of HTML structure:

```html
<HTML>
    <HEAD>
        <TITLE>My cool title</TITLE>
    </HEAD>
    <BODY>
        <H1>This is a Header</H1>
        <ul id="list" class="coolList">
            <li>item 1</li>
            <li>item 2</li>
            <li>item 3</li>
        </ul>
    </BODY>
</HTML>
```

- The red items are called as tag or element. Usually, tag follows "<".
- HTML, HEAD, and BODY are the main elements and the rests are the content. For your attention, we will focus on the contents.
- The orange items are attribut that give information about the tag.
- The blue texts are the attribute value.


## Accessing the Web

Now, we will access https://www.gramedia.com/categories/buku for this lesson. Before we go further, we need to understand how to access the url in Python. To do it, we use requests library.

In [None]:
import requests
page = requests.get("https://www.gramedia.com/categories/buku")
page

<Response [200]>

If you see the output is <Response [200]>, then you are success to access the url. "200" refers to HTTP status codes. You can read https://id.wikipedia.org/wiki/Daftar_kode_status_HTTP for further explaination.

Now, you can check the HTML content of the page in Python. However, you can also check it on your browser by right click and choose Inspect element to ease your understanding od the web structure.

Above is the HTML structure that Python successfully access. We need to parsing the structure using Beautifulsoup to make it clear and accessible to scrape.

In [None]:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('/Users/fahmimn21/Downloads/chromedriver')

url="https://www.gramedia.com/categories/buku"
driver.get(url)
html = driver.page_source

soup = BeautifulSoup(html, "html.parser")
print(soup.prettify()[:700])

  driver = webdriver.Chrome('/Users/fahmimn21/Downloads/chromedriver')


<html class="" lang="id">
 <head>
  <base href="/"/>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="#281e5a" name="theme-color"/>
  <meta content="index, follow" name="robots"/>
  <link href="/assets/favicon.ico" rel="icon" type="image/x-icon"/>
  <link href="manifest.json" rel="manifest"/>
  <link href="/assets/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <meta content="z91Vp6ZYo9UoX5D4ur6i4Lrl0l1j3DDoCH08fD3n53g" name="google-site-verification"/>
  <meta content="810657685650228" property="fb:app_id"/>
  <style>
   .async-hide {
        opacity: 0 !important
    }
  </style>
  <script async="" src="https://app.


<img src="https://i.ibb.co/vsz2M33/message-Image-1636690176458.jpg"></img>

Let that we want to retrieve the books' title, so let's check the title position on the HTML out using Inspect element!

We know that based on the Inspect element, the books' title lie on this code:

```html
<div _ngcontent-web-gramedia-c53="" class="list-title">Creepy Case Club 4: Kasus Pohon Pemanggil</div>
```

"Creepy Case Club 4: Kasus Pohon Pemanggil" located at **div** tag with attribute **class** and the value of "*list-title*". So we will use the information to inform the soup where the titles exist.

So we need to find all div elements that contain attribute class and value "list-title".

To do that, we use ```soup.find_all("<element>",{"<attribute>":"<attribute value>"})```

In [None]:
soup.find_all('div',{"class":"list-title"})

[<div _ngcontent-web-gramedia-c26="" class="list-title">Semangat Baja Ibnu Sutanto</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Tanya Jawab Seru Tentang Cuaca</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Kumpulan Dongeng Paud Mengenal Suara Di Sekitar Kita (Bonus Stiker Mewarnai Bip)</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Think And Grow Rich : Cara Para Jutawan Dan Miliarder Meraih Kekayaan</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Mungkin Kita Hanya</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Dongeng Karakter Positif PAUD : Permintaan Rara Jonggrang</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Dongeng Karakter Positif PAUD : Petualangan Sangkuriang</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Kitty dan Tragedi di Pekan Raya</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Kitty dan Kompetisi Lampion</div>,
 <div _ngcontent-web-gramedia-c

We see that the soup found all div elements that contain attribute class and value "list-title" but we need the title text only. To extract it, just add .get_text() method to each list element.

In [None]:
for div_tag in soup.find_all('div',{"class":"list-title"}):
    print(div_tag.get_text())

Semangat Baja Ibnu Sutanto
Tanya Jawab Seru Tentang Cuaca
Kumpulan Dongeng Paud Mengenal Suara Di Sekitar Kita (Bonus Stiker Mewarnai Bip)
Think And Grow Rich : Cara Para Jutawan Dan Miliarder Meraih Kekayaan
Mungkin Kita Hanya
Dongeng Karakter Positif PAUD : Permintaan Rara Jonggrang
Dongeng Karakter Positif PAUD : Petualangan Sangkuriang
Kitty dan Tragedi di Pekan Raya
Kitty dan Kompetisi Lampion
Jejak Nostalgia Pat Hendranto
Hoegeng Polisi dan Menteri Teladan Edisi Revisi
Shimmer & Shine : Di Dalam Rumah Boneka
Long Hu Men The Vengeance Continues 36
AKASHA : Re: Zero, Starting Life in Another World Chapter 2 : A Week at the Mansion 03
The Apothecary Diaries 03
The Promised Neverland 20 (END)
Lil` Sis Please Cook For me! 02


It is easy, isn't it?

Next, we will do more. Our task is to get information about Title, Author, Price, Link to the book's page, and link refers to image.

Based on the Inspect element, we know that those information locate on:
- Title: ```<div _ngcontent-web-gramedia-c53="" class="list-title">Creepy Case Club 4: Kasus Pohon Pemanggil</div>```
- Author: ```<p class="div-author"><span _ngcontent-web-gramedia-c53="" class="list-author ng-star-inserted"> Arvidan None </span>```
- Price: ```<p _ngcontent-web-gramedia-c53="" class="formats-price">Rp 79.000</p>```
- Link: ```<div class="ng-star-inserted"><a _ngcontent-web-gramedia-c53="" href="/products/think-and-grow-rich-cara-para-jutawan-dan-miliarder-meraih-kekayaan">```
- Image: ```<img _ngcontent-web-gramedia-c26="" class="product-list-img ng-star-inserted ng-lazyloaded" src="https://cdn.gramedia.com/uploads/items/9786230405990_Think_and_Grow_Rich__w149_hauto.jpeg" alt="Think And Grow Rich : Cara Para Jutawan Dan Miliarder Meraih Kekayaan">```

Let's we wrap up the code and then input the data into Pandas DataFrame.

In [None]:
import pandas as pd

data = pd.DataFrame()

data['Title'] = [ title.get_text() for title in soup.find_all( 'div', {"class":"list-title"} ) ]
data['Author'] = [ author.get_text() for author in soup.find_all( 'p', {"class":"div-author"} ) ]
data['Price'] = [ price.get_text() for price in soup.find_all( 'p', {"class":"formats-price"} ) ]
data['Image'] = [ img['src'] for img in soup.find_all( 'img',{"class":"product-list-img"} ) ]

links = []
for tag in soup.find_all( 'div',{"_ngcontent-web-gramedia-c26":"","class":"ng-star-inserted"} ):
    try:
        links.append("https://www.gramedia.com"+tag.find_all('a',{"_ngcontent-web-gramedia-c26":""})[0]['href'])
    except:
        pass

data['Link'] = links

data

Unnamed: 0,Title,Author,Price,Image,Link
0,Semangat Baja Ibnu Sutanto,Robert Adhi Ksp,Rp 100.000,https://cdn.gramedia.com/uploads/items/semanga...,https://www.gramedia.com/products/semangat-baj...
1,Tanya Jawab Seru Tentang Cuaca,Miles Kelly,Rp 45.000,https://cdn.gramedia.com/uploads/items/img2021...,https://www.gramedia.com/products/tanya-jawab-...
2,Kumpulan Dongeng Paud Mengenal Suara Di Sekita...,Heru Kurniawan,Rp 118.000,https://cdn.gramedia.com/uploads/items/img2021...,https://www.gramedia.com/products/kumpulan-don...
3,Think And Grow Rich : Cara Para Jutawan Dan Mi...,Napoleon Hill,Rp 77.000,https://cdn.gramedia.com/uploads/items/9786230...,https://www.gramedia.com/products/think-and-gr...
4,Mungkin Kita Hanya,Nugroho Putu,Rp 65.000,https://cdn.gramedia.com/uploads/items/9786230...,https://www.gramedia.com/products/mungkin-kita...
5,Dongeng Karakter Positif PAUD : Permintaan Rar...,Heru Kurniawan Umi Khomsiyatun,Rp 75.000,https://cdn.gramedia.com/uploads/items/9786230...,https://www.gramedia.com/products/dongeng-kara...
6,Dongeng Karakter Positif PAUD : Petualangan Sa...,"Heru Kurniawan, Endah Kusumaningrum",Rp 75.000,https://cdn.gramedia.com/uploads/items/9786230...,https://www.gramedia.com/products/dongeng-kara...
7,Kitty dan Tragedi di Pekan Raya,PAULA HARRISON,Rp 67.000,https://cdn.gramedia.com/uploads/items/COVER_d...,https://www.gramedia.com/products/kitty-dan-tr...
8,Kitty dan Kompetisi Lampion,PAULA HARRISON,Rp 67.000,/assets/default-images/product.png,https://www.gramedia.com/products/kitty-dan-ko...
9,Jejak Nostalgia Pat Hendranto,Pat Hendranto,Rp 69.000,/assets/default-images/product.png,https://www.gramedia.com/products/jejak-nostal...


## Multipage

Currently, we are working on a page. However, the rest of the web consist of more pages like below:

<img src="https://i.ibb.co/CQ6JQLv/message-Image-1636716930335.jpg"></img>

If we look at the next page such as page 2, we can see that the url change to https://www.gramedia.com/categories/buku?page=2 and page 3: https://www.gramedia.com/categories/buku?page=3. Then we know each page has a numbering format on url so we can access many pages one time automatically using loop. We exclude the image since image loader is very depended on your connection. Let's check the code below.

In [None]:
title = []
author = []
price = []
image = []
Links = []

driver = webdriver.Chrome('/Users/fahmimn21/Downloads/chromedriver')

for i in range(1,21):
    url="https://www.gramedia.com/categories/buku?page={}".format(i)
    driver.get(url)
    html = driver.page_source
    soup = BeautifulSoup(html, "html.parser")

    title += [ title.get_text() for title in soup.find_all( 'div', {"class":"list-title"} ) ]
    author += [ author.get_text() for author in soup.find_all( 'p', {"class":"div-author"} ) ]
    price += [ price.get_text() for price in soup.find_all( 'p', {"class":"formats-price"} ) ]
    image += [ img['src'] for img in soup.find_all( 'img',{"class":"product-list-img"} ) ]

    links = []
    for tag in soup.find_all( 'div',{"_ngcontent-web-gramedia-c26":"","class":"ng-star-inserted"} ):
        try:
            links.append("https://www.gramedia.com"+tag.find_all('a',{"_ngcontent-web-gramedia-c26":""})[0]['href'])
        except:
            pass
    Links += links

data_multipage = pd.DataFrame()
data_multipage['Title'] = title
data_multipage['Author'] = author
data_multipage['Price'] = price
data_multipage['Image'] = image
data_multipage['Link'] = Links

data_multipage

Unnamed: 0,Title,Author,Price,Image,Link
0,Semangat Baja Ibnu Sutanto,Robert Adhi Ksp,Rp 100.000,https://cdn.gramedia.com/uploads/items/semanga...,https://www.gramedia.com/products/semangat-baj...
1,Tanya Jawab Seru Tentang Cuaca,Miles Kelly,Rp 45.000,https://cdn.gramedia.com/uploads/items/img2021...,https://www.gramedia.com/products/tanya-jawab-...
2,Kumpulan Dongeng Paud Mengenal Suara Di Sekita...,Heru Kurniawan,Rp 118.000,https://cdn.gramedia.com/uploads/items/img2021...,https://www.gramedia.com/products/kumpulan-don...
3,Think And Grow Rich : Cara Para Jutawan Dan Mi...,Napoleon Hill,Rp 77.000,https://cdn.gramedia.com/uploads/items/9786230...,https://www.gramedia.com/products/think-and-gr...
4,Mungkin Kita Hanya,Nugroho Putu,Rp 65.000,https://cdn.gramedia.com/uploads/items/9786230...,https://www.gramedia.com/products/mungkin-kita...
...,...,...,...,...,...
392,Pengantar Pemahaman Konsepsi Dasar Sekitar Hak...,"Dr. Bambang Kesowo, S.H., LL.M.",Rp 119.000,/assets/default-images/product.png,https://www.gramedia.com/products/pengantar-pe...
393,Siapa Orang Asli Palestina? Sejarah Singkat Pa...,"Zafarul Islam Khan, Ph.D",Rp 58.000,/assets/default-images/product.png,https://www.gramedia.com/products/siapa-orang-...
394,Jam Berapa Sekarang? (Boardbook),Erika Medinah,Rp 69.000,/assets/default-images/product.png,https://www.gramedia.com/products/jam-berapa-s...
395,Next G: Tenteram Dengan Shalat (Republish),"Nafila Radyana Syamma, Dkk",Rp 39.000,/assets/default-images/product.png,https://www.gramedia.com/products/next-g-tente...


## Accessing Individual Page

<img src="https://i.ibb.co/F8D5bCy/message-Image-1637134633305.jpg"></img>

Suppose that we want to get more detail information about the books, but the information are on the individual page. So, we will access the individual page and scrape some information on it. We will catch title, author, price, description, number of pages, date of issue and publisher.

In [None]:
title = []
author = []
price = []
desc = []
num_pages = []
date_issue = []
publisher = []

driver = webdriver.Chrome('/Users/fahmimn21/Downloads/chromedriver')

for i in range(1,3):
    url="https://www.gramedia.com/categories/buku?page={}".format(i)
    driver.get(url)
    html = driver.page_source
    soup = BeautifulSoup(html, "html.parser")

    for tag in soup.find_all( 'div',{"_ngcontent-web-gramedia-c26":"","class":"ng-star-inserted"} ):
        try:
            link = "https://www.gramedia.com"+tag.find('a',{"_ngcontent-web-gramedia-c26":""})['href']
            driver.get(link)
            html_ind = driver.page_source
            soup_ind = BeautifulSoup(html_ind, "html.parser")

            title.append( soup_ind.find( 'div', {"class":"book-title"} ).get_text() )
            author.append( soup_ind.find('span',{"class":"title-author"}).get_text() )
            price.append( soup_ind.find('div', {'class':'price-product'}).get_text() )
            desc.append( soup_ind.find('pre').get_text() )
            num_pages.append( soup_ind.find('div',{'class':'detail-section'}).find_all('p')[0].get_text() )
            date_issue.append( soup_ind.find('div',{'class':'detail-section'}).find_all('p')[2].get_text() )
            publisher.append( soup_ind.find('div',{'class':'detail-section'}).find_all('p')[1].get_text() )

        except:
            pass

pages = pd.DataFrame()
pages['Title'] = title
pages['Author'] = author
pages['Price'] = price
pages['Desc'] = desc
pages['Num Pages'] = num_pages
pages['Date Issue'] = date_issue
pages['Publisher'] = publisher

pages

  driver = webdriver.Chrome('/Users/fahmimn21/Downloads/chromedriver')


Unnamed: 0,Title,Author,Price,Desc,Num Pages,Date Issue,Publisher
0,Semangat Baja Ibnu Sutanto,Robert Adhi Ksp,Rp 100.000,Buku Semangat Baja Ibnu Su...,350.0,17 Nov 2021,Pbk
1,Think And Grow Rich : Cara Para Jutawan Dan Mi...,Napoleon Hill,Rp 77.000,Kekayaan dimulai dengan ke...,364.0,17 Nov 2021,Bhuana Ilmu Populer
2,Kitty dan Tragedi di Pekan Raya,PAULA HARRISON,Rp 67.000,Kitty sangat ingin pergi k...,128.0,17 Nov 2021,Bhuana Ilmu Populer
3,Hoegeng Polisi dan Menteri Teladan Edisi Revisi,Suhartono,Rp 75.000,"Namun, bagaimana kiprah Ho...",300.0,17 Nov 2021,Pbk
4,Lil` Sis Please Cook For me! 02,IUNOSU,Rp 40.000,Yuzuki akhirnya mulai terb...,200.0,17 Nov 2021,Elex Media Komputindo
5,Detektif Conan Premium 09,Aoyama Gosho,Rp 65.000,Kogoro Mouri menerima sepu...,368.0,17 Nov 2021,Elex Media Komputindo
6,Kumpulan Latihan PHP,Eri Mardiani,Rp 75.000,Saat ini pemrograman sanga...,224.0,17 Nov 2021,Elex Media Komputindo
7,Only Human (Themis Files #3),Sylvain Neuvel,Rp 110.000,Sebentuk tangan raksasa ta...,408.0,17 Nov 2021,Elex Media Komputindo
8,Mereka yang Tak Kembali (Long Bright River),Liz Moore,Rp 129.000,Jalanan kota Philadelphia ...,392.0,17 Nov 2021,Gramedia Pustaka Utama
9,MetroPop: Saat-Saat Jauh,Lia Seplia,Rp 87.000,Aline dan Alex saling perc...,280.0,17 Nov 2021,Gramedia Pustaka Utama
