# Week 3: Day 5 AM // Web Scraping

## Tools Preparation
Scrapping is basically one of the way to retrieve the data and this process is very important to know as a data scientist since sometimes we cannot get data easily as we querying the data from the database or download Kaggle. We're going to scrape Gramedia.com in this lesson using Beautifulsoup. Before we're going further, please install beautifulsoup.

To install beautifulsoup, you may run one of the following commands on Anaconda Prompt (Windows) or Terminal (Linux/Mac/VSCode):

```
pip install bs4
```

and also you need to install requests to acces a web address by running:

```
pip install requests
```

## Basic Web Component

The website that you are scraping in this lesson contains several components. Those are:
- HTML — the main content of the page.
- CSS — used to add styling to make the page look nicer.
- JS — Javascript files add interactivity to web pages.
- Images — image formats, such as JPG and PNG, allow web pages to show pictures.

There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look primarily at the HTML. Hence, you need to know some HTML structure to ease your scraping works. But don't worry, you don't need to dive in deeply into it.

### HTML Structure

HyperText Markup Language (HTML) is the language that web pages are created in. HTML isn’t a programming language, like Python, though. It’s a markup language that tells a browser how to display content. 

HTML has many functions that are similar to what you might find in a word processor like Microsoft Word — it can make text bold, create paragraphs, and so on.

Below an example of HTML structure:

```html
<HTML>
    <HEAD>
        <TITLE>My cool title</TITLE>
    </HEAD>
    <BODY>
        <H1>This is a Header</H1>
        <ul id="list" class="coolList">
            <li>item 1</li>
            <li>item 2</li>
            <li>item 3</li>
        </ul>
    </BODY>
</HTML>
```

- The red items are called as tag or element. Usually, tag follows "<".
- HTML, HEAD, and BODY are the main elements and the rests are the content. For your attention, we will focus on the contents.
- The orange items are attribut that give information about the tag.
- The blue texts are the attribute value.


## Accessing the Web

Now, we will access https://www.gramedia.com/categories/buku for this lesson. Before we go further, we need to understand how to access the url in Python. To do it, we use requests library.

In [1]:
import requests
page = requests.get("https://www.gramedia.com/categories/buku")
page

<Response [200]>

If you see the output is <Response [200]>, then you are success to access the url. "200" refers to HTTP status codes. You can read https://id.wikipedia.org/wiki/Daftar_kode_status_HTTP for further explaination.

Now, you can check the HTML content of the page in Python. However, you can also check it on your browser by right click and choose Inspect element to ease your understanding od the web structure.

Above is the HTML structure that Python successfully access. We need to parsing the structure using Beautifulsoup to make it clear and accessible to scrape.

In [2]:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('chromedriver')

url="https://www.gramedia.com/categories/buku"
driver.get(url)
html = driver.page_source

soup = BeautifulSoup(html, "html.parser")
print(soup.prettify()[:700])

<html class="async-hide" lang="id">
 <head>
  <base href="/"/>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="#281e5a" name="theme-color"/>
  <meta content="index, follow" name="robots"/>
  <link href="/assets/favicon.ico" rel="icon" type="image/x-icon"/>
  <link href="manifest.json" rel="manifest"/>
  <link href="/assets/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <meta content="z91Vp6ZYo9UoX5D4ur6i4Lrl0l1j3DDoCH08fD3n53g" name="google-site-verification"/>
  <meta content="810657685650228" property="fb:app_id"/>
  <style>
   .async-hide {
        opacity: 0 !important
    }
  </style>
  <script async="" src="//


<img src="https://i.ibb.co/vsz2M33/message-Image-1636690176458.jpg"></img>

Let that we want to retrieve the books' title, so let's check the title position on the HTML out using Inspect element!

We know that based on the Inspect element, the books' title lie on this code:

```html
<div _ngcontent-web-gramedia-c53="" class="list-title">Creepy Case Club 4: Kasus Pohon Pemanggil</div>
```

"Creepy Case Club 4: Kasus Pohon Pemanggil" located at **div** tag with attribute **class** and the value of "*list-title*". So we will use the information to inform the soup where the titles exist.

So we need to find all div elements that contain attribute class and value "list-title". 

To do that, we use ```soup.find_all("<element>",{"<attribute>":"<attribute value>"})```

In [3]:
soup.find_all('div',{"class":"list-title"})

[]

We see that the soup found all div elements that contain attribute class and value "list-title" but we need the title text only. To extract it, just add .get_text() method to each list element.

In [4]:
for div_tag in soup.find_all('div',{"class":"list-title"}):
    print(div_tag.get_text())

It is easy, isn't it?

Next, we will do more. Our task is to get information about Title, Author, Price, Link to the book's page, and link refers to image.

Based on the Inspect element, we know that those information locate on:
- Title: ```<div _ngcontent-web-gramedia-c53="" class="list-title">Creepy Case Club 4: Kasus Pohon Pemanggil</div>```
- Author: ```<p class="div-author"><span _ngcontent-web-gramedia-c53="" class="list-author ng-star-inserted"> Arvidan None </span>```
- Price: ```<p _ngcontent-web-gramedia-c53="" class="formats-price">Rp 79.000</p>```
- Link: ```<div class="ng-star-inserted"><a _ngcontent-web-gramedia-c53="" href="/products/think-and-grow-rich-cara-para-jutawan-dan-miliarder-meraih-kekayaan">```
- Image: ```<img _ngcontent-web-gramedia-c26="" class="product-list-img ng-star-inserted ng-lazyloaded" src="https://cdn.gramedia.com/uploads/items/9786230405990_Think_and_Grow_Rich__w149_hauto.jpeg" alt="Think And Grow Rich : Cara Para Jutawan Dan Miliarder Meraih Kekayaan">```

Let's we wrap up the code and then input the data into Pandas DataFrame.

In [5]:
import pandas as pd

data = pd.DataFrame()

data['Title'] = [ title.get_text() for title in soup.find_all( 'div', {"class":"list-title"} ) ]
data['Author'] = [ author.get_text() for author in soup.find_all( 'p', {"class":"div-author"} ) ]
data['Price'] = [ price.get_text() for price in soup.find_all( 'p', {"class":"formats-price"} ) ]
data['Image'] = [ img['src'] for img in soup.find_all( 'img',{"class":"product-list-img"} ) ]

links = []
for tag in soup.find_all( 'div',{"_ngcontent-web-gramedia-c26":"","class":"ng-star-inserted"} ):
    try:
        links.append("https://www.gramedia.com"+tag.find_all('a',{"_ngcontent-web-gramedia-c26":""})[0]['href'])
    except:
        pass

data['Link'] = links

data

Unnamed: 0,Title,Author,Price,Image,Link


## Multipage

Currently, we are working on a page. However, the rest of the web consist of more pages like below:

<img src="https://i.ibb.co/CQ6JQLv/message-Image-1636716930335.jpg"></img>

If we look at the next page such as page 2, we can see that the url change to https://www.gramedia.com/categories/buku?page=2 and page 3: https://www.gramedia.com/categories/buku?page=3. Then we know each page has a numbering format on url so we can access many pages one time automatically using loop. We exclude the image since image loader is very depended on your connection. Let's check the code below.

In [6]:
title = []
author = []
price = []
image = []
Links = []

driver = webdriver.Chrome('chromedriver')

for i in range(1,21):
    url="https://www.gramedia.com/categories/buku?page={}".format(i)
    driver.get(url)
    html = driver.page_source
    soup = BeautifulSoup(html, "html.parser")

    title += [ title.get_text() for title in soup.find_all( 'div', {"class":"list-title"} ) ]
    author += [ author.get_text() for author in soup.find_all( 'p', {"class":"div-author"} ) ]
    price += [ price.get_text() for price in soup.find_all( 'p', {"class":"formats-price"} ) ]
    image += [ img['src'] for img in soup.find_all( 'img',{"class":"product-list-img"} ) ]

    links = []
    for tag in soup.find_all( 'div',{"_ngcontent-web-gramedia-c26":"","class":"ng-star-inserted"} ):
        try:
            links.append("https://www.gramedia.com"+tag.find_all('a',{"_ngcontent-web-gramedia-c26":""})[0]['href'])
        except:
            pass
    Links += links

data_multipage = pd.DataFrame()
data_multipage['Title'] = title
data_multipage['Author'] = author
data_multipage['Price'] = price
data_multipage['Image'] = image
data_multipage['Link'] = Links

data_multipage

Unnamed: 0,Title,Author,Price,Image,Link
0,The Book Of Menaklukkan Audiens,Nufi Wibisana,Rp 49.500,https://cdn.gramedia.com/uploads/items/The_Boo...,https://www.gramedia.com/products/the-book-of-...
1,The Book Of Membangun Relasi,Ridho Aldhily,Rp 44.500,https://cdn.gramedia.com/uploads/items/The_Boo...,https://www.gramedia.com/products/the-book-of-...
2,The Book Of Memotivasi Jiwa,Kinanti Linda Rahayu,Rp 44.500,https://cdn.gramedia.com/uploads/items/The_Boo...,https://www.gramedia.com/products/the-book-of-...
3,Diet Sodium: Diet Sehat Tanpa Garam,"NINGGAR D DIASTITI, A.MD.GZ.",Rp 30.500,/assets/default-images/product.png,https://www.gramedia.com/products/diet-sodium-...
4,The Book Of Melepaskan Emosi & Depresi,M. Heri Susilo,Rp 58.500,https://cdn.gramedia.com/uploads/items/The_Boo...,https://www.gramedia.com/products/the-book-of-...
5,"Hello, Korean!",Borassaem,Rp 119.000,/assets/default-images/product.png,https://www.gramedia.com/products/hello-korean
6,The Golden Story Of Zulkarnain,Rizem Aizid,Rp 50.000,https://cdn.gramedia.com/uploads/items/The_Gol...,https://www.gramedia.com/products/the-golden-s...
7,Jagoan Trading Crypto,Diar Puji Oktavian,Rp 100.000,https://cdn.gramedia.com/uploads/items/cover_d...,https://www.gramedia.com/products/jagoan-tradi...
8,From Zero To Master English Speaking,Zae Arsy,Rp 68.000,https://cdn.gramedia.com/uploads/items/From_Ze...,https://www.gramedia.com/products/from-zero-to...
9,Jagat Batin Syekh Siti Jenar,Imron Mustofa,Rp 60.000,https://cdn.gramedia.com/uploads/items/Jagat_B...,https://www.gramedia.com/products/jagat-batin-...


## Accessing Individual Page

<img src="https://i.ibb.co/F8D5bCy/message-Image-1637134633305.jpg"></img>

Suppose that we want to get more detail information about the books, but the information are on the individual page. So, we will access the individual page and scrape some information on it. We will catch title, author, price, description, number of pages, date of issue and publisher.

In [8]:
from time import sleep
from random import randint

title = []
author = []
price = []
desc = []
num_pages = []
date_issue = []
publisher = []

driver = webdriver.Chrome('chromedriver')

for i in range(1,2):
    url="https://www.gramedia.com/categories/buku?page={}".format(i)
    driver.get(url)
    sleep(randint(5,7))
    html = driver.page_source
    soup = BeautifulSoup(html, "html.parser")

    for tag in soup.find_all( 'div',{"_ngcontent-web-gramedia-c26":"","class":"ng-star-inserted"} ):
        
        try:
            link = "https://www.gramedia.com"+tag.find('a',{"_ngcontent-web-gramedia-c26":""})['href']
            driver.get(link)
            sleep(randint(5,7))
            
            html_ind = driver.page_source
            soup_ind = BeautifulSoup(html_ind, "html.parser")

            title.append( soup_ind.find( 'div', {"class":"book-title"} ).get_text() )
            author.append( soup_ind.find('span',{"class":"title-author"}).get_text() )
            price.append( soup_ind.find('div', {'class':'price-product'}).get_text() )
            desc.append( soup_ind.find('pre').get_text() )
            num_pages.append( soup_ind.find('div',{'class':'detail-section'}).find_all('p')[0].get_text() )
            date_issue.append( soup_ind.find('div',{'class':'detail-section'}).find_all('p')[2].get_text() )
            publisher.append( soup_ind.find('div',{'class':'detail-section'}).find_all('p')[1].get_text() )

        except:
            pass

pages = pd.DataFrame()
pages['Title'] = title
pages['Author'] = author
pages['Price'] = price
pages['Desc'] = desc
pages['Num Pages'] = num_pages
pages['Date Issue'] = date_issue
pages['Publisher'] = publisher

pages

Unnamed: 0,Title,Author,Price,Desc,Num Pages,Date Issue,Publisher
