In [1]:
from IPython.core.display import display, HTML

Since, its just a string. We can use **string operations** & **regex** to extract meaningful informations from it. Even though, python has great support for working with strings. It still would be a lot of work.

If you look closely, then you might have already noticed that the above HTML document have a particular structure to it. The most important element in a HTML document is **tag**, which may contain other tags/strings. For example:
- The complete document is built using tags, such as `<html>`, `<head>`, `<body>`, `<div>`, `<p>`, `<a>`, etc.
- Each tag has a complementary closing tag, such as `</html>`, `</head>`, `</body>`, `</div>`, `</p>`, `</a>`, etc.
- Tags can also have attributes.

![](https://i.imgur.com/wf3Ahyg.png)

Don't worry if you looking at HTML for the first time. It might look wierd at first, but with time it will grow on you.

This structure allowed developers to write very efficient HTML parsers. These parsers make it super easy to extract information from the HTML document. Parsers also provide ways of navigating, searching, and modifying the parse tree.

[Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) is a Python library for extracting data out of HTML and XML files. There are many such libraries like [requests-html](https://requests.readthedocs.io/projects/requests-html/en/latest/index.html), [lxml](https://lxml.de/), [gazpacho](https://gazpacho.xyz/), etc. for parsing HTML docs.

In [2]:
# import
from bs4 import BeautifulSoup as bs

We start by creating a `soup` object which will help us to extract details. The input can be a *string*, or a *file object*.

In [3]:
ai_html = """
<html>
  <head>
   <title>
     Web Scraping 101 - by aiadventures
   </title>
  </head>
  <body>
    <div id="course">
      <h3> Courses at 
        <a href="www.aiadventures.in">aiadventures</a>
      </h3>
      <ul>
        <li>Python</li>
        <li>Data Science</li>
        <li>Machine Learning</li>
        <li>Deep Learning</li>
        <li>Computer Vision</li>
      </ul>
    </div>
    <div class="follow_us">
      <h3> Follow Us </h3>
      <ul>
        <li><a href="https://www.instagram.com/aiadventures.pune">Instagram</a></li>
        <li><a href="https://www.linkedin.com/company/aiadventures">LinkedIn</a></li>
        <li><a href="https://medium.com/aiadventures">Medium</a></li>
        <li><a href="https://www.youtube.com/channel/UCPZqWUIXZAs926TBRclhUGw">Youtube</a></li>
      </ul>
    </div>
  </body>
</html>
"""

In [4]:
soup = bs(ai_html)
soup

<html>
<head>
<title>
     Web Scraping 101 - by aiadventures
   </title>
</head>
<body>
<div id="course">
<h3> Courses at 
        <a href="www.aiadventures.in">aiadventures</a>
</h3>
<ul>
<li>Python</li>
<li>Data Science</li>
<li>Machine Learning</li>
<li>Deep Learning</li>
<li>Computer Vision</li>
</ul>
</div>
<div class="follow_us">
<h3> Follow Us </h3>
<ul>
<li><a href="https://www.instagram.com/aiadventures.pune">Instagram</a></li>
<li><a href="https://www.linkedin.com/company/aiadventures">LinkedIn</a></li>
<li><a href="https://medium.com/aiadventures">Medium</a></li>
<li><a href="https://www.youtube.com/channel/UCPZqWUIXZAs926TBRclhUGw">Youtube</a></li>
</ul>
</div>
</body>
</html>

The output looks exactly the same. But under the hood, the complete string has being parsed and organised in the form of a tree, for easy access. For example,

In [5]:
soup.title

<title>
     Web Scraping 101 - by aiadventures
   </title>

## Selecting Tags
To extract information, first we will have to learn how to select/search HTML tags. Only after selecting HTML tags, we can extraction meaningful information from the tags. So, lets get started

#### Searching by Tag names
The easiest way to search a tag (in BeautifulSoup) is to **search by its name**. You can simply select `title` tag by using `find` method.

In [6]:
soup.find('title')

<title>
     Web Scraping 101 - by aiadventures
   </title>

**Note:** `find()` returns only the first tag/element. You can use `find_all()` to get a list of all the tags/elements.  

In [7]:
soup.find_all('div')

[<div id="course">
 <h3> Courses at 
         <a href="www.aiadventures.in">aiadventures</a>
 </h3>
 <ul>
 <li>Python</li>
 <li>Data Science</li>
 <li>Machine Learning</li>
 <li>Deep Learning</li>
 <li>Computer Vision</li>
 </ul>
 </div>,
 <div class="follow_us">
 <h3> Follow Us </h3>
 <ul>
 <li><a href="https://www.instagram.com/aiadventures.pune">Instagram</a></li>
 <li><a href="https://www.linkedin.com/company/aiadventures">LinkedIn</a></li>
 <li><a href="https://medium.com/aiadventures">Medium</a></li>
 <li><a href="https://www.youtube.com/channel/UCPZqWUIXZAs926TBRclhUGw">Youtube</a></li>
 </ul>
 </div>]

**Note:** `find()` returns a *tag* object and `find_all()` retuns a *ResultSet* object which is very similar to *python list*. So, it's very important to keep checking the data type. Because, it tells you, what operations are allowed.

You can also select multiple tags by passing a list of tags. 

In [8]:
## Select both h3 and ul tags
soup.find_all(['h3', 'ul'])

[<h3> Courses at 
         <a href="www.aiadventures.in">aiadventures</a>
 </h3>,
 <ul>
 <li>Python</li>
 <li>Data Science</li>
 <li>Machine Learning</li>
 <li>Deep Learning</li>
 <li>Computer Vision</li>
 </ul>,
 <h3> Follow Us </h3>,
 <ul>
 <li><a href="https://www.instagram.com/aiadventures.pune">Instagram</a></li>
 <li><a href="https://www.linkedin.com/company/aiadventures">LinkedIn</a></li>
 <li><a href="https://medium.com/aiadventures">Medium</a></li>
 <li><a href="https://www.youtube.com/channel/UCPZqWUIXZAs926TBRclhUGw">Youtube</a></li>
 </ul>]

### Searching by Tag attributes

Sometimes, we want to select tags based on attributes & its value. Its pretty simple, just pass the attribute (with the value) as an argument.

For example, to select *div* tag with `id = course`, we can write . . . 

In [9]:
soup.find('div', id='course')

<div id="course">
<h3> Courses at 
        <a href="www.aiadventures.in">aiadventures</a>
</h3>
<ul>
<li>Python</li>
<li>Data Science</li>
<li>Machine Learning</li>
<li>Deep Learning</li>
<li>Computer Vision</li>
</ul>
</div>

and to select *div* tag with `class = follow_us`, we can write . . .

In [10]:
soup.find('div', class_='follow_us')

<div class="follow_us">
<h3> Follow Us </h3>
<ul>
<li><a href="https://www.instagram.com/aiadventures.pune">Instagram</a></li>
<li><a href="https://www.linkedin.com/company/aiadventures">LinkedIn</a></li>
<li><a href="https://medium.com/aiadventures">Medium</a></li>
<li><a href="https://www.youtube.com/channel/UCPZqWUIXZAs926TBRclhUGw">Youtube</a></li>
</ul>
</div>

**Note:** The `class` keyword is already taken python language, hence BeautifulSoup uses `class_` (extra '_' at the end).

You can also select a tag by checking if an attribute is present or not.

In [11]:
## Selects all the 'div' which has 'id' attribute
soup.find('div', id=True)

<div id="course">
<h3> Courses at 
        <a href="www.aiadventures.in">aiadventures</a>
</h3>
<ul>
<li>Python</li>
<li>Data Science</li>
<li>Machine Learning</li>
<li>Deep Learning</li>
<li>Computer Vision</li>
</ul>
</div>

#### Regular Expressions

Everywhere, you can **use regular expressions** (instead of strings) to select tags & its attribute values.

In [12]:
import re
soup.find(re.compile('di'), class_= re.compile('follow_us'))

<div class="follow_us">
<h3> Follow Us </h3>
<ul>
<li><a href="https://www.instagram.com/aiadventures.pune">Instagram</a></li>
<li><a href="https://www.linkedin.com/company/aiadventures">LinkedIn</a></li>
<li><a href="https://medium.com/aiadventures">Medium</a></li>
<li><a href="https://www.youtube.com/channel/UCPZqWUIXZAs926TBRclhUGw">Youtube</a></li>
</ul>
</div>

The above code will select the first tag whose name matches the regular expression `di`, and where the value of class attribute matches the regular expression `follow_us`.

### Accessing information

So far, we have learnt how to select HTML tags. This is important because, once you have selected the elements, you can access all the information present inside it.

Every tag has 3 major components:
- Tag name
- Text between the open & close tags, called **Inner text**.
- Tag attributes and its values

Lets see how we can extract all these information from our `title_tag`.

In [13]:
title_tag = soup.find('title')
title_tag

<title>
     Web Scraping 101 - by aiadventures
   </title>

#### Tag name

To access the tag name, you can simple run `tag_element.name`.

In [14]:
title_tag.name

'title'

#### Inner text

To access the inner Text, you can simple run `tag_element.text`.

In [15]:
title_tag.text

'\n     Web Scraping 101 - by aiadventures\n   '

#### Attribute values

You can also extract the attribute values as follows:

In [16]:
a_tag = soup.find('a')
a_tag

<a href="www.aiadventures.in">aiadventures</a>

Once you have the tag, just think of it as a dictionary. You can easily access any attribute by passing it as a key.

In [17]:
a_tag['href']

'www.aiadventures.in'

Once you know how to extract attribute values, you can easily extract all the links by running the following code

In [18]:
[a_tag['href'] for a_tag in soup.find_all('a')]

['www.aiadventures.in',
 'https://www.instagram.com/aiadventures.pune',
 'https://www.linkedin.com/company/aiadventures',
 'https://medium.com/aiadventures',
 'https://www.youtube.com/channel/UCPZqWUIXZAs926TBRclhUGw']

Take a minute to think & see if the above code make sense.

### Further Reading

So far, we have just scratched the surface. But I think this is good enough to get you started with `BeautifulSoup` & to scrape most of the static pages on the web. `BeautifulSoup` has much more to offer, like 
- Searching the tree, [read more](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree)
- CSS Selector, [read more](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors)
- Navigating DOM tree, [read more](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree)
- Manipulating Elements, [read more](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#modifying-the-tree)
- and much more . . .

I will highly recommend you to take some time and read more about `BeautifulSoup`. Here are some good YT videos on Web Scraping
- https://www.youtube.com/watch?v=RUQWPJ1T6Zc
- https://www.youtube.com/watch?v=RKsLLG-bzEY

Note: Both the videos are approximately 3hrs long but are worth it. Also, they include projects & some more project ideas for you to practice later.

We will now shift gears and will try to scrape a real web page.

# Scraping a Real web page

**Web Scraping is not always the solution!**

Some websites provide data that can be downloaded in CSV format, or can be accessible via Application Programming Interfaces (APIs). Use web scraping only when both these options are not available.

**Legalities**

Generally, if you are going to use the scraped data for personal or educational purpose, then there may not be any problem. But if you are going to use it for commercial purpose then I will highly recommend you to do some background research about website's scraping policies as well about the data you are going to scrape.

To understand the terms and conditions of any website, you can start with **robots.txt** file. For any website, simply write *robots.txt* after the website address. For example, www.google.com/robots.txt

**Finally,**

Once you make sure that you are not breaking any law/policy, then you should spend some time **analysis the web page**. Doing things like:
- View page source
- Inspect DOM elements
- Is the page static/dynamic ?
- Is it using AJAX calls ?

After you know answers to all the above questions. You are good to start web scraping.

In this notebook, we will scrape all the books from http://books.toscrape.com/ website. So, try looking at page source, & also inspect the book element.

### Requests library

`requests` is python library used to download the contents of the web page. With the help of `requests`, we can get the raw HTML of web pages which can then be parsed using `BeautifulSoup`. 

Remember, BeautifulSoup is a parsing library, it cannot fetch a web page by itself. 

In [19]:
import requests

In [20]:
url = 'http://books.toscrape.com/'
response = requests.get(url)
response

<Response [200]>

you can also check the response status as follows

In [21]:
response.status_code

200

*200* means the request was successfully served. These are called **HTTP response status codes**. You can read more about them, [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status).

Now, that the request to http://books.toscrape.com/ has been successfully served, we can get the all HTML text by calling `response.text`. Lets look at the first 1000 characters.

In [22]:
print(response.text[:1000])

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    All products | Books to Scrape - Sandbox
</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
        <meta name="created" content="24th Jun 2016 09:29" />
        <meta name="description" content="" />
        <meta name="viewport" content="width=device-width" />
        <meta name="robots" content="NOARCHIVE,NOCACHE" />

        <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
        <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->

        
            <link rel="shortcut icon" href="static/oscar/favicon.

**`response.text` vs `response.content`**

- `text` is the content of the response in **Unicode**, and `content` is the content of the response in **bytes**.

- `text` would be preferred for textual responses, such as an HTML or XML document, and `content` would be preferred for "binary" filetypes, such as an image or PDF file.


In [23]:
type(response.text)

str

Since, `response.text` is simply a python string, we can directly pass it to `BeautifulSoup` to get the *soup* object

In [24]:
soup = bs(response.text)
type(soup)

bs4.BeautifulSoup

Lets have a look at the site title

In [25]:
soup.find('title').text

'\n    All products | Books to Scrape - Sandbox\n'

not very pretty to look at. We can use `strip()` string method

In [26]:
soup.find('title').text.strip()

'All products | Books to Scrape - Sandbox'

Its time to inspect all the HTML tag and to identify the book tag so that we can extract information about the books.

![](https://i.imgur.com/LpmDAgg.jpg)

As you can see in the above image, our book is placed inside 'article' tag and the class name is 'product_pod'. We can easily select all the 'article' tags where the class is 'product_pod' using the below code . . .

In [27]:
books_tag = soup.find_all('article', class_='product_pod')

There are 20 books in a page. Lets verify it

In [28]:
len(books_tag)

20

Great! Our selection is perfect.

Now, lets try to select a single book and extract all the information we can

In [29]:
book_tag = books_tag[0]
book_tag

<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">Â£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

**Book title**

The title is present inside 'a' tag. We cannot select all the 'a' tags. We only want the tags with title attribute. So, lets select it.

In [30]:
title_tag = book_tag.find('a', title=True)
title_tag

<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>

Wonderful! we have our title tag. To get the title we can simply use `title_tag.text`

In [31]:
title_tag.text

'A Light in the ...'

but as you can see, its not the complete title. For complete title of the book, can be extracted from the title attribute

In [32]:
title_tag['title']

'A Light in the Attic'

and . . . you have your title. You can easily write every thing in just one line of code

In [33]:
## Title
title = book_tag.find('a', title=True)['title']
title

'A Light in the Attic'

Follow the same process (as above) to extract ratings, price & book_link.

In [34]:
## Rating
rating = book_tag.find('p')['class'][1]
rating

'Three'

In [35]:
## Price
price = book_tag.find('p', class_='price_color').text[1:]
price

'£51.77'

In [36]:
## Book link
link = 'http://books.toscrape.com/' + book_tag.find('a')['href']
link

'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'

Lets put the above code inside a function

In [37]:
def get_details(book_tag):
    title = book_tag.find('a', title=True)['title']
    rating = book_tag.find('p')['class'][1]
    price = book_tag.find('p', class_='price_color').text[1:]
    link = 'http://books.toscrape.com/' + book_tag.find('a')['href']
    return title, rating, price, link

`get_details` function takes a 'book_tag', extracts all the details from it and returns them.

Lets write some more functions to 

In [38]:
def get_soup(url):
    """Takes URL and returns a soup object"""
    resp = requests.get(url)
    if resp.status_code == 200:
        return bs(resp.text)
    else: return None


def get_books(url):
    """Extact details from all the book tags"""
    soup = get_soup(url)
    book_tags = soup.find_all('article', class_='product_pod')

    books = []
    for book_tag in book_tags:
        books.append(get_details(book_tag))

    return books

In [39]:
url = 'http://books.toscrape.com/'
books = get_books(url)
len(books)

20

Lets have a look at books . . .

In [40]:
books[:3]

[('A Light in the Attic',
  'Three',
  '£51.77',
  'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'),
 ('Tipping the Velvet',
  'One',
  '£53.74',
  'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'),
 ('Soumission',
  'One',
  '£50.10',
  'http://books.toscrape.com/catalogue/soumission_998/index.html')]

Not very pretty, right? Don't worry, we will make a pandas dataframe.

Its time we write a function that would extract all the 1000 books from the website. 

In [41]:
import pandas as pd

def get_all_books(page = 3):
    books = []
    for i in range(1, page+1):
        ## This is how the url changes with every page
        url = f'http://books.toscrape.com/catalogue/page-{i}.html'
        soup = get_soup(url)
        if soup:    
            book_tags = soup.find_all('article', class_='product_pod')

            for book_tag in book_tags:
                books.append(get_details(book_tag))
            
    books = pd.DataFrame(books, columns=['title', 'rating', 'price', 'link'])
    return books

We will only scrape first 3 pages to test our code

In [42]:
df = get_all_books(3)
df.head()

Unnamed: 0,title,rating,price,link
0,A Light in the Attic,Three,£51.77,http://books.toscrape.com/a-light-in-the-attic...
1,Tipping the Velvet,One,£53.74,http://books.toscrape.com/tipping-the-velvet_9...
2,Soumission,One,£50.10,http://books.toscrape.com/soumission_998/index...
3,Sharp Objects,Four,£47.82,http://books.toscrape.com/sharp-objects_997/in...
4,Sapiens: A Brief History of Humankind,Five,£54.23,http://books.toscrape.com/sapiens-a-brief-hist...


DataFrame looks much better. We scraped 3 pages so we should have 60 (3 x 20) records. 

In [43]:
df.shape

(60, 4)

Perfection! Excatly what we expected. 

Before we scrape all the 1000 books, we will have to take care of a few more things. 

- Whenever you are scraping a website, try to be responsilble. A normal user generally makes 2-5 requests (clicks) per minute. But you python program can make upto 1000 requests per second. This can use all the resources in the server. Sometimes, it can even crash the server. So, make sure you sleep for a couple of seconds before you make the next request.

- There are multiple things that can go wrong when scraping a website, like network error, slow connection, timeout, element missing, code change, etc. So, its high recommended to use `try / except` blocks to handle errors effectively.


This is how your final code will look.

In [44]:
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

def get_soup(url):
    """Takes URL and returns a soup object"""
    try:
        resp = requests.get(url)
    except:
        return None
    
    if resp.status_code == 200:
        return bs(resp.text)
    else: 
        return None


def get_details(book_tag):
    ## title
    try:
        title = book_tag.find('a', title=True)['title']
    except:
        title = None
        
    ## rating
    try: 
        rating = book_tag.find('p')['class'][1]
    except: 
        rating = None 
        
    ## Price
    try: 
        price = book_tag.find('p', class_='price_color').text[1:]
    except:
        price = None
    
    ## Link
    try:
        link = 'http://books.toscrape.com/' + book_tag.find('a')['href']
    except:
        price = None
        
    return title, rating, price, link



def get_all_books(page = 3):
    books = []
    for i in range(1, page+1):
        url = f'http://books.toscrape.com/catalogue/page-{i}.html'
        soup = get_soup(url)
        if soup:    
            try:
                book_tags = soup.find_all('article', class_='product_pod')

                for book_tag in book_tags:
                    books.append(get_details(book_tag))
            except:
                print(f'Error reading page {i} . . .')

            time.sleep(1) # sleep before making the next request

    books = pd.DataFrame(books, columns=['title', 'rating', 'price', 'link'])
    return books

In [None]:
df = get_all_books(50) # 20 books x 50 pages = 1000 books
df.head()

In [None]:
df.shape