<a href="https://colab.research.google.com/github/Rahul-7131/CrowdSource-Workshop/blob/main/web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img style="width:600px;" src="https://i.imgur.com/Kj8fCpd.png" alt="aiadventures Logo"/>

<p align="center"><a href="https://www.aiadventures.in">Website </a> | <a href="https://www.instagram.com/aiadventures.pune/">Instagram</a> | <a href="https://www.linkedin.com/company/aiadventures">LinkedIn</a> | <a href="https://www.youtube.com/channel/UCPZqWUIXZAs926TBRclhUGw">YouTube</a></p>

# Web Scraping 101

### What & Why Web Scraping?
Web is the greatest source of information. Web scraping, allows us to extract, parse, download and organize useful information from the web automatically.

- Machine Learning & Data Science
- Research
- Business Intelligence
- E-commerce Websites
- Marketing and Sales Campaigns
- Search Engine Optimization (SEO)


### Pre-requisites

NO! But, having a prior know of the follow will surely help.
- HTML & CSS
- Basic understanding of HTML/DOM tree
- Python basics


### Installations
As `BeautifulSoup` is not a standard python library, we need to install it first. We will also need `requests` library to download content from web pages.

In [None]:
!pip install -q beautifulsoup4
!pip install -q requests

### Important links
- [BeautifulSoup docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- http://toscrape.com/


---

# Coding Time

In [None]:
ai_html = """
<html>
  <head>
   <title>
     Web Scraping 101 - by aiadventures
   </title>
  </head>
  <body>
    <div id="course">
      <h3> Courses at 
        <a href="www.aiadventures.in">aiadventures</a>
      </h3>
      <ul>
        <li>Python</li>
        <li>Data Science</li>
        <li>Machine Learning</li>
        <li>Deep Learning</li>
        <li>Computer Vision</li>
      </ul>
    </div>
    <div class="follow_us">
      <h3> Follow Us </h3>
      <ul>
        <li><a href="https://www.instagram.com/aiadventures.pune">Instagram</a></li>
        <li><a href="https://www.linkedin.com/company/aiadventures">LinkedIn</a></li>
        <li><a href="https://medium.com/aiadventures">Medium</a></li>
        <li><a href="https://www.youtube.com/channel/UCPZqWUIXZAs926TBRclhUGw">Youtube</a></li>
      </ul>
    </div>
  </body>
</html>
"""

CodePen: https://codepen.io/pen/

Since, its just a string. We can use **string operations** & **regex** to extract meaningful informations from it. Even though, python has great support for working with strings. It still would be a lot of work.

If you know HTML, then you might already know that all HTML documents have a common structure to it. This structure allowed developers to write very efficient HTML parsers. 

These parsers make it super easy to extract information from the HTML document. Parsers also provide ways of navigating, searching, and modifying the parse tree.

[Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) is a Python library for extracting data out of HTML and XML files.

There are many such libraries like [requests-html](https://requests.readthedocs.io/projects/requests-html/en/latest/index.html), [lxml](https://lxml.de/), [gazpacho](https://gazpacho.xyz/), etc. for parsing HTML docs.

In [None]:
# import
from bs4 import BeautifulSoup as bs

We start by creating a `soup` object which will help us to extract details.

The input can be a *string*, or a *file object*.

In [None]:
soup = bs(ai_html)
type(soup)

In [None]:
soup.title

The output looks exactly the same. But under the hood, the complete string has being parsed and organised in the form of a tree, for easy access.

### HTML Tags
One of the most important elements in a HTML document are **tags**, which may contain other tags/strings. The easiest way to search a tag is to **search by its name**. 

In [None]:
soup.find('title')

In [None]:
type(soup.find('title'))

**Note:** `find()` returns only the first tag/element. You can use `find_all()` to get a list of all the tags/elements.  

In [None]:
soup.find_all('div')

In [None]:
type(soup.find_all('div'))

Remember, `find()` returns a `tag` object and `find_all()` retuns a *ResultSet* object which is very similar to *python list*. So, it's very important to keep checking the data type. Because, it tells you, what operations are allowed.

You can also select multiple tags by passing a list of tags.

In [None]:
## Select both h3 and ul tags
soup.find_all(['h3', 'ul'])

### Tag attributes

Sometimes, we want to select tags based on its attribute.

In [None]:
soup.find('div', id='course')

In [None]:
soup.find('div', class_='follow_us')

You can also select a tag by checking if an attribute is present or not.

In [None]:
## Selects all the 'div' which has 'id' attribute
soup.find('div', id=True)

### Regex

Instead of string, you can **use regular expressions** to select tags & also for attribute values

In [None]:
import re
soup.find(re.compile('di'), class_= re.compile('follow_us'))

### Accessing meaningful information

So far, we have learnt how to select HTML tags. This is important because, once you have selected the elements, you can access all the information present inside it.

In [None]:
title_tag = soup.find('title')
title_tag

To access the inner HTML, you can simple run `tag_element.text`. For example,

In [None]:
title_tag.text.strip()

You can also extract the attribute values as follows:

In [None]:
a_tag = soup.find('a')
a_tag

Once you have the tag, just think of it as a dictionary. You can easily access any attribute by passing it as a key.

In [None]:
a_tag['href']

You can extract all the links by running the following code

In [None]:
[a_tag['href'] for a_tag in soup.find_all('a')]

### Further Reading

So far, we have just scratched the surface. But I think this is good enough to get you started with `BeautifulSoup` & to scrape most of the static pages on the web. `BeautifulSoup` has much more to offer, like 
- Searching the tree, [read more](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree)
- CSS Selector, [read more](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors)
- Navigating DOM tree, [read more](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree)
- Manipulating Elements, [read more](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#modifying-the-tree)
- and much more . . .

I will highly recommend you to take some time and read more about `BeautifulSoup`.

# Scraping a Real web page

### Web Scraping is not always the solution!
Some websites offer data sets that are downloadable in CSV format, or accessible via an Application Programming Interfaces (APIs).

For example, 
- IMDB
- Netflix
- Google, etc.

### Legalities
Generally, if you are going to use the scraped data for personal or educational purpose, then there may not be any problem. But if you are going to use it for commercial purpose then I will highly recommend you to do some background research about website's scraping policies as well about the data you are going to scrape.

You should start with **robots.txt** file. After any website, simply write *robots.txt*. For example, www.google.com/robots.txt

### Finally,
Once you make sure that you are not breaking any law/policy, then you should spend some time **analysis the web page**.
- View page source
- Inspect DOM elements
- Is the page static/dynamic ?
- Is it using AJAX calls ?  

In this workshop, we will scrape all the books from http://books.toscrape.com/ website. Its good time to analysis the page.


### `requests` library
`requests` is python library used for accessing web pages. With the help of Requests, we can get the raw HTML of web pages which can then be parsed for retrieving the data.

We will use `requests` library to download the contents of the web page. Because, `BeautifulSoup` needs an input document to create a `soup` object and it cannot fetch a web page by itself. 

In [None]:
import requests

In [None]:
url = 'http://books.toscrape.com/'
response = requests.get(url)
response

you can also check the status as follows

In [None]:
response.status_code

*200* means the request was successfully served. These are called **HTTP response status codes**. You can read more about them, [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status).

Now, that the request to http://books.toscrape.com/ has been successfully served, we can get the all HTML text by calling `response.text`

In [None]:
print(response.text)

### `response.text` vs `response.content`

- `text` is the content of the response in **Unicode**, and `content` is the content of the response in **bytes**.

- `text` would be preferred for textual responses, such as an HTML or XML document, and `content` would be preferred for "binary" filetypes, such as an image or PDF file.


In [None]:
type(response.text)

Since, `response.text` is simply a python string, we can directly pass it to `BeautifulSoup`

In [None]:
soup = bs(response.text)
type(soup)

In [None]:
soup.find('title').text

In [None]:
books_tag = soup.find_all('article', class_='product_pod')

In [None]:
book_tag = books_tag[0]
book_tag

In [None]:
## Title
title = 
title

In [None]:
## Rating
rating = 
rating

In [None]:
## Price
price = 
price

In [None]:
## Book link
link = 
link

In [None]:
## function get_details

In [None]:
## function get_soup
def get_soup(url):
    resp = requests.get(url)
    if resp.status_code == 200:
        return bs(resp.text)
    else: return None

In [None]:
## function get_books

In [None]:
url = 'http://books.toscrape.com/'
books = get_books(url)
len(books)

In [None]:
## function get_all_books. Add 'try/except' + 'pandas' + 'time'

In [None]:
df = get_all_books()
df.head()

In [None]:
df.shape

## Limitations:

- Won't work for **dynamic pages**. But why is it so?
    - Content generated by JavaScript code
    - Content generated upon user action. For example; reaching the bottom of the page, etc.

- Won't work for pages that need authentication. 

    Your browser hides a lot of complexity (like cookies) from you. If you want to programmatically access your account, then you will have to address all this complexity yourself. 
    
    Its not impossible, but its a lot of work.

In [None]:
url = 'http://quotes.toscrape.com/js/'
soup = get_soup(url)
len(soup.find_all('div', class_='quote'))

[Selenium](https://selenium-python.readthedocs.io/) is a powerful tool for controlling web browsers through programs and performing browser automation. Selenium test scripts can be written in any programming languages like Java, Python, C#, Ruby, Perl, as so on.

Selenium runs a browser instance, allowing you to do everything that a normal human being can do. But its has a steep learning curve to it.😜

Thank You all for your time. Hope you had a wonderful time, learning with [aiadventures](www.aiadventures.in)