# Web Scraping

---

Extracting information from a website and parsing it in a readable format(this is called getting soup).

---

## Why do we need web scraping?
1. Collect data for AI projects (movie recommendation system, NLP for product reviews, corona prediction, sports predictions etc)
2. Market analysis
3. Marketing and sales solution
4. SEO
5. Product comparision



## Demand for web scrapring
https://www.upwork.com/ab/jobs/search/?q=web%20scraping&sort=recency

# Introduction to Beautiful Soup

* Python package to parse HTML and XML trees. 


* provides easier ways of navigating, searching, and modifying the parse tree.


## Installing Beautiful Soup

To install Beautiful Soup, use the following command.

```
pip install beautifulsoup4
```

## Importing Beautiful Soup

In [None]:
from bs4 import BeautifulSoup

## Parsing an HTML file with Beautiful Soup

In [None]:
#Say you have the following HTML string.

HTML_STRING = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <title>Document</title>
</head>
<body>
    <div>
        <ol>
            <li>Item 1</li>
            <li>Item 2</li>
            <li>Item 3</li>
            <li>Item 4</li>
        </ol>
    </div>
</body>
</html>
'''

You can create a `BeautifulSoup` object out of this string.

Parsing a html document with Beautiful Soup is easy: you create an instance of `BeautifulSoup`. The constructor takes in either a file or a string containing HTML.

In [None]:
soup = BeautifulSoup(HTML_STRING, 'html.parser')
soup.find_all('li')

Beautifulsoup has an amazing documentaion which you can read through [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc).


# Finding Elements of a Webpage

## Getting the HTML for a page

In [None]:
import requests
page = requests.get('http://quotes.toscrape.com/')

The HTML string for the web page is in `page.text`.

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.text, 'html.parser')
soup

## Inspecting the source for the page

* All the quotes lie inside a `<div>` element of class `col-md-8`.

* All the quote boxes are `<div>` elements of class `quote`.

* Quotes, tags and names of the authors are together in a `<div>` element of class `quote` for each quote.

* Each quote is in a `<span>` element of class `<text>`.

* Each author is in a `<small>` element of class `author` and is nested in a `<span>` element with no class specified.

* Tags are in `<a>` elements, all of class `tag`. These elements are inside a `<div>` element of class `tags`.

A good place to begin at is to get the `<div>` elements that contain the quotes in the page.

In [None]:
all_quote_divs = soup('div', {'class':'quote'})
print('Found {} quotes.'.format(len(all_quote_divs)))

## Getting the quotes

In [None]:
quotes = []

for quote_div in all_quote_divs:
    quote_span = quote_div.find('span', {'class':'text'})
    quote = quote_span.text
    quotes.append(quote)

quotes

## Getting the authors

In [None]:
authors = []

for quote_div in all_quote_divs:
    author_small = quote_div.find('small', {'class':'author'})
    author = author_small.text
    authors.append(author)

authors

## Getting the tags

In [None]:
all_tags = []

for quote_div in all_quote_divs:
    tags_as = quote_div.find_all('a', {'class':'tag'})
    
    tags = []
    
    for tag_a in tags_as:
        tag = tag_a.text
        tags.append(tag)
    all_tags.append(tags)

all_tags

# CLASSWORK: Scrape title, total revenue in weekend, and gross revenue of Top Box Office (US) movies from IMDB. 

Source: https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht 

In [None]:
import requests as req
from bs4 import BeautifulSoup as bs

In [None]:
page = req.get('https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht')
soup = bs(page.text, 'html.parser')

In [None]:
trs = soup.find('table', {'class': 'chart full-width'}).find('tbody').find_all('tr')


In [None]:
for tr in trs:
    image = tr.find('img').get('src')
    title = tr.find('td', {'class': 'titleColumn'}).find('a').text
    weekend = tr.find('td', {'class': 'ratingColumn'}).text.strip()
    gross = tr.find('span', {'class': 'secondaryInfo'}).text
    
    print("%s \n %s \n %s \n %s" % (image, title, weekend, gross))


## Access forbidden resources

In [None]:
import requests

page = requests.get('https://hamrobazaar.com/')
page.text

In [None]:
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36'}
page = requests.get('https://hamrobazaar.com/', headers = headers)
page.text

### Proxy

In [None]:
import requests

proxy = {'http': 'http://10.10.10.10:1234'}
page = requests.get('https://www.google.com/', proxies = proxy)
page