# In this notebook we show how we can scrap data from webpages using the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), a python library.
<br><br>

In [5]:
# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

# !pip install numpy==1.19.5
!pip install beautifulsoup4

# ===========================



In [3]:
# To install the requirements for the entire chapter, uncomment the lines below and run this cell

# ===========================

try:
    import google.colab
    # For Google Colab, install requirements from the remote file
    # Use %pip instead of !pip for compatibility
    # The following line downloads and installs each requirement
    # If running locally, this block will raise ModuleNotFoundError and skip to except
    # Note: %pip does not support piping from curl directly, so we use !curl here to download first
    !curl -O https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch2/ch2-requirements.txt
    %pip install -r ch2-requirements.txt
except ModuleNotFoundError:
    # For local environments, install from local requirements file
    %pip install -r "ch2-requirements.txt"
# ===========================

Collecting numpy==1.19.5 (from -r ch2-requirements.txt (line 1))
  Using cached numpy-1.19.5.zip (7.3 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
Note: you may need to restart the kernel to use updated packages.


ERROR: Exception:
Traceback (most recent call last):
  File "c:\Users\rende\anaconda3\Lib\site-packages\pip\_internal\cli\base_command.py", line 105, in _run_wrapper
    status = _inner_run()
  File "c:\Users\rende\anaconda3\Lib\site-packages\pip\_internal\cli\base_command.py", line 96, in _inner_run
    return self.run(options, args)
           ~~~~~~~~^^^^^^^^^^^^^^^
  File "c:\Users\rende\anaconda3\Lib\site-packages\pip\_internal\cli\req_command.py", line 68, in wrapper
    return func(self, options, args)
  File "c:\Users\rende\anaconda3\Lib\site-packages\pip\_internal\commands\install.py", line 387, in run
    requirement_set = resolver.resolve(
        reqs, check_supported_wheels=not options.target_dir
    )
  File "c:\Users\rende\anaconda3\Lib\site-packages\pip\_internal\resolution\resolvelib\resolver.py", line 96, in resolve
    result = self._result = resolver.resolve(
                            ~~~~~~~~~~~~~~~~^
        collected.requirements, max_rounds=limit_how_complex_r

In [4]:
# making the necessary imports
from pprint import pprint
from bs4 import BeautifulSoup
from urllib.request import urlopen 

In [5]:
myurl = "https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-python" # specify the url

from urllib.request import Request

req = Request(myurl, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read() # query the website so that it returns a html page  
soupified = BeautifulSoup(html, 'html.parser') # parse the html in the 'html' variable, and store it in Beautiful Soup format

As the size of the HTML webpage (soupified) is large, we are just showing some of its output (only 2000 characters).

In [6]:
pprint(soupified.prettify())      # for printing the full HTML structure of the webpage

('<!DOCTYPE html>\n'
 '<html class="html__responsive " itemscope="" '
 'itemtype="https://schema.org/QAPage" lang="en">\n'
 ' <head>\n'
 '  <title>\n'
 '   datetime - How do I get the current time in Python? - Stack Overflow\n'
 '  </title>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196" '
 'rel="shortcut icon"/>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" '
 'rel="apple-touch-icon"/>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" '
 'rel="image_src"/>\n'
 '  <link href="/opensearch.xml" rel="search" title="Stack Overflow" '
 'type="application/opensearchdescription+xml"/>\n'
 '  <link '
 'href="https://stackoverflow.com/questions/415511/how-do-i-get-the-current-time-in-python" '
 'rel="canonical">\n'
 '   <meta content="width=device-width, height=device-height, '
 'initial-scale=1.0, minimum-scale=1.0" name="vie

In [10]:
pprint(soupified.prettify()[:2000]) # to get an idea of the html structure of the webpage 

('<!DOCTYPE html>\n'
 '<html class="html__responsive " itemscope="" '
 'itemtype="https://schema.org/QAPage" lang="en">\n'
 ' <head>\n'
 '  <title>\n'
 '   datetime - How do I get the current time in Python? - Stack Overflow\n'
 '  </title>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196" '
 'rel="shortcut icon"/>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" '
 'rel="apple-touch-icon"/>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" '
 'rel="image_src"/>\n'
 '  <link href="/opensearch.xml" rel="search" title="Stack Overflow" '
 'type="application/opensearchdescription+xml"/>\n'
 '  <link '
 'href="https://stackoverflow.com/questions/415511/how-do-i-get-the-current-time-in-python" '
 'rel="canonical">\n'
 '   <meta content="width=device-width, height=device-height, '
 'initial-scale=1.0, minimum-scale=1.0" name="vie

In [11]:
soupified.title # to get the title of the web page 

<title>datetime - How do I get the current time in Python? - Stack Overflow</title>

In [12]:
question = soupified.find("div", {"class": "question"}) # find the nevessary tag and class which it belongs to
questiontext = question.find("div", {"class": "s-prose js-post-body"})
print("Question: \n", questiontext.get_text().strip())

answer = soupified.find("div", {"class": "answer"}) # find the nevessary tag and class which it belongs to
answertext = answer.find("div", {"class": "s-prose js-post-body"})
print("Best answer: \n", answertext.get_text().strip())

Question: 
 How do I get the current time in Python?
Best answer: 
 Use datetime:
>>> import datetime
>>> now = datetime.datetime.now()
>>> now
datetime.datetime(2009, 1, 6, 15, 8, 24, 78915)
>>> print(now)
2009-01-06 15:08:24.789150

For just the clock time without the date:
>>> now.time()
datetime.time(15, 8, 24, 78915)
>>> print(now.time())
15:08:24.789150


To save typing, you can import the datetime object from the datetime module:
>>> from datetime import datetime

Then remove the prefix datetime. from all of the above.


BeautifulSoup is one of the many libraries which allow us to scrape web pages. Depending on your needs you can choose between the many available choices like beautifulsoup, scrapy, selenium, etc