# Web scraping with cbs_utils

In this notebook some small examples are given on how to use the web scraping utilities from cbs_utils

## Using the get_page_from_url function

The *get_page_from_url* function allow to obtain the contents of an url and store the results in cache. The next time you run the function again, the function is read from cache. Here, an small example is given. First start with import the required modules:

In [1]:
import logging
from pathlib import Path

from bs4 import BeautifulSoup

from cbs_utils.misc import (create_logger, merge_loggers)
from cbs_utils.regular_expressions import (KVK_REGEXP, ZIP_REGEXP, BTW_REGEXP)
from cbs_utils.web_scraping import (get_page_from_url, UrlSearchStrings)

Set up the logging module using the cbs_utils misc function *create_logger*

In [2]:
# set up logging
log_level = logging.DEBUG  # change to DEBUG for more info
log_format = logging.Formatter('%(levelname)8s --- %(message)s')
logger = create_logger(console_log_level=log_level, formatter=log_format)
merge_loggers(logger, "cbs_utils.web_scraping", logger_level_to_merge=logging.INFO)

<Logger cbs_utils.web_scraping (INFO)>

For this example a *tmp* directory is made in your working directory to store the cache. First make sure we clean this directory in case it still existed from the previous run

In [3]:
# create url name and clean previous cache file
cache_directory = Path("tmp")
clean_cache = True
if clean_cache:
    if cache_directory.exists():
        for item in cache_directory.iterdir():
            item.unlink()
        cache_directory.rmdir()
    else:
        logger.info(f"Cache directory {cache_directory} was already removed")


Now we can demonstrate the *get_page_from_url* function. Note that the *with Timer* construct is only added to be able to report the processing time

In [4]:
%%time
url = "https://www.example.com"
page = get_page_from_url(url, cache_directory=cache_directory)

CPU times: user 35 ms, sys: 13 ms, total: 48 ms
Wall time: 5.47 s


As you can see, it took about 5.5 s to get all the information from the internet. Because we have added a *cache_to_disk* iterator to the *get_page_from_url* function, a cache wil in the *tmp* directory was made:

In [5]:
for ii, item in enumerate(cache_directory.iterdir()):
    logger.info(f"Cache file {ii}: {item}")

    INFO --- Cache file 0: tmp/get_page_from_url_https_www_example_com_.pkl


The contents of the url was stored in page and look like this:

In [6]:
soup = BeautifulSoup(page.text, 'lxml')
logger.info("\n{}\n".format(soup.body))

    INFO --- 
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.</p>
<p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>



We can run the same function again. Since we now have a cache file, it will be about 1000 x faster:

In [7]:
%%time
page2 = get_page_from_url(url, cache_directory=cache_directory)


CPU times: user 2 ms, sys: 0 ns, total: 2 ms
Wall time: 2.12 ms


Indeed the contecnt with exactely the same function statement runs in with about 1 ms. Now compare the results:

In [8]:
soup2 = BeautifulSoup(page2.text, 'lxml')
logger.info("Contents is equal: {}".format(soup.body == soup2.body))

    INFO --- Contents is equal: True


# Using the *UrlSearchStrings* class

The *UrlSearchString* class can be used to recursively crawl a web site and search for a list of regular expression we want to obtained from the web site. Again, the result is cached, so in case you want to run it again with different search string it will run significantly faster. 

Let first set up our first search session, trying to retrieve the postal code and kvk number from a web page

In [9]:
%%time
# the regular expression are obtained from the cbs_utils.regular_expressions module
searches = dict(
    postcode=ZIP_REGEXP,
    kvknumber=KVK_REGEXP
)

url = "www.be-one.nl"
url_analyse = UrlSearchStrings(url, search_strings=searches, cache_directory=cache_directory,
                               store_page_to_cache=True)

    INFO --- Get (cached) page: https://www.be-one.nl/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/be-one/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/beaumont/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mijn-account/verlanglijstje.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mila-sierra-suede-sneaker-plato_cognac_19541.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/milestone/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/ml-collections/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/moment-by-moment/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mos-mosh/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mouwloos-bloemen_groen_19532.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/new-arrivals/ with valida

It took us about 2 minutes to crawl the whole site. The results can be viewed by just printing the *url_analyse* object to screen:

In [10]:
logger.info(url_analyse)

    INFO --- Matches in https://www.be-one.nl/
postcode : ['9206 BE']
kvknumber : ['01066434']


So we have found one postal code and and kvk number. Now, let's assume we also would like to have the tax number (btw in Ducth). We can run our search again, but much faster because we have stored every thing in cache again.Now we are going to add the search to our *searches* dictionary and run again:

In [14]:
%%time
searches["btwnummer"] = BTW_REGEXP

url_analyse = UrlSearchStrings(url, search_strings=searches, cache_directory=cache_directory,
                               store_page_to_cache=True, schema=url_analyse.schema,
                               ssl_valid=url_analyse.ssl_valid,
                               validate_url=False
                              )

    INFO --- Get (cached) page: https://www.be-one.nl/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/be-one/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/beaumont/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mijn-account/verlanglijstje.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mila-sierra-suede-sneaker-plato_cognac_19541.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/milestone/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/ml-collections/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/moment-by-moment/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mos-mosh/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mouwloos-bloemen_groen_19532.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/new-arrivals/ with valida

This time we could run our search in about 5 seconds instead of two minutes. Note that we have explicitely added the url scheme "https" and gave a flag that the urls should not be validated.

The results can be seen by printing the object

In [16]:
logger.info(url_analyse)

    INFO --- Matches in https://www.be-one.nl/
postcode : ['9206 BE']
kvknumber : ['01066434']
btwnummer : ['NL8019.96.028.B.01']


Indeed, a btwnumber was added this time. 