# Web scraping with cbs_utils

In this notebook some small examples are given on how to use the web scraping utilities from cbs_utils. The following utilities are discussed:

1. [*get_page_from_url*](#get_page_from_url) : retrieve contents of ulr from internet or cache
2. [*UrlSearchStrings*](#urlseachstrings) : crawl a domain and search for strings


<a id=get_page_from_url></a>

## Using the get_page_from_url function

The *get_page_from_url* function allows to obtain the contents of an url and store the results in cache. The next time you run the function again, the function is read from cache. The benefits of caching your data are:
1. Significant speed up of processing time
2. During development of a crawler you reduce the burden on a domain
3. You can work off-line

Here, an small example is given. First start with importing the required modules:

In [1]:
import logging
from pathlib import Path

from bs4 import BeautifulSoup

from cbs_utils.misc import (create_logger, merge_loggers)
from cbs_utils.regular_expressions import (KVK_REGEXP, ZIP_REGEXP, BTW_REGEXP)
from cbs_utils.web_scraping import (get_page_from_url, UrlSearchStrings)

*BeautifulSoup* is used to parse the contents of the web site. The *create_logger* and *merge_logger* functions are used to quickly setup the logging system. The *regular_expressions* are standard regular expression we can use to find strings such as de postal code (Dutch form), tax number, etc.

Next, set up the logging module using the cbs_utils misc function *create_logger*

In [2]:
# set up logging
log_level = logging.DEBUG  # change to DEBUG for more info
log_format = logging.Formatter('%(levelname)8s --- %(message)s')
logger = create_logger(console_log_level=log_level, formatter=log_format)
merge_loggers(logger, "cbs_utils.web_scraping", logger_level_to_merge=logging.INFO)

<Logger cbs_utils.web_scraping (INFO)>

For this example a *tmp* directory is made in your working directory to store the cache. First make sure we clean this directory in case it still existed from the previous run

In [3]:
# create url name and clean previous cache file
cache_directory = Path("tmp")
clean_cache = True
if clean_cache:
    if cache_directory.exists():
        for item in cache_directory.iterdir():
            item.unlink()
        cache_directory.rmdir()
    else:
        logger.info(f"Cache directory {cache_directory} was already removed")


Now we can demonstrate the *get_page_from_url* function. 

In [4]:
%%time
url = "https://www.example.com"
page = get_page_from_url(url, cache_directory=cache_directory)

CPU times: user 36 ms, sys: 3 ms, total: 39 ms
Wall time: 5.51 s


As you can see, it took about 5.5 s to get all the information from the internet. Because we have added a *cache_to_disk* iterator to the *get_page_from_url* function, a cache file in the *tmp* directory was made:

In [5]:
for ii, item in enumerate(cache_directory.iterdir()):
    logger.info(f"Cache file {ii}: {item}")

    INFO --- Cache file 0: tmp/get_page_from_url_https_www_example_com_.pkl


The contents of the url was stored in page and look like this:

In [6]:
soup = BeautifulSoup(page.text, 'lxml')
logger.info("\n{}\n".format(soup.body))

    INFO --- 
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.</p>
<p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>



We can run the same function again. Since we now have a cache file, it will be about 1000 x faster:

In [7]:
%%time
page2 = get_page_from_url(url, cache_directory=cache_directory)


CPU times: user 1 ms, sys: 6 ms, total: 7 ms
Wall time: 20.9 ms


Indeed the same function statement runs in with about 1 ms. Now compare the results:

In [8]:
soup2 = BeautifulSoup(page2.text, 'lxml')
logger.info("Contents is equal: {}".format(soup.body == soup2.body))

    INFO --- Contents is equal: True


<a id=urlseachstrings></a>

# Using the *UrlSearchStrings* class

The *UrlSearchString* class can be used to recursively crawl a website and search for a list of regular expressions we want to obtain from the website. Again, the result is cached, so in case you want to run it again with different search strings it will run significantly faster. 

Let's first set up our first search session, trying to retrieve the postal code and kvk number from a web page. The regular expression are obtained from the *regular_expressions* module of *cbs_utils* and are discussed below

In [9]:
# the regular expression are obtained from the cbs_utils.regular_expressions module
searches = dict(
    postcode=ZIP_REGEXP,
    kvknumber=KVK_REGEXP
)

url = "www.be-one.nl"

logger.info(f"Start crawling the url {url} and search for the folliwing regular expressions:")
for key, reg_exp in searches.items():
    logger.info("{:10s}: {}".format(key, reg_exp))
logger.info("\n")

    INFO --- Start crawling the url www.be-one.nl and search for the folliwing regular expressions:
    INFO --- postcode  : [1-9]\d{3}\s{0,1}[A-Z]{2}
    INFO --- kvknumber : ((?![-\w])|(\s|^))([\d][\.]{0,1}){7}\d((?![-\w])|(\s|^))
    INFO --- 



#### Some remarks on the regular expressions. 
The postcode regular expression is quite clear: it matches any four digit number (where the first digit can not be a 0), plus 2 alphanumerica characters (must be capitals). There may be a space between the digits and the characters. So the following matches 1234AB, 4545 YZ

The kvknumber is a bit more complicated. The kvk number is a 8 digit number which may have dots. Something like 123.456.78, or 12345678. Normally, we would use word boundaries (\b) around the 8 digits to prevent a 10 digit number to match as well. However, a hyphen (-) is a word boundary too, giving a match to for instance M-12345678. It appears that this type of strings occur frequently in url's, but these are not kvk numbers. To avoid to include hyphens in the word boundary, we have explicitly given the list of characters which belong to the word boundary, making the kvknumber regexp better machthing to real kvk numbers. 

#### crawling the domain

Now let's crawl the domain for the first time

In [10]:
%%time
url_analyse = UrlSearchStrings(url, search_strings=searches, cache_directory=cache_directory,                         
                               store_page_to_cache=True)

    INFO --- Get (cached) page: https://www.be-one.nl/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/be-one/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/beaumont/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mijn-account/verlanglijstje.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mila-sierra-suede-sneaker-plato_cognac_19541.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/milestone/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/ml-collections/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/moment-by-moment/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mos-mosh/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mouwloos-bloemen_groen_19532.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/new-arrivals/ with valida

It took us about 2 minutes to crawl the whole site. The results can be viewed by just printing the *url_analyse* object to screen:

In [11]:
logger.info(url_analyse)

    INFO --- Matches in https://www.be-one.nl/
postcode : ['9206 BE']
kvknumber : ['01066434']


So we have found one postal code and and kvk number. Now, let's assume we also would like to have the tax number (btw in Dutch). We can run our search again, but much faster because we have stored every thing in cache again. Now we are going to add the search to our *searches* dictionary and run again:

In [12]:
%%time
# add a new search string to our dictionary
searches["btwnummer"] = BTW_REGEXP

url_analyse = UrlSearchStrings(url, search_strings=searches, cache_directory=cache_directory,
                               store_page_to_cache=True, schema=url_analyse.schema,
                               ssl_valid=url_analyse.ssl_valid,
                               validate_url=False
                              )

    INFO --- Get (cached) page: https://www.be-one.nl/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/be-one/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/beaumont/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mijn-account/verlanglijstje.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mila-sierra-suede-sneaker-plato_cognac_19541.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/milestone/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/ml-collections/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/moment-by-moment/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mos-mosh/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mouwloos-bloemen_groen_19532.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/new-arrivals/ with valida

This time we could run our search in about 5 seconds instead of two minutes. Note that we have explicitely added the url scheme "https" and gave a flag that the urls should not be validated. This was not needed the first time we ran the code because the scheme is determined internally. But since this take a lot of time, we switch it off and just impose it 

The results can be seen by printing the object

In [13]:
logger.info(url_analyse)

    INFO --- Matches in https://www.be-one.nl/
postcode : ['9206 BE']
kvknumber : ['01066434']
btwnummer : ['NL8019.96.028.B.01']


Indeed, a btwnumber was added this time. 

In case you want to access the search result: this is strored in the *matches* attribute which is just a normal dictionary

In [14]:
for key, value in url_analyse.matches.items():
    logger.info(f"The search key {key} has the following matches: {value}")

    INFO --- The search key postcode has the following matches: ['9206 BE']
    INFO --- The search key kvknumber has the following matches: ['01066434']
    INFO --- The search key btwnummer has the following matches: ['NL8019.96.028.B.01']


#### Adding a search order in the domain

There is one more trick to speed up your crawl sessions. In this example we just searched the whole domain too look for a string, which still takes a lot of time. In many cases the string we are looking for occurs in standard locations. Information on the company for instance is found in many cases in a page with 'contact' or 'about-us' in the hyper ref.

We can make use of this information by giving a list of hyper ref names which we want to search first, before all the rest is search. Also we can stop any futher crawling as soon we find a match. Let's have a look at an example. 

First, we make a list of common hyper ref names were company information is stored. The string in the hyper refs are treated as regular expression so that don't have to be exact: of a part of the hyper ref contains the string in the list it will match and searched first.


In [15]:
sort_order_hrefs=[
    "about",
    "over",
    "contact",
    "privacy",
    "algeme",
    "voorwaarden",
    "klanten",
    "customer",
]

Now we can pass this list to your UrlSearchStrings class and crawl again. Note that we have also added 'btwnumber' to the *stop_search_on_found_keys* list. This arguments give a list of keys from our *search_string* dictionary for which we want to stop searching as soon as we have found a match. 

In [16]:
%%time
# add a new search string to our dictionary
searches["btwnummer"] = BTW_REGEXP

url_analyse = UrlSearchStrings(url, search_strings=searches, cache_directory=cache_directory,
                               store_page_to_cache=True, schema=url_analyse.schema,
                               ssl_valid=url_analyse.ssl_valid,
                               validate_url=False, 
                               sort_order_hrefs=sort_order_hrefs,
                               stop_search_on_found_keys=['btwnummer']
                              )

    INFO --- Get (cached) page: https://www.be-one.nl/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/algemene-voorwaarden.html with validate True
CPU times: user 142 ms, sys: 4 ms, total: 146 ms
Wall time: 187 ms


As you can see, this time we started searching in a hyper ref which we included in our *sort_order_hrefs* list. As a result we scraped the hyper ref *klantenservice/algemene-voorwaarden.html* first, which was almost the last page we crawled when we did not give this sort list. Since we have added 'stop_search_on_found_keys' as well, we inmediately stop crawling as soon as we found a match for *btwnummer*. Combined with the fact we were aslo crawling from cache, this time our crawl only too 167 ms. Compared to the initial 2 minutes of our first crawl, this is quite a speed up indeed. 