# URL Scraper Starter Kit
## Structure of this Starter Kit

1. Source code (in Python) - library and application
2. Jupyter Notebook files (ipynb) including manuals inside
3. Example files - data with urls - url.txt

### Data processing schema
URL list in files -> URLScraper -> Websites in NoSQL collections for further processing

### Prerequisites
A data source containing the URLs to scrape is needed. It can be an iterable like a list, or a data frame column containing URLs. For this starer kit, we use an input file that is line-separated and looks like this:

http://stat.gov.pl

http://destatis.de

http://www.nsi.bg

Five steps to run this application.

1. Import libraries
2. Create a connection to mongodb server
3. Set the database name
4. Set the file name of URLs to import
5. Start the web scraping

# 1. Import libraries.
If they do not exist please update your Python environment with pip, pip3, conda or easy_install. Look into manual.

In [1]:
# import libraries
from pymongo import MongoClient
import requests
from bs4 import BeautifulSoup
import re
from datetime import datetime
import sys
import time

# 2. Create a connection to mongodb server. 

Replace the values below with your own.

### Variables to set:

servername - change with IP address or name of the server, e.g. 192.168.1.1 or serverdb.domain.com

port - change the port number - for MongoDB default is 27017

In [2]:
host='localhost'
port=27017
# define the client connection
# host - default localhost
# port - default 27017
client=MongoClient('mongodb://'+str(host)+":"+str(port))

# 3. Set the database name.

### Variable to set:

dbname - if the database does not exist it will be created.

In [3]:
dbname='URLScraping'
try:
    database=client[dbname]
except:
    print('Error connecting the database', sys.exc_info()[0])

# 4. Import the file containing URLs to scrape. 
We created a line separated file containing URLs as explained in the prerequisites.

### Variable to set:

filename - the name of the file, e.g. url.txt

In [4]:
filename='url.txt'
file=open(filename,'r')

# 5. Define Class and methods for web scraping

In [5]:
class ScrapeDomain():
    ##################
    # init
    ##################
    def __init__(self, domain, lang_codes=[], max_pages=1, accept_subdomains=False):
        self.domain = domain.lower().strip().replace('http://','').replace('https://','').replace('www.','')
        self.domain_link = 'https://'+self.domain
        self.json = {'domain': self.domain,
                     'content': {}} # All scraped pages are embedded documents within content dict
        self.max_pages = max_pages
        self.link_set = {self.domain_link}
        self.num_pages = 0
        self.scraped = set()
        self.accept_subdomains = accept_subdomains
        self.lang_codes = lang_codes
        if self.lang_codes:
            self.lang_idents = []
            #TODO: language tag identification with regular expressions
            for language in self.lang_codes:
                self.lang_idents.append("/{}/".format(language))
                self.lang_idents.append("/{}-{}/".format(language, language))
                self.lang_idents.append("?lang={}".format(language))

    
    ##################
    # Utility functions
    ##################
    # Exclude downloadable files, pictures, etc from being scraped
    # List taken from ARGUS by datawizard1337 (I added xls and xlsx)
    # https://github.com/datawizard1337/ARGUS --> language prioritising was also inspired by ARGUS
    filetypes = set(['mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif', 'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg',
            'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff',
            '3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv', 'm4a',
            'css', 'pdf', 'doc', 'exe', 'bin', 'rss', 'zip', 'rar', 'msu', 'flv', 'dmg', 'xls', 'xlsx',
            'mng?download=true', 'pct?download=true', 'bmp?download=true', 'gif?download=true', 'jpg?download=true', 'jpeg?download=true', 'png?download=true', 'pst?download=true', 'psp?download=true', 'tif?download=true', 'tiff?download=true', 'ai?download=true', 'drw?download=true', 'dxf?download=true', 'eps?download=true', 'ps?download=true', 'svg?download=true',
            'mp3?download=true', 'wma?download=true', 'ogg?download=true', 'wav?download=true', 'ra?download=true', 'aac?download=true', 'mid?download=true', 'au?download=true', 'aiff?download=true',
            '3gp?download=true', 'asf?download=true', 'asx?download=true', 'avi?download=true', 'mov?download=true', 'mp4?download=true', 'mpg?download=true', 'qt?download=true', 'rm?download=true', 'swf?download=true', 'wmv?download=true', 'm4a?download=true',
            'css?download=true', 'pdf?download=true', 'doc?download=true', 'exe?download=true', 'bin?download=true', 'rss?download=true', 'zip?download=true', 'rar?download=true', 'msu?download=true', 'flv?download=true', 'dmg?download=true'])

    def det_prio(self):
        '''Determines which URL should be scraped next'''
        not_scraped = list(self.link_set.difference(self.scraped))
        if not self.lang_codes:
            link_stack = sorted(not_scraped, key=len)
        else:
            correct_lang = []
            other_lang = []
            for link in not_scraped:
                if any(tag.lower() in link.lower() for tag in self.lang_idents):
                    correct_lang.append(link)
                else:
                    other_lang.append(link)
            # Sort urls that were not yet scraped by link length
            link_stack = sorted(correct_lang, key=len) + sorted(other_lang, key=len)
        self.to_scrape = link_stack[0]
    def extractLinks(self):      
        soup = BeautifulSoup(self.website.text,"html.parser")
        for a in soup.find_all('a', href=True):
           url = self.get_internalURL(a['href'])
           if url:
               # only include links that don't belong to blacklisted filetypes and not mailto links
               if not url.split(".")[-1].lower() in self.filetypes:
                   self.link_set.add(url)
    def get_internalURL(self, url):
        #ignore javascript, mailto and telephone links
        pattern = re.compile("mailto:|tel:|javascript:", re.IGNORECASE)
        if url and not re.search(pattern, url):
            if self.domain in url:
                if self.accept_subdomains == False:
                    # Test whether link doesn't contain a subdomain
                    cleaned_url = url.lower().replace('http://','').replace('https://','').replace('www.','')
                    if cleaned_url.split('.')[0]==self.domain.split('.')[0]:
                        return url
                else:
                    return url
            elif url[0:2]=="./":
                return self.domain_link+url.replace('./','/')
            elif url[0]=="/":
                return self.domain_link+url
            elif "http" not in url:
                return self.domain_link+'/'+url

    ##################
    # Scraper
    ##################
    def url_scraping(self, user_agent, timeOutConnect=10,
                     timeOutRead=15, timeBetweenRequests=2):
        '''Scrapes link'''
        self.det_prio()
        print('Scraping', self.to_scrape)
        headers = {'user-agent': user_agent}
        self.scraped.add(self.to_scrape)
        try:
            self.website=requests.get(self.to_scrape, headers=headers,
                                      timeout=(timeOutConnect,timeOutRead))
            self.num_pages += 1
            self.scraped.add(self.website.url) # Add scraped url to scraped set (in case of redirect)
            print('Success in scraping page no', self.num_pages, 'of Domain:', self.domain)
            self.json['content'].update({str(self.num_pages): {'url': self.website.url,
                                                          'page': self.website.text,
                                                          'date': str(datetime.now())}})
            # I use website.url to obtain the url that was actually scraped 
            # (different to original url in case of redirects)
            if self.num_pages < self.max_pages:
                self.extractLinks()
            if self.num_pages >= self.max_pages or not self.link_set.difference(self.scraped):
                # Save json to MongoDB when max_pages is reached or no links are left for scraping
                try:
                    result = collectionName.insert_one(self.json)
                    print('Saved scraped pages to database')
                except:
                    print('Error while saving into database occurred')
                    
        except Exception as e:
            print('Error scraping', self.to_scrape)
            print(f'Error message: {type(e)}: {e}')
        finally:
            # Even if the current page produced an error, continue scraping if there are still page links left
            if self.link_set.difference(self.scraped) and self.num_pages<self.max_pages:
                # N second delay on purpose
                time.sleep(timeBetweenRequests)
                self.url_scraping(user_agent, timeOutConnect, timeOutRead, timeBetweenRequests)

# 6. Start the web scraping.

### Variables to set:

collectionName - default database.websites - value after dot can be changed, e.g. database.myfirstcollection, database.wpc_20200301

max_pages - maximum number of pages on domain to be scraped

preferred_langs - list of ISO language codes that should be prioritised while scraping (can be omitted)

accept_subdomains - if set to True, also allow the crawler to scrape subdomains, which have the format: subdomain.domain --> eg. https://www-genesis.destatis.de/genesis/online, so "www-genesis" is a subdomain of destatis.de (this is where you can accesss the database of destatis)

userAgent - the name of the robot (should be changed to the name of your organization and the purpose of scraping)

timeBetweenRequests - set the time between requests - in seconds (suggested 3-5 seconds).

timeOutConnect - maximum time in seconds to connect to the website

timeOutRead - maximum time in seconds to read the website


In [6]:
# Database collection to be used
collectionName = database.websites

# Parameters for ScrapeDomain class
max_pages = 10
preferred_langs = ['en']
accept_subdomains = False

# Parameters for scraping
userAgent='python-app/0.1 experimental for statistical purposes'
timeBetweenRequests=2
timeOutConnect=10
timeOutRead=15


In [7]:
# Loop through domain urls to scrape

for domain in file:
    URLScraping = ScrapeDomain(domain, 
                               max_pages = max_pages,
                               lang_codes = preferred_langs,
                               accept_subdomains = False)
    URLScraping.url_scraping(userAgent, timeOutConnect, timeOutRead, timeBetweenRequests)

Scraping https://stat.gov.pl
Error scraping https://stat.gov.pl
Error message: <class 'requests.exceptions.SSLError'>: HTTPSConnectionPool(host='stat.gov.pl', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))
Scraping https://destatis.de
Success in scraping page no 1 of Domain: destatis.de
Scraping https://destatis.de/EN/Home/_node.html
Success in scraping page no 2 of Domain: destatis.de
Scraping https://destatis.de/Europa/EN/Home/_node.html
Success in scraping page no 3 of Domain: destatis.de
Scraping https://destatis.de/EN/Service/Reporting-Online/_node.html
Success in scraping page no 4 of Domain: destatis.de
Scraping https://destatis.de/EN/Themes/Countries-Regions/International-Statistics/_node.html
Success in scraping page no 5 of Domain: destatis.de
Scraping https://destatis.de/EN/Home/_node.html;jsessionid=2A275C368E8569F427E9580611C84031.internet8

Remark: The language priotising doesn't work as well for nsi.bg, because it expects the language tag to be within two slashes. I should probably change it so that a URL like http://nsi.bg/en is priotised over http://nsi.bg/bg when 'en' is specified.

# 7. Retrieve saved website data from MongoDB
This part shows how the data is saved in the NoSQL database

In [None]:
# Get the first saved website as an example
a_website = database.websites.find_one()

In [None]:
# Print the HTML code of the first domain and the first page
print(a_website['content']['1'])