# URL Scraper Starter Kit
## Structure of this Starter Kit

1. Source code (in Python) - library and application
2. Jupyter Notebook files (ipynb) including manuals inside
3. Example files - data with urls - url.txt

### What does this module do?
The URLScraper takes a URL and scrapes it. If specified, internal links up to a specified number are also scraped. When deciding which internal links to scrape, tagged links with a specific language (if defined) and shorter links are prioritized. The URLScraper then saves the scraped data into a NoSQL database (MongoDB).

URL list in files -> URLScraper -> Websites in NoSQL collections for further processing

### Prerequisites
A data source containing the URLs to scrape is needed. It can be an iterable like a list, or a data frame column containing URLs. For this starter kit, we use an input file that is line-separated and looks like this:

http://stat.gov.pl

http://destatis.de

http://www.nsi.bg

Five steps to run this application.

1. Import libraries
2. Create a connection to mongodb server
3. Set the database name
4. Set the file name of URLs to import
5. Start the web scraping

# 1. Import libraries.
If they do not exist please update your Python environment with pip, pip3, conda or easy_install. Look into manual.

In [1]:
import os
import sys
sys.path.insert(0, os.path.abspath('../src/'))
from DomainScraper import ScrapeDomain

from pymongo import MongoClient
import requests
from bs4 import BeautifulSoup
import re
from datetime import datetime
import time

# 2. Create a connection to mongodb server. 

Replace the values below with your own.

### Variables to set:

servername - change with IP address or name of the server, e.g. 192.168.1.1 or serverdb.domain.com

port - change the port number - for MongoDB default is 27017

In [2]:
host='localhost'
port=27017
# define the client connection
# host - default localhost
# port - default 27017
client=MongoClient('mongodb://'+str(host)+":"+str(port))

# 3. Set the database name.

### Variable to set:

dbname - if the database does not exist it will be created.

In [3]:
dbname='URLScraping'
try:
    database=client[dbname]
except:
    print('Error connecting the database', sys.exc_info()[0])

# 4. Import the file containing URLs to scrape. 
We created a line separated file containing URLs as explained in the prerequisites.

### Variable to set:

filename - the name of the file, e.g. url.txt

In [4]:
filename='url.txt'
file=open(filename,'r')

# 5. Start the web scraping.

### Variables to set:

collectionName - default database.websites - value after dot can be changed, e.g. database.myfirstcollection, database.wpc_20200301

max_pages - maximum number of pages on domain to be scraped

preferred_langs - list of ISO language codes that should be prioritised while scraping (can be omitted)

accept_subdomains - if set to True, also allow the crawler to scrape subdomains, which have the format: subdomain.domain --> eg. https://www-genesis.destatis.de/genesis/online, so "www-genesis" is a subdomain of destatis.de (this is where you can accesss the database of destatis)

userAgent - the name of the robot (should be changed to the name of your organization and the purpose of scraping)

timeBetweenRequests - set the time between requests - in seconds (suggested 3-5 seconds).

timeOutConnect - maximum time in seconds to connect to the website

timeOutRead - maximum time in seconds to read the website


In [5]:
# Database collection to be used
collectionName = database.websites

# Parameters for ScrapeDomain class
max_pages = 1
preferred_langs = ['en']
accept_subdomains = False

# Parameters for scraping
userAgent='python-app/0.1 experimental for statistical purposes'
timeBetweenRequests=2
timeOutConnect=10
timeOutRead=15


In [6]:
# Loop through domain urls to scrape

for index, domain in enumerate(file, 1):
    URLScraping = ScrapeDomain(domain = domain,
                               index = index,
                               max_pages = max_pages,
                               lang_codes = preferred_langs,
                               accept_subdomains = False)
    URLScraping.url_scraping(userAgent, collectionName, timeOutConnect, timeOutRead, timeBetweenRequests)

Scraping https://stat.gov.pl
Error scraping https://stat.gov.pl
Error message: <class 'requests.exceptions.SSLError'>: HTTPSConnectionPool(host='stat.gov.pl', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))
Scraping https://destatis.de
Success in scraping page no 1 of Domain: destatis.de
Saved scraped pages to database
Scraping https://nsi.bg
Success in scraping page no 1 of Domain: nsi.bg
Saved scraped pages to database


# 6. Retrieve saved website data from MongoDB
This part shows how the data is saved in the NoSQL database

In [7]:
# Get the first saved website as an example
a_website = database.websites.find_one()

In [8]:
# Print the HTML code of the first domain and the first page
print(a_website['content']['1'])

{'url': 'https://www.destatis.de/DE/Home/_inhalt.html', 'page': '<!doctype html>\n<html lang="de">\n<head>\n  <base href="https://www.destatis.de/"/>\n  <meta charset="UTF-8"/>\n  <title>Startseite  -  Statistisches Bundesamt</title>\n  <meta name="title" content="Startseite"/>\n  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.5, user-scalable=1"/>\n  <meta name="generator" content="Government Site Builder"/>\n  \n  \n  \n    <meta name="keywords" content="Amtliche Statistik, Pressemitteilung, Publikation, Statistik, Statistisches Bundesamt / Deutschland, Tabelle"/>\n    <meta name="description" content="Internetangebot des Statistischen Bundesamtes mit aktuellen Informationen, Publikationen, Zahlen und Fakten der amtlichen Statistik"/>\n\n  \n\n\n\n\n\n\n\n<meta property="og:site_name" content="Statistisches Bundesamt"/>\n<meta property="og:type" content="website"/>\n<meta property="og:title" content="Startseite"/>\n<meta property="og:description