# Essential skill for the Internet crawling

## Regular expressions

Regular expressions (aka regex, regexp) are used to search for patterns. Machine-readable languages often have regualar structure (not always), or at least are non-ambiguous.

Obvious way is, of course, to let machine parse the document and then process the result (as in the previous lab). But this often result in additinal depenencies and significant memory and time overhead (which is ok for a single document, but won't work for millions).

#### Predefined Character Classes:

In [13]:
import re

string = "we have only 5 do11ars. This amount of $ is small. H0w should we sur-vive?"

pattern = "\d"  #: Matches any digit (0-9).
pattern = "\D" #: Matches any non-digit. 
pattern = "\w" #: Matches any word character (a-z, A-Z, 0-9, _).
pattern = "\W" #: Matches any non-word character.
pattern = "\s" #: Matches any whitespace character (spaces, tabs, line breaks).
pattern = "\S" #: Matches any non-whitespace character.

print(pattern, end=": ",)
print(re.findall(pattern, string))
print(len(re.findall(pattern, string)))


# Extra point task - 
#   1) return the all the digits the are serounded by white spaces:
#   2) return the all the none word characters the are serounded by letters:
# pattern1 = "\s\d\s" 
# pattern1 = "\w\W\w" 
# print(pattern, end=": ",)
# print(re.findall(pattern, string))
# print(len(re.findall(pattern, string)))



\S: ['w', 'e', 'h', 'a', 'v', 'e', 'o', 'n', 'l', 'y', '5', 'd', 'o', '1', '1', 'a', 'r', 's', '.', 'T', 'h', 'i', 's', 'a', 'm', 'o', 'u', 'n', 't', 'o', 'f', '$', 'i', 's', 's', 'm', 'a', 'l', 'l', '.', 'H', '0', 'w', 's', 'h', 'o', 'u', 'l', 'd', 'w', 'e', 's', 'u', 'r', '-', 'v', 'i', 'v', 'e', '?']
60


#### Quantifiers:



In [14]:
import re

string = "W8! we have only 5 do11111ars. This amount of $ is small. How should we sur-vive?"

pattern = 'll*.'      #: Matches 0 or more of the preceding character or group. E.g., ab*c matches "ac", "abc", "abbc", etc.
pattern = 'll+.'      #: Matches 1 or more of the preceding character or group. E.g., ab+c matches "abc", "abbc", but not "ac".
pattern = 'll?.'      #: Matches 0 or 1 of the preceding character or group. E.g., ab?c matches "ac" and "abc", but not "abbc".
pattern = 'll{0}.'    #: Matches exactly n of the preceding character or group.
pattern = 'll{1,}.'   #: Matches n or more of the preceding character or group.
pattern = 'do1{1,2}ar'  #: Matches between n and m of the preceding character or group.

print(pattern, end=": ",)
print(re.findall(pattern, string))
print(len(re.findall(pattern, string)))

#Extra point task - 
#  1) return all the words that continas digits in the middle (and then any word that have a digit):
#  2) return all the words that end with a 2 'l's:
# pattern1 = "\w+\d*\w+"
pattern2 = "\w+l{2}" 
print(pattern2, end=": ",)
print(re.findall(pattern2, string))
print(len(re.findall(pattern2, string)))


do1{1,2}ar: []
0
\w+l{2}: ['small']
1


#### Character Classes and Sets

In [15]:
string = "we have only 5 do11111ars. This amount of $ is small. How should we sur-vive?"

pattern="[a1s]"   #: Matches any one character inside the brackets, so it matches either "a", "b", or "c".
pattern="[^ah ]"  #: Matches any one character not inside the brackets.
pattern="[a-z]"   #: Matches any lowercase letter.
pattern="[A-Z]"   #: Matches any uppercase letter.
pattern="[0-9]"   #: Matches any digit.
pattern="(abc)"   #": Matches the exact characters "abc" and groups them together.
pattern="(a|b)"   #": Matches either "a" or "b".


print(pattern, end=": ",)
print(re.findall(pattern, string))
print(len(re.findall(pattern, string)))

#Extra point task - 
#  1) Find all words that start with a digit and end with a non-word character:

# string = "3dogs! 4cats. 5fishes But not this one."
# pattern = "" 
# print(pattern, end=": ",)
# print(re.findall(pattern, string))
# print(len(re.findall(pattern, string)))


#  2) Find all sequences in a text that start with an uppercase letter,
# followed by one or more lowercase letters, and end with either the character 'a', '1', or 's':

# string = "Apple1 Bananas TreeS Mango Orange1 Grapes Peachs Pear"
# pattern = "" 
# print(pattern, end=": ",)
# print(re.findall(pattern, string))
# print(len(re.findall(pattern, string)))


(a|b): ['a', 'a', 'a', 'a']
4


### Find URLs/URIs vs parse the doc

Instead of building DOM model and extracting `href` and `src` attributes, you may rely on the structure of the url itself. Extact all URLs from [the page](https://math.stackexchange.com/questions/411486/understanding-the-singular-value-decomposition-svd) with regexp. You major tool is [re.findall(...)](https://docs.python.org/3/library/re.html#). You may also be interested in compiled regular rexpression (if you reuse one).

In [16]:
import re
import requests

url = "https://math.stackexchange.com/questions/"\
        "411486/understanding-the-singular-value-decomposition-svd"

text = requests.get(url).text

# my inspiration - 
# I took some example URL regexp from the internet, 
# specifically from here:
# https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url
expressions = [
    "(?:([A-Za-z]+):)?(\/{0,3})([0-9.\-A-Za-z]+)(?::(\d+))?(?:\/([^?#]*))?(?:\?([^#]*))?(?:#(.*))?",
    "(www|http:|https:)+[^\s]+[\w]",
    "https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)",
    "[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)?",
    "(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})",
    "(?!mailto:)(?:(?:http|https|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?:(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[0-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,})))|localhost)(?::\\d{2,5})?(?:(/|\\?|#)[^\\s]*)?",
    "https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&\/=]*)",
]

for expression in expressions:
    print()
    pattern = re.compile(expression)
    urls = pattern.findall(text)
    print(expression)
    print(urls[:10])


(?:([A-Za-z]+):)?(\/{0,3})([0-9.\-A-Za-z]+)(?::(\d+))?(?:\/([^?#]*))?(?:\?([^#]*))?(?:#(.*))?
[('', '', 'DOCTYPE', '', '', '', ''), ('', '', 'html', '', '', '', ''), ('', '', 'html', '', '', '', ''), ('', '', 'itemscope', '', '', '', ''), ('', '', 'itemtype', '', '', '', ''), ('https', '//', 'schema.org', '', 'QAPage" class="html__responsive " lang="en">\r\n\r\n    <head>\r\n\r\n        <title>linear algebra - Understanding the singular value decomposition (SVD) - Mathematics Stack Exchange</title>\r\n        <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/math/Img/favicon.ico', 'v=92addaa54d18">\r\n        <link rel="apple-touch-icon" href="https://cdn.sstatic.net/Sites/math/Img/apple-touch-icon.png?v=0ae50baa40ed">\r\n        <link rel="image_src" href="https://cdn.sstatic.net/Sites/math/Img/apple-touch-icon.png?v=0ae50baa40ed"> \r\n        <link rel="search" type="application/opensearchdescription+xml" title="Mathematics Stack Exchange" href="/opensearch.xml">\r\n    

Was this success? 

Compose your own minimalistic:

In [17]:
import re
import requests

url = "https://math.stackexchange.com/questions/"\
        "411486/understanding-the-singular-value-decomposition-svd"

text = requests.get(url).text

#print(text)

# write the regex for the domain and path to get the extra point
# example of a link: "https://cdn.sstatic.net/Sites/math/Img/apple-touch-icon.png?v=0ae50baa40ed"
protocol = "https?://"
domain = "[\w\-\.]+"
path = "[/\w\-\.]*"
args =  "(?:\?[\w\=&-_;\[\]]+)?"
hashtail = "(?:#[\w$%-_;]+)?"

expression = protocol + domain + path + args + hashtail
pattern = re.compile(expression)
regexp_urls = pattern.findall(text)
print(regexp_urls[:20])

['https://schema.org/QAPage', 'https://cdn.sstatic.net/Sites/math/Img/favicon.ico?v=92addaa54d18', 'https://cdn.sstatic.net/Sites/math/Img/apple-touch-icon.png?v=0ae50baa40ed', 'https://cdn.sstatic.net/Sites/math/Img/apple-touch-icon.png?v=0ae50baa40ed', 'https://math.stackexchange.com/questions/411486/understanding-the-singular-value-decomposition-svd', 'https://math.stackexchange.com/questions/411486/understanding-the-singular-value-decomposition-svd', 'https://cdn.sstatic.net/Sites/math/Img/apple-touch-icon', 'https://cdn.sstatic.net/', 'https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js', 'https://cdn.sstatic.net/Js/third-party/npm/', 'https://cdn.sstatic.net/Js/stub.en.js?v=98d9f62851ed', 'https://cdn.sstatic.net/Shared/stacks.css?v=85feea4d403d', 'https://cdn.sstatic.net/Sites/math/primary.css?v=61593710227a', 'https://cdn.sstatic.net/Shared/Channels/channels.css?v=a4d77abedec3', 'https://cdn.sstatic.net/Js/third-party/citation-helper.js?v=2591ce444a3f', 'https://c

# Streams and files

When you deal with the big files you should take care about the RAM. Today 1GB won't suprise anyone on the desktop, but server machines, which implement crawlers, may be optimized for the resource.

Using streams instead of RAM-cached files is a good strategy.

- Look for solution here: https://stackoverflow.com/a/16696317
- Look for the sample big file here: http://xcal1.vodafone.co.uk/
- Read about python memory measurement here: https://pythonspeed.com/articles/measuring-memory-python/

In [18]:
import psutil, gc 

def get_mem():
    return psutil.Process().memory_info().rss

In [19]:
large_file_url = "http://212.183.159.230/100MB.zip"

First, download the file as you would do it simple way:

In [20]:
gc.collect()
print("Resident set size:", get_mem())
data = requests.get(large_file_url).content
print("Resident set size:", get_mem())

with open('100-RAM', 'wb') as f:
    f.write(data)

print("Resident set size:", get_mem())
data = None
gc.collect()
print("Resident set size:", get_mem())

Resident set size: 91357184
Resident set size: 196259840
Resident set size: 196259840
Resident set size: 91402240


And then use the streaming mode of the `requests` library.

\# hint: If you have the request open as r such as:
```
with requests.get(url, stream=True) as r:
```
You can use the line:
```
shutil.copyfileobj(r.raw, f)
```
to write data on a new file in steam instead of doing it in one instance


(Don't fotget to check if the request is a vliad request!)


In [21]:
import requests
import shutil

def download_file(url, destination):
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(destination, 'wb') as f:
            shutil.copyfileobj(r.raw, f)

gc.collect()
print("Resident set size:", get_mem())
download_file(large_file_url, "100-stream")
print("Resident set size:", get_mem())

Resident set size: 91402240
Resident set size: 91402240


# BeautifulSoup

Plain text HTML is a mixture of content, markup, and code. Extracting structure, or URLs, or plain text might be tricky with regular expressions. 

Building a DOM model is slow, but may save a lot of code and keep you from mistakes.

## Extract all sentences
For indexing and semantic analysis we use different granularity. Often sentence is a good choice. 

In [22]:
! pip install nltk
import nltk
nltk.download("punkt")

Defaulting to user installation because normal site-packages is not writeable


[nltk_data] Downloading package punkt to /home/kamil/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [23]:
from bs4 import BeautifulSoup
from bs4.element import Comment
from nltk import tokenize

doc_url = "https://math.stackexchange.com/questions/"\
        "411486/understanding-the-singular-value-decomposition-svd"

text = requests.get(doc_url).text
dom = BeautifulSoup(text)
paragraphs = [p.strip() for p in dom.text.split('\n') if p.strip()]

sents = []
for p in paragraphs:
    sents += tokenize.sent_tokenize(p)
    
print(sents[90:100])

['I know they are square roots of eigenvalues of $\\textbf{A}^{\\textrm{T}}\\textbf{A}$.', "What I don't understand is the meaning?", 'I know if I e.g.', 'take covariance matrix and diagonalize it, I end up with eigenvalues (or maximum/unique/?singular?', 'values) in a diagonal matrix representing variances.', 'SVD however is product of three matrices: outer product o, singular values, inner product of A.', "But I still don't see the meaning of all this.", '$\\endgroup$', '–\xa0Celdor', 'Jun 5, 2013 at 2:55']


# Extract URLs from nodes

Be careful with relative links. How would you process them?

In [24]:
import urllib.parse

all_hrefs = dom.find_all('a', href=True)
all_urls = set()

for a in all_hrefs:
    url = a["href"]
    url = urllib.parse.urljoin(doc_url, url)
    all_urls.add(url)

all_urls = list(all_urls)
all_urls[:10]

['https://math.stackexchange.com/users/35472/mhenni-benghorbal',
 'https://stackexchange.com/sites#culturerecreation',
 'https://math.stackexchange.com/a/3283853',
 'https://data.stackexchange.com/',
 'https://math.stackexchange.com/users/signup?ssrc=hero&returnurl=https%3a%2f%2fmath.stackexchange.com%2fquestions%2f411486%2funderstanding-the-singular-value-decomposition-svd',
 'https://math.stackexchange.com/users/signup?ssrc=head&returnurl=https%3a%2f%2fmath.stackexchange.com%2fquestions%2f411486%2funderstanding-the-singular-value-decomposition-svd',
 'https://math.stackexchange.com/posts/3242024/timeline',
 'https://math.meta.stackexchange.com',
 'https://math.stackexchange.com/posts/411486/timeline',
 'https://math.stackexchange.com/users/login?ssrc=site_switcher&returnurl=https%3a%2f%2fmath.stackexchange.com%2fquestions%2f411486%2funderstanding-the-singular-value-decomposition-svd']

Discuss the next result:

In [25]:
print("|DOM ∩ REGX| =", len(set(all_urls) & set(regexp_urls)))
print("|DOM \ REGX| =", len(set(all_urls) - set(regexp_urls)))
print("|REGX \ DOM| =", len(set(regexp_urls) - set(all_urls)))

|DOM ∩ REGX| = 52
|DOM \ REGX| = 86
|REGX \ DOM| = 62


# Unique file name

Please, never try to convert a domain (`google.com`), or a path component (`/index.php`) into a filename. They are not unique!

Also, better not to try to substitute sensitive symbols of the full URL (`/:`...) -- you will definitely forget one. Also, you may easily overflow file name.

Nice way is to use hash strings with fixed length and character set. Compute hash strings from the previous list.

In [26]:
import hashlib

for url in all_urls[:20]:
    s = hashlib.sha1(url.encode("utf-8")).hexdigest()
    print(s, url[:15] + "..." + url[-15:])

14cae2a34115ade11b8d04a176e80c6fa6d6d19c https://math.st...enni-benghorbal
26d1efe6d30a87bf2ef9632d071e27d290da6ee9 https://stackex...lturerecreation
4c503f4728e89a5bc88d7ace1ae51d5f524ddedc https://math.st...e.com/a/3283853
a4a70f3831e9838a1749a380175c47ccbc7cd1c2 https://data.st...ckexchange.com/
9625c6bae49474d74fd9053edd6fc418f46ef890 https://math.st...composition-svd
4cdb43765692872f778544154b17aa3583e9c143 https://math.st...composition-svd
563a7e7dd58b7d8804cacb267ac004ed8f1aef31 https://math.st...242024/timeline
9c8c77289ac89aca7890ded12b955220eb6aed0c https://math.me...ackexchange.com
bf49f5ca4da237a0928e1e8d707fc5802770ecd0 https://math.st...411486/timeline
d0175b49a3f4d3d023dab9e5a107a48634104ca7 https://math.st...composition-svd
1110b7e19857f9453abaa59c3859ea356ad9a2a6 https://ell.sta...mell-like-sweat
7d86b074441292cf1d8abcd5099ae070cae78bec https://math.st...on?noredirect=1
86860182649069c8cf3fd53be5244d1f6b80104a https://twitter...m/stackoverflow
fd40c62904aab7a72327f10af