# Essential skill for the Internet crawling

## Regular expressions

Regular expressions (aka regex, regexp) are used to search for patterns. Machine-readable languages often have regualar structure (not always), or at least are non-ambiguous.

Obvious way is, of course, to let machine parse the document and then process the result (as in the previous lab). But this often result in additinal depenencies and significant memory and time overhead (which is ok for a single document, but won't work for millions).

#### Predefined Character Classes:

In [8]:
import re

string = "we have only 5 do11ars. This amount of $ is small. H0w should we sur-vive?"

# pattern = "\d"  #: Matches any digit (0-9).
#pattern = "\D" #: Matches any non-digit. 
#pattern = "\w" #: Matches any word character (a-z, A-Z, 0-9, _).
#pattern = "\W" #: Matches any non-word character.
#pattern = "\s" #: Matches any whitespace character (spaces, tabs, line breaks).
#pattern = "\S" #: Matches any non-whitespace character.

# print(pattern, end=": ",)
# print(re.findall(pattern, string))
# print(len(re.findall(pattern, string)))


# Extra point task - 
#   1) return the all the digits that are surrounded by white spaces:
#   2) return the all the none word characters that are surrounded by letters:
# pattern = "\s\d\s" 
pattern = "[a-zA-Z]\W[a-zA-Z]"
print(pattern, end=": ", )
print(re.findall(pattern, string))
print(len(re.findall(pattern, string)))



[a-zA-Z]\W[a-zA-Z]: ['e h', 'e o', 's a', 't o', 's s', 'w s', 'd w', 'e s', 'r-v']
9


#### Quantifiers:



In [14]:
import re

string = "W8! we have only 5 do11111ars. This amount of $ is small. How should we sur-vive?"

pattern = 'll*.'  #: Matches 0 or more of the preceding character or group. E.g., ab*c matches "ac", "abc", "abbc", etc.
#pattern = 'll+.'      #: Matches 1 or more of the preceding character or group. E.g., ab+c matches "abc", "abbc", but not "ac".
#pattern = 'll?.'      #: Matches 0 or 1 of the preceding character or group. E.g., ab?c matches "ac" and "abc", but not "abbc".
#pattern = 'll{0}.'    #: Matches exactly n of the preceding character or group.
#pattern = 'll{1,}.'   #: Matches n or more of the preceding character or group.
#pattern = 'do1{1,2}ar'  #: Matches between n and m of the preceding character or group.

# print(pattern, end=": ",)
# print(re.findall(pattern, string))
# print(len(re.findall(pattern, string)))

#Extra point task - 
#  1) return all the words that contains digits in the middle (and then any word that have a digit):
#  2) return all the words that end with a 2 'l's:
# pattern2 = "[a-zA-Z]+\d+[a-zA-Z]+" # with digits in the middle
# pattern2 = "[a-zA-Z]*\d+[a-zA-Z]*" # a word that has a digit 
pattern2 = "\w*ll"
print(pattern2, end=": ", )
print(re.findall(pattern2, string))
print(len(re.findall(pattern2, string)))


\w*ll: ['small']
1


#### Character Classes and Sets

In [20]:
string = "we have only 5 do11111ars. This amount of $ is small. How should we sur-vive?"

#pattern="[a1s]"   #: Matches any one character inside the brackets, so it matches either "a", "b", or "c".
#pattern="[^ah ]"  #: Matches any one character not inside the brackets.
#pattern="[a-z]"   #: Matches any lowercase letter.
#pattern="[A-Z]"   #: Matches any uppercase letter.
#pattern="[0-9]"   #: Matches any digit.
#pattern="(abc)"   #": Matches the exact characters "abc" and groups them together.
#pattern="(a|b)"   #": Matches either "a" or "b".


# print(pattern, end=": ",)
# print(re.findall(pattern, string))
# print(len(re.findall(pattern, string)))

#Extra point task - 
#  1) Find all words that start with a digit and end with a non-word character:

# string = "3dogs! 4cats. 5fishes But not this one."
# pattern = "\d\w*\W" 
# print(pattern, end=": ",)
# print(re.findall(pattern, string))
# print(len(re.findall(pattern, string)))


#  2) Find all sequences in a text that start with an uppercase letter,
# followed by one or more lowercase letters, and end with either the character 'a', '1', or 's':

# string = "Apple1 Bananas TreeS Mango Orange1 Grapes Peachs Pear"
# pattern = "[A-Z][a-z]+[a1s]"
# print(pattern, end=": ",)
# print(re.findall(pattern, string))
# print(len(re.findall(pattern, string)))


[A-Z][a-z]+[a1s]: ['Apple1', 'Bananas', 'Orange1', 'Grapes', 'Peachs', 'Pea']
6


### Find URLs/URIs vs parse the doc

Instead of building DOM model and extracting `href` and `src` attributes, you may rely on the structure of the url itself. Extact all URLs from [the page](https://math.stackexchange.com/questions/411486/understanding-the-singular-value-decomposition-svd) with regexp. You major tool is [re.findall(...)](https://docs.python.org/3/library/re.html#). You may also be interested in compiled regular rexpression (if you reuse one).

In [21]:
import re
import requests

url = "https://math.stackexchange.com/questions/" \
      "411486/understanding-the-singular-value-decomposition-svd"

text = requests.get(url).text

# my inspiration - 
# I took some example URL regexp from the internet, 
# specifically from here:
# https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url
expressions = [
    "(?:([A-Za-z]+):)?(\/{0,3})([0-9.\-A-Za-z]+)(?::(\d+))?(?:\/([^?#]*))?(?:\?([^#]*))?(?:#(.*))?",
    "(www|http:|https:)+[^\s]+[\w]",
    "https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)",
    "[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)?",
    "(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})",
    "(?!mailto:)(?:(?:http|https|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?:(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[0-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,})))|localhost)(?::\\d{2,5})?(?:(/|\\?|#)[^\\s]*)?",
    "https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&\/=]*)",
]

for expression in expressions:
    print()
    pattern = re.compile(expression)
    urls = pattern.findall(text)
    print(expression)
    print(urls[:10])


(?:([A-Za-z]+):)?(\/{0,3})([0-9.\-A-Za-z]+)(?::(\d+))?(?:\/([^?#]*))?(?:\?([^#]*))?(?:#(.*))?
[('', '', 'DOCTYPE', '', '', '', ''), ('', '', 'html', '', '', '', ''), ('', '', 'html', '', '', '', ''), ('', '', 'itemscope', '', '', '', ''), ('', '', 'itemtype', '', '', '', ''), ('https', '//', 'schema.org', '', 'QAPage" class="html__responsive " lang="en">\r\n\r\n    <head>\r\n\r\n        <title>linear algebra - Understanding the singular value decomposition (SVD) - Mathematics Stack Exchange</title>\r\n        <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/math/Img/favicon.ico', 'v=92addaa54d18">\r\n        <link rel="apple-touch-icon" href="https://cdn.sstatic.net/Sites/math/Img/apple-touch-icon.png?v=0ae50baa40ed">\r\n        <link rel="image_src" href="https://cdn.sstatic.net/Sites/math/Img/apple-touch-icon.png?v=0ae50baa40ed"> \r\n        <link rel="search" type="application/opensearchdescription+xml" title="Mathematics Stack Exchange" href="/opensearch.xml">\r\n    

Was this success? 

Compose your own minimalistic:

In [98]:
import re
import requests

url = "https://math.stackexchange.com/questions/" \
      "411486/understanding-the-singular-value-decomposition-svd"

text = requests.get(url).text

#print(text)

# write the regex for the domain and path to get the extra point
# example of a link: "https://cdn.sstatic.net/Sites/math/Img/apple-touch-icon.png?v=0ae50baa40ed"
protocol = "https?://"
domain = "[\w\d\-\.]+\/"
path = "[/\w\-\.]+"
args = "(?:\?[\w\=&-_;\[\]]+)?"
hashtail = "(?:#[\w$%-_;]+)?"

expression = protocol + domain + path + args + hashtail
pattern = re.compile(expression)
regexp_urls = pattern.findall(text)
print(regexp_urls[:20])

['https://schema.org/QAPage', 'https://cdn.sstatic.net/Sites/math/Img/favicon.ico?v=92addaa54d18', 'https://cdn.sstatic.net/Sites/math/Img/apple-touch-icon.png?v=0ae50baa40ed', 'https://cdn.sstatic.net/Sites/math/Img/apple-touch-icon.png?v=0ae50baa40ed', 'https://math.stackexchange.com/questions/411486/understanding-the-singular-value-decomposition-svd', 'https://math.stackexchange.com/questions/411486/understanding-the-singular-value-decomposition-svd', 'https://cdn.sstatic.net/Sites/math/Img/apple-touch-icon', 'https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js', 'https://cdn.sstatic.net/Js/third-party/npm/', 'https://cdn.sstatic.net/Js/stub.en.js?v=98d9f62851ed', 'https://cdn.sstatic.net/Shared/stacks.css?v=85feea4d403d', 'https://cdn.sstatic.net/Sites/math/primary.css?v=61593710227a', 'https://cdn.sstatic.net/Shared/Channels/channels.css?v=a4d77abedec3', 'https://cdn.sstatic.net/Js/third-party/citation-helper.js?v=2591ce444a3f', 'https://cdnjs.cloudflare.com/ajax/lib

# Streams and files

When you deal with the big files you should take care about the RAM. Today 1GB won't suprise anyone on the desktop, but server machines, which implement crawlers, may be optimized for the resource.

Using streams instead of RAM-cached files is a good strategy.

- Look for solution here: https://stackoverflow.com/a/16696317
- Look for the sample big file here: http://xcal1.vodafone.co.uk/
- Read about python memory measurement here: https://pythonspeed.com/articles/measuring-memory-python/

In [66]:
import psutil, gc


def get_mem():
    return psutil.Process().memory_info().rss

In [67]:
large_file_url = "http://212.183.159.230/100MB.zip"

First, download the file as you would do it simple way:

In [68]:
gc.collect()
print("Resident set size:", get_mem())
data = requests.get(large_file_url).content
print("Resident set size:", get_mem())

with open('100-RAM', 'wb') as f:
    f.write(data)

print("Resident set size:", get_mem())
data = None
gc.collect()
print("Resident set size:", get_mem())

Resident set size: 86278144
Resident set size: 190570496
Resident set size: 190570496
Resident set size: 85708800


And then use the streaming mode of the `requests` library.

\# hint: If you have the request open as r such as:
```
with requests.get(url, stream=True) as r:
```
You can use the line:
```
shutil.copyfileobj(r.raw, f)
```
to write data on a new file in steam instead of doing it in one instance


(Don't fotget to check if the request is a vliad request!)


In [82]:
import requests
import shutil


def download_file(url, path):
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(path, 'wb') as f:
            shutil.copyfileobj(r.raw, f)


gc.collect()
print("Resident set size:", get_mem())
download_file(large_file_url, "100-stream")
print("Resident set size:", get_mem())

Resident set size: 111181824
Resident set size: 111181824


# BeautifulSoup

Plain text HTML is a mixture of content, markup, and code. Extracting structure, or URLs, or plain text might be tricky with regular expressions. 

Building a DOM model is slow, but may save a lot of code and keep you from mistakes.

## Extract all sentences
For indexing and semantic analysis we use different granularity. Often sentence is a good choice. 

In [71]:
! pip install nltk

Defaulting to user installation because normal site-packages is not writeable
Collecting nltk
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting joblib
  Using cached joblib-1.3.2-py3-none-any.whl (302 kB)
Collecting regex>=2021.8.3
  Using cached regex-2023.8.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (771 kB)
Collecting tqdm
  Using cached tqdm-4.66.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, regex, joblib, nltk
Successfully installed joblib-1.3.2 nltk-3.8.1 regex-2023.8.8 tqdm-4.66.1


In [72]:
import nltk

nltk.download('punkt')


[nltk_data] Downloading package punkt to /home/kamil/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

\# Hint you can use the:
```
tokenize.sent_tokenize(p)
```
to return the list of sentences in a paragraph

In [83]:
from bs4 import BeautifulSoup
from bs4.element import Comment
from nltk import tokenize

doc_url = "https://math.stackexchange.com/questions/" \
          "411486/understanding-the-singular-value-decomposition-svd"

text = requests.get(doc_url).text
dom = BeautifulSoup(text, 'html.parser')  # Specify the parser to avoid any potential warning
paragraphs = [p.strip() for p in dom.text.split('\n') if p.strip()]

sents = []
for p in paragraphs:
    sents += tokenize.sent_tokenize(p)

print(sents[90:100])


['I know they are square roots of eigenvalues of $\\textbf{A}^{\\textrm{T}}\\textbf{A}$.', "What I don't understand is the meaning?", 'I know if I e.g.', 'take covariance matrix and diagonalize it, I end up with eigenvalues (or maximum/unique/?singular?', 'values) in a diagonal matrix representing variances.', 'SVD however is product of three matrices: outer product o, singular values, inner product of A.', "But I still don't see the meaning of all this.", '$\\endgroup$', '–\xa0Celdor', 'Jun 5, 2013 at 2:55']


# Extract URLs from nodes

Be careful with relative links. How would you process them?

You can use ```userlib.parse``` to merge two links, e.g:

```
joined_url = urllib.parse.urljoin(url1, url2)
```

In [101]:
import urllib.parse

all_hrefs = dom.find_all('a', href=True)
all_urls = set()

for a in all_hrefs:
    url = a['href']
    if url[0] == '/':
        url = urllib.parse.urljoin(doc_url, url)
    all_urls.add(url)

all_urls = list(all_urls)
all_urls[:10]

['https://math.stackexchange.com/help',
 'https://math.stackexchange.com/questions/411486/understanding-the-singular-value-decomposition-svd',
 'https://stackexchange.com/sites#professional',
 'https://math.stackexchange.com/',
 'https://math.stackexchange.com/questions/4315564/singular-value-decomposition-of-symmetric-matrix',
 'https://meta.stackexchange.com/questions/392048/our-design-vision-for-stack-overflow-and-the-stack-exchange-network',
 'https://math.stackexchange.com/questions/704238/singular-value-decomposition-of-rank-1-matrix',
 'https://stackoverflow.com/legal',
 'https://math.stackexchange.com/q/4718815',
 'https://math.stackexchange.com/users/38053/julien']

Discuss the next result:

In [102]:
print("|DOM ∩ REGX| =", len(set(all_urls) & set(regexp_urls)))
print("|DOM \ REGX| =", len(set(all_urls) - set(regexp_urls)))
print("|REGX \ DOM| =", len(set(regexp_urls) - set(all_urls)))

|DOM ∩ REGX| = 41
|DOM \ REGX| = 98
|REGX \ DOM| = 59


# Unique file name

Please, never try to convert a domain (`google.com`), or a path component (`/index.php`) into a filename. They are not unique!

Also, better not to try to substitute sensitive symbols of the full URL (`/:`...) -- you will definitely forget one. Also, you may easily overflow file name.

Nice way is to use hash strings with fixed length and character set. Compute hash strings from the previous list.

In [105]:
import hashlib

for url in all_urls[:20]:
    s = hashlib.sha256(url.encode()).hexdigest()
    print(s, url[:15] + "..." + url[-15:])

a5cc359892fa656ce65a1c7c20ac57afa4c692a7d8de3b3d7d3990d2185baabf https://math.st...change.com/help
528f2428d8cb51fcfff1ae391819ad45c852ec58b2a708462a2c28d57bc7f60d https://math.st...composition-svd
5127882c52094af9fa7263d26ad5a3996575432df85b4a84477092d229eac028 https://stackex...es#professional
2a5bd98544149d360b5bab831c4a1493e13c57a6d8a932d6033a26320ca1a60b https://math.st...ckexchange.com/
b21503e2b4d457ce8af62b0eced5b4133e0a102d212911c304c6ce8242900305 https://math.st...ymmetric-matrix
dde05676d2e73238d6b1c2c9688f47d7a3e4931e81e7ffcecb3c66fac918c001 https://meta.st...xchange-network
6d8f4231171c22462edc7a79e7d45b8aa6c283915b5991fd7eec545c0f1ca2c3 https://math.st...f-rank-1-matrix
907f1a66da444e9f511881e242aa45743da3c9b622e99a4a4c266c1979b6d411 https://stackov...rflow.com/legal
60825c068c1b5d5399ec11393a5b94aa0cd4ca0f22f3130efc3dd571a3050535 https://math.st...e.com/q/4718815
dac689438e0c51a7bf22d0bd87bfe8b1e98a05274131dadae078495bd0e1dad8 https://math.st...rs/38053/julien
d700f665fb