# Parse me if you can #

Sometimes when crawling we have to parse websites that turn out to be SaaS - i.e., there is a special JS application which renders documents and which is downloaded first. Therefore, data that is to be rendered initially comes in a proprietary format. One of the examples is Google Drive. Last time we downladed and parsed some files from GDrive, however, we didn't parse GDrive-specific file formats, such as google sheets or google slides.

Today we will learn to obtain and parse such data using Selenium - a special framework for testing web-applications.

## 1. Getting started

Let's try to load and parse the page the way we did before:

In [1]:
import requests, os 
from bs4 import BeautifulSoup
resp = requests.get("https://docs.google.com/presentation/d/1LuZvz3axBD8UuHLagdv0EbhsGEWJmpd7gN5KjwYCp9Y/edit?usp=sharing")
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.body.text[:1000])

Не удалось открыть файл, поскольку в вашем браузере отключено использование JavaScript. Включите его и перезагрузите страницу.Некоторые функции PowerPoint не поддерживаются в Google Презентациях. Они будут удалены, если вы измените документ.Подробнее…6. Approximate nearest neighbours search 2. Trees   Смотреть  Открыть доступВойтиИспользуемая вами версия браузера больше не поддерживается. Установите поддерживаемую версию браузера.Закрытьdocument.getElementById('docs-unsupported-browser-bar').addEventListener('click', function () {this.parentNode.parentNode.removeChild(this.parentNode);return false;});ФайлИзменитьВидСправкаСпециальные возможностиОтладкаНесохраненные изменения: ДискПоследние изменения      Специальные возможности  Только просмотр     DOCS_timing['che'] = new Date().getTime();DOCS_timing['chv'] = new Date().getTime();Презентация в виде HTML(function(){/*

 Copyright The Closure Library Authors.
 SPDX-License-Identifier: Apache-2.0
*/
var a=this||self;function b(){this.a=t

As we see, the output is not what we expect. So, what can we do when a page is not being loaded right away, but is rather rendered by a script? Browser engines can help us get data. Let's try to load the same web-page, but do it in a different way: let's give some time to a browser to load the scripts and run them; and then will work with DOM (Document Object Model), but will get it from browser engine itself, not from BeautifulSoup.

Where do we get browser engine from? Simply installing a browser will do the thing. How do we send commands to it from code and retrieve DOM? Service applications called drivers will interpret out commands and translate them into browser actions.


For each browser engine suport you will need to:
1. install browser itself;
2. download 'driver' - binary executable, which passed commands from selenium to browser. E.g. [Gecko == Firefox](https://github.com/mozilla/geckodriver/releases), [ChromeDriver](http://chromedriver.storage.googleapis.com/index.html);
3. unpack driver into a folder under PATH environment variable. Or specify exact binary location.

### Download driver

And place it in any folder or under PATH env. variable.

### Install & Import selenium

In [2]:
# !pip3 install -U selenium
from selenium import webdriver
import time

from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

### Launch browser

This will open browser window

In [3]:
# browser = webdriver.Chrome()
# browser = webdriver.Safari()
browser = webdriver.Chrome(executable_path='./chromedriver')

### Download the page

In [4]:
# navigate to page
browser.get('http://tiny.cc/00dhkz')
browser.implicitly_wait(15)  # wait 5 seconds

# select all text parts from document
elements = browser.find_elements_by_css_selector("g.sketchy-text-content-text")
# note that if number differs from launch to launch this means better extend wait time
print("Elements found:", len(elements))

# oh no! It glues all the words!
print("What if just a silly approach:", elements[0].text)

# GDrive stores all text blocks word-by-word
subnodes = elements[0].find_elements_by_css_selector("text")
text = " ".join(n.text for n in subnodes)
print("What if a smart approach:", text)

Elements found: 110
What if just a silly approach: Forests
of
search
trees
What if a smart approach: Forests of search trees


In [5]:
browser.quit()

### Problems
- Too slow, wait for browser to open, browser to render

### Headless mode
Browsers (at least [FF](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode), [Chrome](https://intoli.com/blog/running-selenium-with-headless-chrome/), IE) have headless mode - no window rendering and so on. Means it should work much faster!

In [6]:
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('--kiosk') #for mac os (maximize window)
chrome_options.add_argument('headless')

In [7]:
browser = webdriver.Chrome('./chromedriver',chrome_options=chrome_options) #open browser
browser.maximize_window()

# navigate to page
browser.get('http://tiny.cc/00dhkz')
browser.implicitly_wait(5)  # wait 5 seconds

# select all text parts from document
elements = browser.find_elements_by_css_selector("g.sketchy-text-content-text")
# note that if number differs from launch to launch this means better extend wait time
print("Elements found:", len(elements))

# oh no! It adds NEW LINE. Behavior differs!!!!
print("What if just a silly approach:", elements[0].text)

# GDrive stores all text blocks word-by-word
subnodes = elements[0].find_elements_by_css_selector("text")
text = " ".join(n.text for n in subnodes)
print("What if a smart approach:", text)

browser.quit() #Close browser

  """Entry point for launching an IPython kernel.


Elements found: 95
What if just a silly approach: Forests
of
search
trees
What if a smart approach: Forests of search trees


### NB 
Note, that browser behavior differs for the same code!

## 2.Task 
Our lectures usually have lot's of links. Here are the links to oviginal versions of the documents.

[4. Vector space](https://docs.google.com/presentation/d/1UxjGZPPrPTM_3lCa_gWTk8yZI_qNmTKwtMxr8JZQCIc/edit?usp=sharing)

[6. search trees](https://docs.google.com/presentation/d/1LuZvz3axBD8UuHLagdv0EbhsGEWJmpd7gN5KjwYCp9Y/edit?usp=sharing)

[7-8. Web basics](https://docs.google.com/presentation/d/1bgsCgpjMcQmrFpblRI6oH9SnG4bjyo5SzSSdKxxHNlg/edit?usp=sharing)

Please complete the following tasks:

### 2.1 Inverted index for slides with numbers
I want to type a word, and it should say which slides of which lecture has this word.

Loading and parsing text and images from google slides

In [8]:
def getTextAndImgsFromSlides(url):    
    slides_text = dict() # dictionary slide_num : slide_text
    img_list = [] # list of image urls 
    
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--kiosk') #for mac os (maximize window)
    chrome_options.add_argument('headless') # hide browser

    # Initialize the Chrome webdriver and open the URL
    browser = webdriver.Chrome('./chromedriver',chrome_options=chrome_options)
    browser.get(url)
    browser.implicitly_wait(5)  # wait 5 seconds

    action = ActionChains(browser).send_keys(Keys.PAGE_DOWN) # action to scroll down the page
    
    prev_url = browser.current_url # get current slide url : each slide has unique url when active 
    #loop through all the slides , from top to bottom
    i = 1
    while True :
        elements = browser.find_elements_by_css_selector("div#slides-view") # get the slide display part 
        print(i)
        # get current slide and extract content only for it : actually elements lenght = 1 :)
        # extract the slide text, presenter notes and images if any
        for el in elements:  
            if el.is_displayed():
                text = (" ".join(n.text for n in el.find_elements_by_css_selector("text"))).strip()
                print(text)
                slides_text[i] = text
                i += 1
                images = el.find_elements_by_tag_name('image') #could be better (get more specific tag name)
                if len(images) > 0 : 
                    img_list += [img_obj.get_attribute('href') for img_obj in images if img_obj.is_displayed()]

        action.perform() # move to next slide
        #check if scroll down resulted in new presentation slide if not, it means we have reached the end of doc
        if browser.current_url == prev_url : 
            print("Reached the END")
            break
        prev_url = browser.current_url
        time.sleep(0.2) #wait a bit for loading the current slide content 
    
    browser.quit()
    return slides_text, img_list

Parsing three presentations

In [10]:
links = ["https://docs.google.com/presentation/d/1UxjGZPPrPTM_3lCa_gWTk8yZI_qNmTKwtMxr8JZQCIc/edit?usp=sharing", 
         "https://docs.google.com/presentation/d/1LuZvz3axBD8UuHLagdv0EbhsGEWJmpd7gN5KjwYCp9Y/edit?usp=sharing",
         "https://docs.google.com/presentation/d/1bgsCgpjMcQmrFpblRI6oH9SnG4bjyo5SzSSdKxxHNlg/edit?usp=sharing"]

all_imgs = dict()
all_texts = dict()

for i, link in enumerate(links,1):
    texts, imgs = getTextAndImgsFromSlides(link)
    all_imgs[f'presentation{i}'] = imgs
    all_texts[f'presentation{i}'] = texts


  # Remove the CWD from sys.path while we load stuff.


1
Vector space modelling with ML Stanislav Protasov and Albina Khusainova
2
Refresh ● Term-document matrix ● Vector space model ● Distributive semantics ● Latent space ● LSA, PCA, SVD
3
Agenda ● LSA - what is missing ● Neural networks solving embedding task ○ word|doc 2vec ○ DSSM ○ BERT
4
LSA critics Speed issue . Even optimized SVD is slow and requires memory and CPU time ● Fast Randomized SVD (facebook) ● Alternating Least Squares (ALS). Distributed and streaming versions Model issue . PCA assumes normal data distribution, but life is complicated ● pLSA . Statistical independence ( any distribution ) vs linear orthogonality. Roots in statistics What about adding new words/texts? Can we take some model and don’t care about distributions, statistics, memory and so? Существенным недостатком метода является значительное снижение скорости вычисления при увеличении объёма входных данных (например, при SVD-преобразовании) скорость вычисления соответствует порядку N(doc+term)^(2*k) Предполаг

11
QuadTree insertion #1 function insert (p) { if (!this. boundary.containsPoint (p)) return false; // object cannot be added if (this. points.size < QT_NODE_CAPACITY && northWest == null ) { this.points. append (p); return true; } if (this.northWest == null) this. subdivide (); if (this.northWest->insert(p)) return true; if (this.northEast->insert(p)) return true; if (this.southWest->insert(p)) return true; if (this.southEast->insert(p)) return true; } idea
12
QuadTree insertion #2 A \ B \ C - small rotation A \ B / C - big rotation
13
QuadTree deletion <<… In fact, it seems that one cannot do better than to reinsert all of the stranded nodes , one by one, into the new tree. This answer is not very satisfactory, and it is a matter of some interest whether there exists any merging algorithm that works faster than n log n , where n is the total number of nodes in the two trees to be merged…>> Common problem for all DS with balance
14
QuadTree optimization By an optimized tree we will me

35
See also M-trees R-trees and R*-trees Octree …
Reached the END
1
Web basics Stanislav Protasov and Albina Khusainova
2
Agenda ● Internet in a nutshell ● Web 1.0 … 4.0 ○ Idea of the Web ○ Web transport and formats ○ Browsers and DOM ● Crawling basics ○ robots.txt, sitemap ○ Terms and Conditions ○ APIs
3
Ok, Google, what is internet?
4
IP protocol and address http://192.34.57.61/ http://0xC022393D/ http://3223468349/ http://meduza.io:@3223468349/ IPv6 - 128 bits Task – create URL that will lead you Zones and fixed ips: localhost/loopback, broadcast, /*.*.*.255, any IP – 0.0.0.0 Адреса, используемые в локальных сетях, относят к частным. К частным относятся IP-адреса из следующих сетей: 10.0.0.0/8 172.16.0.0/12 192.168.0.0/16 Также для внутреннего использования: 127.0.0.0/8 169.254.0.0/16 — используется для автоматической настройки сетевого интерфейса в случае отсутствия DHCP . When you test your site on local machine or server, be sure that IP will not change (DHCP, delegation). Parall

20
HTTP HTTP (hypertext transfer protocol) – application (7) level protocol to deliver text data. Created to transfer hypertext. Provide communication between client (usually browser) and server (web-server) using client requests and server responses. HTTP v1.0 – does not support using single TCP session for multiple requests. Supports following client request methods: • GET – get content from the server • HEAD – get only header from the server without content ("what to expect") • POST – sent data to the server HTTP v1.1. – supports also PUT, DELETE, TRACE, OPTIONS, CONNECT, PATCH HTTP/2 - SDPY (Google) based update. Binary. Header compression, Server pushes, conveyor requests, request multiplexing over single TCP Придуман для передачи html, но используется далеко не только для. При этом, конечно, схлопатывает оверхэд – данные кодятся в текст (base64, UTF-7 например – 8/6 в лучшем случае). Одна из особенностей клиент серверной архитектуры в том, что сервер («обслуживающий») отвечает на

35
COMMON FACTS ABOUT AUTHENTICATION If server returns 401, this means it wants to authenticate you. Server must send WWW-Authenticate header to you. HTTP/1.0 401 Unauthorized Cache-Control: no-cache Pragma: no-cache Content-Length: 58 Content-Type: text/html Expires: -1 Server: Microsoft-IIS/8.0 WWW-Authenticate: Basic realm=“ area to be accessed ” WWW-Authenticate – это т.н. Challenge – задача клиенту на прохождение аутентификации NB 401 Unauthorized – это неправильно. На самом деле не аутентифицирован 403 Forbidden – это и есть Unauthorized Realm – это то, что показывают пользователю, чтобы он знал, что вводить.
36
BASIC AUTHENTICATION Easiest way to setup authentication GET /sometail.aspx HTTP/1.1 Host: somehost Authorization: Basic bG9naW46cGFzc3cwcmQ= where “ bG9naW46cGFzc3cwcmQ= ” == base64(“ login:passw0rd ”) NB: • Login and password are not secured in fact! Only way to use – over HTTPS • You can send this without challenge • With each request Показать сайт http://www.httpwatch

53
BROWSER ENGINES = LAYOUT ENGINE + JS +… • Good article about browser architecture • Browser Layout Engine (html + css) ○ Trident (IE), " Edge " (Spartan) → Chromium (2019) ○ Gecko (Mozilla) ○ WebKit (Safari, Chromium-family), WebCore ■ Blink (Chrome 28+, Opera 15+, Chrome for Android) ○ Others (KHTML, Presto ) Что делают собсна, как работают изнутри Различия между движком и самими браузером . Что браузер – это ещё и JS, например. WebKit = это WebCore + JavaScriptCore Главное – этапы отрисовки со схемы идут последовательно, но может быть с наложением по времени (из-за нежелания ждать). Дерево отображения – порядок расположения прямоугольников. Здесь уже работают стили подгруженные или по умолчанию. Большая часть кода синтаксического анализатора служит для исправления ошибок разработчиков!!!!!
54
HTML DOCUMENT STRUCTURE Любой sgml/xml/html объект – это дерево. Поэтому документ можно представить как дерево разбора. GOTO simple.html Рассказать про DOM - Можно манипулировать после создан

### 2.2 Tests

In [11]:
texts, imgs = getTextAndImgsFromSlides('http://tiny.cc/00dhkz')

assert len(texts) == 35 # equal to the total number of slides in the presentation 
print(len(texts))

assert len(imgs) > 26 # can be more than that due to visitor icons
print(len(imgs))

assert any("Navigable" in value for value in texts.values()) # word is on a slide
assert any("MINUS" in value for value in texts.values()) # word is in speaker notes

  # Remove the CWD from sys.path while we load stuff.


1
Forests of search trees Stanislav Protasov and Albina Khus a inova Context: vector space Cosine Sublinear (log) for any metrics space MINUS: NSW - avg log, not worst case
2
Refresh ● Proximity graphs ● Small World Graphs ● Navigable Small World Ideas of last lect
3
Agenda ANNS with trees: - Search trees - Quad trees - KD-trees - Annoy - And some others Focus on KD
4
Approach #3. Trees 1 - clustering 2 - Proximity graphs
5
Refresher for [B]ST ● K-ary (usually binary) trees ● Built upon comparable keys (scalars) ● Similar search procedure ● Preserved balance property, ensures O(log(N)) max path length ● Can be homogeneous (AVL) and not ( B+ tree ) K-ary = REGULAR SCALARS BALANCE Non-homogeneous - search always stops in the leaf node
6
But what if we have vectors? Vectors - coordinates, object embeddings
7
Originated from Computer Graphics Why from CG? - Don’t render that you don’t need to render. 2D and 3D scenes. Overlapping regions, polygons are used in 3D and 2.5D (Doom) graphics. -

32
BSP-tree To store polygons in a list: ● Choose a polygon P from a list L . ● Make a node N , and add P to the list of N . ● For each other polygon Q in the list: ○ If Q is in front of P plane, move Q to the list L F “ in front of P ”. ○ If Q is behind P plane, move Q to the list L B “ behind P ” . ○ If Q intersects P plane, split it into two polygons and move them to the respective lists. ○ If that polygon lies in the plane containing P , add it to the list of N . ● Apply this algorithm to L F and L B . BSP search is pointing a viewpoint and search allows to obtain correct rendering order with the same procedure as interval tree 1. If the current node is a leaf node , render the polygons at the current node. 2. Otherwise, if the viewing location V is in front of the current node: 1. Render the child BSP tree containing polygons behind the current node 2. Render the polygons at the current node 3. Render the child BSP tree containing polygons in front of the current node 3. Otherwise

Building an inverted index and search using it (boolean retrieval is just ok)

In [12]:
import re, nltk
from nltk.corpus import stopwords
from collections import Counter

class Preprocessor:
    """Text Preprocessor class"""
    
    def __init__(self):
        self.stop_words = stopwords.words('russian') + stopwords.words('english')
        self.ps = nltk.stem.PorterStemmer()

    def tokenize(self, text):
        """word tokenize text using nltk library"""
        return nltk.word_tokenize(text)

    def stem(self, word, stemmer):
        """stem word using provided stemmer"""
        return stemmer.stem(word)

    def is_apt_word(self, word):
        """check if word is appropriate - not a stop word and isalpha"""
        return word not in self.stop_words and word.isalpha()

    def preprocess(self, text):
        """tokenizes lowercased text and stems it, ignoring not appropriate words"""
        text = str(text)
        text = " ".join(re.split("[^а-яА-Я||a-zA-Z]+",text))
        text = " ".join(text.split())
        tokenized = self.tokenize(text.lower())
        return [self.stem(w, self.ps) for w in tokenized if self.is_apt_word(w) and len(w) > 2]

def find(query, index):
    """Boolean retrieval"""
    query = Counter(query)
    postings = []
    for term in query.keys():
        if term not in index:  # ignoring absent terms
            continue
        posting = index[term][1:]
        # extract document info only
        posting = [i[0] for i in posting]
        postings.append(posting)
    docs = set.intersection(*map(set,postings))
    return docs 

def index_documents(all_texts):
    """Creates documents index"""
    prep = Preprocessor()
    inverted_index = {}
    for doc_num , doc in all_texts.items():
        for slide_num, text in doc.items():
            content = prep.preprocess(text)
            doc_index = Counter(content)

            for term in doc_index.keys():
                term_freq = doc_index[term]
                if term not in inverted_index:                
                    inverted_index[term] = [term_freq, (f'{doc_num}_slide_{slide_num}', term_freq)]
                else:
                    inverted_index[term][0] += term_freq
                    inverted_index[term].append((f'{doc_num}_slide_{slide_num}', term_freq))

    return inverted_index 

### 2.3 Tests

In [13]:
queries = ["architecture", "algorithm", "function", "dataset", 
           "Protasov", "cosine", "модель", "например"]

inverted_index = index_documents(all_texts)
prep = Preprocessor()
for query in queries:
    r = find(prep.preprocess(query), inverted_index)
    print("Results for: ", query)
    print("\t", r)
    assert len(r) > 0, "Query should return at least 1 document"
    assert len(r) > 1, "Query should return at least 2 documents"

Results for:  architecture
	 {'presentation1_slide_5', 'presentation1_slide_19', 'presentation1_slide_16', 'presentation3_slide_53'}
Results for:  algorithm
	 {'presentation2_slide_32', 'presentation2_slide_13', 'presentation3_slide_22', 'presentation2_slide_34', 'presentation2_slide_14'}
Results for:  function
	 {'presentation2_slide_10', 'presentation2_slide_11', 'presentation3_slide_34'}
Results for:  dataset
	 {'presentation2_slide_23', 'presentation1_slide_9', 'presentation2_slide_27'}
Results for:  Protasov
	 {'presentation1_slide_1', 'presentation2_slide_1', 'presentation3_slide_1'}
Results for:  cosine
	 {'presentation1_slide_16', 'presentation2_slide_1'}
Results for:  модель
	 {'presentation3_slide_51', 'presentation1_slide_16'}
Results for:  например
	 {'presentation3_slide_20', 'presentation3_slide_38', 'presentation3_slide_13', 'presentation1_slide_4', 'presentation3_slide_53'}


### 2.4 Images saved
Save all images used in a presentation as separate files.

In [14]:
#TODO: load and save all images from slides on disk (from one | all presentations, both are fine)
for pres_name, imgs in all_imgs.items():
    if not os.path.exists(pres_name): os.makedirs(pres_name) #Create folder for each presentation if not existing
    for j, img_url in enumerate(imgs,1):
        try:
            img_data = requests.get(img_url).content
            with open(f'./{pres_name}/image_{j}.jpg', 'wb') as handler:
                handler.write(img_data)
        except :
            pass #ignore broken urls

## 3. Bonus task.  Links exploration

Find all external links in these presentations and index them for search too. I.e., parse them and extend inverted index you built with external links content: when searching by word we should get not only slides that contain it, but also any links that were mentioned in slides and contain this query word. 

In [None]:
#TODO: bonus task
# all_texts