# Parse me if you can #

Sometimes when crawling we have to parse websites that turn out to be SaaS - i.e., there is a special JS application which renders documents and which is downloaded first. Therefore, data that is to be rendered initially comes in a proprietary format. One of the examples is Google Drive. Last time we downladed and parsed some files from GDrive, however, we didn't parse GDrive-specific file formats, such as google sheets or google slides.

Today we will learn to obtain and parse such data using Selenium - a special framework for testing web-applications.

## 1. Getting started

Let's try to load and parse the page the way we did before:

In [1]:
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://docs.google.com/presentation/d/1LuZvz3axBD8UuHLagdv0EbhsGEWJmpd7gN5KjwYCp9Y/edit?usp=sharing")
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.body.text[:1000])

Не удалось открыть файл, поскольку в вашем браузере отключено использование JavaScript. Включите его и перезагрузите страницу.Некоторые функции PowerPoint не поддерживаются в Google Презентациях. Они будут удалены, если вы измените документ.Подробнее…6. Approximate nearest neighbours search 2. Trees   Смотреть  Открыть доступВойтиИспользуемая вами версия браузера больше не поддерживается. Установите поддерживаемую версию браузера.Закрытьdocument.getElementById('docs-unsupported-browser-bar').addEventListener('click', function () {this.parentNode.parentNode.removeChild(this.parentNode);return false;});ФайлИзменитьВидСправкаСпециальные возможностиОтладкаНесохраненные изменения: ДискПоследние изменения      Специальные возможности  Только просмотр     DOCS_timing['che'] = new Date().getTime();DOCS_timing['chv'] = new Date().getTime();Презентация в виде HTML(function(){/*

 Copyright The Closure Library Authors.
 SPDX-License-Identifier: Apache-2.0
*/
var a=this||self;function b(){this.a=t

As we see, the output is not what we expect. So, what can we do when a page is not being loaded right away, but is rather rendered by a script? Browser engines can help us get data. Let's try to load the same web-page, but do it in a different way: let's give some time to a browser to load the scripts and run them; and then will work with DOM (Document Object Model), but will get it from browser engine itself, not from BeautifulSoup.

Where do we get browser engine from? Simply installing a browser will do the thing. How do we send commands to it from code and retrieve DOM? Service applications called drivers will interpret out commands and translate them into browser actions.


For each browser engine suport you will need to:
1. install browser itself;
2. download 'driver' - binary executable, which passed commands from selenium to browser. E.g. [Gecko == Firefox](https://github.com/mozilla/geckodriver/releases), [ChromeDriver](http://chromedriver.storage.googleapis.com/index.html);
3. unpack driver into a folder under PATH environment variable. Or specify exact binary location.

### Download driver

And place it in any folder or under PATH env. variable.

### Install selenium

In [None]:
!pip install -U selenium

In [3]:
from selenium import webdriver

### Launch browser

This will open browser window

In [4]:
browser = webdriver.Firefox()
# or explicitly
# browser = webdriver.Firefox(
#     executable_path='C:/bin/geckodriver.exe', 
#     firefox_binary='C:/Program Files/Mozilla Firefox/firefox.exe'
# )

### Download the page

In [5]:
# navigate to page
browser.get('http://tiny.cc/00dhkz')
browser.implicitly_wait(5)  # wait 5 seconds

# select all text parts from document
elements = browser.find_elements_by_css_selector("g.sketchy-text-content-text")
# note that if number differs from launch to launch this means better extend wait time
print("Elements found:", len(elements))

# oh no! It glues all the words!
print("What if just a silly approach:", elements[0].text)

# GDrive stores all text blocks word-by-word
subnodes = elements[0].find_elements_by_css_selector("text")
text = " ".join(n.text for n in subnodes)
print("What if a smart approach:", text)

Elements found: 140
What if just a silly approach: Forestsofsearchtrees
What if a smart approach: Forests of search trees


In [6]:
browser.quit()

### Problems
- Too slow, wait for browser to open, browser to render

### Headless mode
Browsers (at least [FF](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode), [Chrome](https://intoli.com/blog/running-selenium-with-headless-chrome/), IE) have headless mode - no window rendering and so on. Means it should work much faster!

In [7]:
options = webdriver.FirefoxOptions()

options.add_argument('headless')
options.add_argument('window-size=1200x600')
browser = webdriver.Firefox(options=options)

In [8]:
## SAME CODE

# navigate to page
browser.get('http://tiny.cc/00dhkz')
browser.implicitly_wait(5)  # wait 5 seconds

# select all text parts from document
elements = browser.find_elements_by_css_selector("g.sketchy-text-content-text")
# note that if number differs from launch to launch this means better extend wait time
print("Elements found:", len(elements))

# oh no! It adds NEW LINE. Behavior differs!!!!
print("What if just a silly approach:", elements[0].text)

# GDrive stores all text blocks word-by-word
subnodes = elements[0].find_elements_by_css_selector("text")
text = " ".join(n.text for n in subnodes)
print("What if a smart approach:", text)

Elements found: 140
What if just a silly approach: Forestsofsearchtrees
What if a smart approach: Forests of search trees


In [9]:
browser.quit()

### NB 
Note, that browser behavior differs for the same code!

## 2.Task 
Our lectures usually have lot's of links. Here are the links to oviginal versions of the documents.

[4. Vector space](https://docs.google.com/presentation/d/1UxjGZPPrPTM_3lCa_gWTk8yZI_qNmTKwtMxr8JZQCIc/edit?usp=sharing)

[6. search trees](https://docs.google.com/presentation/d/1LuZvz3axBD8UuHLagdv0EbhsGEWJmpd7gN5KjwYCp9Y/edit?usp=sharing)

[7-8. Web basics](https://docs.google.com/presentation/d/1bgsCgpjMcQmrFpblRI6oH9SnG4bjyo5SzSSdKxxHNlg/edit?usp=sharing)

Please complete the following tasks:

### 2.1 Inverted index for slides with numbers
I want to type a word, and it should say which slides of which lecture has this word.

Loading and parsing text and images from google slides

In [11]:
def getTextAndImgsFromSlides(url):    
    slides_text = dict() # dictionary slide_num : slide_text
    img_list = [] # list of image urls 
    #TODO: parse google slides and save all text and image urls in slides_text and img_list
    # you should get the contents from ALL slides - however, you will see that at one moment 
    # of time only single slide + few slide previews on the left are visible. To be able to    
    # reach all slides you will need to scroll to and click these previews. While slide contents 
    # can be obtained from previews themselves, speaker notes (which you also have to extract)
    # can be viewed only if a particular slide is open.
    # to scroll the element of interest into view, use can this: 
    # browser.execute_script("arguments[0].scrollIntoView();", el)
    # to click the element, use can use ActionChains library   
    
    
    return slides_text, img_list

Parsing three presentations

In [12]:
links = ["https://docs.google.com/presentation/d/1UxjGZPPrPTM_3lCa_gWTk8yZI_qNmTKwtMxr8JZQCIc/edit?usp=sharing", 
         "https://docs.google.com/presentation/d/1LuZvz3axBD8UuHLagdv0EbhsGEWJmpd7gN5KjwYCp9Y/edit?usp=sharing",
         "https://docs.google.com/presentation/d/1bgsCgpjMcQmrFpblRI6oH9SnG4bjyo5SzSSdKxxHNlg/edit?usp=sharing"]

all_imgs = []
all_texts = dict()

for i, link in enumerate(links):
    texts, imgs = getTextAndImgsFromSlides(link)


0
Vector space modelling with ML Stanislav Protasov and Albina Khusainova
1
Refresh ● Term-document matrix ● Vector space model ● Distributive semantics ● Latent space ● LSA, PCA, SVD
2
Agenda ● LSA - what is missing ● Neural networks solving embedding task ○ word|doc 2vec ○ DSSM ○ BERT
3
LSA critics Speed issue . Even optimized SVD is slow and requires memory and CPU time ● Fast Randomized SVD (facebook) ● Alternating Least Squares (ALS). Distributed and streaming versions Model issue . PCA assumes normal data distribution, but life is complicated ● pLSA . Statistical independence ( any distribution ) vs linear orthogonality. Roots in statistics What about adding new words/texts? Can we take some model and don’t care about distributions, statistics, memory and so? Существенным недостатком метода является значительное снижение скорости вычисления при увеличении объёма входных данных (например, при SVD-преобразовании) скорость вычисления соответствует порядку N(doc+term)^(2*k) Предполаг

### 2.2 Tests

In [13]:
texts, imgs = getTextAndImgsFromSlides('http://tiny.cc/00dhkz')

assert len(texts) == 35 # equal to the total number of slides in the presentation 
print(len(texts))

assert len(imgs) > 26 # can be more than that due to visitor icons
print(len(imgs))

assert any("Navigable" in value for value in texts.values()) # word is on a slide
assert any("MINUS" in value for value in texts.values()) # word is in speaker notes

35
28


Building an inverted index and search using it (boolean retrieval is just ok)

In [15]:
#TODO: build an inverted index and enable search in it
inverted_index = None

### 2.3 Tests

In [22]:
queries = ["architecture", "algorithm", "function", "dataset", 
           "Protasov", "cosine", "модель", "например"]

for query in queries:
    r = find(query, inverted_index)
    print("Results for: ", query)
    print("\t", r)
    assert len(r) > 0, "Query should return at least 1 document"
    assert len(r) > 1, "Query should return at least 2 documents"

Results for:  architecture
	 {'3_53', '1_16'}
Results for:  algorithm
	 {'2_34', '2_14', '2_13', '2_32'}
Results for:  function
	 {'3_34', '2_11', '2_10'}
Results for:  dataset
	 {'2_23', '2_27'}
Results for:  Protasov
	 {'2_1', '1_1', '3_1'}
Results for:  cosine
	 {'2_1', '1_16'}
Results for:  модель
	 {'3_51', '1_16'}
Results for:  например
	 {'1_4', '3_13', '3_53', '3_20', '3_38'}


### 2.4 Images saved
Save all images used in a presentation as separate files.

In [18]:
#TODO: load and save all images from slides on disk (from one | all presentations, both are fine)

## 3. Bonus task.  Links exploration

Find all external links in these presentations and index them for search too. I.e., parse them and extend inverted index you built with external links content: when searching by word we should get not only slides that contain it, but also any links that were mentioned in slides and contain this query word. 

In [None]:
#TODO: bonus task