# Parse me if you can #

Sometimes when crawling we have to parse websites that turn out to be SaaS - i.e., there is a special JS application which renders documents and which is downloaded first. Therefore, data that is to be rendered initially comes in a proprietary format. One of the examples is Google Drive. Last time we downladed and parsed some files from GDrive, however, we didn't parse GDrive-specific file formats, such as google sheets or google slides.

Today we will learn to obtain and parse such data using Selenium - a special framework for testing web-applications.

## 1. Getting started

Let's try to load and parse the page the way we did before:

In [1]:
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://docs.google.com/presentation/d/1LuZvz3axBD8UuHLagdv0EbhsGEWJmpd7gN5KjwYCp9Y/edit?usp=sharing")
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.body.text[:1000])

Не удалось открыть файл, поскольку в вашем браузере отключено использование JavaScript. Включите его и перезагрузите страницу.Некоторые функции PowerPoint не поддерживаются в Google Презентациях. Они будут удалены, если вы измените документ.Подробнее…6. Approximate nearest neighbours search 2. Trees   Смотреть  Открыть доступВойтиИспользуемая вами версия браузера больше не поддерживается. Установите поддерживаемую версию браузера.Закрытьdocument.getElementById('docs-unsupported-browser-bar').addEventListener('click', function () {this.parentNode.parentNode.removeChild(this.parentNode);return false;});ФайлИзменитьВидСправкаСпециальные возможностиОтладкаНесохраненные изменения: ДискПоследние изменения      Специальные возможности  Только просмотр     DOCS_timing['che'] = new Date().getTime();DOCS_timing['chv'] = new Date().getTime();Презентация в виде HTML(function(){/*

 Copyright The Closure Library Authors.
 SPDX-License-Identifier: Apache-2.0
*/
var a=this||self;function b(){this.a=t

As we see, the output is not what we expect. So, what can we do when a page is not being loaded right away, but is rather rendered by a script? Browser engines can help us get data. Let's try to load the same web-page, but do it in a different way: let's give some time to a browser to load the scripts and run them; and then will work with DOM (Document Object Model), but will get it from browser engine itself, not from BeautifulSoup.

Where do we get browser engine from? Simply installing a browser will do the thing. How do we send commands to it from code and retrieve DOM? Service applications called drivers will interpret out commands and translate them into browser actions.


For each browser engine suport you will need to:
1. install browser itself;
2. download 'driver' - binary executable, which passed commands from selenium to browser. E.g. [Gecko == Firefox](https://github.com/mozilla/geckodriver/releases), [ChromeDriver](http://chromedriver.storage.googleapis.com/index.html);
3. unpack driver into a folder under PATH environment variable. Or specify exact binary location.

### Download driver

And place it in any folder or under PATH env. variable.

### Install selenium

In [2]:
!pip install -U selenium

Requirement already up-to-date: selenium in c:\soft\anaconda3\lib\site-packages (3.141.0)


In [3]:
from selenium import webdriver

### Launch browser

This will open browser window

In [4]:
# browser = webdriver.Firefox()
# or explicitly
browser = webdriver.Chrome(
    executable_path='C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe', 
#     firefox_binary='C:/Program Files/Mozilla Firefox/firefox.exe'
)

### Download the page

In [5]:
# navigate to page
browser.get('http://tiny.cc/00dhkz')
browser.implicitly_wait(5)  # wait 5 seconds

# select all text parts from document
elements = browser.find_elements_by_css_selector("g.sketchy-text-content-text")
# note that if number differs from launch to launch this means better extend wait time
print("Elements found:", len(elements))

# oh no! It glues all the words!
print("What if just a silly approach:", elements[0].text)

# GDrive stores all text blocks word-by-word
subnodes = elements[0].find_elements_by_css_selector("text")
text = " ".join(n.text for n in subnodes)
print("What if a smart approach:", text)

Elements found: 110
What if just a silly approach: Forests
of
search
trees
What if a smart approach: Forests of search trees


In [6]:
browser.quit()

### Problems
- Too slow, wait for browser to open, browser to render

### Headless mode
Browsers (at least [FF](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode), [Chrome](https://intoli.com/blog/running-selenium-with-headless-chrome/), IE) have headless mode - no window rendering and so on. Means it should work much faster!

In [196]:
options = webdriver.ChromeOptions()

options.add_argument('headless')
options.add_argument('window-size=1200x600')
driver = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe',options=options)

In [130]:
## SAME CODE

# navigate to page
driver.get('http://tiny.cc/00dhkz')
driver.implicitly_wait(5)  # wait 5 seconds

# select all text parts from document
elements = driver.find_elements_by_css_selector("g.sketchy-text-content-text")
# note that if number differs from launch to launch this means better extend wait time
print("Elements found:", len(elements))

# oh no! It adds NEW LINE. Behavior differs!!!!
print("What if just a silly approach:", elements[0].text)
# GDrive stores all text blocks word-by-word
subnodes = elements[0].find_elements_by_css_selector("text")
text = " ".join(n.text for n in subnodes)
print("What if a smart approach:", text)

Elements found: 112
What if just a silly approach: Forests
of
search
trees
What if a smart approach: Forests of search trees


In [197]:
driver.quit()

In [191]:
slides_text = dict() # dictionary slide_num : slide_text
img_list = [] # list of image urls 
current_page = 0

driver = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe',options=options)
driver.get("https://docs.google.com/presentation/d/1UxjGZPPrPTM_3lCa_gWTk8yZI_qNmTKwtMxr8JZQCIc/edit?usp=sharing")
driver.implicitly_wait(5)  # wait 5 seconds

current_slide = 0
all_slides = driver.find_elements_by_class_name ("punch-filmstrip-thumbnail")

#Go throuth all slides
while True:
    #condition to stop
    if current_page > (len(all_slides) -1):
        break
        
    all_slides = driver.find_elements_by_class_name ("punch-filmstrip-thumbnail")
    #scroll to slide
    driver.execute_script("arguments[0].scrollIntoView();", all_slides[current_page])
    #click on slide
    webdriver.ActionChains(driver).move_to_element(all_slides[current_page]).click(all_slides[current_page]).perform()

    #preparation for slide
    text_list = ""
    #extract text top
    # select all text parts from document
    g = driver.find_element_by_id("workspace").find_elements_by_css_selector("text")
    text = " ".join(n.text for n in g)
    text_list+=text
    #extract text comments
    g = driver.find_element_by_id("speakernotes-workspace").find_elements_by_css_selector("text")
    text = " ".join(n.text for n in g)
    text_list+=text
    #update slides_text
    slides_text[current_page] = text
    #requred next page
    current_page +=1

# extract image
images = driver.find_elements_by_tag_name('image')
for s in range (len(images)):
    img_list.append(images[s].get_attribute("href"))
    
# driver.quit()

### NB 
Note, that browser behavior differs for the same code!

## 2.Task 
Our lectures usually have lot's of links. Here are the links to oviginal versions of the documents.

[4. Vector space](https://docs.google.com/presentation/d/1UxjGZPPrPTM_3lCa_gWTk8yZI_qNmTKwtMxr8JZQCIc/edit?usp=sharing)

[6. search trees](https://docs.google.com/presentation/d/1LuZvz3axBD8UuHLagdv0EbhsGEWJmpd7gN5KjwYCp9Y/edit?usp=sharing)

[7-8. Web basics](https://docs.google.com/presentation/d/1bgsCgpjMcQmrFpblRI6oH9SnG4bjyo5SzSSdKxxHNlg/edit?usp=sharing)

Please complete the following tasks:

### 2.1 Inverted index for slides with numbers
I want to type a word, and it should say which slides of which lecture has this word.

Loading and parsing text and images from google slides

In [202]:
def getTextAndImgsFromSlides(url):    
    slides_text = dict() # dictionary slide_num : slide_text
    img_list = [] # list of image urls 
    current_page = 0

    driver = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe',options=options)
    driver.get(url)
    driver.implicitly_wait(5)  # wait 5 seconds

    all_slides = driver.find_elements_by_class_name ("punch-filmstrip-thumbnail")
    #Go throuth all slides
    while True:
        #condition to stop
        if current_page > (len(all_slides) -1):
            break

        all_slides = driver.find_elements_by_class_name ("punch-filmstrip-thumbnail")
        #scroll to slide
        driver.execute_script("arguments[0].scrollIntoView();", all_slides[current_page])
        #click on slide
        webdriver.ActionChains(driver).move_to_element(all_slides[current_page]).click(all_slides[current_page]).perform()

        #preparation for slide
        text_list = ""
        #extract text top
        # select all text parts from document
        g = driver.find_element_by_id("workspace").find_elements_by_css_selector("text")
        text = " ".join(n.text for n in g)
        text_list+=text
        #extract text comments
        g = driver.find_element_by_id("speakernotes-workspace").find_elements_by_css_selector("text")
        text = " ".join(n.text for n in g)
        text_list+=text
        #update slides_text
        slides_text[current_page] = text_list
        #requred next page
        current_page +=1

    # extract image
    images = driver.find_elements_by_tag_name('image')
    for s in range (len(images)):
        img_list.append(images[s].get_attribute("href"))

    driver.quit()
    return slides_text, img_list

Parsing three presentations

In [206]:
links = ["https://docs.google.com/presentation/d/1UxjGZPPrPTM_3lCa_gWTk8yZI_qNmTKwtMxr8JZQCIc/edit?usp=sharing", 
         "https://docs.google.com/presentation/d/1LuZvz3axBD8UuHLagdv0EbhsGEWJmpd7gN5KjwYCp9Y/edit?usp=sharing",
         "https://docs.google.com/presentation/d/1bgsCgpjMcQmrFpblRI6oH9SnG4bjyo5SzSSdKxxHNlg/edit?usp=sharing"]

all_imgs = []
all_texts = dict()
for i, link in enumerate(links):
    texts, imgs = getTextAndImgsFromSlides(link)
    all_imgs+=imgs
    all_texts.update({i+1:texts})

### 2.2 Tests

In [204]:
texts, imgs = getTextAndImgsFromSlides('http://tiny.cc/00dhkz')

assert len(texts) == 35 # equal to the total number of slides in the presentation 
print(len(texts))

assert len(imgs) > 26 # can be more than that due to visitor icons
print(len(imgs))

assert any("Navigable" in value for value in texts.values()) # word is on a slide
assert any("MINUS" in value for value in texts.values()) # word is in speaker notes

35
54


Building an inverted index and search using it (boolean retrieval is just ok)

In [241]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer

class Preprocessor:
    def __init__(self):
        self.stop_words = {'a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'for', 'from', 'has', 'he', 'in', 'is', 'it', 'its',
                      'of', 'on', 'that', 'the', 'to', 'was', 'were', 'will', 'with'}
        self.ps = nltk.stem.PorterStemmer()

    
    def tokenize(self, text):
        #TODO word tokenize text using nltk lib
        
        return word_tokenize (text)

    
    def stem(self, word, stemmer):
        #TODO stem word using provided stemmer
        stemmed_word = stemmer.stem(word.lower())
        return stemmed_word

    
    def is_apt_word(self, word):
        #TODO check if word is appropriate - not a stop word and isalpha, 
        # i.e consists of letters, not punctuation, numbers, datesа
  
        if word.isalpha():
          if word not in self.stop_words:
            return True
        return False
    def preprocess(self, text):
        #TODO combine all previous methods together: tokenize lowercased text 
        # and stem it, ignoring not appropriate words
        tokenized = self.tokenize (text)
        filtered = []
        for word in tokenized:
          word = self.stem(word,self.ps)
          if self.is_apt_word(word):
            filtered.append(word)
        
          
        return filtered

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dorak\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [297]:
all_text_copy=all_texts.copy()
prep = Preprocessor()
index = dict()

In [335]:
def retrieval (query, index):
    postings = []

    if query not in index:
        return 0
    posting = index[query][1:]
    # extract document info only
    posting = [i[0] for i in posting]
    postings.append(posting)
    docs = set.intersection(*map(set,postings))
    return docs 

In [341]:
index["architectur"]

['1', '_', '4', '1_15', '1_18', '3_52']

In [319]:
#finder
def find(query, index):
    text_ids = retrieval(query, index)
    return text_ids

In [299]:
# preprocess
for i in all_text_copy:
    for z in all_text_copy[i]:
        all_text_copy [i][z] = prep.preprocess(all_text_copy [i][z])
        print (all_text_copy [i][z])

['vector', 'space', 'model', 'ml', 'stanislav', 'protasov', 'albina', 'khusainova']
['refresh', 'matrix', 'vector', 'space', 'model', 'distribut', 'semant', 'latent', 'space', 'lsa', 'pca', 'svd']
['agenda', 'lsa', 'what', 'miss', 'neural', 'network', 'solv', 'embed', 'task', 'dssm', 'bert']
['lsa', 'critic', 'speed', 'issu', 'even', 'optim', 'svd', 'slow', 'requir', 'memori', 'cpu', 'time', 'fast', 'random', 'svd', 'facebook', 'altern', 'least', 'squar', 'al', 'distribut', 'stream', 'version', 'model', 'issu', 'pca', 'assum', 'normal', 'data', 'distribut', 'but', 'life', 'complic', 'plsa', 'statist', 'independ', 'ani', 'distribut', 'vs', 'linear', 'orthogon', 'root', 'statist', 'what', 'about', 'ad', 'new', 'can', 'we', 'take', 'some', 'model', 'don', 't', 'care', 'about', 'distribut', 'statist', 'memori', 'so', 'существенным', 'недостатком', 'метода', 'является', 'значительное', 'снижение', 'скорости', 'вычисления', 'при', 'увеличении', 'объёма', 'входных', 'данных', 'например', 'при

In [337]:
#build index
index = dict()
for i in all_text_copy:
    for z in all_text_copy[i]:
        this_set = set(all_text_copy[i][z])
        for f in this_set:
            if f in index:
                index[f].append( "%d_%d" %(i, z))
            else:
                index.update({f:list( "%d_%d" %(i, z))})
                
                


### 2.3 Tests

In [338]:
queries = ["architecture", "algorithm", "function", "dataset", 
           "Protasov", "cosine", "модель", "например"]

for query in queries:
    query = prep.preprocess(query)
    r = find(query[0], index)
    print("Results for: ", query)
    print("\t", r)
    assert len(r) > 0, "Query should return at least 1 document"
    assert len(r) > 1, "Query should return at least 2 documents"

Results for:  ['architectur']
	 {'1', '_', '4', '3'}
Results for:  ['algorithm']
	 {'1', '_', '2', '3'}
Results for:  ['function']
	 {'_', '2', '3', '9'}
Results for:  ['dataset']
	 {'_', '2', '8'}
Results for:  ['protasov']
	 {'0', '_', '2', '3'}
Results for:  ['cosin']
	 {'1', '_', '2', '5'}
Results for:  ['модель']
	 {'1', '_', '3', '5'}
Results for:  ['например']
	 {'_', '3'}


### 2.4 Images saved
Save all images used in a presentation as separate files.

In [223]:
import urllib.request
count = 0
#TODO: load and save all images from slides on disk (from one | all presentations, both are fine)
for image in all_imgs:
  urllib.request.urlretrieve(image, "local-filename%d.jpg" %(count))
  count+=1

## 3. Bonus task.  Links exploration

Find all external links in these presentations and index them for search too. I.e., parse them and extend inverted index you built with external links content: when searching by word we should get not only slides that contain it, but also any links that were mentioned in slides and contain this query word. 

In [None]:
#TODO: bonus task