# Parse me if you can #

Sometimes when crawling we have to parse websites that turn out to be SaaS - i.e., there is a special JS application which renders documents and which is downloaded first. Therefore, data that is to be rendered initially comes in a proprietary format. One of the examples is Google Drive. Last time we downladed and parsed some files from GDrive, however, we didn't parse GDrive-specific file formats, such as google sheets or google slides.

Today we will learn to obtain and parse such data using Selenium - a special framework for testing web-applications.

## 1. Getting started

Let's try to load and parse the page the way we did before:

In [1]:
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://docs.google.com/presentation/d/1LuZvz3axBD8UuHLagdv0EbhsGEWJmpd7gN5KjwYCp9Y/edit?usp=sharing")
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.body.text[:1000])

Не удалось открыть файл, поскольку в вашем браузере отключено использование JavaScript. Включите его и перезагрузите страницу.Некоторые функции PowerPoint не поддерживаются в Google Презентациях. Они будут удалены, если вы измените документ.Подробнее…6. Approximate nearest neighbours search 2. Trees   Смотреть  Открыть доступВойтиИспользуемая вами версия браузера больше не поддерживается. Установите поддерживаемую версию браузера.Закрытьdocument.getElementById('docs-unsupported-browser-bar').addEventListener('click', function () {this.parentNode.parentNode.removeChild(this.parentNode);return false;});ФайлИзменитьВидСправкаСпециальные возможностиОтладкаНесохраненные изменения: ДискПоследние изменения      Специальные возможности  Только просмотр     DOCS_timing['che'] = new Date().getTime();DOCS_timing['chv'] = new Date().getTime();Презентация в виде HTML(function(){/*

 Copyright The Closure Library Authors.
 SPDX-License-Identifier: Apache-2.0
*/
var a=this||self;function b(){this.a=t

As we see, the output is not what we expect. So, what can we do when a page is not being loaded right away, but is rather rendered by a script? Browser engines can help us get data. Let's try to load the same web-page, but do it in a different way: let's give some time to a browser to load the scripts and run them; and then will work with DOM (Document Object Model), but will get it from browser engine itself, not from BeautifulSoup.

Where do we get browser engine from? Simply installing a browser will do the thing. How do we send commands to it from code and retrieve DOM? Service applications called drivers will interpret out commands and translate them into browser actions.


For each browser engine suport you will need to:
1. install browser itself;
2. download 'driver' - binary executable, which passed commands from selenium to browser. E.g. [Gecko == Firefox](https://github.com/mozilla/geckodriver/releases), [ChromeDriver](http://chromedriver.storage.googleapis.com/index.html);
3. unpack driver into a folder under PATH environment variable. Or specify exact binary location.

### Download driver

And place it in any folder or under PATH env. variable.

### Install selenium

In [None]:
# !pip install -U selenium

In [2]:
from selenium import webdriver

### Launch browser

This will open browser window

In [3]:
browser = webdriver.Chrome(r'C:\Users\Rufina\Downloads\selenium\chromedriver.exe')
# or explicitly
# browser = webdriver.Firefox(
#     executable_path='C:/bin/geckodriver.exe', 
#     firefox_binary='C:/Program Files/Mozilla Firefox/firefox.exe'
# )

### Download the page

In [4]:
# navigate to page
browser.get('http://tiny.cc/00dhkz')
browser.implicitly_wait(5)  # wait 5 seconds
# browser.quit()
# select all text parts from document
elements = browser.find_elements_by_css_selector("g.sketchy-text-content-text")
# note that if number differs from launch to launch this means better extend wait time
print("Elements found:", len(elements))

# oh no! It glues all the words!
print("What if just a silly approach:", elements[0].text)

# GDrive stores all text blocks word-by-word
subnodes = elements[0].find_elements_by_css_selector("text")
text = " ".join(n.text for n in subnodes)
print("What if a smart approach:", text)

Elements found: 95
What if just a silly approach: Forests
of
search
trees
What if a smart approach: Forests of search trees


In [5]:
browser.quit()

### Problems
- Too slow, wait for browser to open, browser to render

### Headless mode
Browsers (at least [FF](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode), [Chrome](https://intoli.com/blog/running-selenium-with-headless-chrome/), IE) have headless mode - no window rendering and so on. Means it should work much faster!

In [6]:
options = webdriver.ChromeOptions() # webdriver.FirefoxOptions()

# options.add_argument('headless')
options.add_argument('window-size=1200x600')
browser = webdriver.Chrome(options=options) # webdriver.Firefox(options=options)

In [7]:
## SAME CODE

# navigate to page
browser.get('http://tiny.cc/00dhkz')
browser.implicitly_wait(5)  # wait 5 seconds

# select all text parts from document
elements = browser.find_elements_by_css_selector("g.sketchy-text-content-text")
# note that if number differs from launch to launch this means better extend wait time
print("Elements found:", len(elements))

# oh no! It adds NEW LINE. Behavior differs!!!!
print("What if just a silly approach:", elements[0].text)

# GDrive stores all text blocks word-by-word
subnodes = elements[0].find_elements_by_css_selector("text")
text = " ".join(n.text for n in subnodes)
print("What if a smart approach:", text)

browser.quit()

Elements found: 95
What if just a silly approach: Forests
of
search
trees
What if a smart approach: Forests of search trees


### NB 
Note, that browser behavior differs for the same code!

## 2.Task 
Our lectures usually have lot's of links. Here are the links to oviginal versions of the documents.

[4. Vector space](https://docs.google.com/presentation/d/1UxjGZPPrPTM_3lCa_gWTk8yZI_qNmTKwtMxr8JZQCIc/edit?usp=sharing)

[6. search trees](https://docs.google.com/presentation/d/1LuZvz3axBD8UuHLagdv0EbhsGEWJmpd7gN5KjwYCp9Y/edit?usp=sharing)

[7-8. Web basics](https://docs.google.com/presentation/d/1bgsCgpjMcQmrFpblRI6oH9SnG4bjyo5SzSSdKxxHNlg/edit?usp=sharing)

Please complete the following tasks:

### 2.1 Inverted index for slides with numbers
I want to type a word, and it should say which slides of which lecture has this word.

Loading and parsing text and images from google slides

In [8]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time
import re


def getTextAndImgsFromSlides(url = 'http://tiny.cc/00dhkz'):    
    slides_text = dict() # dictionary slide_num : slide_text
    img_list = [] # list of image urls 
    #TODO: parse google slides and save all text and image urls in slides_text and img_list
    # you should get the contents from ALL slides - however, you will see that at one moment 
    # of time only single slide + few slide previews on the left are visible. To be able to    
    # reach all slides you will need to scroll to and click these previews. While slide contents 
    # can be obtained from previews themselves, speaker notes (which you also have to extract)
    # can be viewed only if a particular slide is open.
    # to scroll the element of interest into view, use can this: 
    # browser.execute_script("arguments[0].scrollIntoView();", el)
    # to click the element, use can use ActionChains library   

    options = webdriver.ChromeOptions() 
    options.add_argument('headless')
    options.add_argument('window-size=1200x600')
    browser = webdriver.Chrome(options=options) 

    browser.get(url)
    browser.implicitly_wait(10)  # wait 5 seconds
    actions = ActionChains(browser)
    elems_num = 0
    i = 1
    # get left bar that is visible for now marking slides that are visited
    # click on all of them and then scroll down, update left bar
    
    left_bar = browser.find_elements_by_css_selector('g.punch-filmstrip-thumbnail')
    while True:
        all_visited = True
        for lb in left_bar:
            # get slide number
            slide_n = lb.text.split()
            if len(slide_n)==0:
                continue
            slide_n = slide_n[0]
            if slide_n not in slides_text:
                all_visited = False
                # click on that slide
                lb.click()
                time.sleep(0.1)
                # take info
                text = ''
                images = browser.find_elements_by_tag_name('image')
                slide = browser.find_element_by_xpath("//div[@id='workspace-container']")
                elements = slide.find_elements_by_css_selector('g.sketchy-text-content-text')
                speaker_notes = browser.find_element_by_xpath("//div[@id='speakernotes-workspace']")
                speaker_notes = speaker_notes.find_elements_by_css_selector('g.sketchy-text-content-text')
                for e in elements:
                    text += e.text
                    text += ' '
                for e in speaker_notes:
                    text += e.text
                    text += ' '
                text = text.replace('\n', ' ')
                text = re.sub(' +', ' ', text)
                slides_text[slide_n] = text
        # now we check if at least one slide was new for now, if there was updates we scroll down
        if all_visited:
            break
        left_bar = browser.find_elements_by_css_selector('g.punch-filmstrip-thumbnail')
    
    for img in images:
        img_list.append(img.get_attribute('href'))

    browser.quit()
    return slides_text, img_list

Parsing three presentations

In [9]:
links = ["https://docs.google.com/presentation/d/1UxjGZPPrPTM_3lCa_gWTk8yZI_qNmTKwtMxr8JZQCIc/edit?usp=sharing", 
         "https://docs.google.com/presentation/d/1LuZvz3axBD8UuHLagdv0EbhsGEWJmpd7gN5KjwYCp9Y/edit?usp=sharing",
         "https://docs.google.com/presentation/d/1bgsCgpjMcQmrFpblRI6oH9SnG4bjyo5SzSSdKxxHNlg/edit?usp=sharing"]

all_imgs = []
all_texts = dict()

for i, link in enumerate(links):
    texts, imgs = getTextAndImgsFromSlides(link)
    all_imgs += imgs
    for slide_n in texts:
        all_texts[f'{i}_{slide_n}'] = texts[slide_n]

### 2.2 Tests

In [10]:
texts, imgs = getTextAndImgsFromSlides('http://tiny.cc/00dhkz')

assert len(texts) == 35 # equal to the total number of slides in the presentation 
print(len(texts))

assert len(imgs) > 26 # can be more than that due to visitor icons
print(len(imgs))

assert any("Navigable" in value for value in texts.values()) # word is on a slide
assert any("MINUS" in value for value in texts.values()) # word is in speaker notes

35
53


Building an inverted index and search using it (boolean retrieval is just ok)

In [11]:
#TODO: build an inverted index and enable search in it

import nltk
from collections import Counter
nltk.download('stopwords')

class Preprocessor:
    
    def __init__(self):
        self.stop_words = nltk.corpus.stopwords.words('english')
        self.ps = nltk.stem.PorterStemmer()


    # word tokenize text using nltk lib
    def tokenize(self, text):
        return nltk.word_tokenize(text)


    # stem word using provided stemmer
    def stem(self, word, stemmer):
        return stemmer.stem(word)


    # check if word is appropriate - not a stop word and isalpha, 
    # i.e consists of letters, not punctuation, numbers, dates
    def is_apt_word(self, word):
        return word not in self.stop_words and word.isalpha()


    # combines all previous methods together
    # tokenizes lowercased text and stems it, ignoring not appropriate words
    def preprocess(self, text):
        tokenized = self.tokenize(text.lower())
        return [self.stem(w, self.ps) for w in tokenized if self.is_apt_word(w)]

def build_inverted_index(files_data):
  index = dict()
  # doc_names = dict()
  def index_doc(doc_content, doc_id):
    prep = Preprocessor()
    doc_content = prep.preprocess(doc_content)
    article_index = Counter(doc_content)
    for term in article_index.keys():
        article_freq = article_index[term]
        if term not in index:                
            index[term] = [article_freq, (doc_id, article_freq)]
        else:
            index[term][0] += article_freq
            index[term].append((doc_id, article_freq))
  #TODO build search index from files
  for doc_id, file_name in enumerate(files_data):
    # doc_names[doc_id] = file_name
    index_doc(files_data[file_name], file_name)
  return index

class QueryProcessing:
    @staticmethod
    def prepare_query(raw_query):
        prep = Preprocessor()
        # pre-process query the same way as documents
        query = prep.preprocess(raw_query)
        # count frequency
        return Counter(query)
    
    @staticmethod
    def boolean_retrieval(query, index):
        postings = []
        for term in query.keys():
            if term not in index:  # ignoring absent terms
                continue
            posting = index[term][1:]
            # extract document info only
            posting = [i[0] for i in posting]
            postings.append(posting)
        docs = set.intersection(*map(set,postings))
        
        return docs 

def find(query, index):
    #TODO implement search procedure
    # preprocess query
    query = QueryProcessing.prepare_query(query)
    return QueryProcessing.boolean_retrieval(query, index)


inverted_index = build_inverted_index(all_texts)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Rufina\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 2.3 Tests

In [12]:
queries = ["architecture", "algorithm", "function", "dataset", 
           "Protasov", "cosine", "модель", "например"]

for query in queries:
    r = find(query, inverted_index)
    print("Results for: ", query)
    print("\t", r)
    assert len(r) > 0, "Query should return at least 1 document"
    assert len(r) > 1, "Query should return at least 2 documents"

Results for:  architecture
	 {'2_53', '0_16', '0_5', '0_19'}
Results for:  algorithm
	 {'1_13', '1_32', '1_14', '1_34', '2_22'}
Results for:  function
	 {'1_11', '2_34', '1_10'}
Results for:  dataset
	 {'0_9', '1_27', '1_23'}
Results for:  Protasov
	 {'0_1', '1_1', '2_1'}
Results for:  cosine
	 {'0_16', '1_1'}
Results for:  модель
	 {'0_16', '2_51'}
Results for:  например
	 {'0_4', '2_38', '2_53', '2_20', '2_13'}


### 2.4 Images saved
Save all images used in a presentation as separate files.

In [13]:
#TODO: load and save all images from slides on disk (from one | all presentations, both are fine)
import urllib.request

for idx, img_url in enumerate(all_imgs):
    try: urllib.request.urlretrieve(img_url, f'{idx}.jpg')
    except ValueError:
        continue # its a video

## 3. Bonus task.  Links exploration

Find all external links in these presentations and index them for search too. I.e., parse them and extend inverted index you built with external links content: when searching by word we should get not only slides that contain it, but also any links that were mentioned in slides and contain this query word. 

In [16]:
!pip install html2text

Collecting html2text
  Downloading https://files.pythonhosted.org/packages/ae/88/14655f727f66b3e3199f4467bafcc88283e6c31b562686bf606264e09181/html2text-2020.1.16-py3-none-any.whl
Installing collected packages: html2text
Successfully installed html2text-2020.1.16


You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [41]:
# #TODO: bonus task
# import requests
# from urllib.parse import quote
# from bs4 import BeautifulSoup
# from bs4.element import Comment
# import urllib.parse
# import os
# import html2text

# def get_url_content(url):
#     response = requests.get(url)
#     if response.status_code == 200:
#         h = html2text.HTML2Text()
#         return h.handle(response.content.decode('utf-8'))
#     else:
#         return None

# # print(get_url_content('https://research.fb.com/blog/2014/09/fast-randomized-svd/'))

# def getTextAndImgsAndLinksFromSlides(url = 'http://tiny.cc/00dhkz'):    
#     slides_text = dict() # dictionary slide_num : slide_text
#     img_list = [] # list of image urls 
#     links_list = []

#     options = webdriver.ChromeOptions() 
#     options.add_argument('headless')
#     options.add_argument('window-size=1200x600')
#     browser = webdriver.Chrome(options=options) 

#     browser.get(url)
#     browser.implicitly_wait(10)  # wait 5 seconds
#     actions = ActionChains(browser)
#     elems_num = 0
#     i = 1
#     # get left bar that is visible for now marking slides that are visited
#     # click on all of them and then scroll down, update left bar
    
#     left_bar = browser.find_elements_by_css_selector('g.punch-filmstrip-thumbnail')
#     while True:
#         all_visited = True
#         for lb in left_bar:
#             # get slide number
#             slide_n = lb.text.split()
#             if len(slide_n)==0:
#                 continue
#             slide_n = slide_n[0]
#             if slide_n not in slides_text:
#                 all_visited = False
#                 # click on that slide
#                 lb.click()
#                 # take info
#                 text = ''
#                 images = browser.find_elements_by_tag_name('image')
#                 slide = browser.find_element_by_xpath("//div[@id='workspace-container']")
#                 elements = slide.find_elements_by_css_selector('g.sketchy-text-content-text')
#                 speaker_notes = browser.find_element_by_xpath("//div[@id='speakernotes-workspace']")
#                 speaker_notes = speaker_notes.find_elements_by_css_selector('g.sketchy-text-content-text')
#                 for e in elements:
#                     text += e.text
#                     text += ' '
#                     # click on a text to obtain the link
#                     e.click()
#                     link = slide.find_element_by_css_selector('div.docs-bubble docs-linkbubble-bubble docs-linkbubble-link-preview')
#                     link = slide.find_element_by_tag_name('a')
#                     links_list.append(link.get_attribute('href'))
#                 for e in speaker_notes:
#                     text += e.text
#                     text += ' '
#                 text = text.replace('\n', ' ')
#                 text = re.sub(' +', ' ', text)
#                 slides_text[slide_n] = text
#         # now we check if at least one slide was new for now, if there was updates we scroll down
#         if all_visited:
#             break
#         left_bar = browser.find_elements_by_css_selector('g.punch-filmstrip-thumbnail')
    
#     for img in images:
#         img_list.append(img.get_attribute('href'))
#     for link in links_list:
#         print(link)
#     print(len(links_list))

#     browser.quit()
#     return slides_text, img_list

# out = getTextAndImgsAndLinksFromSlides()

AttributeError: 'WebElement' object has no attribute 'w3c'