# Web Scraping: Selenium
_Automate your browser._ <br>
_Collect data from dynamically generated web pages or those requiring user interaction._

### Docs

- [Selenium homepage](https://www.seleniumhq.org/) 
- [Selenium documentation](https://selenium-python.readthedocs.io/) - unofficial, but helpful

### Installation

With conda:
- `conda install -c conda-forge selenium`

With pip:
- `pip install -U selenium`

#### ChromeDriver

You will also need to install a web driver to use Selenium.  ChromeDriver is recommended but others are also available.

1. Check your browser's version _(Chrome > About Google Chrome)_
![Browser Version](images/browser_version.png) 
<br>
2. Navigate to the [ChromeDriver downloads page](https://sites.google.com/a/chromium.org/chromedriver/downloads).
<br><br>
3. Download appropriately based on your browser's version and your OS.
![Download ChromeDriver zip file](images/chromedriver_options.png)

4. Unzip the driver.
<br><br>
5. Move to Applications folder (or wherever your Chrome application is).

In [21]:
ls

 Volume in drive C is Windows
 Volume Serial Number is A020-D39D

 Directory of C:\Users\fahadd\t5_bootcamp_directory\PRACTICE\gamma\NBM_Regression_Gamma

11/22/2021  11:12 AM    <DIR>          .
11/22/2021  11:12 AM    <DIR>          ..
11/21/2021  09:54 AM    <DIR>          careers
11/22/2021  11:08 AM    <DIR>          chromedriver_win32
11/21/2021  09:54 AM    <DIR>          curriculum
11/21/2021  09:54 AM    <DIR>          pairs
11/21/2021  09:54 AM             9,690 README.md
11/21/2021  09:54 AM    <DIR>          resources
               1 File(s)          9,690 bytes
               7 Dir(s)  693,355,278,336 bytes free


In [29]:
from bs4 import BeautifulSoup
import requests
import time, os

In [30]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# driver= webdriver.Chrome(ChromeDriverManager().install())


chromedriver = "chromedriver_win32/chromedriver.exe" # path to the chromedriver executable
os.environ["webdriver.chrome.driver"] = chromedriver

## Example 1 - YouTube

### Dynamic Pages

Some pages serve their content dynamically, which means they could look different each time they are loaded into the browser.  HTML that you see by inspecting elements in your browser might be missing from `requests` and `BeautifulSoup` because it is generated at access time.

In [45]:
# query = "data science"
youtube_search = "https://www.youtube.com"
# youtube_query = youtube_search + query.replace(' ', '+')
page = requests.get(youtube_search).text

soup = BeautifulSoup(page, 'html5lib')
print(soup.find('div'))

<div class="ytd-searchbox-spt" id="search-container" slot="search-container"></div>


In [36]:
query = "data science"
youtube_search = "https://www.youtube.com/results?search_query="
youtube_query = youtube_search + query.replace(' ', '+')

In [37]:
page = requests.get(youtube_query).text

soup = BeautifulSoup(page, 'html5lib')

In [38]:
soup.find('div', id='contents')

Uh oh.  The video links should be under the contents div, but it's missing from our request.

> **QUESTION**: Why do you think this happened?

One option is to first load the page with Selenium THEN parse the page's HTML with BeautifulSoup.

First we launch the YouTube search page through our ChromeDrive.  A new browser should pop up.  **To continue using Selenium, keep this window open!**

In [40]:
driver = webdriver.Chrome(chromedriver)
driver.get(youtube_query)

  driver = webdriver.Chrome(chromedriver)


We can access the page's HTML through the driver:

In [None]:
driver.page_source[:1000]

Now we parse this with `BeautifulSoup` and the video information appears!

In [41]:
soup = BeautifulSoup(driver.page_source, 'html.parser')

In [42]:
soup.find('div', id='contents')

<div class="style-scope ytd-section-list-renderer" id="contents"><ytd-item-section-renderer class="style-scope ytd-section-list-renderer" use-height-hack=""><!--css-build:shady--><div class="style-scope ytd-item-section-renderer" id="header"></div>
<div class="style-scope ytd-item-section-renderer" id="spinner-container">
<tp-yt-paper-spinner-lite aria-hidden="true" class="style-scope ytd-item-section-renderer"><!--css-build:shady--><div class="style-scope tp-yt-paper-spinner-lite" id="spinnerContainer"><div class="spinner-layer style-scope tp-yt-paper-spinner-lite"><div class="circle-clipper left style-scope tp-yt-paper-spinner-lite"><div class="circle style-scope tp-yt-paper-spinner-lite"></div></div><div class="circle-clipper right style-scope tp-yt-paper-spinner-lite"><div class="circle style-scope tp-yt-paper-spinner-lite"></div></div></div></div></tp-yt-paper-spinner-lite>
</div>
<div class="style-scope ytd-item-section-renderer" id="contents"><ytd-promoted-sparkles-web-renderer 

In [43]:
contents_div = soup.find('div', id='contents')

for title in contents_div.find_all('a', id='video-title'):
    print(title.text.strip())

Data Science In 5 Minutes | Data Science For Beginners | What Is Data Science? | Simplilearn
Learn Data Science Tutorial - Full Course for Beginners
What REALLY is Data Science? Told by a Data Scientist
Is Data Science Dying?
How I Would Learn Data Science (If I Had to Start Over)
Главные профессии в ИИ - Data Science, Data Analyst, Data Engineer | Академия ИИ
Why you should not be a data scientist
A Day in The Life of a Google Data Scientist (Analytics)
يعني إيه علم البيانات Data Science | #إسأل_مصطفى
Data science - كل ما تود معرفته عن مجال علم البيانات
تعريف علم البيانات | What is Data Science?
علم البيانات - data science
Data Scientist vs Data Analyst (funny!)
MUST WATCH Before Choosing Data Analyst As Your Career!
Why I Left Data Engineering for Data Science
My Data Science Journey with Non-Tech Background


> **QUESTION**: We only got about 20 video titles -- surely there are more videos about data science.  What do you think is happening?

### Interacting with Pages

We can also interact with pages using Selenium.  For example, we can 
- click
- type in input cells
- scroll
- drag and drop, etc.

If we want more data science video titles, we need to scroll down to the bottom of the screen for more videos to populate.

In [None]:
for i in range(5):
    #Scroll
    driver.execute_script(
        "window.scrollTo(0, document.documentElement.scrollHeight);" #Alternatively, document.body.scrollHeight
    )
    
    #Wait for page to load
    time.sleep(1)

In [None]:
soup = BeautifulSoup(driver.page_source, 'html.parser')

In [None]:
contents_div = soup.find('div', id='contents')

len(contents_div.find_all('a', id='video-title'))

Awesome!  Now we have several more videos to analyze and we could continue scrolling if we wanted even more.

What if we want to perform a new search for machine learning?

In [None]:
search_box = driver.find_element_by_xpath("//input[@id='search']")

#clear the current search
search_box.clear()

#input new search
search_box.send_keys("machine learning")

#hit enter
search_box.send_keys(Keys.RETURN)  

time.sleep(1)

And can we filter to short videos (< 4 minutes) only?

In [None]:
filter_button = driver.find_element_by_xpath(
    '//a[contains(@class, "ytd-toggle-button")]'
)
filter_button.click()

In [None]:
short_link = driver.find_element_by_xpath(
    '//div[contains(@title, "Search for Short")]'
)
short_link.click()

Now we can either parse the page source with Beautiful Soup like before or pull text directly.  

For example, the title of the first short ML video (that isn't an ad!) can be found with:

In [None]:
first_title = driver.find_element_by_xpath("//a[@id='video-title']")
first_title.text

In [None]:
first_author = driver.find_element_by_xpath(
    "//ytd-video-renderer//ytd-channel-name//a"
)
first_author.text

#### Notes

- Check [here](https://www.w3schools.com/xml/xpath_syntax.asp) for additonal help writing xpath selectors.

- To select multiple elements, just switch to `driver.find_elements_by_xpath(...)`, which will return a list of matching elements.

- You can also access elements by id, name, etc.  Check [the docs](https://selenium-python.readthedocs.io/locating-elements.html) for more options.

Finally, when you are finished with the driver, be sure to close it.

In [None]:
driver.close()

## Example 2 - Open Table  _(Optional)_

Let's try one more example: gathering information from Open Table about restaurants with available reservation slots.

In [None]:
driver = webdriver.Chrome(chromedriver)
driver.get('http://www.opentable.com/')
time.sleep(1)  #pause to be sure page has loaded

Inspecting this page, we see the **name** of the drop down for picking the number of people is `Select_1`. Let's set the reservation for 4 people:

In [None]:
people_dropdown = driver.find_element_by_xpath('//select[@aria-label="Party size selector"]')
people_dropdown.send_keys("4 people")
time.sleep(1)

Now select the reservation date: 3 days from now.

In [None]:
from datetime import datetime, timedelta

In [None]:
today = datetime.today()
today_truncated = datetime(today.year, today.month, today.day)
res_date = (today + timedelta(days=3)).strftime('%a, %b %d, %Y')
res_date

In [None]:
#Expand the calendar
date_picker = driver.find_element_by_xpath('//div[@aria-label="Date selector"]')
date_picker.click()
time.sleep(1)

In [None]:
#Select the date three days from now
date_element = driver.find_element_by_xpath(f'//div[@aria-label="{res_date}"]')
date_element.click()
time.sleep(1)

Set our reservation time for 8 PM.

In [None]:
time_dropdown = driver.find_element_by_xpath('//select[@aria-label="Time selector"]')
time_dropdown.send_keys("8:00 PM")
time.sleep(1)

And search!

In [None]:
search_button = driver.find_element_by_xpath("//*[contains(text(), 'Let’s go')]")
search_button.click()
time.sleep(1)

On this new page we find a long list of restaurants with available reservations for 4 people at roughly our desired day/time.  At this point we could grab the HTML (`driver.page_source`) and parse with BeautifulSoup.  

In [None]:
soup = BeautifulSoup(driver.page_source)

In [None]:
for rest in soup.find_all('div', class_='rest-row-header')[:20]:
    print(rest.find('a').text)

Or we could click into an individual restaurant to learn more.

In [None]:
first_rest = driver.find_element_by_xpath('//div[@class="rest-row-header"]//a')
first_rest.click()

In [None]:
rest_soup = BeautifulSoup(driver.page_source)

print(rest_soup.title.text)

> **QUESTION**:  Why doesn't the title of this page match up with this individual restaurant?

In [None]:
#Switch windows!
driver.switch_to.window(driver.window_handles[1])

In [None]:
rest_soup = BeautifulSoup(driver.page_source)

print(rest_soup.title.text)

As usual when working with Selenium, make sure to close your browser.  Since we have two windows up, we use `driver.quit()` to close the entire browser session.

In [None]:
driver.quit()