# Trump's Lies

In [19]:
%%html
<center><iframe src="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" width="960" height="400">
</iframe> </center>


---


# Web Scraping


Using the Python programming language, it is possible to "scrape" data from the web in a quick and efficient manner.

**Web Scraping** commonly refers to the practice of writing an automated program
that queries a web server, requests data (usually in the form of HTML and other files
that compose web pages), and then parses that data to extract needed information.



Web scraping is a valuable tool in the data scientist's skill set.

<span style="margin-right: 10%;  margin-left: 20%;">[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)</span>
<span style="margin-right: 10%;  margin-left: 15%;">[Selenium](https://www.selenium.dev)</span>

<p>

<img src="https://www.crummy.com/software/BeautifulSoup/10.1.jpg" style="float: left; margin-bottom: 1.5em; margin-top: 1.5em; margin-right: 10%;  margin-left: 15%; " width=220/>
 <img src="https://www.selenium.dev/images/selenium_logo_large.png" style="float: left; margin-top: 7.5em;" width=280/>



---


# A Detour to Python Modules

Most of the functionalities in Python are provided by **modules** (in the **standard library** or many third-party packages), which typically correspond to Python program files that define functionalities we want to import to use.

We can use <a href="https://docs.python.org/3.7/reference/simple_stmts.html#grammar-token-import-stmt" > the `import` statement</a> to read in a whole module or package:

In [22]:
import random

The first time a module or a package is imported, Python creates a module object:

In [16]:
random

<module 'random' from 'C:\\ProgramData\\Anaconda3\\lib\\random.py'>

Using `dir()`, we can find the name `random` imported to the current namespace:


In [2]:
print(dir())

['In', 'Out', '_', '__', '___', '__builtin__', '__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', '_dh', '_i', '_i1', '_i2', '_ih', '_ii', '_iii', '_oh', 'exit', 'get_ipython', 'quit', 'random']


Like classes and instances, modules are also self-contained namespace objects:

In [6]:
print(dir(random))

['BPF', 'LOG4', 'NV_MAGICCONST', 'RECIP_BPF', 'Random', 'SG_MAGICCONST', 'SystemRandom', 'TWOPI', '_Sequence', '_Set', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '_accumulate', '_acos', '_bisect', '_ceil', '_cos', '_e', '_exp', '_inst', '_log', '_os', '_pi', '_random', '_repeat', '_sha512', '_sin', '_sqrt', '_test', '_test_generator', '_urandom', '_warn', 'betavariate', 'choice', 'choices', 'expovariate', 'gammavariate', 'gauss', 'getrandbits', 'getstate', 'lognormvariate', 'normalvariate', 'paretovariate', 'randint', 'random', 'randrange', 'sample', 'seed', 'setstate', 'shuffle', 'triangular', 'uniform', 'vonmisesvariate', 'weibullvariate']


All its functionalities are now available for use. But we need to use the dot notation (a.k.a. attribute reference notation) to refer to an individual functionality:

In [12]:
random.__file__           # give the location where the module was found in the file system

'C:\\ProgramData\\Anaconda3\\lib\\random.py'

In [9]:
random.randint(0, 10)     # return a random integer in range [a, b], including both end points; help(random.randint)

9



Alternatively, we can import only a few selected functionalities from a module by explicitly listing them with the `from` form:

In [3]:
from random import randint, sample

In [4]:
print(dir())

['In', 'Out', '_', '__', '___', '__builtin__', '__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', '_dh', '_i', '_i1', '_i2', '_i3', '_i4', '_ih', '_ii', '_iii', '_oh', 'exit', 'get_ipython', 'quit', 'randint', 'random', 'sample']


Then we don't need to use the prefix every time we use the imported names:

In [16]:
randint(0, 10)

8

In [17]:
sample(range(100), 20)     #  help(sample)

[37, 6, 84, 82, 96, 73, 78, 38, 19, 51, 88, 35, 12, 94, 55, 47, 65, 58, 36, 60]

We can use the command below to check the installed packages:

In [None]:
!pip freeze     # packages are listed in a case-insensitive sorted order

<div class="alert alert-info">Any command that works at system command-line tools can be used in Jypyter notebook by prefixing it with the ! character.</div>



---


# HTML

<br>

[**HyperText Markup Language**](https://developer.mozilla.org/en-US/docs/Web/HTML) (HTML for short) is a markup language for describing web documents.


---

 ```html
<html>
<head><title>Sample HTML Page</title></head>
  
<body>
<h1>This is a heading.</h1>
<p>This is a typical paragraph.</p>
<p class = "notThisOne"> This is a paragraph of the "notThisOne" class. </p>
<p id = "thisOne"> 
But I only want this <a href = "sample.html">paragraph</a>.
</p>
</body>
  
</html>
```


<html>
<head><title>Sample HTML Page</title></head>
  
<body>
<h1>This is a heading.</h1>
<p>This is a typical paragraph.</p>
<p class = "notThisOne"> This is a paragraph of the "notThisOne" class. </p>
<p id = "thisOne"> 
But I only want this <a href = "sample.html"> paragraph</a>. </p>
</body>
  
</html>


---

HTML elements are written with a start tag, an end tag, and with the content in between: `<tagname>content</tagname>`.

- `<h1>`, `<h2>`,..., `<h6>`: largest heading, second largest heading, etc.
- `<p>`: paragraphs
- `<ul>` or `<ol>`: unordered or ordered bulleted list
- `<li>`: individual List item
- `<div>`: division or section
- `<table>`: table
- `<img>`: image
- `<a>`: anchor
- and many others ...


The tags typically contain the textual content we wish to scrape and may include attributes. 


```html
<tag attribute1="value1" attribute2="value2">content</tag>
```


These textual components form a hierarchy tree, where the top-level `<html>` tag contains the `<head>` and `<body`> tags, which further contain other textl contents and tags, and so on:




<img src="https://raw.githubusercontent.com/justinjiajia/img/master/python/html_nodes.PNG" width=300/>

 ```html
<html>
<head><title>Sample HTML Page</title></head>
  
<body>
<h1>This is a heading.</h1>
<p>This is a typical paragraph.</p>
<p class = "notThisOne"> This is a paragraph of the "notThisOne" class. </p>
<p id = "thisOne"> 
But I only want this <a href = "sample.html"> paragraph</a>. </p>
</body>
  
</html>
```

---


# Beautiful Soup



<img src="https://www.crummy.com/software/BeautifulSoup/10.1.jpg"  width=220/>

>Beautiful Soup, so rich and green,<br>
&nbsp;&nbsp;Waiting in a hot tureen!<br>
&nbsp;&nbsp;Who for such dainties would not stoop?<br>
&nbsp;&nbsp;Soup of the evening, beautiful Soup!






<br>

[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) is a Python library for pulling data out of HTML and XML files. 

BeautifulSoup helps format and organize the messy web (**tag soup**) by fixing bad HTML and presenting us with easily traversable Python objects representing XML structures.


In [8]:
from bs4 import BeautifulSoup

The most commonly used object in the BeautifulSoup library is the `BeautifulSoup` object. 

Let's pass a string of HTML source code into the `BeautifulSoup` constructor to "***make the soup***":



In [33]:
html_sample_code = '<html><head><link href="example.css" rel="stylesheet" type="text/css"><title> \
Sample HTML Page</title></head><body><h1>This is a heading.</h1><p>This is a typical paragraph. \
</p><p class = "notThisOne">This is a paragraph of the "notThisOne" class.</p><p id = "thisOne"> \
But I only want this <a href = "sample.html">paragraph</a>.</p></body></html>'



bs_sample = BeautifulSoup(html_sample_code, 'html.parser')     # pass it into the BeautifulSoup constructor


- The 1st argument is the HTML text the object is based on, and the 2nd specifies the parser that we want BeautifulSoup to use in order to create a `BeautifulSoup` object.

 

In [34]:
bs_sample

<html><head><link href="example.css" rel="stylesheet" type="text/css"/><title> Sample HTML Page</title></head><body><h1>This is a heading.</h1><p>This is a typical paragraph. </p><p class="notThisOne">This is a paragraph of the "notThisOne" class.</p><p id="thisOne"> But I only want this <a href="sample.html">paragraph</a>.</p></body></html>

In [109]:
type(bs_sample)

bs4.BeautifulSoup

A `BeautifulSoup` object represents the parsed document as a whole, and for most purposes, can be viewed as a `Tag` object, which contains strings and other tags as its children.

We can use the `prettify()` method to format the HTML source code so as to visualize its structure:


In [35]:
print(bs_sample.prettify())             

<html>
 <head>
  <link href="example.css" rel="stylesheet" type="text/css"/>
  <title>
   Sample HTML Page
  </title>
 </head>
 <body>
  <h1>
   This is a heading.
  </h1>
  <p>
   This is a typical paragraph.
  </p>
  <p class="notThisOne">
   This is a paragraph of the "notThisOne" class.
  </p>
  <p id="thisOne">
   But I only want this
   <a href="sample.html">
    paragraph
   </a>
   .
  </p>
 </body>
</html>



`BeautifulSoup` provides a lot of different attributes for navigating and iterating over a tag's children. Among them, the simplest way to navigate a tag is to say the name of the child we want:

In [26]:
bs_sample.html.body        # the <body> tag is beneath the <html> tag

<body><h1>This is a heading.</h1><p>This is a typical paragraph. </p><p class="notThisOne">This is a paragraph of the "notThisOne" class.</p><p id="thisOne"> But I only want this <a href="sample.html">paragraph</a>.</p></body>

In [68]:
bs_sample.body             # we can also call the <body> tag directly as long as there's no ambiguity

<body><h1>This is a heading.</h1><p>This is a typical paragraph. </p><p class="notThisOne">This is a paragraph of the "notThisOne" class.</p><p id="thisOne"> But I only want this <a href="sample.html">paragraph</a>.</p></body>

In [69]:
# the <h1> tag is nested two layers deep into the BeautifulSoup object structure (html → body → h1)
bs_sample.html.body.h1     # equivalently, bs_sample.body.h1 or bs_sample.h1 

<h1>This is a heading.</h1>

In [10]:
bs_sample.p

<p>This is a typical paragraph. </p>

A tag's children are available in a list called `.contents`:

In [70]:
bs_sample.html.contents

[<head><link href="example.css" rel="stylesheet" type="text/css"/><title> Sample HTML Page</title></head>,
 <body><h1>This is a heading.</h1><p>This is a typical paragraph. </p><p class="notThisOne">This is a paragraph of the "notThisOne" class.</p><p id="thisOne"> But I only want this <a href="sample.html">paragraph</a>.</p></body>]

In [168]:
bs_sample.body.contents

[<h1>This is a heading.</h1>,
 <p>This is a typical paragraph. </p>,
 <p class="notThisOne">This is a paragraph of the "notThisOne" class.</p>,
 <p id="thisOne"> But I only want this <a href="sample.html">paragraph</a>.</p>]

Alternatively, we can iterate over a tag's children using `.children`:

In [72]:
for child in bs_sample.html.body.children:
    print(child)

<h1>This is a heading.</h1>
<p>This is a typical paragraph. </p>
<p class="notThisOne">This is a paragraph of the "notThisOne" class.</p>
<p id="thisOne"> But I only want this <a href="sample.html">paragraph</a>.</p>


If a tag has only one child, which is a string (of a special kind), this string can be accessed with `.string`:

In [74]:
bs_sample.h1.contents

['This is a heading.']

In [76]:
bs_sample.h1.string        # we may need str() to convert it to a regular string 

'This is a heading.'

<div class="alert alert-info"> Tag.string operates recursively. If tag A contains a single tag B and nothing else, then A.string is the same as B.string. </div>

If we only want the human-readable text inside a document or tag, we can use the `get_text()` method, which returns all the text in a document or beneath a tag, as a single regular string:

In [93]:
bs_sample.body.get_text()

'This is a heading.This is a typical paragraph. This is a paragraph of the "notThisOne" class. But I only want this paragraph.'

In [94]:
bs_sample.h1.get_text()

'This is a heading.'

We can access a tag's attributes with `.attrs`:

In [28]:
bs_sample.link.attrs    

{'href': 'example.css', 'rel': ['stylesheet'], 'type': 'text/css'}

In [37]:
bs_sample.body.p           # get the first <p> tag beneath the <body> tag:

<p>This is a typical paragraph.</p>

Using a tag name as an attribute will give us only the first tag by that name.  If we need to get all tags with a certain name, we'll need to use `find_all()`.

The `find_all()` (`find()`) method can take a variety of filters to find lists of desired tags (a single tag):

In [169]:
bs_sample.find_all('p')                            # perform a match against that exact string; return a list of tags  

[<p>This is a typical paragraph. </p>,
 <p class="notThisOne">This is a paragraph of the "notThisOne" class.</p>,
 <p id="thisOne"> But I only want this <a href="sample.html">paragraph</a>.</p>]

In [18]:
bs_sample.find_all(["p", "a"])                     # perform a string match against any item in that list

[<p>This is a typical paragraph. </p>,
 <p class="notThisOne">This is a paragraph of the "notThisOne" class.</p>,
 <p id="thisOne"> But I only want this <a href="sample.html">paragraph</a>.</p>,
 <a href="sample.html">paragraph</a>]

 We can also form filters based on tags' various attributes:

- The 1st argument is a filter on tag name;

- The 2nd argument is a dictionary of filters on attribute values.

In [39]:
bs_sample.find_all('p', {'class': 'notThisOne'})   # return a list that contains a single tag

[<p class="notThisOne">This is a paragraph of the "notThisOne" class.</p>]

In [170]:
bs_sample.find('p', {'id': 'thisOne'})             # return a single tag

<p id="thisOne"> But I only want this <a href="sample.html">paragraph</a>.</p>

---


In [21]:
%%html
<center><iframe src="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" width="960" height="400">
</iframe> </center>

---



We'll need to first use `urlopen()` in the [`urllib.request` module](https://docs.python.org/3/library/urllib.request.html#module-urllib.request) to open a URL for reading its content. 


 

 

In [1]:
from urllib.request import urlopen

lies_page = urlopen("https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html")  # open the URL

 Next, we'll transform the returned file object to a `BeautifulSoup` object:

In [None]:
bs_lies = BeautifulSoup(lies_page, 'html.parser')

In [None]:
print(bs_lies.prettify())

---


In the HTML code, every record is surrounded by the `<span>` tag of `class="short-desc"`:

```html
<span class="short-desc">
      <strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span>
</span>

```






In [31]:
item_list = bs_lies.find_all('span', {'class':'short-desc'})  

This returns a list of all tags that match the given criteria:

In [36]:
item_list[0]

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [36]:
item_list[5]

<span class="short-desc"><strong>Jan. 25 </strong>“You had millions of people that now aren't insured anymore.” <span class="short-truth"><a href="https://www.nytimes.com/2017/03/13/us/politics/fact-check-trump-obamacare-health-care.html" target="_blank">(The real number is less than 1 million, according to the Urban Institute.)</a></span></span>

The general structure of a single record is:

```html
<strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span>
```


Use `.find()` with the tag name `"strong"` to select the tag that contains the `DATE`:



In [87]:
item_list[0].find("strong")

<strong>Jan. 21 </strong>

Then use `.get_text()` to extract only the text, with the `strip` option active to remove leading and trailing spaces:

In [91]:
item_list[0].find("strong").get_text(strip=True)

'Jan. 21'

Next, use `.contents` with list indexing to extract the `LIE`:

In [32]:
item_list[0].contents

[<strong>Jan. 21 </strong>,
 "“I wasn't a fan of Iraq. I didn't want to go into Iraq.” ",
 <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>]

In [36]:
child_nodes = item_list[0].contents
child_nodes[1][1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

For the `EXPLANATION`, select the text within the `<span>` tag, which is the 3rd child of the tag:

In [37]:
child_nodes[2].get_text(strip=True)[1:-2]

'He was for an invasion before he was against it'

Note that the `URL` is an attribute (the `href` attribute) within the `<a>` tag.  We can access the tag's attribute dictionary directly with `.attrs`:


In [96]:
item_list[0].find('a').attrs

{'href': 'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the',
 'target': '_blank'}

In [84]:
item_list[0].find('a').attrs['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

Finally, we extend this process to all the rest of it using a `for` loop:

In [85]:
date_list = []; lie_list = []; explanation_list = []; url_list = []

for item in item_list:
    first, middle, last = item.contents
    date_list.append(first.get_text(strip=True))
    lie_list.append(middle[1:-2])
    explanation_list.append(last.get_text(strip=True)[1:-2])
    url_list.append(item.find('a').attrs['href'])


In [104]:
print(date_list)

['Jan. 21', 'Jan. 21', 'Jan. 23', 'Jan. 25', 'Jan. 25', 'Jan. 25', 'Jan. 25', 'Jan. 26', 'Jan. 26', 'Jan. 28', 'Jan. 29', 'Jan. 30', 'Feb. 3', 'Feb. 4', 'Feb. 5', 'Feb. 6', 'Feb. 6', 'Feb. 6', 'Feb. 6', 'Feb. 7', 'Feb. 7', 'Feb. 9', 'Feb. 9', 'Feb. 10', 'Feb. 12', 'Feb. 16', 'Feb. 16', 'Feb. 16', 'Feb. 16', 'Feb. 16', 'Feb. 16', 'Feb. 18', 'Feb. 18', 'Feb. 24', 'Feb. 24', 'Feb. 24', 'Feb. 27', 'Feb. 27', 'Feb. 28', 'Feb. 28', 'Feb. 28', 'March 3', 'March 4', 'March 4', 'March 7', 'March 13', 'March 13', 'March 15', 'March 17', 'March 20', 'March 21', 'March 22', 'March 22', 'March 22', 'March 29', 'March 31', 'April 2', 'April 2', 'April 5', 'April 6', 'April 11', 'April 12', 'April 12', 'April 12', 'April 12', 'April 16', 'April 18', 'April 21', 'April 21', 'April 27', 'April 28', 'April 28', 'April 28', 'April 29', 'April 29', 'April 29', 'April 29', 'April 29', 'April 29', 'May 1', 'May 1', 'May 1', 'May 2', 'May 4', 'May 4', 'May 4', 'May 8', 'May 8', 'May 8', 'May 12', 'May 12', '

We can now combine the data into a pandas `DataFrame` (a tabular data model that makes data manipulation and analysis easy; we'll learn more about pandas later) for future analyses:

In [106]:
import pandas as pd
lie_df = pd.DataFrame({'date': date_list, 'lie': lie_list, 'explanation': explanation_list, 'url': url_list})    
lie_df

Unnamed: 0,date,lie,explanation,url
0,Jan. 21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,Jan. 21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,Jan. 23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting,https://www.nytimes.com/2017/01/23/us/politics...
3,Jan. 25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,Jan. 25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud,https://www.nytimes.com/2017/01/24/us/politics...
...,...,...,...,...
175,Oct. 25,We have trade deficits with almost everybody.,We have trade surpluses with more than 100 cou...,https://www.bea.gov/newsreleases/international...
176,Oct. 27,"Wacky & totally unhinged Tom Steyer, who has b...",Steyer has financially supported many winning ...,https://www.opensecrets.org/donor-lookup/resul...
177,Nov. 1,"Again, we're the highest-taxed nation, just ab...",We're not,http://www.politifact.com/truth-o-meter/statem...
178,Nov. 7,When you look at the city with the strongest g...,"Several other cities, including New York and L...",http://www.politifact.com/truth-o-meter/statem...


Save the output in the file system:

In [107]:
lie_df.to_csv('trump_lies.csv')

---


# Scraping Dynamic Web Pages


We are increasingly encountering pages whose contents are dynamically generated within the user's Web browser; that is, the content is determined only when the page is rendered and is updated dynamically based on user interactions and inputs.



Is there a programmatic approach to drive a browser to mimic human users' actions, e.g., clicking on a button, filling in a form, etc., to load contents dynamically?

<img src="https://www.selenium.dev/images/selenium_logo_large.png"  width=260/>

 




[Selenium](https://www.selenium.dev) is an umbrella project for a range of tools and libraries that enable and support the automation of web browsers. At its core is WebDriver, an interface to write instruction sets that can be run interchangeably in many browsers.


Python language bindings for Selenium WebDriver is provided by the [selenium](https://pypi.org/project/selenium/) package




---

## Getting Started

The `selenium.webdriver` module provides all the WebDriver implementations. Currently supported WebDriver implementations are Firefox, Chrome, IE and Remote.

In [154]:
from selenium.webdriver import Chrome

driver = Chrome()         # create an instance of Chrome WebDriver; or Firefox(), Ie(), etc.

In [48]:
type(driver)

selenium.webdriver.chrome.webdriver.WebDriver

In [17]:
driver.__dict__

{'service': <selenium.webdriver.chrome.service.Service at 0x209e1047dc0>,
 'command_executor': <selenium.webdriver.chrome.remote_connection.ChromeRemoteConnection at 0x209e1047970>,
 '_is_remote': False,
 'session_id': '2b53a02f28cf3600e6da5dcde17eb7cb',
 'capabilities': {'acceptInsecureCerts': False,
  'browserName': 'chrome',
  'browserVersion': '86.0.4240.111',
  'chrome': {'chromedriverVersion': '86.0.4240.22 (398b0743353ff36fb1b82468f63a3a93b4e2e89e-refs/branch-heads/4240@{#378})',
   'userDataDir': 'C:\\Users\\justi\\AppData\\Local\\Temp\\scoped_dir14756_1643016801'},
  'goog:chromeOptions': {'debuggerAddress': 'localhost:53624'},
  'networkConnectionEnabled': False,
  'pageLoadStrategy': 'normal',
  'platformName': 'windows',
  'proxy': {},
  'setWindowRect': True,
  'strictFileInteractability': False,
  'timeouts': {'implicit': 0, 'pageLoad': 300000, 'script': 30000},
  'unhandledPromptBehavior': 'dismiss and notify',
  'webauthn:virtualAuthenticators': True},
 'error_handler':


The first thing we'll want to do with `WebDriver` is navigate to a page given by the URL. The convenient way to do so is to call the `get()` method:




In [155]:
driver.get("https://www.google.com/webhp?gl=us")

<div class="alert alert-info">WebDriver will wait until the page has fully loaded before returning control to the script. </div>

In [115]:
print(dir(driver))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_file_detector', '_is_remote', '_mobile', '_switch_to', '_unwrap_value', '_web_element_cls', '_wrap_value', 'add_cookie', 'application_cache', 'back', 'capabilities', 'close', 'command_executor', 'create_options', 'create_web_element', 'current_url', 'current_window_handle', 'delete_all_cookies', 'delete_cookie', 'desired_capabilities', 'error_handler', 'execute', 'execute_async_script', 'execute_cdp_cmd', 'execute_script', 'file_detector', 'file_detector_context', 'find_element', 'find_element_by_class_name', 'find_element_by_css_selector', 'find_element_by_id', 'find_element_by_link_text', 'find_element_by_name', 'find_

In [116]:
driver.title   

'Google'

`WebDriver` offers a number of ways to find elements. 


- We can use one of its `find_element_by_*()` or  `find_elements_by_*()` methods to locate the first matching `WebElement` or a list of matching `WebElement`s in a page. For example:


A hyperlink element that contains a specific link text

```html
<a href = "continue.html">Continue</a>
```

can be found by using either of:

```python
driver.find_element_by_link_text('Continue')
driver.find_element_by_partial_link_text('Cont')
```

And a text field defined as:

```html
<input type="text" name="passwd" id="passwd-id" />
```

can be located using either of:

```python
driver.find_element_by_id("passwd-id")
driver.find_element_by_name("passwd")
```

In [158]:
search_box = driver.find_element_by_name("q")   # locate the input text element by its name attribute

In [60]:
search_box

<selenium.webdriver.remote.webelement.WebElement (session="2b53a02f28cf3600e6da5dcde17eb7cb", element="9e9b8da7-3291-4a39-9d75-1c3df0440213")>

<div class="alert alert-info">A parent WebElement can be chained with find_element(s)_by_*() to access child elements.</div>

Virtualized device input can be generated by the `send_keys()` method:



In [119]:
search_box.clear()                           # clear any pre-populated text
search_box.send_keys("us election 2020")

<div class="alert alert-info">Typing something into a text field won't automatically clear it. Instead, what we type will be appended to what's already there.</div>

 Special keys can be sent using `Keys` class imported from `selenium.webdriver.common.keys`:

In [121]:
from selenium.webdriver.common.keys import Keys
search_box.send_keys(Keys.RETURN)

Here are the list of possible keystrokes that `WebDriver` supports:

In [125]:
print(dir(Keys))

['ADD', 'ALT', 'ARROW_DOWN', 'ARROW_LEFT', 'ARROW_RIGHT', 'ARROW_UP', 'BACKSPACE', 'BACK_SPACE', 'CANCEL', 'CLEAR', 'COMMAND', 'CONTROL', 'DECIMAL', 'DELETE', 'DIVIDE', 'DOWN', 'END', 'ENTER', 'EQUALS', 'ESCAPE', 'F1', 'F10', 'F11', 'F12', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7', 'F8', 'F9', 'HELP', 'HOME', 'INSERT', 'LEFT', 'LEFT_ALT', 'LEFT_CONTROL', 'LEFT_SHIFT', 'META', 'MULTIPLY', 'NULL', 'NUMPAD0', 'NUMPAD1', 'NUMPAD2', 'NUMPAD3', 'NUMPAD4', 'NUMPAD5', 'NUMPAD6', 'NUMPAD7', 'NUMPAD8', 'NUMPAD9', 'PAGE_DOWN', 'PAGE_UP', 'PAUSE', 'RETURN', 'RIGHT', 'SEMICOLON', 'SEPARATOR', 'SHIFT', 'SPACE', 'SUBTRACT', 'TAB', 'UP', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__']


In [159]:
search_box.clear()         
search_box.send_keys("us election 2020" + Keys.RETURN)

---

## Transitioning to Beautiful Soup


After retrieving the search result page, we instruct selenium to hand off the page source to Beautiful Soup:

In [128]:
bs_google = BeautifulSoup(driver.page_source, 'html.parser')

In [None]:
 print(bs_google.prettify())  

---

<div class="g">
    <div class="rc" data-hveid="CB8QAA" data-ved="2ahUKEwiNs73kjtLsAhVNpZ4KHXkJAkYQFSgAMAt6BAgfEAA">
        <div class="yuRUbf">
            <a href="https://www.theguardian.com/us-news/2020/oct/26/us-election-polls-tracker-who-is-leading-in-the-swing-states" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://www.theguardian.com/us-news/2020/oct/26/us-election-polls-tracker-who-is-leading-in-the-swing-states&amp;ved=2ahUKEwiNs73kjtLsAhVNpZ4KHXkJAkYQFjALegQIHxAC">
            <br>
            <span>US election polls tracker: who is leading in the swing states ...</span>
            <div class="TbwUpd NJjxre">
                <cite class="iUh30 gBIQub qLRx3b tjvcx">www.theguardian.com
                    <span class="eipWBe"><span> › us-news › 2020 › oct › us-electi...</span></span>
                </cite>
            </div>
            </a>
            <div class="B6fmyf"><div class="TbwUpd"><cite class="iUh30 gBIQub qLRx3b tjvcx">www.theguardian.com<span class="eipWBe"><span> › us-news › 2020 › oct › us-electi...</span></span></cite></div><div class="eFM0qc"></div></div></div><div class="IsZvec"><div><span class="aCOpRe"><span class="f">2 hours ago — </span>
        <span>As the presidential campaign heats up, the Guardian is tracking the latest polling in eight states that could decide the ... <em>US elections 2020</em>&nbsp;...</span></span></div></div></div><!--n--></div>


---

In [129]:
search_result_list = bs_google.find_all('div', {'class':'yuRUbf'})

In [130]:
search_result_list[0]

<div class="yuRUbf"><a href="https://www.theguardian.com/us-news/2020/oct/28/us-election-polls-tracker-swing-states-donald-trump-joe-biden" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://www.theguardian.com/us-news/2020/oct/28/us-election-polls-tracker-swing-states-donald-trump-joe-biden&amp;ved=2ahUKEwjT7Pv49NjsAhUNfXAKHdJDDKsQFjABegQIARAC"><br/><h3 class="LC20lb DKV0Md"><span>US election polls tracker: who is leading in swing states, Trump</span></h3><div class="TbwUpd NJjxre"><cite class="iUh30 gBIQub qLRx3b tjvcx">www.theguardian.com<span class="dyjrff qzEoUe"><span> › us-news › 2020 › oct › us-electi...</span></span></cite></div></a><div class="B6fmyf"><div class="TbwUpd"><cite class="iUh30 gBIQub qLRx3b tjvcx">www.theguardian.com<span class="dyjrff qzEoUe"><span> › us-news › 2020 › oct › us-electi...</span></span></cite></div><div class="eFM0qc"></div></div></div>

In [131]:
search_result_list[0].find("a").attrs['href']

'https://www.theguardian.com/us-news/2020/oct/28/us-election-polls-tracker-swing-states-donald-trump-joe-biden'

In [132]:
search_result_list[0].find("a").contents

[<br/>,
 <h3 class="LC20lb DKV0Md"><span>US election polls tracker: who is leading in swing states, Trump</span></h3>,
 <div class="TbwUpd NJjxre"><cite class="iUh30 gBIQub qLRx3b tjvcx">www.theguardian.com<span class="dyjrff qzEoUe"><span> › us-news › 2020 › oct › us-electi...</span></span></cite></div>]

In [135]:
search_result_list[0].find("a").contents[1].get_text()

'US election polls tracker: who is leading in swing states, Trump'

In [136]:
url_list = []
text_list = []

for search_result in search_result_list:
    link_node = search_result.find("a")
    url_list.append(link_node.attrs['href'])
    text_list.append(link_node.contents[1].get_text())

search_result_df = pd.DataFrame({'text': text_list, 'url': url_list})    
search_result_df

Unnamed: 0,text,url
0,US election polls tracker: who is leading in s...,https://www.theguardian.com/us-news/2020/oct/2...
1,"US Election Day 2020: When is it, what time do...",https://www.telegraph.co.uk/news/2020/10/28/20...
2,"US election 2020: Latest news on Biden, Trump ...",https://www.cnn.com/politics/live-news/us-elec...
3,Live Biden vs. Trump Election Updates - The Ne...,https://www.nytimes.com/live/2020/10/28/us/tru...
4,US Election 2020 - BBC News - BBC.com,https://www.bbc.com/news/election/us2020
5,US election 2020 polls: Who is ahead - Trump o...,https://www.bbc.com/news/election-us-2020-5365...
6,Biden vs Trump: US presidential election 2020 ...,https://ig.ft.com/us-election-2020/
7,2020 United States presidential election - Wik...,https://en.wikipedia.org/wiki/2020_United_Stat...


Navigating around the search result is just a repeated application of generating keystrokes with `selenium`:

In [87]:
driver.find_element_by_id("pnnext").send_keys(Keys.RETURN)

Or we can click on an element using the `click()` method:

In [160]:
driver.find_element_by_id("pnnext").click() 

`WebDriver`'s `back()` and `forward()` methods allow us to move backward and forward in the browser's history:


In [161]:
driver.back()      # driver.driver.forward()

`.refresh()` refreshes the current page:

In [162]:
driver.refresh()

When we are finished with the browser session, we should close the browser window:

In [101]:
driver.close() 

<div class="alert alert-info"> We can also call quit() method instead of close(). quit() will exit entire browser whereas close() will close one tab.</div>