# [Trump's Lies](https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html)


---


# Web Scraping


Using the Python programming language, it is possible to "scrape" data from the Web in a quick and efficient manner.

**Web Scraping** commonly refers to the practice of writing an automated program
that queries a web server, requests data (usually in the form of HTML and other files
that compose web pages), and then parses that data to extract needed information.



Web scraping is a valuable tool in the data scientist's skill set.

<span style="margin-right: 10%;  margin-left: 20%;">[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)</span>
<span style="margin-right: 10%;  margin-left: 15%;">[Selenium](https://www.selenium.dev)</span>

<p> 

<img src="https://raw.githubusercontent.com/justinjiajia/img/master/python/beautifulsoup.jpg" style="float: left; margin-bottom: 1.5em; margin-top: 1.5em; margin-right: 10%;  margin-left: 15%; " width=220/>
 <img src="https://upload.wikimedia.org/wikipedia/commons/d/d5/Selenium_Logo.png" style="float: left; margin-top: 3.8em;" width=140/>



---


# HTML

<br>

[**HyperText Markup Language**](https://developer.mozilla.org/en-US/docs/Web/HTML) (HTML for short) is a markup language for describing Web documents.


---

 ```html
<!DOCTYPE html>
<html>
<head><title>Sample HTML Page</title></head>
  
<body>
<h1>This is a heading.</h1>
<p>This is a typical paragraph.</p>
<p class = "notThisOne"> This is a paragraph of the "notThisOne" class. </p>
<p id = "thisOne"> 
But I only want this <a href = "sample.html">paragraph</a>.
</p>
</body>
  
</html>
```


<!DOCTYPE html>
<html>
<head><title>Sample HTML Page</title></head>
  
<body>
<h1>This is a heading.</h1>
<p>This is a typical paragraph.</p>
<p class = "notThisOne"> This is a paragraph of the "notThisOne" class. </p>
<p id = "thisOne"> 
But I only want this <a href = "sample.html"> paragraph</a>. </p>
</body>
  
</html>


---

HTML elements are written with a start tag, an end tag, and with the content in between: `<tagname>content</tagname>`.

- `<h1>`, `<h2>`,..., `<h6>`: largest heading, second largest heading, etc.
- `<p>`: paragraphs
- `<ul>` or `<ol>`: unordered or ordered bulleted list
- `<li>`: an individual list item
- `<div>`: division or section
- `<table>`: table
- `<img>`: image
- `<a>`: anchor
- and many others ...


The tags typically contain the textual content we wish to scrape and may include attributes. 


```html
<tag attribute1="value1" attribute2="value2">content</tag>
```


These textual components form a "family tree", where the top-level `<html>` tag contains the `<head>` and `<body`> tags, which further contain other textual contents and tags, and so on:


<img src="https://raw.githubusercontent.com/justinjiajia/img/master/python/html_nodes.PNG" width=300/>

 ```html
<!DOCTYPE html>
<html>
<head><title>Sample HTML Page</title></head>
  
<body>
<h1>This is a heading.</h1>
<p>This is a typical paragraph.</p>
<p class = "notThisOne"> This is a paragraph of the "notThisOne" class. </p>
<p id = "thisOne"> 
But I only want this <a href = "sample.html"> paragraph</a>. </p>
</body>
  
</html>
```




An HTML node can further have the `class` or `id` attribute:


```html
<p>This is a typical paragraph.</p>
<p class = "notThisOne"> This is a paragraph of the "notThisOne" class.</p>
<p id = "thisOne"> But I only want this <a href = "sample.html">paragraph</a>.</p>
```

- We can use the `class` or `id` property to differentiate the section we want from other sections.
- The difference between an `id` and a `class` is that an `id` is used to identify one element, whereas a `class` is used to identify more than one.


---



## CSS Selectors


[**Cascading Style Sheets**](https://developer.mozilla.org/en-US/docs/Web/CSS) (CSS for short) is a style sheet language for describing the presentation of a document written in a markup language.


 

```css
h1 {
  color: royalblue;
  text-align: center; }

p {
  color: salmon;
  font-family: "Century Gothic", CenturyGothic, Geneva, 
  AppleGothic, sans-serif; }

p.notThisOne {
  margin-bottom: 40px;
  font-family:"Times New Roman", Georgia, Serif; }

p#thisOne {
  color: #a1d99b;
  font-style: italic; }
```


<!DOCTYPE html>
<html>

<head><title>Sample HTML Page</title></head>
  
<body>
<h1 style="color: royalblue; text-align: center;">This is a heading.</h1>
<p style="color: salmon; font-family: Century Gothic, CenturyGothic, Geneva, AppleGothic, sans-serif">This is a typical paragraph.</p>
<p class = "notThisOne" style="color: salmon; font-family: Times New Roman, Georgia, Serif;"> This is a paragraph of the "notThisOne" class. </p>
<p id = "thisOne"   style="color: #a1d99b; font-style: italic;  font-family: Century Gothic, CenturyGothic, Geneva, AppleGothic, sans-serif"> 
But I only want this <a href = "sample.html"> paragraph</a>. </p>
</body>
  
</html>

HTML dictates the content and structure of a webpage, while CSS modifies design and display of HTML elements.

Three ways of applying CSS to HTML

- Inline - right on an HTML tag, using `style = attribute`
- An embedded style sheet in the `<head>` of the document
- As an external style sheet in a separate file

**`sample.html`**

```html
<!DOCTYPE html>
<html>
  
<head>
<link href="example.css" rel="stylesheet" type="text/css">
<title>Sample HTML Page</title>
</head>
  
<body>
<h1>This is a heading.</h1>
<p>This is a typical paragraph.</p>
<p class = "notThisOne">
This is a paragraph of the "notThisOne" class.</p>
<p id = "thisOne">
But I only want this <a href = "sample.html">paragraph</a>.
</p>
</body>
  
</html>
```

**`example.css`**

```css
 h1 {
  color: royalblue;
  text-align: center; }

p {
  color: salmon;
  font-family: "Century Gothic", CenturyGothic, Geneva, 
  AppleGothic, sans-serif; }

p.notThisOne {
  margin-bottom: 40px;
  font-family:"Times New Roman", Georgia, Serif; }

p#thisOne {
  color: #a1d99b;
  font-style: italic; }
```

In CSS, selectors are patterns used to select the element(s) we want to style.



| Selector |  Example | Explanation |
|-----|-----|-----|
| `element` | `p` | Select all `<p>` elements|
| `.class`| `.notThisOne` | Select all elements with `class="notThisOne"` |
| `#id` | `#thisOne` | Select the element with `id="thisOne"` |
|`[attribute]`|`[id]` | Select all elements with an `id` attribute |
| `element.class`| `p.notThisOne` | select all `<p>` elements with `class="notThisOne"` |
| `element#id`| `p#thisOne` | select all `<p>` elements with `id="thisOne"` |
| `,`| `div, p` | Select all `<p>` elements as well as all `<div>` elements |
| ` `| `div p` | Select all `<p>` elements inside `<div>` elements |
| `>`| `div > p` | Select all `<p>` elements whose parent is a `<div>` element|
| `+`| `div + p` | Select all `<p>` elements that immediately follows a `<div>` element |
| `~`| `div ~ p` | Select all `<p>` elements as long as they follow a `<div>` element |
 


More use can be find [here](https://css-tricks.com/almanac/selectors/).

---


# Beautiful Soup



<img src="https://raw.githubusercontent.com/justinjiajia/img/master/python/beautifulsoup.jpg"  width=220/>

>Beautiful Soup, so rich and green,<br>
&nbsp;&nbsp;Waiting in a hot tureen!<br>
&nbsp;&nbsp;Who for such dainties would not stoop?<br>
&nbsp;&nbsp;Soup of the evening, beautiful Soup!






<br>

[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) is a Python library for pulling data out of HTML and XML files. 

An examples of ["**tag soup**"](https://en.wikipedia.org/wiki/Tag_soup):

```html
<H1>Tacky HTML</H1>
Hi there.
<p><img src=tiny.png>
Browsers tolerate a lot of completely broken HTML.
<UL>
<LI>List one
<LI>List 2
</UL>
```
    

<H1>Tacky HTML</H1>
Hi there.
<p><img src=tiny.png>
Browsers tolerate a lot of completely broken HTML.
<UL>
<LI>List one
<LI>List 2
</UL>


BeautifulSoup helps format and organize the messy web by fixing bad HTML and presenting us with easily traversable Python objects representing XML structures.

The most commonly used object in the BeautifulSoup library is the `BeautifulSoup` object. 

In [4]:
from bs4 import BeautifulSoup



Let's pass a string of HTML source code into the `BeautifulSoup` constructor to "***make the soup***":



In [4]:
html_sample_code = '<html><head><link href="example.css" rel="stylesheet" type="text/css"><title> \
Sample HTML Page</title></head><body><h1>This is a heading.</h1><p>This is a typical paragraph. \
</p><p class = "notThisOne">This is a paragraph of the "notThisOne" class.</p><p id = "thisOne"> \
But I only want this <a href = "sample.html">paragraph</a>.</p></body></html>'



bs_sample = BeautifulSoup(html_sample_code, 'html.parser')     # pass it into the BeautifulSoup constructor


- The 1st argument is the HTML text the object is based on, and the 2nd specifies the parser that we want BeautifulSoup to use in order to create a `BeautifulSoup` object.


<div class='alert alert-info'><code>html.parser</code> is the HTML parser included in the standard Python 3 library. Information on other HTML parsers is available <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser">here</a>.

</div>
 

In [11]:
bs_sample

<html><head><link href="example.css" rel="stylesheet" type="text/css"/><title> Sample HTML Page</title></head><body><h1>This is a heading.</h1><p>This is a typical paragraph. </p><p class="notThisOne">This is a paragraph of the "notThisOne" class.</p><p id="thisOne"> But I only want this <a href="sample.html">paragraph</a>.</p></body></html>

In [109]:
type(bs_sample)

bs4.BeautifulSoup

A `BeautifulSoup` object represents the parsed document as a whole, and for most purposes, can be viewed as a `Tag` object, which contains strings and other tags as its children.

We can use the `prettify()` method to format the HTML source code so as to visualize its structure:


In [6]:
print(bs_sample.prettify())             

<html>
 <head>
  <link href="example.css" rel="stylesheet" type="text/css"/>
  <title>
   Sample HTML Page
  </title>
 </head>
 <body>
  <h1>
   This is a heading.
  </h1>
  <p>
   This is a typical paragraph.
  </p>
  <p class="notThisOne">
   This is a paragraph of the "notThisOne" class.
  </p>
  <p id="thisOne">
   But I only want this
   <a href="sample.html">
    paragraph
   </a>
   .
  </p>
 </body>
</html>



`BeautifulSoup` provides a lot of different attributes for navigating and iterating over a tag's children. 

Among them, the simplest way to navigate a tag is to say the name of the child we want:

In [26]:
bs_sample.html.body          # the <body> tag is beneath the <html> tag

<body><h1>This is a heading.</h1><p>This is a typical paragraph. </p><p class="notThisOne">This is a paragraph of the "notThisOne" class.</p><p id="thisOne"> But I only want this <a href="sample.html">paragraph</a>.</p></body>

In [68]:
bs_sample.body               # we can also call the <body> tag directly as long as there's no ambiguity

<body><h1>This is a heading.</h1><p>This is a typical paragraph. </p><p class="notThisOne">This is a paragraph of the "notThisOne" class.</p><p id="thisOne"> But I only want this <a href="sample.html">paragraph</a>.</p></body>

In [69]:
# the <h1> tag is nested two layers deep into the BeautifulSoup object structure (html → body → h1)
bs_sample.html.body.h1       # equivalently, bs_sample.body.h1 or bs_sample.h1 

<h1>This is a heading.</h1>

In [10]:
bs_sample.body.p             # get the 1st <p> tag beneath the <body> tag

<p>This is a typical paragraph. </p>

A tag's children are available in a list called `.contents`:

In [70]:
bs_sample.html.contents

[<head><link href="example.css" rel="stylesheet" type="text/css"/><title> Sample HTML Page</title></head>,
 <body><h1>This is a heading.</h1><p>This is a typical paragraph. </p><p class="notThisOne">This is a paragraph of the "notThisOne" class.</p><p id="thisOne"> But I only want this <a href="sample.html">paragraph</a>.</p></body>]

In [168]:
bs_sample.body.contents

[<h1>This is a heading.</h1>,
 <p>This is a typical paragraph. </p>,
 <p class="notThisOne">This is a paragraph of the "notThisOne" class.</p>,
 <p id="thisOne"> But I only want this <a href="sample.html">paragraph</a>.</p>]

Alternatively, we can iterate over a tag's children using `.children`:

In [72]:
for child in bs_sample.html.body.children:
    print(child)

<h1>This is a heading.</h1>
<p>This is a typical paragraph. </p>
<p class="notThisOne">This is a paragraph of the "notThisOne" class.</p>
<p id="thisOne"> But I only want this <a href="sample.html">paragraph</a>.</p>


If a tag has only one child, which is a string (of a special kind), this string can be accessed with `.string`:

In [5]:
bs_sample.h1.contents

['This is a heading.']

In [6]:
type(bs_sample.h1.string)

bs4.element.NavigableString

In [76]:
bs_sample.h1.string        # we may need str() to convert it to a regular string 

'This is a heading.'

<div class="alert alert-info"> Tag.string operates recursively. If tag A contains a single tag B and nothing else, then A.string is the same as B.string. </div>

If we only want the human-readable text inside a document or tag, we can use the `get_text()` method, which returns all the text in a document or beneath a tag, as a single regular string:

In [93]:
bs_sample.body.get_text()

'This is a heading.This is a typical paragraph. This is a paragraph of the "notThisOne" class. But I only want this paragraph.'

In [94]:
bs_sample.h1.get_text()

'This is a heading.'

We can access a tag's attributes with `.attrs`:

In [7]:
print(bs_sample.prettify())

<html>
 <head>
  <link href="example.css" rel="stylesheet" type="text/css"/>
  <title>
   Sample HTML Page
  </title>
 </head>
 <body>
  <h1>
   This is a heading.
  </h1>
  <p>
   This is a typical paragraph.
  </p>
  <p class="notThisOne">
   This is a paragraph of the "notThisOne" class.
  </p>
  <p id="thisOne">
   But I only want this
   <a href="sample.html">
    paragraph
   </a>
   .
  </p>
 </body>
</html>


In [28]:
bs_sample.link.attrs    

{'href': 'example.css', 'rel': ['stylesheet'], 'type': 'text/css'}

Using a tag name as an attribute will give us only the first tag by that name.  If we need to get all tags with a certain name, we'll need to use `find_all()`.

The `find_all()` (`find()`) method can take a variety of filters to find lists of desired tags (a single tag):

In [169]:
bs_sample.find_all('p')                            # perform a match against that exact string; return a list of tags  

[<p>This is a typical paragraph. </p>,
 <p class="notThisOne">This is a paragraph of the "notThisOne" class.</p>,
 <p id="thisOne"> But I only want this <a href="sample.html">paragraph</a>.</p>]

In [18]:
bs_sample.find_all(["p", "a"])                     # perform a string match against any item in that list

[<p>This is a typical paragraph. </p>,
 <p class="notThisOne">This is a paragraph of the "notThisOne" class.</p>,
 <p id="thisOne"> But I only want this <a href="sample.html">paragraph</a>.</p>,
 <a href="sample.html">paragraph</a>]

 We can also form filters based on tags' various attributes:

- The 1st argument is a filter on tag name;

- The 2nd argument is a dictionary of filters on attribute values.

In [39]:
bs_sample.find_all('p', {'class': 'notThisOne'})   # return a list that contains a single tag

[<p class="notThisOne">This is a paragraph of the "notThisOne" class.</p>]

In [170]:
bs_sample.find('p', {'id': 'thisOne'})             # return a single tag

<p id="thisOne"> But I only want this <a href="sample.html">paragraph</a>.</p>

---

# Scraping Textual Data on Trump's Lies


We'll need to first use `urlopen()` in the [`urllib.request` module](https://docs.python.org/3/library/urllib.request.html#module-urllib.request) to open a URL for reading its content. 


 

 

In [13]:
from urllib.request import urlopen

lies_page = urlopen("https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html")  # open the URL

 Next, we'll transform the returned remote object to a `BeautifulSoup` object:

In [14]:
bs_lies = BeautifulSoup(lies_page, 'html.parser')

To read in the source code from a local HTML file:

In [33]:
lies_page = open("C:/Users/justi/Downloads/lies.html", encoding="utf8")  # open the local html file
bs_lies = BeautifulSoup(lies_page.read(), 'html.parser')

In [46]:
type(bs_lies)

bs4.BeautifulSoup

In [34]:
print(bs_lies.prettify())

<!DOCTYPE html>
<html class="desktop page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive wf-active has-js has-flexbox has-flexboxlegacy has-canvas has-canvastext has-webgl has-no-touch has-geolocation has-postmessage has-websqldatabase has-indexeddb has-hashchange has-history has-draganddrop has-websockets has-rgba has-hsla has-multiplebgs has-backgroundsize has-borderimage has-borderradius has-boxshadow has-textshadow has-opacity has-cssanimations has-csscolumns has-cssgradients has-cssreflections has-csstransforms has-csstransforms3d has-csstransitions has-fontface has-generatedcontent has-video has-audio has-localstorage has-sessionstorage has-webworkers has-applicationcache has-svg has-inlinesvg has-smil has-svgclippaths has-cors tr-coretext tr-aa-unknown-subpixel g-resizer-v3-init edition-domestic gr__nytimes_com viewport-small viewport-small-10 viewport-small-20 viewport-medium" lang="en" style="">
 

---


In the HTML code, every record is surrounded by the `<span>` tag of `class="short-desc"`:

```html
<span class="short-desc">
      <strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span>
</span>

```






In [35]:
item_list = bs_lies.find_all('span', {'class':'short-desc'})  

This returns a list of all tags that match the given criteria:

In [36]:
item_list[0]

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [37]:
item_list[5]

<span class="short-desc"><strong>Jan. 25 </strong>“You had millions of people that now aren't insured anymore.” <span class="short-truth"><a href="https://www.nytimes.com/2017/03/13/us/politics/fact-check-trump-obamacare-health-care.html" target="_blank">(The real number is less than 1 million, according to the Urban Institute.)</a></span></span>

The general structure of a single record is:

```html
<strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span>
```


Use `.find()` with the tag name `"strong"` to select the tag that contains the `DATE`:



In [38]:
item_list[0].find("strong")

<strong>Jan. 21 </strong>

Then use `.get_text()` to extract only the text, with the `strip` option active to remove leading and trailing spaces:

In [39]:
item_list[0].find("strong").get_text(strip=True)

'Jan. 21'

Next, use `.contents` with list indexing to extract the `LIE`:

In [40]:
item_list[0].contents

[<strong>Jan. 21 </strong>,
 "“I wasn't a fan of Iraq. I didn't want to go into Iraq.” ",
 <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>]

In [41]:
child_nodes = item_list[0].contents
child_nodes[1][1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

For the `EXPLANATION`, select the text within the `<span>` tag, which is the 3rd child of the tag:

In [42]:
child_nodes[2].get_text(strip=True)[1:-2]

'He was for an invasion before he was against it'

Note that the `URL` is an attribute (the `href` attribute) within the `<a>` tag.  We can access the tag's attribute dictionary directly with `.attrs`:


In [96]:
item_list[0].find('a').attrs

{'href': 'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the',
 'target': '_blank'}

In [84]:
item_list[0].find('a').attrs['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

Finally, we extend this process to all the rest of it using a `for` loop:

In [43]:
date_list = []; lie_list = []; explanation_list = []; url_list = []

for item in item_list:
    first, middle, last = item.contents
    date_list.append(first.get_text(strip=True))
    lie_list.append(middle[1:-2])
    explanation_list.append(last.get_text(strip=True)[1:-2])
    url_list.append(item.find('a').attrs['href'])


In [44]:
print(date_list)

['Jan. 21', 'Jan. 21', 'Jan. 23', 'Jan. 25', 'Jan. 25', 'Jan. 25', 'Jan. 25', 'Jan. 26', 'Jan. 26', 'Jan. 28', 'Jan. 29', 'Jan. 30', 'Feb. 3', 'Feb. 4', 'Feb. 5', 'Feb. 6', 'Feb. 6', 'Feb. 6', 'Feb. 6', 'Feb. 7', 'Feb. 7', 'Feb. 9', 'Feb. 9', 'Feb. 10', 'Feb. 12', 'Feb. 16', 'Feb. 16', 'Feb. 16', 'Feb. 16', 'Feb. 16', 'Feb. 16', 'Feb. 18', 'Feb. 18', 'Feb. 24', 'Feb. 24', 'Feb. 24', 'Feb. 27', 'Feb. 27', 'Feb. 28', 'Feb. 28', 'Feb. 28', 'March 3', 'March 4', 'March 4', 'March 7', 'March 13', 'March 13', 'March 15', 'March 17', 'March 20', 'March 21', 'March 22', 'March 22', 'March 22', 'March 29', 'March 31', 'April 2', 'April 2', 'April 5', 'April 6', 'April 11', 'April 12', 'April 12', 'April 12', 'April 12', 'April 16', 'April 18', 'April 21', 'April 21', 'April 27', 'April 28', 'April 28', 'April 28', 'April 29', 'April 29', 'April 29', 'April 29', 'April 29', 'April 29', 'May 1', 'May 1', 'May 1', 'May 2', 'May 4', 'May 4', 'May 4', 'May 8', 'May 8', 'May 8', 'May 12', 'May 12', '

We can now combine the data into a pandas `DataFrame` (a tabular data model that makes data manipulation and analysis easy; we'll learn more about pandas later) for future analyses:

In [45]:
import pandas as pd
lie_df = pd.DataFrame({'date': date_list, 'lie': lie_list, 'explanation': explanation_list, 'url': url_list})    
lie_df

Unnamed: 0,date,lie,explanation,url
0,Jan. 21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,Jan. 21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,Jan. 23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting,https://www.nytimes.com/2017/01/23/us/politics...
3,Jan. 25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,Jan. 25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud,https://www.nytimes.com/2017/01/24/us/politics...
...,...,...,...,...
175,Oct. 25,We have trade deficits with almost everybody.,We have trade surpluses with more than 100 cou...,https://www.bea.gov/newsreleases/international...
176,Oct. 27,"Wacky & totally unhinged Tom Steyer, who has b...",Steyer has financially supported many winning ...,https://www.opensecrets.org/donor-lookup/resul...
177,Nov. 1,"Again, we're the highest-taxed nation, just ab...",We're not,http://www.politifact.com/truth-o-meter/statem...
178,Nov. 7,When you look at the city with the strongest g...,"Several other cities, including New York and L...",http://www.politifact.com/truth-o-meter/statem...


Save the output in the file system:

In [107]:
lie_df.to_csv('trump_lies.csv')

---


# Scraping Dynamic Web Pages


We are increasingly encountering pages whose contents are dynamically generated within the user's Web browser; that is, the content is determined only when the page is rendered and is updated dynamically based on user interactions and inputs.



Is there a programmatic approach to drive a browser to mimic human users' actions, e.g., clicking on a button, filling in a form, etc., to load contents dynamically?

<img src="https://upload.wikimedia.org/wikipedia/commons/d/d5/Selenium_Logo.png"  width=150/>

 




[Selenium](https://www.selenium.dev) is an umbrella project for a range of tools and libraries that enable and support the automation of web browsers. At its core is WebDriver, an interface to write instruction sets that can be run interchangeably in many browsers.


Python language bindings for selenium WebDriver is provided by the [selenium](https://pypi.org/project/selenium/) package.




---

## Getting Started

The `selenium.webdriver` module provides all the WebDriver implementations. Currently supported WebDriver implementations are Firefox, Chrome, IE and Remote.

In [47]:
from selenium.webdriver import Chrome

# create an instance of Chrome WebDriver; or Firefox(), Ie(), etc.
driver = Chrome("C:/chromedriver.exe")     # or use the pathname to the binary file as the argument

In [5]:
type(driver)

selenium.webdriver.chrome.webdriver.WebDriver

In [6]:
driver.__dict__

{'service': <selenium.webdriver.chrome.service.Service at 0x20db3012490>,
 'command_executor': <selenium.webdriver.chrome.remote_connection.ChromeRemoteConnection at 0x20db2cf9280>,
 '_is_remote': False,
 'session_id': '6657cfe0559d7823a1f80373284837c7',
 'capabilities': {'acceptInsecureCerts': False,
  'browserName': 'chrome',
  'browserVersion': '95.0.4638.54',
  'chrome': {'chromedriverVersion': '95.0.4638.17 (a9d0719444d4b035e284ed1fce73bf6ccd789df2-refs/branch-heads/4638@{#178})',
   'userDataDir': 'C:\\Users\\justi\\AppData\\Local\\Temp\\scoped_dir17684_1809611791'},
  'goog:chromeOptions': {'debuggerAddress': 'localhost:58956'},
  'networkConnectionEnabled': False,
  'pageLoadStrategy': 'normal',
  'platformName': 'windows',
  'proxy': {},
  'setWindowRect': True,
  'strictFileInteractability': False,
  'timeouts': {'implicit': 0, 'pageLoad': 300000, 'script': 30000},
  'unhandledPromptBehavior': 'dismiss and notify',
  'webauthn:extension:credBlob': True,
  'webauthn:extension:


The first thing we'll want to do with `WebDriver` is to navigate to a page given by the URL. The convenient way to do so is to call the `get()` method:




In [48]:
driver.get("https://www.google.com/webhp?gl=us")

<div class="alert alert-info">WebDriver will wait until the page has fully loaded before returning control to the script. </div>

In [None]:
print(dir(driver))

In [49]:
driver.title   

'Google'

`WebDriver` offers a number of ways to find elements. 


- We can use one of its `find_element_by_*()` or  `find_elements_by_*()` methods to locate the first matching `WebElement` or a list of matching `WebElement`s in a page. For example:


A paragraph element of a specific class:

```html
<p class = "notThisOne"> This is a paragraph of the "notThisOne" class. </p>
```

can be found by using any of:

```python
driver.find_element_by_class_name('notThisOne')
driver.find_element_by_tag_name('p')
driver.find_element_by_css_selector("p#notThisOne")
```

A hyperlink element that contains a specific link text:

```html
<a href = "continue.html">Continue</a>
```

can be found by using either of:

```python
driver.find_element_by_link_text('Continue')
driver.find_element_by_partial_link_text('Cont')
```

And a text field defined as:

```html
<input type="text" name="passwd" id="passwd-id" />
```

can be located using either of:

```python
driver.find_element_by_id("passwd-id")
driver.find_element_by_name("passwd")
driver.find_element_by_css_selector("input#passwd-id")
```

In [50]:
search_box = driver.find_element_by_name("q")   # locate the input text element by its name attribute

In [51]:
search_box

<selenium.webdriver.remote.webelement.WebElement (session="445eaeb976a82150ef0d1bd69dc79b6b", element="49656bcc-174b-4de8-b5e2-7587509c118f")>

<div class="alert alert-info">A parent WebElement can be chained with <code>find_element(s)_by_*()</code> to access child elements.</div>

Virtualized device input can be generated by the `send_keys()` method:



In [52]:
search_box.clear()                           # clear any pre-populated text
search_box.send_keys("us election 2020")

<div class="alert alert-info">Typing something into a text field won't automatically clear it. Instead, what we type will be appended to what's already there.</div>

 Special keys can be sent using the `Keys` class imported from `selenium.webdriver.common.keys`:

In [53]:
from selenium.webdriver.common.keys import Keys
search_box.send_keys(Keys.RETURN)

Here are the list of possible keystrokes that `WebDriver` supports:

In [125]:
print(dir(Keys))

['ADD', 'ALT', 'ARROW_DOWN', 'ARROW_LEFT', 'ARROW_RIGHT', 'ARROW_UP', 'BACKSPACE', 'BACK_SPACE', 'CANCEL', 'CLEAR', 'COMMAND', 'CONTROL', 'DECIMAL', 'DELETE', 'DIVIDE', 'DOWN', 'END', 'ENTER', 'EQUALS', 'ESCAPE', 'F1', 'F10', 'F11', 'F12', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7', 'F8', 'F9', 'HELP', 'HOME', 'INSERT', 'LEFT', 'LEFT_ALT', 'LEFT_CONTROL', 'LEFT_SHIFT', 'META', 'MULTIPLY', 'NULL', 'NUMPAD0', 'NUMPAD1', 'NUMPAD2', 'NUMPAD3', 'NUMPAD4', 'NUMPAD5', 'NUMPAD6', 'NUMPAD7', 'NUMPAD8', 'NUMPAD9', 'PAGE_DOWN', 'PAGE_UP', 'PAUSE', 'RETURN', 'RIGHT', 'SEMICOLON', 'SEPARATOR', 'SHIFT', 'SPACE', 'SUBTRACT', 'TAB', 'UP', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__']


In [54]:
search_box = driver.find_element_by_name("q")   
search_box.clear()         
search_box.send_keys("us election 2020", Keys.RETURN)

We can use a `find_elements_by_*()` method to return a list of matching `WebElement`s:

In [55]:
search_results = driver.find_elements_by_class_name('yuRUbf')
search_results

[<selenium.webdriver.remote.webelement.WebElement (session="445eaeb976a82150ef0d1bd69dc79b6b", element="91df156e-2fca-44ae-846c-11b644a7114d")>,
 <selenium.webdriver.remote.webelement.WebElement (session="445eaeb976a82150ef0d1bd69dc79b6b", element="51451a93-cbff-468f-aab2-74a7fb234617")>,
 <selenium.webdriver.remote.webelement.WebElement (session="445eaeb976a82150ef0d1bd69dc79b6b", element="ec4f46a6-be4c-43b0-9798-44a0df211964")>,
 <selenium.webdriver.remote.webelement.WebElement (session="445eaeb976a82150ef0d1bd69dc79b6b", element="e7cb9321-bc93-49a6-ae5d-b242c8cc0ecb")>,
 <selenium.webdriver.remote.webelement.WebElement (session="445eaeb976a82150ef0d1bd69dc79b6b", element="12c0712f-3851-4da6-8cd6-2a6cb3e400ed")>,
 <selenium.webdriver.remote.webelement.WebElement (session="445eaeb976a82150ef0d1bd69dc79b6b", element="bc011730-a885-4ca5-a73d-7f91310dbdc9")>,
 <selenium.webdriver.remote.webelement.WebElement (session="445eaeb976a82150ef0d1bd69dc79b6b", element="f750107a-d632-489a-90f0-73

The rendered text of a specific element can be retrieved by `.text`:

In [56]:
search_results[0].text

'2020 United States presidential election - Wikipedia\nhttps://en.wikipedia.org › wiki › 2020_United_States_pres...'

The parent `WebElement` can also be chained with a `find_element_by_*()` method to access child elements:

In [57]:
search_results[0].find_element_by_class_name("LC20lb").text

'2020 United States presidential election - Wikipedia'

---

## Transitioning to Beautiful Soup


After retrieving the search result page, we instruct selenium to hand off the page source to Beautiful Soup:

In [58]:
bs_google = BeautifulSoup(driver.page_source, 'html.parser')

In [59]:
 print(bs_google.prettify())  

<html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="origin" name="referrer"/>
  <meta content="Anb2GUnhMjfTIX0D2a4a6NPAqPI5GaxxRAiF81XTjHJ2qK4E3Hw3VFM4eaJBgRzz45CNPt624audv+wHOJwfAAEAAABieyJvcmlnaW4iOiJodHRwczovL2dvb2dsZS5jb206NDQzIiwiZmVhdHVyZSI6IlRydXN0VG9rZW5zIiwiZXhwaXJ5IjoxNjI2MjIwNzk5LCJpc1N1YmRvbWFpbiI6dHJ1ZX0=" http-equiv="origin-trial"/>
  <meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/>
  <title>
   us election 2020 - Google Search
  </title>
  <script async="" nonce="" src="https://apis.google.com/_/scs/abc-static/_/js/k=gapi.gapi.en.hvE_rrhCzPE.O/m=gapi_iframes,googleapis_client/rt=j/sv=1/d=1/ed=1/rs=AHpOoo-98F2Gk-siNaIBZOtcWfXQWKdTpQ/cb=gapi.loaded_0">
  </script>
  <script nonce="">
   (function(){window.google={kEI:'oKl6YcS8GtSC-QbBtIv4Dw',kEXPI:'31',kBL:'VsEN'};google.sn='web';google.kHL='en';})();(function(){
var f=this||self;var h,k=[];function l(

In [61]:
search_result_list = bs_google.find_all('div', {'class':'yuRUbf'})

In [62]:
search_result_list[0]

<div class="yuRUbf"><a data-ved="2ahUKEwjEjLH2nu3zAhVUQd4KHUHaAv8QFnoECBYQAQ" href="https://en.wikipedia.org/wiki/2020_United_States_presidential_election" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://en.wikipedia.org/wiki/2020_United_States_presidential_election&amp;ved=2ahUKEwjEjLH2nu3zAhVUQd4KHUHaAv8QFnoECBYQAQ"><br/><h3 class="LC20lb DKV0Md">2020 United States presidential election - Wikipedia</h3><div class="TbwUpd NJjxre"><cite class="iUh30 Zu0yb qLRx3b tjvcx">https://en.wikipedia.org<span class="dyjrff qzEoUe"> › wiki › 2020_United_States_pres...</span></cite></div></a><div class="B6fmyf"><div class="TbwUpd"><cite class="iUh30 Zu0yb qLRx3b tjvcx">https://en.wikipedia.org<span class="dyjrff qzEoUe"> › wiki › 2020_United_States_pres...</span></cite></div><div class="eFM0qc"></div><div class="csDOgf"><div><div data-acc="n" data-enjspb="true" data-ved="2ahUKEwjEjLH2nu3zAhVUQd4KHUHaAv8Q2esEegQIFhAE" jscontroller="exgaYe" jsdata="l7Bhpb;_;CEAvKM"><div jsaction="KyPa0e:WZTn

In [63]:
search_result_list[0].find("a").attrs['href']

'https://en.wikipedia.org/wiki/2020_United_States_presidential_election'

In [64]:
search_result_list[0].find("a").contents

[<br/>,
 <h3 class="LC20lb DKV0Md">2020 United States presidential election - Wikipedia</h3>,
 <div class="TbwUpd NJjxre"><cite class="iUh30 Zu0yb qLRx3b tjvcx">https://en.wikipedia.org<span class="dyjrff qzEoUe"> › wiki › 2020_United_States_pres...</span></cite></div>]

In [65]:
search_result_list[0].find("a").contents[1].get_text()

'2020 United States presidential election - Wikipedia'

In [66]:
url_list = []
text_list = []

for search_result in search_result_list:
    link_node = search_result.find("a")
    url_list.append(link_node.attrs['href'])
    text_list.append(link_node.contents[1].get_text())

search_result_df = pd.DataFrame({'text': text_list, 'url': url_list})    
search_result_df

Unnamed: 0,text,url
0,2020 United States presidential election - Wik...,https://en.wikipedia.org/wiki/2020_United_Stat...
1,2020 United States elections - Wikipedia,https://en.wikipedia.org/wiki/2020_United_Stat...
2,Presidential Election Results and Electoral Ma...,https://www.cnn.com/election/2020/results/pres...
3,Presidential Election of 2020 - 270toWin,https://www.270towin.com/2020_Election
4,2020 Presidential Election Results: Joe Biden ...,https://www.nytimes.com/interactive/2020/11/03...
5,"Presidential election, 2020 - Ballotpedia","https://ballotpedia.org/Presidential_election,..."
6,United States presidential election of 2020 - ...,https://www.britannica.com/event/United-States...


Navigating around the search result is just a repeated application of generating keystrokes with `selenium`:

In [13]:
driver.find_element_by_id("pnnext").send_keys(Keys.RETURN)

Or we can click on an element using the `click()` method:

In [24]:
driver.find_element_by_id("pnnext").click() 

`WebDriver`'s `back()` and `forward()` methods allow us to move backward and forward in the browser's history:


In [25]:
driver.back()      # driver.driver.forward()

`.refresh()` refreshes the current page:

In [162]:
driver.refresh()

When we are finished with the browser session, we should close the browser window:

In [14]:
driver.close() 

<div class="alert alert-info"> We can also call <code>quit()</code> method instead of <code>close()</code>. <code>quit()</code> will exit entire browser whereas <code>close()</code> will close one tab.</div>