# Web Scraping the president's lies in 16 lines of python
This repository contains the Jupyter notebook and dataset from Data School's introductory web scraping tutorial. All that is required to follow along is a basic understanding of the Python programming language.

By the end of the tutorial, you will be able to scrape data from a static web page using the requests and Beautiful Soup libraries, and export that data into a structured text file using the pandas library.

[Tutorial](https://www.youtube.com/watch?v=Zh2fkZ-uzBU)

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


## 1. Reading Web page into Python

In [3]:
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

## 2. Parsing the HTML using BeautifulSoup
[Differences between parsers](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers) 

In [4]:
soup = BeautifulSoup(r.text, 'html.parser')

## 3. Collecting all of the records


In [51]:
results = soup.find_all('span', attrs={'class':'short-desc'})
print(results[0:3])
print(len(results))

[<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>, <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>, <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_bl

## 4. Extracting the date

In [52]:
first_result = results[0]
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [53]:
first_result.find('strong')

<strong>Jan. 21 </strong>

In [54]:
first_result.find('strong').text

'Jan. 21\xa0'

In [55]:
first_result.find('strong').text[0:-1]

'Jan. 21'

In [56]:
first_result.find('strong').text[0:-1]+', 2017'

'Jan. 21, 2017'

## 5. Extracting the lie

In [57]:
first_result.contents

[<strong>Jan. 21 </strong>,
 "“I wasn't a fan of Iraq. I didn't want to go into Iraq.” ",
 <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>]

In [58]:
first_result.contents[1]

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

In [59]:
first_result.contents[2]

<span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>

In [60]:
first_result.contents[1][1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

## 6. Extracting the explanation

In [61]:
first_result.contents[2]

<span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>

In [62]:
first_result.find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

In [63]:
first_result.find('a').text[1:-1]

'He was for an invasion before he was against it.'

## 7. Extracting the URL

In [64]:
first_result.find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

In [65]:
first_result.find('a')['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

## 8. Recap: Beautiful Soup methods and attributes
Before we finish building the dataset, I want to summarize a few ways you can interact with Beautiful Soup objects.

You can apply these two methods to either the initial soup object or a Tag object (such as first_result):

find(): searches for the first matching tag, and returns a Tag object
find_all(): searches for all matching tags, and returns a ResultSet object (which you can treat like a list of Tags)
You can extract information from a Tag object (such as first_result) using these two attributes:

text: extracts the text of a Tag, and returns a string
contents: extracts the children of a Tag, and returns a list of Tags and strings
It's important to keep track of whether you are interacting with a Tag, ResultSet, list, or string, because that affects which methods and attributes you can access.

# 9. Building the dataset

In [66]:
records = []
for result in results:
    date = result.find('strong').text[0:-1]+',2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append((date, lie, explanation, url))

In [67]:
len(records)

180

In [68]:
records[0:3]

[('Jan. 21,2017',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it.',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21,2017',
  'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.',
  'Trump was on the cover 11 times and Nixon appeared 55 times.',
  'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'),
 ('Jan. 23,2017',
  'Between 3 million and 5 million illegal votes caused me to lose the popular vote.',
  "There's no evidence of illegal voting.",
  'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html')]

## 10. Applying a tabular data structure

In [69]:
import pandas as pd
df = pd.DataFrame(records, columns = ['date','lie','explanation','url'])

In [70]:
df

Unnamed: 0,date,lie,explanation,url
0,"Jan. 21,2017",I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,"Jan. 21,2017",A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,"Jan. 23,2017",Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,"Jan. 25,2017","Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,"Jan. 25,2017",Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...
...,...,...,...,...
175,"Oct. 25,2017",We have trade deficits with almost everybody.,We have trade surpluses with more than 100 cou...,https://www.bea.gov/newsreleases/international...
176,"Oct. 27,2017","Wacky & totally unhinged Tom Steyer, who has b...",Steyer has financially supported many winning ...,https://www.opensecrets.org/donor-lookup/resul...
177,"Nov. 1,2017","Again, we're the highest-taxed nation, just ab...",We're not.,http://www.politifact.com/truth-o-meter/statem...
178,"Nov. 7,2017",When you look at the city with the strongest g...,"Several other cities, including New York and L...",http://www.politifact.com/truth-o-meter/statem...


In [71]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   date         180 non-null    object
 1   lie          180 non-null    object
 2   explanation  180 non-null    object
 3   url          180 non-null    object
dtypes: object(4)
memory usage: 5.8+ KB


In [73]:
df['date'] = pd.to_datetime(df['date'])
# df['date'] = df['date'].apply(lambda x: x.replace(microsecond=0))


OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-21 00:00:00

## 11. Exporting the dataset to a CSV file

In [96]:
df.to_csv('trump_lies.csv', index=False, encoding='utf-8')

In [97]:
df = pd.read_csv('trump_lies.csv', parse_dates=['date'], encoding='utf-8')

# Summary : 16 lines of Python code

In [98]:
import requests
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('span', attrs={'class':'short-desc'})

records = []
for result in results:
    date = result.find('strong').text[0:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append((date, lie, explanation, url))

import pandas as pd
df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])
df['date'] = pd.to_datetime(df['date'])
df.to_csv('trump_lies.csv', index=False, encoding='utf-8')

# Appendix C : Alternative syntax for beautiful Soup

In [99]:
# search for a tag by name
first_result.find('strong')

# shorter alternative: access it like an attribute
first_result.strong

<strong>Jan. 21 </strong>

In [100]:
# search for multiple tags by name and attribute
results = soup.find_all('span', attrs={'class':'short-desc'})

# shorter alternative: if you don't specify a method, it's assumed to be find_all()
results = soup('span', attrs={'class':'short-desc'})

# even shorter alternative: you can specify the attribute as if it's a parameter
results = soup('span', class_='short-desc')