## Web Scraping:  
Web scraping is the process of collecting structured web data in an automated fashion. It's also called web data extraction.

Points related to web scraping:  
1. HTML consists of TAGS
2. TAGS can have ATTRIBUTES
3. TAGS can be NESTED

There should be some pattern in html code for web scraping.

Importing **requests** library to read HTML(web page) into python

In [1]:
import requests

In [2]:
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

In [3]:
r

<Response [200]>

Response object **r** has text attribute, which contain HTML code

In [4]:
# print the first 1000 characters of the HTML
print(r.text[0:1000])

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
<!--[if IE 8]> <html lang="en" class="no-js ie8 lt-ie10 lt-ie9 page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
<!--[if (lt IE 8)]> <html lang="en" class="no-js lt-ie10 lt-ie9 lt-ie8 page-intera

### Parsing the HTML using Beautiful Soup  
importing **Beautiful Soup** library, this library is used for web scraping

In [5]:
from bs4 import BeautifulSoup

In [6]:
soup = BeautifulSoup(r.text,'html.parser')

Above code parses the HTML (stored in r.text) into a special object called soup.

Each record is following a format. There's an outer tag, and then nested within it is a tag plus another tag, which itself contains an tag. Eg: 
<span class='short-desc'><stong>DATE</strong>LIE<span class='short-truth'><a href = URL>JUSTIFICATION</a></span></span>

###  Finding all of the records
Below code searches the soup object for all tags with the attribute class="short-desc"

In [7]:
records =  soup.find_all('span',attrs={'class':'short-desc'})

In [8]:
len(records)

180

In [9]:
records[0]

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [10]:
records[1]

<span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>

In [11]:
records[-1]

<span class="short-desc"><strong>Nov. 11 </strong>“I'd rather have him  – you know, work with him on the Ukraine than standing and arguing about whether or not  – because that whole thing was set up by the Democrats.” <span class="short-truth"><a href="https://www.nytimes.com/interactive/2017/12/10/us/politics/trump-and-russia.html" target="_blank">(There is no evidence that Democrats "set up" Russian interference in the election.)</a></span></span>

We have 180 records, but we need to separate each record into its four components (date, lie, explanation, and URL) in order to give the dataset some structure.

#### 1) Extracting the Date

In [12]:
first_record = records[0]

In [13]:
first_record

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In order to extract the date, we use **find()** method to find a single tag that matches a specific pattern

In [14]:
first_record.find('strong')

<strong>Jan. 21 </strong>

In [15]:
first_record.find('strong').text

'Jan. 21\xa0'

In [16]:
first_record.find('strong').text[:-1]

'Jan. 21'

#### 2) Extracting the Lie

In [17]:
first_record

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [18]:
first_record.contents

[<strong>Jan. 21 </strong>,
 "“I wasn't a fan of Iraq. I didn't want to go into Iraq.” ",
 <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>]

In [19]:
first_record.contents[1]

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

In [20]:
first_record.contents[1][1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

#### 3) Extracting the Justification

In [21]:
first_record.find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

In [22]:
first_record.find('a').text

'(He was for an invasion before he was against it.)'

In [23]:
first_record.find('a').text[1:-1]

'He was for an invasion before he was against it.'

#### 4) Extracting the URL

In [24]:
first_record.find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

In [25]:
first_record.find('a')['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

#### Similarly fatching all components for 2nd record

In [26]:
second_record = records[1]

In [27]:
print('Date : ',second_record.find('strong').text[:-1])
print('Lie : ',second_record.contents[1][1:-2])
print('Justification :',second_record.find('a').text[1:-1] )
print('URL : ',second_record.find('a')['href'])

Date :  Jan. 21
Lie :  A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.
Justification : Trump was on the cover 11 times and Nixon appeared 55 times.
URL :  http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/


### Creating the dataset
We are creating a for loop to repeat above fetching process for all 180 records

In [28]:
results = []

for record in records:
    date = record.find('strong').text[:-1]
    lie = record.contents[1][1:-2]
    justification = record.find('a').text[1:-1]
    URL = record.find('a')['href']
    results.append((date,lie,justification,URL))

In [29]:
len(results) # total 180 records

180

In [30]:
results[0]

('Jan. 21',
 "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
 'He was for an invasion before he was against it.',
 'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the')

### Converting the data into tabular structure
We are using **pandas** library to convert list of tuples into a dataframe

In [31]:
import pandas as pd

In [32]:
df = pd.DataFrame(results, columns=['Date','Lie','Justification','URL'])

In [33]:
df.head()

Unnamed: 0,Date,Lie,Justification,URL
0,Jan. 21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,Jan. 21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,Jan. 23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,Jan. 25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,Jan. 25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...
