<a href="https://colab.research.google.com/github/Swayms-stack/Python-Programs/blob/master/WEB_SCRAPING_(Trump's_Lies).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***READING THE WEB PAGE INTO PYTHON***

In [1]:
import requests
# Fetch the webpage from the URL & store it in a response object (r)
r = requests.get ('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

In [2]:
# Print the 1st 500 characters of the URL
print (r.text [0:500])

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page


***PARSE THE HTML USING BEAUTIFUL SOUP LIBRARY***

In [3]:
from bs4 import BeautifulSoup
# Parse the HTML stored in the object text into a special object called soup
soup = BeautifulSoup (r.text, 'html.parser') 

***COLLECTING ALL THE RECORDS***

In [4]:
# FORMAT OF EACH RECORD
# This is the pattern that allows us to build our dataset
'''<span class = "short-decs"><strong> DATE </strong> LIE <span class = "short-truth">
<a href = "URL"> EXPLANATION </a></span></span>'''

'<span class = "short-decs"><strong> DATE </strong> LIE <span class = "short-truth">\n<a href = "URL"> EXPLANATION </a></span></span>'

In [5]:
# Find all the records using Beautiful Soup
# This code searches the soup object for all span tags with the attribute class = "short-decs"
# A special Beautiful Soup object is returned called results set containing the search results

In [6]:
results = soup.find_all('span', attrs={'class':'short-desc'})

In [7]:
# results acts like a Python list
len (results)

180

In [8]:
print("Title of the website is : ")
for title in soup.find_all('title'):
    print(title.get_text())

Title of the website is : 
Opinion | President Trump’s Lies, the Definitive List - The New York Times


In [9]:
# Slice the object like a list, in order to examine the first three results.
results [0:3]

[<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_

In [10]:
# Last result
results [-1]

<span class="short-desc"><strong>Nov. 11 </strong>“I'd rather have him  – you know, work with him on the Ukraine than standing and arguing about whether or not  – because that whole thing was set up by the Democrats.” <span class="short-truth"><a href="https://www.nytimes.com/interactive/2017/12/10/us/politics/trump-and-russia.html" target="_blank">(There is no evidence that Democrats "set up" Russian interference in the election.)</a></span></span>

***EXTRACTING THE DATE***

In [11]:
# First record
first_result = results [0]
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

*In order to locate the date, we can use its find() method to find a single tag that matches a specific pattern, in contrast to the find_all() method we used above to find all tags that match a pattern:*

In [12]:
'''This code searches first_result for the first instance of a <strong> tag, 
and again returns a Beautiful Soup "Tag" object (not a string).'''
first_result.find ('strong')

<strong>Jan. 21 </strong>

In [13]:
'''Since we want to extract the text between the opening and closing tags, 
we can access its text attribute, which does in fact return a regular Python string:'''
first_result.find ('strong').text 

'Jan. 21\xa0'

In [14]:
'''xa0 is called an "escape sequence" that represents the &nbsp; 
character we saw earlier in the HTML source.'''
# Slice it off from the end of the string
first_result.find ('strong').text [0:-1]

'Jan. 21'

In [15]:
'''Finally, we're going to add the year, 
since we don't want our dataset to include ambiguous dates:'''
first_result.find ('strong').text [0:-1] + ', 2017' 

'Jan. 21, 2017'

***EXTRACTING THE LIE***

In [16]:
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

*Our goal is to extract the two sentences about Iraq. Unfortunately, there isn't a pair of opening and closing tags that starts immediately before the lie and ends immediately after the lie. Therefore, we're going to have to use a different technique :*

In [17]:
first_result.contents

[<strong>Jan. 21 </strong>,
 "“I wasn't a fan of Iraq. I didn't want to go into Iraq.” ",
 <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>]

In [18]:
# Slice the list to extract the 2nd element
first_result.contents [1]  # 1 since Python follows zero-indexing

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

In [19]:
'''Finally, we'll slice off the curly quotation marks as well as the extra space at the end:'''
first_result.contents [1] [1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

***EXTRACTING THE EXPLANATION***

In [20]:
# The first option is to slice the contents attribute
first_result.contents [2]

<span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>

In [21]:
# The second option is to search for the surrounding tag, like we did when extracting the date:
first_result.find ('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

In [22]:
'''Either way, we can access the text attribute and then slice off 
the opening and closing parentheses:'''
first_result.find ('a').text [1:-1]

'He was for an invasion before he was against it.'

***EXTRACTING THE URL***

In [23]:
'''Extract the URL of the article that substantiates the writer's claim that the President was lying.

Let's examine the <a> tag within first_result:'''
first_result.find ('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

*In this case, the text we want to extract is located within the tag itself. Specifically, we want to access the value of the href attribute within the <a> tag.

Beautiful Soup treats tag attributes and their values like key-value pairs in a dictionary: you put the attribute name in brackets (like a dictionary key), and you get back the attribute's value:*

In [24]:
first_result.find ('a') ['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

**You can apply these two methods to either the initial soup object or a Tag object (such as first_result):

find(): searches for the first matching tag, and returns a Tag object
find_all(): searches for all matching tags, and returns a ResultSet object (which you can treat like a list of Tags)
You can extract information from a Tag object (such as first_result) using these two attributes:

text: extracts the text of a Tag, and returns a string
contents: extracts the children of a Tag, and returns a list of Tags and strings**

***BUILDING THE DATASET***

In [25]:
# Store the output in the form of empty tuples
records = []
# Create a loop to repeat the same process of extracting the 4 components on all results
# Loop through the results
for result in results : 
  date = result.find('strong').text[0:-1] + ', 2017'
  lie = result.contents[1][1:-2]
  explanation = result.find('a').text[1:-1]
  url = result.find('a')['href'] 
# Append the results in the records list
  records.append ((date,lie,explanation,url))

In [26]:
# Since there were 180 results, we should have 180 records
len (records)

180

In [27]:
# 1st 3 records
records [0:3]

[('Jan. 21, 2017',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it.',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21, 2017',
  'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.',
  'Trump was on the cover 11 times and Nixon appeared 55 times.',
  'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'),
 ('Jan. 23, 2017',
  'Between 3 million and 5 million illegal votes caused me to lose the popular vote.',
  "There's no evidence of illegal voting.",
  'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html')]

***APPLYING A TABULAR DATA STRUCTURE***

*The last major step in this process is to apply a tabular data structure to our existing structure (which is a list of tuples). We're going to do this using the pandas library. The primary data structure in pandas is the "DataFrame", which is suitable for tabular data with columns of different types.*

In [28]:
# We can convert our list of tuples into a DataFrame by passing it to the 
# DataFrame constructor and specifying the desired column names.
import pandas as pd
df = pd.DataFrame (records, columns = ['date','lie','explanation','url'])

In [29]:
# Top of the DataFrame
df.head ()

Unnamed: 0,date,lie,explanation,url
0,"Jan. 21, 2017",I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,"Jan. 21, 2017",A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,"Jan. 23, 2017",Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,"Jan. 25, 2017","Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,"Jan. 25, 2017",Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


In [30]:
# Bottom of the DataFrame
df.tail ()

Unnamed: 0,date,lie,explanation,url
175,"Oct. 25, 2017",We have trade deficits with almost everybody.,We have trade surpluses with more than 100 cou...,https://www.bea.gov/newsreleases/international...
176,"Oct. 27, 2017","Wacky & totally unhinged Tom Steyer, who has b...",Steyer has financially supported many winning ...,https://www.opensecrets.org/donor-lookup/resul...
177,"Nov. 1, 2017","Again, we're the highest-taxed nation, just ab...",We're not.,http://www.politifact.com/truth-o-meter/statem...
178,"Nov. 7, 2017",When you look at the city with the strongest g...,"Several other cities, including New York and L...",http://www.politifact.com/truth-o-meter/statem...
179,"Nov. 11, 2017","I'd rather have him – you know, work with him...","There is no evidence that Democrats ""set up"" R...",https://www.nytimes.com/interactive/2017/12/10...


In [31]:
# Convert the date column to pandas' special "datetime" format
df['date'] = pd.to_datetime(df['date'])
df

Unnamed: 0,date,lie,explanation,url
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...
...,...,...,...,...
175,2017-10-25,We have trade deficits with almost everybody.,We have trade surpluses with more than 100 cou...,https://www.bea.gov/newsreleases/international...
176,2017-10-27,"Wacky & totally unhinged Tom Steyer, who has b...",Steyer has financially supported many winning ...,https://www.opensecrets.org/donor-lookup/resul...
177,2017-11-01,"Again, we're the highest-taxed nation, just ab...",We're not.,http://www.politifact.com/truth-o-meter/statem...
178,2017-11-07,When you look at the city with the strongest g...,"Several other cities, including New York and L...",http://www.politifact.com/truth-o-meter/statem...


***EXPORTING THE DATASET TO A CSV FILE***

*We'll use pandas to export the DataFrame to a CSV (comma-separated value) file, which is the simplest and most common way to store tabular data in a text file*

In [32]:
df.to_csv ('trump_lies.csv', index = False, encoding = 'utf-8')

In [33]:
# Read the CSV file back into pandas
df = pd.read_csv('trump_lies.csv', parse_dates=['date'], encoding='utf-8')

In [34]:
# Summary (16 Lines Of Code)

# import requests
# r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

# from bs4 import BeautifulSoup
# soup = BeautifulSoup(r.text, 'html.parser')
# results = soup.find_all('span', attrs={'class':'short-desc'})

# records = []
# for result in results:
#     date = result.find('strong').text[0:-1] + ', 2017'
#     lie = result.contents[1][1:-2]
#     explanation = result.find('a').text[1:-1]
#     url = result.find('a')['href']
#     records.append((date, lie, explanation, url))

# import pandas as pd
# df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])
# df['date'] = pd.to_datetime(df['date'])
# df.to_csv('trump_lies.csv', index=False, encoding='utf-8')