# 'Trump lies' is a NewYork times article quoting the statements given by Trump and explanation.

## Here our objective is to extract text by web scrapping using beautiful soup, requests library and extract that into a dataframe using pandas and export into a csv file.

## Reading text in Python

In [1]:
import requests

In [2]:
r=requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

## Parsing HTML

In [4]:
from bs4 import BeautifulSoup

In [5]:
soup=BeautifulSoup(r.text,'html.parser')

## Collecting all text data

each record has the following format:

<span class="short-desc"><strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span></span>

There's an outer <span> tag, and then nested within it is a <strong> tag plus another <span> tag, which itself contains an <a> tag. All of these tags affect the formatting of the text.

In [7]:
results=soup.find_all('span',attrs={'class':'short-desc'})

In [8]:
len(results)

180

In [9]:
results[0:3]

[<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_

In [11]:
results[-1]

<span class="short-desc"><strong>Nov. 11 </strong>“I'd rather have him  – you know, work with him on the Ukraine than standing and arguing about whether or not  – because that whole thing was set up by the Democrats.” <span class="short-truth"><a href="https://www.nytimes.com/interactive/2017/12/10/us/politics/trump-and-russia.html" target="_blank">(There is no evidence that Democrats "set up" Russian interference in the election.)</a></span></span>

In [12]:
first_result=results[0]

In [13]:
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [14]:
first_result.find('strong')

<strong>Jan. 21 </strong>

In [15]:
first_result.find('strong').text

'Jan. 21\xa0'

In [18]:
first_result.find('strong').text[0:7]

'Jan. 21'

since all these statements were recorded in 2017 we can add year 2017 for date formatting while extracting into a data frame.

In [44]:
first_result.find('strong').text[0:7]+', 2017'

'Jan. 21, 2017'

The text has two parts. First is the statement and second is explanation

## Extracting Statement

It can be observed that the first text part in the html doc seems like a statement and second seems like explanation to it. So, let us extract it seperately.

In [20]:
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [21]:
first_result.contents

[<strong>Jan. 21 </strong>,
 "“I wasn't a fan of Iraq. I didn't want to go into Iraq.” ",
 <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>]

In [22]:
first_result.contents[1]

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

In [25]:
first_result.contents[1][1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

In [26]:
first_result.find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

In [27]:
first_result.find('a').text

'(He was for an invasion before he was against it.)'

In [29]:
first_result.find('a').text[1:-2]

'He was for an invasion before he was against it'

In [30]:
first_result.find('a')['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

## Building the dataset

In [45]:
records=[]
for resul in results:
    date=first_result.find('strong').text[0:7]+', 2017'
    statement=first_result.contents[1][1:-2]
    explanation=first_result.find('a').text[1:-2]
    url=first_result.find('a')['href']
    records.append((date,statement,explanation,url))

In [46]:
records

[('Jan. 21, 2017',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21, 2017',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21, 2017',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21, 2017',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21, 2017',
  "I wasn't a fan of Iraq

In [47]:
len(records)

180

In [48]:
records[0]

('Jan. 21, 2017',
 "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
 'He was for an invasion before he was against it',
 'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the')

## Extracting into a data frame

In [49]:
import pandas as pd

In [50]:
df=pd.DataFrame(data=records,columns=['date','statement','explanation','url'])

In [51]:
df.head()

Unnamed: 0,date,statement,explanation,url
0,"Jan. 21, 2017",I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,"Jan. 21, 2017",I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it,https://www.buzzfeed.com/andrewkaczynski/in-20...
2,"Jan. 21, 2017",I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it,https://www.buzzfeed.com/andrewkaczynski/in-20...
3,"Jan. 21, 2017",I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it,https://www.buzzfeed.com/andrewkaczynski/in-20...
4,"Jan. 21, 2017",I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it,https://www.buzzfeed.com/andrewkaczynski/in-20...


In [56]:
df['date']=pd.to_datetime(df['date'],format="%d/%m/%Y")

In [57]:
df.head()

Unnamed: 0,date,statement,explanation,url
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it,https://www.buzzfeed.com/andrewkaczynski/in-20...
2,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it,https://www.buzzfeed.com/andrewkaczynski/in-20...
3,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it,https://www.buzzfeed.com/andrewkaczynski/in-20...
4,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it,https://www.buzzfeed.com/andrewkaczynski/in-20...


## Exporting data into csv file

In [59]:
df.to_csv('Trump_lies.csv',index=False,encoding='utf-8')