---
    Gabriel Graells Solé - gabriel.graells01@estudiant.upf.edu
---

# Politifact Scrapper

The goal of this notebook is to retrieve all of the fact-checked news in [Politifact](https://www.politifact.com/factchecks/list/) website. After de information retrieval we will do a data cleaning and datamining process.

First we will retrieve all links to their news fact check. Then from each news fact check we will retrieve the title, body, tags, author and rating.

In [None]:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import csv
import pandas as pd

---

## Retrieve List of Fact Checked News

Navitage through all the lists pages untill the end is reached.

In [None]:
#Connect Selenium to Chrome webdriver
driver = webdriver.Chrome(f'{PATH}chromedriver')

#Open Politifact Fact Check list Page
driver.get("https://www.politifact.com/factchecks/list/")

urls = []
click = True
while click:
    #Get HTML
    html = driver.page_source
    soup = BeautifulSoup(html ,'html.parser')
    
    #Get Select URLS list
    urls_ = soup.select('.o-listicle__content .o-listicle__list .m-statement__quote [href]')
    
    #Substract URLS from list
    for u in urls_:
        urls.append(u['href'])
    
    #Navigate to next Page    
    try:
        link = driver.find_element_by_link_text('Next')
        link.click()
        time.sleep(0.5)
    except:
        click = False
        print('Done.')

#Exit
driver.quit()

#Remove duplicates
urls = set(urls)

num_urls = len(urls)
print(f'There is a total of {num_urls} URLS.')

There is a total of 18421 URLS.


In [None]:
#Save the list in a CSV file
with open(f'{PATH}ULR_list.csv', 'w', newline='') as f:
    wr = csv.writer(f, quoting=csv.QUOTE_ALL)
    wr.writerow(urls)

---

## Retrieve news information

First we will retrieve **Title, Body, Tags, Author and Rating** from just one new. This is just a primary test.

In [None]:
#Get HTML 
url = 'https://www.politifact.com/factchecks/2020/oct/02/tweets/trump-did-not-ask-supporters-421-million-help-him-/'
r = requests.get(url)

soup = BeautifulSoup(r.content, 'html.parser')

## Title

In [None]:
title_ = soup.select('.m-statement__content .m-statement__quote')
title = title_[0].text.replace('\n','')
title

'Says Donald Trump’s reelection campaign emailed supporters to “donate to help him recover from” COVID-19.'

## Tags

In [None]:
tags_ = soup.select('.m-list.m-list--horizontal .m-list__item')

tags = ''
for t in tags_:
    tags += t.select_one('span').text + ','

tags

'Facebook Fact-checks,Coronavirus,Tweets,'

## Author

In [None]:
author_ = soup.select(".m-statement__author .m-statement__name")
author = author_[0].text.replace('\n','')
author

'Tweets'

## Body

Note that the body is not the news itself, in fact is the justification of their rating. Thus this segment information can not be used in the model.

In [None]:
body_ = soup.select(".t-row .t-row__center .m-textblock p")

body = ''
for b in body_:
    body += b.text

body

'After President Donald Trump tested positive for COVID-19, a fake fundraising email from his re-election campaign started to circulate on social media.A screenshot of the bogus email, which a reader sent us on Twitter, leads with the news of the positive results for the president and first lady Melania Trump. Then, it pivots to a fundraising plea."President Trump would like to ask a favor. Will you please DONATE to help him recover from this disease?" says the email, which was shared by Rev. James Woodall, president of the Georgia NAACP. "It is only fair since he has sacrificed millions of dollars as your President."(Screenshot from Twitter)After we reached out for a comment, Woodall replied to the tweeted screenshot with a correction.\xa0"This is not a legitimate email coming from the Trump campaign. Want to clarify this instead of simply deleting it," he said.The Trump campaign confirmed to us that the email isn’t real. So did the Republican National Committee."That is a fake," RNC 

## Rating

Politifact uses a discrete set of ratings that evaluate the veracity of the news from complete veraciuos to complete fictitious, those ratings are **{True, Mostly True, Half True, Mostly False, False, Pants on Fire}**

In [None]:
rating_ = soup.select(".m-statement__body .m-statement__meter [alt]")
rating = rating_[0]['alt']
rating

'pants-fire'

## Conclusions

After a long visual inspection it can be seen that the **Title** of the fact check is indeed a brief summary of the news, comment or social media post. We would use the **Title** as the textual item to rate veracity.

Accesing to the original source, such as the original tweet, is a rather cumbersome task. Politifact does not always situate the original source on the same place on their website. Thus using an algorithm to retrieve that information would yield into unprecise and unwated results. That is why we are going to use the title as the original source.

---

# Dataset Creation

By convining the previous code snippeds we will retrive atomatically all the information.

In [None]:
#Get links from CSV file
urls = []
with open(f'{PATH}ULR_list.csv', 'r') as f:
    urls = f.read().split(',')

num_urls = len(urls)
print(f'There is a total of {num_urls} URLS')

#Final DataFrame
df  = pd.DataFrame(columns=['Title','Tags','Author','Rating'])
count = 0

for u in urls:
    dic = {}

    #Construct Link
    link = 'https://www.politifact.com'+u.replace('"','')

    #Request
    r = requests.get(link)
    soup = BeautifulSoup(r.content, 'html.parser')

    #Get Title
    title_ = soup.select('.m-statement__content .m-statement__quote')
    title = title_[0].text.replace('\n','')
    dic['Title'] = title

    #Get Tags
    tags_ = soup.select('.m-list.m-list--horizontal .m-list__item')
    tags = ''
    for t in tags_:
        tags += t.select_one('span').text + ','
    dic['Tags'] = tags

    #Get Author
    author_ = soup.select(".m-statement__author .m-statement__name")
    author = author_[0].text.replace('\n','')
    dic['Author'] = author

    #Get Rating
    rating_ = soup.select(".m-statement__body .m-statement__meter [alt]")
    rating = rating_[0]['alt']
    dic['Rating'] = rating

    #Save into DataFrame
    df = df.append(dic, ignore_index=True)


    if len(df)%1000 == 0:
        percentage = (len(df)/num_urls)*100
        print(f'{percentage}%')
        
df.to_csv('politifactDataset.csv', index = False)