# DSCI 511: Data acquisition and pre-processing<br>Chapter 9: Distribution, accessibility, and data sharing
## Exercises
Note: numberings refer to the main notes.

## Additional In-depth Exercises
### A. Releasing a fact-checked tweets dataset
Previously in Chapter 5, we explored a web scraping task focused on the individuals who have been covered by the fact-checking website, PolitiFact:

- https://www.politifact.com/

As discussed at that time, while it appears possible to work with their data from the front end, reproducing it or releasing its content is not allowed. So another distribution method for any of its data is required, which we'll explore here.

For this exercise, our high-level goal will be to distribute a PolitiFact dataset that could support downstream algorithms that automatically fact-check Tweets. 

#### 1. Understanding the release data
Review the following object in the local data directory:

- `./data/fact-checked-tweets.json`

and discuss in the response box below what these data describe with respect to the two platforms (PolitiFact and Twitter) and why they are reasonable to release.

_Response._

In [None]:
## code here

#### 2. Planning hydrated data objects
Assuming we'll release the `fact-checked-tweets.json` data, we now need to plan some data downloaders for the PolitiFact and Twitter data that our code release will access and integrate.

In particular, discuss what tools we'll have to use in conjunction with `fact-checked-tweets.json` to access the following data:
1. fullly hydrated tweet objects
2. PolitiFact ratings and fact-check sources

_Response._

#### 3. Build a tweet hydrator
This is more or less a standard `Twython` task, aside from the fact that the tweet ids themselves will need to be isolated from the urls.

In [None]:
## code here

#### 4. Build a ratings scraper
The other side of this integration job must now take the specified endpoints and append them onto the following base url:

- `https://www.politifact.com`

to resolve the appropriate fact-check data. Utilizing `BeautifulSoup`, build a function that constructs a full url, accesses the webpage's html, and applies `BeautifulSoup` to access a given fact-check's rating (from `'True'` to `'Pants on Fire'`) and sources (a list of urls provided at the bottom of the page).

In [None]:
## code here

#### 5. Build a full data access wrapper
Now that you have the two data access functions in place, package them up in a script that when run will integrate the full data set.

In [None]:
## code here

#### 6. Reviewing our withheld code and planning a strategic release
While we'll release the `'fact-checked-tweets.json'` data and our script to integrate the full dataset, there are some aspects of our code that had to be constructed which we won't release publicly. 

To complete this exercise, review the code below to understand how `'fact-checked-tweets.json'` object was constructed and how it could be expanded to full, PolitiFact-wide data set with many more tweets. This discussion should be placed in the response box, below, and specifically discuss how the three main code blocks provided can be utilized together to produce the full data set that we'd like to release.



_Response._

In [None]:
## code block 1: accessing a list of all personalities covered by PolitiFact
## note: this was not used for the fact-checked tweets, but will be essential
## to gather _all_ fact-checked tweets from the site.

import requests, re
from bs4 import BeautifulSoup

personalities_url = "https://www.politifact.com/personalities/"
personalities_html = requests.get(personalities_url).text
personalities_soup = BeautifulSoup(personalities_html, 'html.parser')

data = {}
for i, section in enumerate(personalities_soup.find_all("section", {"class": "o-platform o-platform--has-thin-border o-platform--is-wide"})):  
  title = section.find("h2", {"class": "c-title c-title--section"}).text.strip()
  section_id = section['id']
  data[section_id] = {"id": section_id, "section": title, "personalities": {}}
  for personality in section.find_all('div', {"class": "c-chyron"}):
    link = personality.find('a')
    url = link['href']
    group = personality.find('div', {"class": "c-chyron__subline"}).text.strip()
    name = link.text.strip()
    personality_id = re.split("/", url)[-2]
    data[section_id]['personalities'][personality_id] = {'url': url, "name": name, "id": personality_id, "group": group}

In [None]:
## code block 2: accessing a list of all fact checked statements made by a personality
## note: the personality 'tweets' is just a catch-all category that the site uses for 
## fact checks of tweets made by lesser-known personalities (so it appears)

list_base_url = "https://www.politifact.com/factchecks/list/"
for personality in ["tweets"]:
  next_page = "?speaker=" + personality
  statements_urls = []
  while next_page:
    speakerchecks_url = list_base_url + next_page
    speakerchecks_html = requests.get(speakerchecks_url).text
    speakerchecks_soup = BeautifulSoup(speakerchecks_html, 'html.parser')
    for statement in speakerchecks_soup.find_all('li', {'class': 'o-listicle__item'}):
      description = statement.find('div', {'class': "m-statement__desc"})
      if description:
        if re.search("(tweet|twitter|post)", description.text.lower()):
          statement_url = statement.find('div', {"class": "m-statement__quote"}).find('a')['href']
          statements_urls.append(statement_url)
    
    next_page = ''
    for link in speakerchecks_soup.find_all('a', {"class": "c-button c-button--hollow"}):
      if link.get('href',''):
        if link.text =="Next":
          next_page = link['href']

In [None]:
## code block 3: retrieve any twitter-specific urls of tweets from a each statement's page
import json
from collections import defaultdict

statement_tweets = defaultdict(list)
for statement_url in statements_urls:
  statement_html = requests.get("https://www.politifact.com" + statement_url).text
  statement_soup = BeautifulSoup(statement_html, 'html.parser')

  for link in statement_soup.find('article', {"class": "m-textblock"}).find_all('a'):
    if re.search("^https://twitter.com[^ ]+/status/\d+$", link.get('href', '')):
      statement_tweets[statement_url].append(link['href'])

with open('./data/fact-checked-tweets.json', 'w') as f:
  f.write(json.dumps(statement_tweets))