##### Use BeautifulSoup to parse The New York Times and Fox News articles   

# Working with Web Pages and HTML

There are several ways of extracting or importing data from the Internet. As the tasks you just done, you can use APIs to retrieve information from any major website such as Twitter, Twitch, Instagram, Facebook which provides APIs to access their website dataset. And all this data available in a structured form.

But there are some drawbacks of API Web Scraping. First, most of the website doesn’t provide an API. Second, the results are usually in a somewhat raw form with no formatting or visual representation (like the results from a database query) so it is far from ideal for end users since it takes some cognitive overhead to interpret the raw information.

Yet, if we have HTML it is quite easy for a human to visually interpret it, but to try to perform some type of programmatic analysis we first need to parse the HTML into a more structured form.

As a general rule of thumb, if the data you need can be accessed or retrieved in a structured form (either from a bulk download or API) prefer that first. But if the data you want (and need) is not as in our case we need to resort to alternative (messier) means.

# Parse a Coronavirus Article from The New York Times

Using `BeautifulSoup`, parse the HTML of an article about coronavirus from The New York Time to extract in a structured form. Fill in following function stubs to parse a single page of article and return: 

1. the article features as a structured Python dictionary
2. the total number of words in the article (do not include the word counts of the summary, only the content!)

Be sure to structure your Python dictionary as follows (to be graded correctly). The order of the keys doesn't matter, only the keys and the data type of the values matters:

```python
{
    'Title': 'F.D.A. Approves First Coronavirus Antibody Test in U.S.' #str
    'Author': 'Apoorva Mandavilli' # list, a list of author names
    'Date': '2020-04-03' # str, yyyy-mm-dd
    'Summary': '.....' #str, a paragraph summarize the article
    'Content': '.....' #list, the whole article content, every element is a paragraph
}
```

Note: Remember to remove blank lines or redundant part for every element. Some articles do not include a summary, deal with this problem and make Summary="No summary" when there's no summary for the article.

In [1]:
#import packages
import requests
import re
import numpy as np
from bs4 import BeautifulSoup
import datetime
from testing.testing import test

In [1]:
#retrieve a url using BeautifulSoup
def retrieve_url(url):
    page =requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    return soup

In [3]:
import re

def article_check_nyt(article):
    type_check = lambda field, typ: field in article and typ(article[field])
    test.true(type_check("Title", lambda r: isinstance(r, str)))
    test.true(type_check("Author", lambda r: isinstance(r, list)))

    datecheck = re.compile("^\d{4}-\d{2}-\d{2}$")
    test.true(type_check("Date", lambda r: datecheck.match(r)))
    test.true(type_check("Summary", lambda r: isinstance(r, str)))
    test.true(type_check("Content", lambda r: isinstance(r, list)))

def parse_page_nyt_test(parse_page_nyt):
    article1, num_words_1 = parse_page_nyt("https://www.nytimes.com/2020/04/21/health/fda-in-home-test-coronavirus.html?searchResultPosition=1")
    article_check_nyt(article1)
    test.equal(article1['Title'], "F.D.A. Authorizes First In-Home Test for Coronavirus")
    test.equal(len(article1['Summary']), 119)
    test.equal(num_words_1, 3343)
    article2, num_words_2 = parse_page_nyt("https://www.nytimes.com/2020/04/18/health/kidney-dialysis-coronavirus.html?searchResultPosition=9")
    article_check_nyt(article2)
    test.equal(len(article2['Author']), 4)
    test.equal(article2['Date'], '2020-04-18')
    test.equal(num_words_2, 11249)

@test
def parse_page_nyt(url):

    """
    Parse the article on a single page of The New York Times.
    
    Args:
        html (string): String of HTML corresponding to a Coronavirus related article from The New York Times

    Returns:
        Tuple(Dict, int): a tuple of two elements
            first element: The dictionary of this single article
            second element: number of words in the content
    """
    
    soup=retrieve_url(url)
    
    dic={}
    
    dic['Title']=soup.find("title").get_text().replace(" - The New York Times", "")
    
    dic['Author']=[]
    for i in soup.find_all('span', itemprop="name"):
        dic['Author'].append(i.get_text())
        
    dic['Date']=soup.find("time")["datetime"][:10]
    
    if soup.find('p', id="article-summary") is not None:
        dic['Summary']=soup.find('p', id="article-summary").get_text()
    else:
        dic['Summary']="No summary"
        
    dic['Content']=[]
    for i in soup.find_all('p', class_='css-exrw3m evys1bk0')[:-1]:
        dic['Content'].append(i.get_text())
        
    
    word_count=0
    for paragraph in dic['Content']:
        word_count+=len(paragraph)
    
    
    
    return dic, word_count

### TESTING parse_page_nyt: PASSED 16/16
###



# Parse Several Coronavirus Article from The New York Times

Now you know how to parse a single page information from the New York Times using `BeautifulSoup`. However, sometimes we would like to parse several articles in a quick way. Parsing an article a time is not a time-efficient way. Let's start parsing several articles a time!



In order to get the same data for everyone, let's set some conditions when getting articles from The New York Times. 
1. Search `coronavirus` in search box of The New York Times home page
2. Set the Date range from 2020/3/27 to 4/27  * I would basically set a date range that around 200 articles are posted
3. Set the section to `health`
4. Set the type to `Article`

For each article, use `parse_page_nyt` to get their dicitionaries and article word counts. Return two things:
1. a list of tuple(include a dictionary and article word counts) in the order that they are present on the page
2. the number of articles *Note: do not get the number of the articles from the search page, use len(list) to get it!

In this function, we have to use `webdriver` in `selenium` package to handle the multi-pages problem. Please refer to the documentation of `selenium`: https://selenium-python.readthedocs.io/ to get more information on this package. 

Remember to state `time.sleep()` when parsing several pages using `selenium` since the website might interupt your visits if you entered the website too many times in a short time.

In [4]:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
options = webdriver.ChromeOptions()
options.add_argument('--headless')

In [5]:
#retrieve the url by webdriver
def retrieve_url_by_driver(url):
    driver = webdriver.Chrome("./chromedriver", options=options)
    driver.get(url)
    return driver

In [15]:
def parse_several_pages_nyt_test(parse_several_pages_nyt):
    articles, num_articles = parse_several_pages_nyt("https://www.nytimes.com/search?dropmab=false&endDate=20200417&query=coronavirus&sections=Health%7Cnyt%3A%2F%2Fsection%2F9f943015-a899-5505-8730-6d30ed861520&sort=best&startDate=20200327&types=article")
    
    article_10, num_wc_10=articles[10]
    article_check_nyt(article_10)
    test.equal(num_wc_10, 13178)
    
    article_38, num_wc_38=articles[38]
    article_check_nyt(article_38)
    test.equal(num_wc_38, 5315)
    
    test.equal(num_articles, 70)

@test
def parse_several_pages_nyt(base_url):
    """
    Retrieve ALL of the articles(include their content) for a single page on The New York Times.

    Args:
        url (string): The New York Times URL of the searched page.

    Returns:
        Tuple(List(tuple), int): a tuple of two elements
            first element: a list of tuple(include a dictionary and article word counts) of the articles in the searched page
            second element: the number of articles
    """
    
    driver=retrieve_url_by_driver(base_url)

    while True:
        try:
            driver.find_element_by_css_selector("button[data-testid='search-show-more-button']").click()
            time.sleep(3)
        except NoSuchElementException:
            break
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    
    ans=[]
    
    
    for i in soup.find_all("a", href=True):
        if "searchResultPosition" in i['href']:
            if "=0" not in i['href']:
                ans.append(parse_page_nyt("https://www.nytimes.com"+i['href']))
        else:
            continue
    
    
    
    
    
    return ans, len(ans)

### TESTING parse_several_pages_nyt: PASSED 13/13
###



# Parse a Coronavirus Article from Fox News

After parsing several articles from The New York Times, let's parse some articles from other resources. This time, let's parse a article from Fox News. 

Similar to `parse_page_nyt`, `parse_page_fn` also returns two elements:
1. the article features as a structured Python dictionary
2. the total number of words in the article


Be sure to structure your Python dictionary as follows (to be graded correctly). The order of the keys doesn't matter, only the keys and the data type of the values matters:

```python
{
    'Title': 'F.D.A. Approves First Coronavirus Antibody Test in U.S.' #str
    'Author': 'Apoorva Mandavilli' # list, a list of author names
    'Date': '2020-04-03' # str, yyyy-mm-dd
    'Content': '.....' #list, the whole article content, every element is a paragraph
}
```

Note 1: Fox News do not include a summary but you have to capture the discriptions under pictures as a paragraph of the content.
Note 2: If there's no date specified, set Date="No Date Specified".
Note 3: If author and article source are both provided, `Author` should only include the `Author`; if only the article source is provided, `Author` should be the article source.

In [22]:
def article_check_fn(article):
    type_check = lambda field, typ: field in article and typ(article[field])
    test.true(type_check("Title", lambda r: isinstance(r, str)))
    test.true(type_check("Author", lambda r: isinstance(r, list)))

    datecheck = re.compile("^\d{4}-\d{2}-\d{2}$")
    test.true(type_check("Date", lambda r: datecheck.match(r)))
    test.true(type_check("Content", lambda r: isinstance(r, list)))

def parse_page_fn_test(parse_page_fn):
    article1, num_words_1 = parse_page_fn("https://www.foxnews.com/health/dying-alone-coronavirus-volunteers-ipads-virtual-connect")
    article_check_fn(article1)
    test.equal(article1['Title'], "Dying alone from coronavirus: Group collects used iPads to virtually connect patients with family")
    test.equal(num_words_1, 4574)
    article2, num_words_2 = parse_page_fn("https://www.foxnews.com/health/is-it-safe-go-into-supermarkets-amid-coronavirus-outbreak")
    article_check_fn(article2)
    test.equal(article2['Author'], ['David Aaro'])
    test.equal(article2['Date'], '2020-04-06')
    test.equal(num_words_2, 6467)

@test
def parse_page_fn(url):

    """
    Parse the article on a single page of Fox News.
    
    Args:
        html (string): String of HTML corresponding to a Coronavirus related article from Fox News

    Returns:
        Tuple(Dict, int): a tuple of two elements
            first element: The dictionary of this single article
            second element: number of words in the content
    """
    
    dic={}
    
    page =requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    if soup.find("h1", class_='headline')!=None:
        dic['Title']=soup.find("h1", class_='headline').get_text()
    else:
        dic['Title']=soup.find("h1", class_='title').get_text()

    dic['Author']=[]
    if soup.find("div", class_='author-byline') is not None:
        for i in soup.find("div", class_='author-byline').find("span"):
            if "By" not in i:

                if i.find("a")!= None and type(i.find("a"))!=int:
                    dic['Author'].append(i.find("a", href=True ).get_text())
                    if "|" in dic['Author'][-1]:
                        del dic['Author'][-1]
                        dic['Author'].append(i.get_text().split("|")[0].strip())
                else:
                    dic['Author'].append(i.get_text())
    else:
        dic['Author'].append("No author")
     
    if soup.find("time")is not None:
        date_str=soup.find("time").get_text().strip()+", 2020"
        datetime_obj = datetime.datetime.strptime(date_str, '%B %d, %Y')   
        dic['Date']=str(datetime_obj.date())
    else:
        dic['Date']="No Date Specified"
    
        
    dic['Content']=[]
    
    if len(soup.find_all('p'))>4:
        for i in soup.find_all('p')[3:-4]:
            if not i.find("strong"):
                dic['Content'].append(i.get_text().strip())
    else:
        for i in soup.find_all('p', itemprop="description"):
             dic['Content'].append(i.get_text().strip())
        
        
    
    word_count=0
    for paragraph in dic['Content']:
        word_count+=len(paragraph)
    
    
    
    return dic, word_count

### TESTING parse_page_fn: PASSED 13/13
###



# Parse Several Coronavirus Article from Fox News

After parsing an article, let's parse several articles from Fox News. 

Similar to the way we parse several articles about coronavirus from The New York Times, we can set the date range, article type and section type to search several articles that meet our needs.

However, the biggest difference between parsing several articles in The New York Times is that the url would not change when you manually set the conditions. Therefore, you have to write an `auto_click` function to help you `auto_click` on the dropdown checkboxes. 

In [53]:
def auto_click(driver, s, c, min_month_id, min_day_id, max_month_id, max_day_id, year_id):
    '''
    A helper fuction that pass in a webdriver object, section(s), content(c), 
    min_month_id, min_day_id, max_month_id, max_day_id, year_id 
    to specify the users condition when searching the articles on Fox News.
    Return a webdriver object for further usage.
    '''
    time.sleep(5)
    
    # --------------Section--------------
    section = driver.find_element_by_css_selector("div.filter.section")
    section.click()
    section.find_element_by_css_selector("ul.option>li>label>input[value=\"%s\"]"%s).click()
    section.click()
    # --------------Content--------------
    content = driver.find_element_by_css_selector("div.filter.content")
    content.click()
    content.find_element_by_css_selector("ul.option>li>label>input[value=\"%s\"]"%c).click()
    content.click()
    # --------------DateRange--------------
    # -------Start-------
    min_month = driver.find_element_by_css_selector("div.date.min div.sub.month")
    min_month.click()
    min_month.find_element_by_css_selector("ul.option>li[id=\"%s\"]"%min_month_id).click()

    min_day = driver.find_element_by_css_selector("div.date.min div.sub.day")
    min_day.click()
    min_day.find_element_by_css_selector("ul.option>li[id=\"%s\"]"%min_day_id).click()

    min_year = driver.find_element_by_css_selector("div.date.min div.sub.year")
    min_year.click()
    min_year.find_element_by_css_selector("ul.option>li[id=\"%s\"]"%year_id).click()
    # --------End--------
    max_month = driver.find_element_by_css_selector("div.date.max div.sub.month")
    max_month.click()
    max_month.find_element_by_css_selector("ul.option>li[id=\"%s\"]"%max_month_id).click()

    max_day = driver.find_element_by_css_selector("div.date.max div.sub.day")
    max_day.click()
    max_day.find_element_by_css_selector("ul.option>li[id=\"%s\"]"%max_day_id).click()

    max_year = driver.find_element_by_css_selector("div.date.max div.sub.year")
    max_year.click()
    max_year.find_element_by_css_selector("ul.option>li[id=\"%s\"]"%year_id).click()

    search = driver.find_element_by_css_selector("div.search-form a")
    search.click()

    time.sleep(5)
    
    return driver

Call `auto_click` in `parse_several_pages_fn` to help you parse several pages in Fox News. Return two things:
1. a list of tuple(include a dictionary and article word counts) in the order that they are present on the page
2. the number of articles 
*Note: This time you might have to get the total article numbers specified on the top of the page.However, when return this value, please return len(list) to check if you retrieve all the articles.

Note: You might need to make some changes of your `parse_page_fn` function since the article structure in Fox News is not unified as The New York Times. Try to handle all the exceptions!

Set `Section=Health`, `Content=Article`, `Date Range from 2020/3/27~2020/4/4` for the testing.

In [48]:
def parse_several_pages_fn_test(parse_several_pages_fn):
    articles, num_articles = parse_several_pages_fn("https://www.foxnews.com/search-results/search?q=coronavirus")
    
    article_10, num_wc_10=articles[10]
    article_check_fn(article_10)
    test.equal(num_wc_10, 3014)
    
    article_38, num_wc_38=articles[38]
    article_check_fn(article_38)
    test.equal(num_wc_38, 2553)
    
    test.equal(num_articles, 93)

@test    
def parse_several_pages_fn(url, s='Health', c='Article', min_month_id='03', \
                           min_day_id='27', max_month_id='04', max_day_id='04', year_id='2020'):
    
    driver = retrieve_url_by_driver(url)
    driver=auto_click(driver, s, c, min_month_id, min_day_id, max_month_id, max_day_id, year_id)   
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    p=int(soup.find("div", class_="num-found").find_all("span")[2].get_text())

    n=0
    while n<p/10:
        try:
            time.sleep(3)
            driver.find_element_by_css_selector("div.button.load-more> a >span").click()
        except NoSuchElementException:
            break
        n=n+1

    time.sleep(10)        
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    ans=[]
    n=1
    for i in soup.select('h2.title a[href]'):
#         print(i['href'])
        ans.append(parse_page_fn(i['href']))
#         print(n, " Completed!")
        n=n+1
    
    return ans, len(ans)
    

### TESTING parse_several_pages_fn: PASSED 11/11
###

