# Using the Newspaper module

This module makes it much easier to scrape newspapers. Rather than investigating the structure of the webpage, the Newspaper module is capable of understanding the structure of many news sites and doing the hard work for you.

Sites I have tested this with:
- https://cnn.com/
- https://bbc.co.uk/
- https://www.telegraph.co.uk/
- https://www.theguardian.com/

**IMPORTANT** Before you begin you need to install the Newspaper module by running the followin code block. You only need to do this once for every Noteable session


In [None]:
!pip install newspaper3k

## Scraping the news site
Now we have installed the Newspapers module we can use it to scrape the site.

The following code block by default will search the guardian site and write results to 'guardian.csv'

To change this, enter a different URL in the 'news_source' variable and enter a different filename in the 'csv_file' variable

In [None]:
import newspaper
import csv
import os
from datetime import date

today = date.today()

# Declare news source
news_source = 'https://www.theguardian.com/'
# news_source = 'https://bbc.co.uk/'
# news_source = 'https://www.telegraph.co.uk/'
# news_source = 'https://cnn.com/'

# declare export file name - news articles are written to this csv file
csv_file = 'guardian.csv'

paper = newspaper.build(news_source,  memoize_articles=False)

# create empty list for existing news article links
links = []

# Check if csv already exists and if so store the news article links in a list
if os.path.exists(csv_file):
    with open(csv_file, 'r') as f:
        csvreader = csv.reader(f, delimiter=",")
        for row in csvreader:
            links.append(row[2])
            print(row[2]) # show existing links


# Open the file ready for writing
file = open(csv_file, "a")
writer = csv.writer(file, quoting=csv.QUOTE_ALL)

cnt = 0 # Set a counter


for article in paper.articles:
    if article.url not in links:
        # Retrieving the page
        article.download()
        article.parse()

        # Getting the article link
        link = article.url

        # Getting the title
        title = article.title

        # Getting the authors
        authors = article.authors
        authors = ', '.join(authors) # convert authors to a comma separated list

        # Get  all of the page content
        txt = article.text

        # Removing line-breaks
        txt = txt.replace('\n', ' ').replace('\r', '')

        # Get publication date
        pubdate = article.publish_date

        # Perform Natural Language Processing on text to extract keywords
        article.nlp()
        keywords = ', '.join(article.keywords) # convert keywords to a comma separated list

        if txt != None:  # Check there is an article on the page

            # Check if article exists already
           # if link not in links:
            print('Retrieving article -- ' + title)
            cnt += 1
            writer.writerow([pubdate, title, link, authors, txt, keywords])

print('Added ' + str(cnt) + ' news articles to ' + csv_file)
