# Webscraping News sites
## Guardian News scraper
https://www.theguardian.com/uk

The following code block imports the necessary modules then requests the URL. Once this is received it displays the first 50 characters. 

Note the 'headers' variable. This replicates the identity that a web browser would send to avoid the site blocking our request thinking it was a bot

In [None]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import time
import re
import csv
import os.path


headers = {
    'Accept-Encoding': 'gzip, deflate, sdch',
    'Accept-Language': 'en-US,en;q=0.8',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
}

# url to fetch
url = "https://www.theguardian.com/uk"

# Request
req = requests.get(url, headers=headers)


# Save the front page content
frontpage = req.content

print(frontpage[:50])


##  BeautifulSoup
We can use BeautifulSoup to find all the article titles and display a total count

In [None]:
# Feed the page into Beautiful Soup
soup = BeautifulSoup(frontpage, 'html.parser')

# Find news articles
frontpage_news = soup.find_all('h3', class_='fc-item__title')

# total no of articles
total =  (len(frontpage_news)) 

print(total)


 ## Create a CSV file to hold the articles
 In the next codeblock we declare the name of our CSV\* file. If it doesn't exists it will be created.
 
 If the file does exist the URLs are extracted and stored in a list so we have a reference of articles already downloaded. We can then prin out this list for reference.
 
\* CSV stands for Comma Separated Values and this file type can be viewd in MS Excel or a text editor

In [None]:
# declare csv file name
csv_file = 'guardian.csv'

# create empty list for news article links
links = []

# Check if csv already exists and if so store the news article links in a list
if os.path.exists(csv_file):
    with open(csv_file, 'r') as f:
        csvreader = csv.reader(f, delimiter=",")
        for row in csvreader:
            links.append(row[2])
            print(row[2])

# Open the file ready for writing
file = open(csv_file, "a")
writer = csv.writer(file, quoting=csv.QUOTE_ALL)

cnt = 0 # Set a counter


## Get the article
the next block of code does the main work. It:

- Extracts the URL
- Requests the contents of this page
- Identifies the various bits of content, title, date, etc
- Checks that the article doesn't exist in the CSV file, and if not, writes it to the file.


In [None]:
for n in np.arange(0, total):

    # Getting the article link
    link = frontpage_news[n].find('a')['href']

    # Getting the title
    title = frontpage_news[n].find('a').get_text()

    # Retrieving the page
    page = requests.get(link)

    print('found link - ' + link)

    # Get the all of the page content
    page_content = page.content

    # Parse the page content with Beautiful Soup
    soup_article = BeautifulSoup(page_content, 'html.parser')
    article_content = soup_article.find('div', class_='content__article-body from-content-api js-article__body')
    

    if article_content != None: # Check there is an article on the page

        body = article_content.find_all('p')

        pubdate = soup_article.find('time')['datetime']
        # Unifying the paragraphs
        list_paragraphs = []


        for p in np.arange(0, len(body)):

            paragraph = body[p].get_text()
            list_paragraphs.append(paragraph)
            final_article = " ".join(list_paragraphs)

            # Removing non-breaking spaces
            txt = re.sub("\\xa0", "", final_article)

            title = title.replace('\n', ' ').replace('\r', '').lstrip()


        if link not in links:
            print('Retrieving artice -- ' + title)
            cnt +=1
            writer.writerow([pubdate, title, link, txt])

print('Added ' + str(cnt) + ' news articles to ' + csv_file)


## Viewing the CSV file

The file 'guardian.csv' will now be available to view or download. Go back to the Noteable Home Page tab to see the file.

Running the above code again will add any new articles to the CSV file.