# Scraping articles from Irish Times website
In this notebook we scrape data about news articles published to the Irish Times' website [ https://www.irishtimes.com/ ]. We are concerned with articles tagged as *Economy* that contain the words *Irish* and *economy*, assuming that those articles are about the Irish economy.

For each of the relevant articles, we want to store the **Title**, **Date** and **Text**. These features will describe each article in our dataset.

### Imports
We use:
- *requests* to request HTML code from a given URL.
- *BeautifulSoup* to parse the HTML code received.
- *datetime* to parse dates from strings and alter their formatting.
- *pandas* to write our data to csv. 

In [5]:
import requests
from bs4 import BeautifulSoup
import datetime
import pandas as pd

### Get article links
This function takes the URL for a page of search results generated by Irish Times search feature [ https://www.irishtimes.com/search ], and returns a list of articles linked to by that page.

In [54]:
def getArticleLinks( url ):
    
    articleUrls = []
    
    page = requests.get(url)
    htmlResponse = page.text
    
    soup = BeautifulSoup(htmlResponse, 'html.parser')
    # We find all divs that contain links to search result articles
    searchResultDivs = soup.find_all("div", {"class": "search_items_title"})
    
    for searchResultDiv in searchResultDivs:
            spanElem = searchResultDiv.find("span", {"class":"h2"})
            # We look for the 'href' attribute of the relevant <a> tags to find the URLs
            articleUrls.append('https://www.irishtimes.com' + spanElem.contents[0]['href'])
    
    return articleUrls

### Parse articles

This function takes the URL for an Irish Times article, and from its HTML extracts its title, publish date and text content. It then creates a DataFrame row containing these values and writes that to the specified CSV file.

In [55]:
def parseArticle( url ):
    
    page = requests.get(url)
    htmlResponse = page.text
    
    soup = BeautifulSoup(htmlResponse, 'html.parser')
    
    # Ensure article is not 'subscriber only'
    subOnlyElem = soup.find("div", {"class": "intercept-modal"})
    if(subOnlyElem != None):
        return
    
    # Get article title
    headerSectionElem = soup.find("hgroup")
    titleElem = headerSectionElem.find("h1")
    titleText = titleElem.text
    
    # Get article date
    timeElem = soup.find("time")
    timeText = timeElem.text
    timeText = timeText[:timeText.rindex(',')]
    dateText = datetime.datetime.strptime(timeText, '%a, %b %d, %Y').strftime('%Y-%m-%d')
    
    # Get article text
    articleElem = soup.find("div", {"class": "article_bodycopy"})
    paragraphElems = articleElem.find_all("p")
    
    paragraphText = ""
    
    # Article text consists of a set of paragraphs, which we concatenate in paragraphText
    for paragraphElem in paragraphElems:
        paragraphText += paragraphElem.text 
    
    # Make a DataFrame with these values in a row, and append that row to a csv file
    data = [[titleText, dateText, paragraphText]]
    df = pd.DataFrame(data, columns=['title', 'date', 'text'])
    df.to_csv('trial3.csv', mode='a', header=False, index=False)
    

### Parsing all available articles
We generated links for all 634 pages of search results, allowing us to parse all available articles.

In [53]:
baseUrl = "https://www.irishtimes.com/search/search-7.2285082?q=irish+economy&toDate=09-06-2020&pageId=2.709&page="

for i in range(0,634):
    articleLinks = getArticleLinks(baseUrl + str(i))
    for link in articleLinks:
        parseArticle(link)
        

KeyboardInterrupt: 