# Scraping articles from Irish Times website
In this notebook we scrape data about news articles published to the Irish Times' website [ https://www.irishtimes.com/ ]. We are concerned with articles tagged as *Economy* that contain the words *Irish* and *economy*, assuming that those articles are about the Irish economy.

For each of the relevant articles, we want to store the **Title**, **Date** and **Text**. These features will describe each article in our dataset.

### Imports
We use:
- *requests* to request HTML code from a given URL.
- *BeautifulSoup* to parse the HTML code received.
- *datetime* to parse dates from strings and alter their formatting.
- *pandas* to write our data to csv. 

In [3]:
import requests
from bs4 import BeautifulSoup
import datetime
import pandas as pd

### Get article links
This function takes the URL for a page of search results generated by Irish Times search feature [ https://www.irishtimes.com/search ], and returns a list of articles linked to by that page.

In [4]:
def getArticleLinks( url ):
    
    articleUrls = []
    
    page = requests.get(url)
    htmlResponse = page.text
    
    soup = BeautifulSoup(htmlResponse, 'html.parser')
    # We find all divs that contain links to search result articles
    searchResultDivs = soup.find_all("div", {"class": "search_items_title"})
    
    for searchResultDiv in searchResultDivs:
            spanElem = searchResultDiv.find("span", {"class":"h2"})
            # We look for the 'href' attribute of the relevant <a> tags to find the URLs
            articleUrls.append('https://www.irishtimes.com' + spanElem.contents[0]['href'])
    
    return articleUrls

### Parse articles

This function takes the URL for an Irish Times article, and from its HTML extracts its title, publish date and text content. It then creates a DataFrame row containing these values and writes that to the specified CSV file.

In [27]:
def parseArticle( url ):
    
    page = requests.get(url)
    htmlResponse = page.text
    
    soup = BeautifulSoup(htmlResponse, 'html.parser')
    
    # Ensure article is not 'subscriber only'
    subOnlyElem = soup.find("div", {"class": "intercept-modal"})
    if(subOnlyElem != None):
        return
    
    # Get article title
    headerSectionElem = soup.find("hgroup")
    titleElem = headerSectionElem.find("h1")
    titleText = titleElem.text
    
    # Get article date
    timeElem = soup.find("time")
    timeText = timeElem['title']
    if(timeText == ''):
        timeText = timeElem.text
    timeText = timeText[:timeText.rindex(',')]
    dateText = datetime.datetime.strptime(timeText, '%a, %b %d, %Y').strftime('%Y-%m-%d')
    
    # Get article text
    articleElem = soup.find("div", {"class": "article_bodycopy"})
    paragraphElems = articleElem.find_all("p")
    
    paragraphText = ""
    
    # Article text consists of a set of paragraphs, which we concatenate in paragraphText
    for paragraphElem in paragraphElems:
        paragraphText += paragraphElem.text 
    
    # Make a DataFrame with these values in a row, and append that row to a csv file
    data = [[titleText, dateText, paragraphText]]
    df = pd.DataFrame(data, columns=['title', 'date', 'text'])
    df.to_csv('csv/trial3.csv', mode='a', header=False, index=False)
    

### Parsing all available articles
We generated links for all 1066 pages of search results, allowing us to parse all available articles.

In [29]:
baseUrl = "https://www.irishtimes.com/search/search-7.4195619?q=%22irish+economy%22&toDate=14-06-2020&page="

for i in range(902,1067 ):
    articleLinks = getArticleLinks(baseUrl + str(i))
    print(i)
    for link in articleLinks:
        try:
            parseArticle(link)
        except:
            print("Exception")

902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
