### Introduction

In this notebook, I scraped Singapore's local news from a news website

References:
- https://realpython.com/beautiful-soup-web-scraper-python/#step-2-scrape-html-content-from-a-page
- https://stackoverflow.com/questions/34467410/python-doesnt-have-permission-to-access-on-this-server-return-city-state-from

In [1]:
import requests
from bs4 import BeautifulSoup
import re
from tqdm import tqdm
import pandas as pd
from datetime import datetime

pd.set_option('display.max_row', 100)

### Scrape Singapore News from Straits Times Website

Consideration: Straits Times was chosen as: (1) it is an established local news agency, and (2) it's website is organised in a straight forward manner for scrapping (with sitemap). The other news agency in consideration are: Today Online (can't find sitemap), and Channel News Asia (website is not as straight forward with dynamic loading)

In [2]:
# Compile the html link of all of the available pages within Straits Times' xml site

headers = {'user-agent': 'Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0'}
outerpage = requests.get("https://www.straitstimes.com/sitemap.xml", headers=headers)
outersoup = BeautifulSoup(outerpage.content, features='xml')
outerpagelinks = outersoup.find_all('loc')
outerpagelinks = [entry.get_text() for entry in outerpagelinks]
outerpagelinks

['https://www.straitstimes.com/sitemap.xml?page=1',
 'https://www.straitstimes.com/sitemap.xml?page=2',
 'https://www.straitstimes.com/sitemap.xml?page=3',
 'https://www.straitstimes.com/sitemap.xml?page=4',
 'https://www.straitstimes.com/sitemap.xml?page=5',
 'https://www.straitstimes.com/sitemap.xml?page=6',
 'https://www.straitstimes.com/sitemap.xml?page=7',
 'https://www.straitstimes.com/sitemap.xml?page=8',
 'https://www.straitstimes.com/sitemap.xml?page=9',
 'https://www.straitstimes.com/sitemap.xml?page=10',
 'https://www.straitstimes.com/sitemap.xml?page=11',
 'https://www.straitstimes.com/sitemap.xml?page=12',
 'https://www.straitstimes.com/sitemap.xml?page=13',
 'https://www.straitstimes.com/sitemap.xml?page=14',
 'https://www.straitstimes.com/sitemap.xml?page=15',
 'https://www.straitstimes.com/sitemap.xml?page=16',
 'https://www.straitstimes.com/sitemap.xml?page=17',
 'https://www.straitstimes.com/sitemap.xml?page=18',
 'https://www.straitstimes.com/sitemap.xml?page=19',
 '

In [3]:
# For each page, compile all of the html link that belongs to news from Singapore

singaporepagelinks = []

for outerpagelink in tqdm(outerpagelinks):
    try:
        innerpage = requests.get(outerpagelink, headers=headers)
        innersoup = BeautifulSoup(innerpage.content, features='xml')
        innerpagelinks = innersoup.find_all('loc')
        innerpagelinks = [entry.get_text() for entry in innerpagelinks]
        innerpagelinks_singapore = [entry for entry in innerpagelinks if entry.split('/')[3] == 'singapore']
        singaporepagelinks += innerpagelinks_singapore
        
    except:
        
        # If extraction fails, try one more time
        try:
            innerpage = requests.get(outerpagelink, headers=headers)
            innersoup = BeautifulSoup(innerpage.content, features='xml')
            innerpagelinks = innersoup.find_all('loc')
            innerpagelinks = [entry.get_text() for entry in innerpagelinks]
            innerpagelinks_singapore = [entry for entry in innerpagelinks if entry.split('/')[3] == 'singapore']
            singaporepagelinks += innerpagelinks_singapore
        
        except:
            pass
            
            
singaporepagelinks

100%|███████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:50<00:00,  1.50s/it]


['https://www.straitstimes.com/singapore/singapore-ushers-in-2015-with-brilliant-fireworks-display-song-performances',
 'https://www.straitstimes.com/singapore/party-over-cleanup-time',
 'https://www.straitstimes.com/singapore/transport/upcoming-premium-school-bus-service-that-cuts-commuting-time-by-half-will-cost',
 'https://www.straitstimes.com/singapore/columbarium-in-sengkang-fernvale-link-will-be-out-of-public-view',
 'https://www.straitstimes.com/singapore/its-a-girl-singapores-first-jubilee-baby-born-at-stroke-of-midnight',
 'https://www.straitstimes.com/singapore/scdf-tackles-blaze-at-tampines-flat',
 'https://www.straitstimes.com/singapore/environment/singapore-could-expect-more-volatile-weather',
 'https://www.straitstimes.com/singapore/here-are-the-10-most-read-stories-on-sts-website-in-2014',
 'https://www.straitstimes.com/singapore/environment/have-you-used-your-eco-bag-11-times',
 'https://www.straitstimes.com/singapore/courts-crime/beware-of-con-men-abusing-pic-scheme',


In [4]:
len(singaporepagelinks)

31646

In [5]:
# Extract relevant details and compile it into a dataframe

publishdates = []
publishdates_standardised = []
titles = []
articles = []
links = []

df = pd.DataFrame(columns = ['Published Date', 'Published Date (Standardised)', 'Title', 'Article', 'Links'])

batch_size=2

for singaporepagelink in tqdm(singaporepagelinks):
    try:
        links.append(singaporepagelink)
        
        articlepage = requests.get(singaporepagelink, headers=headers)
        articlesoup = BeautifulSoup(articlepage.content)
        
        publishdate = articlesoup.find(class_='story-postdate').get_text().strip()
        publishdate_standardised = datetime.strptime(publishdate, "%B %d, %Y at %I:%M %p").strftime("%d/%m/%y")
        publishdates.append(publishdate)
        publishdates_standardised.append(publishdate_standardised)
        
        title = articlesoup.find(class_='node-header').get_text().strip()
        titles.append(title)
        
        paragraphs = articlesoup.find_all(class_ = 'clearfix text-formatted field field--name-field-paragraph-text field--type-text-long field--label-hidden field__item')
        paragraphs = [entry.get_text() for entry in paragraphs]
        article = ' '.join(paragraphs)
        article = re.sub(r'\n', ' ', article)
        article = re.sub(' +', ' ', article).strip()
        articles.append(article)
        
    except:
        
        # If extraction fails, try one more time
        try:
            links.append(singaporepagelink)
        
            articlepage = requests.get(singaporepagelink, headers=headers)
            articlesoup = BeautifulSoup(articlepage.content)

            publishdate = articlesoup.find(class_='story-postdate').get_text().strip()
            publishdate_standardised = datetime.strptime(publishdate, "%B %d, %Y at %I:%M %p").strftime("%d/%m/%y")
            publishdates.append(publishdate)
            publishdates_standardised.append(publishdate_standardised)

            title = articlesoup.find(class_='node-header').get_text().strip()
            titles.append(title)

            paragraphs = articlesoup.find_all(class_ = 'clearfix text-formatted field field--name-field-paragraph-text field--type-text-long field--label-hidden field__item')
            paragraphs = [entry.get_text() for entry in paragraphs]
            article = ' '.join(paragraphs)
            article = re.sub(r'\n', ' ', article)
            article = re.sub(' +', ' ', article).strip()
            articles.append(article)
            
        except:
        
            pass
    
    if len(articles) > batch_size:
        df_batch = pd.DataFrame(zip(publishdates, publishdates_standardised, titles, articles, links), columns = ['Published Date', 'Published Date (Standardised)', 'Title', 'Article', 'Links'])
        df = pd.concat([df, df_batch]).reset_index(drop=True)
        publishdates = []
        publishdates_standardised = []
        titles = []
        articles = []
        links = []
    
if len(articles) > 0:
    df_batch = pd.DataFrame(zip(publishdates, publishdates_standardised, titles, articles, links), columns = ['Published Date', 'Published Date (Standardised)', 'Title', 'Article', 'Links'])
    df = pd.concat([df, df_batch]).reset_index(drop=True)

100%|██████████████████████████████████████████████████████████████████████████████████████| 31646/31646 [13:28:32<00:00,  1.53s/it]


In [6]:
df['Published Date (Standardised)'] = pd.to_datetime(df['Published Date (Standardised)'], dayfirst=True)
df = df.sort_values('Published Date (Standardised)').reset_index(drop=True)
df

Unnamed: 0,Published Date,Published Date (Standardised),Title,Article,Links
0,"July 18, 2007 at 2:00 PM",2007-07-18,"From the ST archives: Student, 19, missing for...",This article appeared in the print edition of ...,https://www.straitstimes.com/singapore/from-th...
1,"August 13, 2008 at 5:00 AM",2008-08-13,Silent star of Singapore,SINGAPORE - The Prime Minister has an official...,https://www.straitstimes.com/singapore/silent-...
2,"April 5, 2010 at 6:00 AM",2010-04-05,A RUNNER’S DIARY\n\n It's crazy only if...,"This story was first published on April 5, 201...",https://www.straitstimes.com/singapore/its-cra...
3,"April 19, 2012 at 6:00 AM",2012-04-19,Top 29 pupils head for the Big Spell,Twelve-year-old Jordan Foo received a belated ...,https://www.straitstimes.com/singapore/top-29-...
4,"April 22, 2012 at 6:00 AM",2012-04-22,Why celebrate a spelling champ in the digital age,Competitive spelling is a mind sport. Like wat...,https://www.straitstimes.com/singapore/why-cel...
...,...,...,...,...,...
31244,"October 25, 2023 at 12:00 PM",2023-10-25,Mandatory treatment for senior with mental con...,SINGAPORE – An elderly man suffering from bipo...,https://www.straitstimes.com/singapore/courts-...
31245,"October 25, 2023 at 12:35 PM",2023-10-25,Former president Halimah Yacob awarded Singapo...,SINGAPORE – Former president Halimah Yacob has...,https://www.straitstimes.com/singapore/former-...
31246,"October 25, 2023 at 2:55 PM",2023-10-25,Over three years’ jail for man who cheated 68 ...,SINGAPORE – The man behind one of Singapore’s ...,https://www.straitstimes.com/singapore/courts-...
31247,"October 25, 2023 at 3:45 PM",2023-10-25,"Porsche, 2 Rolls-Royces among 4 cars seized fr...",SINGAPORE – Four luxury cars linked to Singapo...,https://www.straitstimes.com/singapore/courts-...


In [7]:
# df.to_csv('straitstimes_articles_raw.csv', index=False)