# NCCS RetrieveURLS Script

This is a cleaned up script that only performs URL retrieval on the National Climate Change Secretariat website found at https://www.nccs.gov.sg/. 

The output of this notebook is a .txt file that contains all relevant links with useful information. 

This may not be an exhaustive list of all links in the NCCS website, though I believe it covers most of the stuff there. 

Also, I have removed certain links that do not contain useful information or require additional processing. These are: 

- Links that contain pdfs
- Links that are media releases 
- Links that are public consultations 


*Apart from the links found in the NCCS webpage, I also find their recent publications to contain highly relevant and useful information. These are stored in PDF files and will be scrapped at a later stage.* 

In [3]:
import requests
from bs4 import BeautifulSoup
import copy 

In [4]:
def collect_links(home_page, home_url=None):
    """ Returns a list of all links embedded in a home_page. 
    
    Arguments: 
        - home_page (str): URL of home page to scrape links from 
        - home_url (str): parent URL of all websites
    """
    response = requests.get(home_page)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all the links in the page
    links = soup.find_all('a')
    collected_links_set = set()

    for link in links: 
        try: 
            url = link.get('href')
            if url.startswith('http'): 
                pass 
            else: 
                collected_links_set.add(url)
        except: 
            pass

    collected_links_lst = list(collected_links_set)
    
    if home_url is None: 
        home_url = home_page
        
    collected_links_lst = list(map(lambda x: home_url + x, collected_links_lst))
    return collected_links_lst

In [5]:
initial_links = collect_links('https://www.nccs.gov.sg')
print(initial_links)

all_links = copy.deepcopy(initial_links)
print(len(all_links))

['https://www.nccs.gov.sg/about-climate-change/how-we-are-affecting-climate-change/', 'https://www.nccs.gov.sg/singapores-climate-action/Mitigation-Efforts/overview/', 'https://www.nccs.gov.sg/careers', 'https://www.nccs.gov.sg/singapores-climate-action/overview/what-we-can-do-overview/', 'https://www.nccs.gov.sg/singapores-climate-action/overview/adaptation-overview', 'https://www.nccs.gov.sg/media/parliamentary-replies/', 'https://www.nccs.gov.sg/who-we-are/inter-ministerial-committee-on-climate-change/', 'https://www.nccs.gov.sg/singapores-climate-action/overview/national-circumstances/', 'https://www.nccs.gov.sg/media/press-releases/addendum-to-the-presidents-address-2023/', 'https://www.nccs.gov.sg/media/speeches/speech-by-senior-minister-and-coordinating-minister-for-national-security-teo-chee-hean-committee-of-supply-2023/', 'https://www.nccs.gov.sg/singapores-climate-action/overview/adaptation-overview/', 'https://www.nccs.gov.sg/media/speeches/speech-ds-cindy-khoo-wwf-sg-2023/

In [6]:
# Collect new links from all previous collected webpages

all_links_set = set(all_links)

for link in initial_links: 
    new_found_links = collect_links(link, home_url = 'https://www.nccs.gov.sg')
    
    for i in new_found_links: 
        all_links_set.add(i)
        
all_links = list(all_links_set)
print(len(all_links))

466


In [9]:
all_links = [link for link in all_links if "media" not in link and "mailto" not in link]
all_links = [link for link in all_links if "pdf" not in link] # Handle PDF separately
all_links = [link for link in all_links if "public-consultation" not in link]

unclean_urls = ['https://www.nccs.gov.sg/', 'https://www.nccs.gov.sg/privacy-statement/', \
                'https://www.nccs.gov.sg/careers/', 'https://www.nccs.gov.sg/pages/contact-us/contact-info/', \
               'https://www.nccs.gov.sg/careers', 'https://www.nccs.gov.sg/terms-of-use/']

all_links = [link for link in all_links if link not in unclean_urls]

print(len(all_links))

49


In [11]:
with open("urls.txt", "w") as file:
    file.write(str(all_links))