# Web-Scraping News from the United Nations site
We will scrape the latest News-posts from UN's site and compile them to a Dataframe that can be exported to different file-types.

In [1]:
from bs4 import BeautifulSoup # The nice libraries
import urllib.request as ur, pandas as pd

# tag-tokens per post that we are looking for, unnecessary but kept here just in case
cards_ref = dict(title=['h2',{'class':'blog-shortcode-post-title'}],
                 meta=['p',{'class':'fusion-single-line-meta'}],
                 info=['div',{'class':'fusion-post-content-container'}]
                 )

# Collecting the HTML Cards/Panels per post on the 'News' section of the site
url = 'https://www.un.org/youthenvoy/news/'
req = ur.Request(url, headers = {'User-Agent': 'Someone'})
page_content = ur.urlopen(req).read()
soup = BeautifulSoup(page_content, "html.parser")
cards = soup.find_all("div", attrs={'class':'fusion-post-content post-content'})

print("Page-1 Posts collected (in HTML) from UN's News-page.")

Page-1 Posts collected (in HTML) from UN's News-page.


### Collecting details per News-post (title, date, etc...)

In [2]:
print('Details being collected per news-post card, an example of the first 3:\n')

UN_news_cards = []
for i, card in enumerate(cards, 1):
    
    card_link = card.find('a', href=True)
    title = card_link.text.strip() # Or: card.find('h2').text
    date = card.find('span').text.strip()
    subtext = card.find('div',
                        attrs={'class':'fusion-post-content-container'}).text.strip()
    card_url = card_link['href']
    
    if i < 4: 
        print(f'Title:\t\t\t{title}\n'+
              f'Date:\t\t\t{date}\n'+
              f'Subtext:\t\t{subtext}...\n'+
              f'Read more (post-link):\t{card_link["href"]}\n\n')
    
    UN_news_cards.append([title, date, subtext, card_url])

Details being collected per news-post card, an example of the first 3:

Title:			Launch of “Be Seen, Be Heard” Campaign
Date:			11 May 2022
Subtext:		The United Nations Secretary-General’s Envoy on Youth and The Body Shop launch global collaboration calling for more young voices in the halls of power   The Body Shop and United Nations Secretary-General’s Envoy on Youth...
Read more (post-link):	https://www.un.org/youthenvoy/2022/05/launch-of-the-be-seen-be-heard-campaign/


Title:			World leaders renew commitment to “Youth, Peace and Security” agenda
Date:			21 January 2022
Subtext:		Governments announce new actions to meaningfully include youth in peacebuilding efforts at the High-Level Global Conference on Youth-Inclusive Peace Processes 21 January 2022 (Doha, Qatar) —  At the virtual High-Level Global Conference on Youth-Inclusive...
Read more (post-link):	https://www.un.org/youthenvoy/2022/01/world-leaders-renew-commitment-to-youth-peace-and-security-agenda/


Title:			First-ever G

### News-Posts compiled as a Dataframe

In [3]:
UN_news_df = pd.DataFrame(UN_news_cards, columns=['UN_News','Date','Subtext','Link'])

UN_news_df

Unnamed: 0,UN_News,Date,Subtext,Link
0,"Launch of “Be Seen, Be Heard” Campaign",11 May 2022,The United Nations Secretary-General’s Envoy o...,https://www.un.org/youthenvoy/2022/05/launch-o...
1,"World leaders renew commitment to “Youth, Peac...",21 January 2022,Governments announce new actions to meaningful...,https://www.un.org/youthenvoy/2022/01/world-le...
2,First-ever Global Report on Protecting Young P...,18 June 2021,The launch of the report comes alongside a hig...,https://www.un.org/youthenvoy/2021/06/first-ev...
3,Young Human Rights Defenders Adapting to COVID-19,9 December 2020,Celebrating Youth Resilience & Creativity on I...,https://www.un.org/youthenvoy/2020/12/young-hu...
4,"On International Volunteers Day, Meet Our Firs...",4 December 2020,"5 December 2020 Earlier this year, the Off...",https://www.un.org/youthenvoy/2020/12/on-inter...
5,Breaking gender barriers for young women parli...,25 November 2020,Twenty-five years after the Beijing Declaratio...,https://www.un.org/youthenvoy/2020/11/breaking...
6,Statement Issued by the Office of the UN Secre...,14 April 2020,The Office of the UN Secretary-General’s Envoy...,https://www.un.org/youthenvoy/2020/04/statemen...
7,Want to be a Fellow? Apply Now! No more unpaid...,5 February 2020,The Office of the Secretary-General’s Envoy on...,https://www.un.org/youthenvoy/2020/02/want-to-...
8,UN Secretary-General’s Envoy on Youth visits G...,24 December 2019,"In October 2019, the United Nations Secretary-...",https://www.un.org/youthenvoy/2019/12/un-secre...
9,Summer of Solutions,29 November 2019,As the effects of climate change become more a...,https://www.un.org/youthenvoy/2019/11/summer-o...


Some may prefer a dataframe with dates as our index, lets set it this way with "Dates" as date-time objects:

In [4]:
datetime_index = pd.to_datetime(UN_news_df.Date)

UN_news_df.set_index(datetime_index, inplace=True)
UN_news_df.drop('Date', axis=1, inplace=True)

print("A cleaner dataframe indexed Date-times:")
UN_news_df

A cleaner dataframe indexed Date-times:


Unnamed: 0_level_0,UN_News,Subtext,Link
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2022-05-11,"Launch of “Be Seen, Be Heard” Campaign",The United Nations Secretary-General’s Envoy o...,https://www.un.org/youthenvoy/2022/05/launch-o...
2022-01-21,"World leaders renew commitment to “Youth, Peac...",Governments announce new actions to meaningful...,https://www.un.org/youthenvoy/2022/01/world-le...
2021-06-18,First-ever Global Report on Protecting Young P...,The launch of the report comes alongside a hig...,https://www.un.org/youthenvoy/2021/06/first-ev...
2020-12-09,Young Human Rights Defenders Adapting to COVID-19,Celebrating Youth Resilience & Creativity on I...,https://www.un.org/youthenvoy/2020/12/young-hu...
2020-12-04,"On International Volunteers Day, Meet Our Firs...","5 December 2020 Earlier this year, the Off...",https://www.un.org/youthenvoy/2020/12/on-inter...
2020-11-25,Breaking gender barriers for young women parli...,Twenty-five years after the Beijing Declaratio...,https://www.un.org/youthenvoy/2020/11/breaking...
2020-04-14,Statement Issued by the Office of the UN Secre...,The Office of the UN Secretary-General’s Envoy...,https://www.un.org/youthenvoy/2020/04/statemen...
2020-02-05,Want to be a Fellow? Apply Now! No more unpaid...,The Office of the Secretary-General’s Envoy on...,https://www.un.org/youthenvoy/2020/02/want-to-...
2019-12-24,UN Secretary-General’s Envoy on Youth visits G...,"In October 2019, the United Nations Secretary-...",https://www.un.org/youthenvoy/2019/12/un-secre...
2019-11-29,Summer of Solutions,As the effects of climate change become more a...,https://www.un.org/youthenvoy/2019/11/summer-o...


### Collecting a range of news-posts from the UN site
We would first have to find how many pages or news-stories have been published in the site so far, and then we can iterate our above processes of news-post collection per news-page.

In [5]:
pages_url = url + 'pages/'

print("Indication of the total number of news-posts in the site:")
soup.find('a', attrs={'title':'News stories'}).text

Indication of the total number of news-posts in the site:


'News for Youth (312)'

Since we collected a total of *10 posts* in page-1 of the site, the math says there should be **32 pages** of news-posts to collect in total.

In [6]:
total_posts = 312
posts_per_page = UN_news_df.shape[0]
post_pages, remaining_posts = divmod(total_posts, posts_per_page)
post_pages += 1 if 0 < remaining_posts <= posts_per_page else 0

print(f"Total number of News-posts \t= {total_posts},\n"+
      f"Number of News-posts per page \t= {posts_per_page},\n"+
      f"Total number of post-pages \t= {post_pages}\n")

Total number of News-posts 	= 312,
Number of News-posts per page 	= 10,
Total number of post-pages 	= 32



#### A function to get News-posts from a single page or a range of pages

In [7]:
import numpy as np

def UN_news(url='https://www.un.org/youthenvoy/news/page/%d', page=1, to_page = 0, as_df=True, save=False):
    
    if not isinstance(page, int):
        raise ValueError(f"Invalid {type(page)} page input, only numbers (integers) allowed.")
        
    url = url%page if page >= 0 else url.replace('page/%d','')
    
    # Collecting page-content via BeautifulSoup
    req = ur.Request(url, headers = {'User-Agent': 'Someone'})
    page_content = ur.urlopen(req).read()
    soup = BeautifulSoup(page_content, "html.parser")
    cards = soup.find_all("div", attrs={'class':'fusion-post-content post-content'})
    
    # Loop-collection of details per News-post on page.
    UN_news_cards = []
    for i, card in enumerate(cards, 1):

        card_link = card.find('a', href=True)
        title = card_link.text.strip() # Or: card.find('h2').text
        date = card.find('span').text.strip()
        subtext = card.find('div',
                            attrs={'class':'fusion-post-content-container'}).text.strip()
        card_url = card_link['href']

        UN_news_cards.append([title, date, subtext, card_url])
    
    # Pages-Recursion: Add more pages of news-posts until the final page ("to_page")
    if to_page > page:
        for i, pg in enumerate(range(page+1, to_page+1), 1):
            page_posts = UN_news(page=pg, as_df=False, save=True)
            UN_news_cards += page_posts
            
            page_in10s = i%10
            
            if page_in10s == 0:
                if i < 11:
                    print("Scraped the first 10 pages.")
                else:
                    print("Scraped the next 10 pages.")
            
            if i == (to_page - page + 1):
                print("Last page reached! (all scraped)")
    
    # If we want to return our results as a dataframe (date-time indexed)
    if as_df: 
        UN_news_df = pd.DataFrame(UN_news_cards, columns=['UN_News','Date','Subtext','Link'])
        datetime_index = pd.to_datetime(UN_news_df.Date)

        UN_news_df.set_index(datetime_index, inplace=True)
        UN_news_df.drop('Date', axis=1, inplace=True)
        
        UN_news_cards = UN_news_df
    
    if save: # This is mostly for saving progress of iterative results (multiple page scraping)
        np.save('news_from_UN', {'posts':UN_news_cards}, allow_pickle=True)
    
    return UN_news_cards

#### Testing our function to collect News-posts of the next two pages

In [8]:
from ttictoc import tic, toc # To track code run-time

tic()
pages_2to3 = UN_news(page=2, to_page=3)

print(f"Total time taken to scrape news pages 2 - 3: {round(toc(), 3)} seconds")

Total time taken to scrape news pages 2 - 3: 35.135 seconds


In [9]:
pages_2to3

Unnamed: 0_level_0,UN_News,Subtext,Link
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-06-25,UN Youth Envoy at Lisboa+21 Conference,"21 years ago in 1998, the Lisbon Declaration w...",https://www.un.org/youthenvoy/2019/06/un-youth...
2019-06-22,UN Youth Envoy Visit to West Bank and Gaza,"On June 19, I arrived on my first visit to Eas...",https://www.un.org/youthenvoy/2019/06/un-youth...
2019-06-20,UN Youth Envoy visit to Jordan,I recently returned from my mission to Jordan....,https://www.un.org/youthenvoy/2019/06/un-youth...
2019-06-05,UN Youth Envoy at Women Deliver Conference 2019,"At the beginning of June, I had the opportunit...",https://www.un.org/youthenvoy/2019/06/130275/
2019-05-21,UN Youth Envoy at Global Platform for Disaster...,In a world where more disasters are caused by ...,https://www.un.org/youthenvoy/2019/05/un-youth...
2019-05-10,Speak Up to Save Lives – The 5th UN Global Roa...,"According to UN, nearly 1.3 million people di...",https://www.un.org/youthenvoy/2019/05/speak-up...
2019-05-10,"UN Youth Envoy Visit to Belgrade, Pristina and...",Three cities in one week! I just came back fro...,https://www.un.org/youthenvoy/2019/05/un-youth...
2019-05-07,2019 ECOSOC Youth Forum,"On 8-9 April 2019, the 8th Economic and Social...",https://www.un.org/youthenvoy/2019/05/2019-eco...
2019-03-27,UN Secretary-General’s Envoy on Youth at the 6...,The sixty-third session of the Commission on t...,https://www.un.org/youthenvoy/2019/03/un-secre...
2019-02-11,First International Symposium on Youth Partici...,"Dear friends and fellow young people, I am exc...",https://www.un.org/youthenvoy/2019/02/first-in...


The above collection of the next **two pages of news-posts** took approx. **~35 seconds** of scraping time. This means that for the next **29 pages** (left of news-posts in the site), it should take around **8 minutes** (507.5 seconds) to get! Lets find out:

In [10]:
tic()

next_pages = UN_news(page=4, to_page=32)

end_time = round(toc(), 3)

print(f"Total time taken to scrape the next 29 pages: {end_time} seconds! ({end_time/60} minutes)")

Scraped the first 10 pages.
Scraped the next 10 pages.
Total time taken to scrape the next 29 pages: 487.209 seconds! (8.12015 minutes)


We indeed took approx. ~8 minutes to collect the rest of the news-posts. Lets finally compile our above dataframes to have a complete table of all news-posts from the site.

In [11]:
next_pages

Unnamed: 0_level_0,UN_News,Subtext,Link
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-04-17,Young entrepreneurs in Tanzania: Where are the...,Via ILO The ILO Kazi Nje Nje BDS apprenticeshi...,https://www.un.org/youthenvoy/2017/04/young-en...
2017-04-17,Exploring opportunities to build youth capacity,"Via UN Volunteers Mr Toily Kurbanov, Deputy ...",https://www.un.org/youthenvoy/2017/04/explorin...
2017-04-05,‘Traditional leaders stand up to protect youth...,"Via UNFPA SHISELWENI, Swaziland – “We have hig...",https://www.un.org/youthenvoy/2017/04/traditio...
2017-03-30,From where I stand: “We barely had a meal in t...,From UN Women Florence Luanda Maheshe found he...,https://www.un.org/youthenvoy/2017/03/stand-ba...
2017-03-23,Young women call for increased women’s leaders...,From UN Women Young women attending the 61st s...,https://www.un.org/youthenvoy/2017/03/young-wo...
...,...,...,...
2013-08-26,UN Envoy on Youth Announces Partnership with I...,"On 17 July 2013, the UN Secretary-General’s En...",https://www.un.org/youthenvoy/2013/08/envoy-on...
2013-08-25,"In Venezuela, Music Provides Hope for Impoveri...","Nehyda Alas, a teacher and director of Venezue...",https://www.un.org/youthenvoy/2013/08/in-venez...
2013-08-25,UN Envoy on Youth Attends the Child and Youth ...,"On 7 May, UN Secretary General’s Envoy on Yout...",https://www.un.org/youthenvoy/2013/08/envoy-on...
2013-08-25,UNV youth volunteers welcomed at Czech Republi...,by Veronika Jemelikova Seven young Czech natio...,https://www.un.org/youthenvoy/2013/08/unv-yout...


### All News-posts from the UN site (until the latest from May-11th 2022)

In [12]:
pd.concat([UN_news_df, pages_2to3, next_pages])

Unnamed: 0_level_0,UN_News,Subtext,Link
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2022-05-11,"Launch of “Be Seen, Be Heard” Campaign",The United Nations Secretary-General’s Envoy o...,https://www.un.org/youthenvoy/2022/05/launch-o...
2022-01-21,"World leaders renew commitment to “Youth, Peac...",Governments announce new actions to meaningful...,https://www.un.org/youthenvoy/2022/01/world-le...
2021-06-18,First-ever Global Report on Protecting Young P...,The launch of the report comes alongside a hig...,https://www.un.org/youthenvoy/2021/06/first-ev...
2020-12-09,Young Human Rights Defenders Adapting to COVID-19,Celebrating Youth Resilience & Creativity on I...,https://www.un.org/youthenvoy/2020/12/young-hu...
2020-12-04,"On International Volunteers Day, Meet Our Firs...","5 December 2020 Earlier this year, the Off...",https://www.un.org/youthenvoy/2020/12/on-inter...
...,...,...,...
2013-08-26,UN Envoy on Youth Announces Partnership with I...,"On 17 July 2013, the UN Secretary-General’s En...",https://www.un.org/youthenvoy/2013/08/envoy-on...
2013-08-25,"In Venezuela, Music Provides Hope for Impoveri...","Nehyda Alas, a teacher and director of Venezue...",https://www.un.org/youthenvoy/2013/08/in-venez...
2013-08-25,UN Envoy on Youth Attends the Child and Youth ...,"On 7 May, UN Secretary General’s Envoy on Yout...",https://www.un.org/youthenvoy/2013/08/envoy-on...
2013-08-25,UNV youth volunteers welcomed at Czech Republi...,by Veronika Jemelikova Seven young Czech natio...,https://www.un.org/youthenvoy/2013/08/unv-yout...


#### Saving our dataframe as a dictionary-contained numpy-file
We can save the data as other file-types too (csv, excel, etc), numpy is preferred given it's minimal file-size storage. And we are keeping it "dictionary-contained" in numpy to keep it's column names and attributes (to be assigned) intact when loading the data later.

In [13]:
all_newsposts = pd.concat([UN_news_df, pages_2to3, next_pages])
all_newsposts.attrs = {'news_url':url, 'news_pages':post_pages}

np.save('news_from_UN', {'df' : all_newsposts}, allow_pickle = True)

print("Saved! Asalamualaikum.")

Saved! Asalamualaikum.
