In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


# **Making a GET request:**



The GET request is a way for your browser to ask the server for the content associated with a specific URL. Below we are making a GET rquest to get content from the Meetings Coverage and Press Releases website of the United Nations. The requests.get method is used to send a GET request to the https://press.un.org URL. You are essentially mimicking what happens when you click on a link in a web browser.

![UN Website](https://github.com/IshitaGopal/TRIADS_workshops/raw/main/intro_to_web_scraping/un_homepage.png)


In [2]:
# Make a simple GET request
response = requests.get("https://press.un.org/en/")


We can check the status code of our request. If the status code is 200, it means the request was successful. 😄

In [3]:
response

<Response [200]>

If I request content from a website or page that does not exist I will get a 404. 😞

The status code 404 is associated with "Not Found."



![UN Website](https://github.com/IshitaGopal/TRIADS_workshops/raw/main/intro_to_web_scraping/bad_request.png)



In [None]:
requests.get("https://press.un.org/en/random")

<Response [404]>

You can access the raw HTML content from an HTTP response
using response.content attribute.

In [None]:
response.content[:2000]

b'<!DOCTYPE html>\n<html lang="en" dir="ltr">\n  <head>\n    <meta charset="utf-8" />\n<link rel="canonical" href="https://press.un.org/en" />\n<link rel="shortlink" href="https://press.un.org/en" />\n<meta name="MobileOptimized" content="width" />\n<meta name="HandheldFriendly" content="true" />\n<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no" />\n<meta http-equiv="x-ua-compatible" content="ie=edge" />\n<link rel="icon" href="/themes/custom/un3_press/favicon.ico" type="image/vnd.microsoft.icon" />\n<link rel="alternate" hreflang="en" href="https://press.un.org/en" />\n<link rel="alternate" hreflang="fr" href="https://press.un.org/fr" />\n<link rel="preload" href="/themes/custom/un3/un3_base/node_modules/%40fontsource/roboto/files/roboto-latin-400-normal.woff2" as="font" type="font/woff2" crossorigin />\n<link rel="preload" href="/themes/custom/un3/un3_base/node_modules/%40fontsource/roboto/files/roboto-latin-400-italic.woff2" as="font" type="font/

# **Converting the HTTP response into something beautiful:**
Now, we will use the **BeautifulSoup library in Python** to parse and extract information from the HTML content received received above.

We will create a BeautifulSoup object, which is a data structure representing the HTML document. The BeautifulSoup object allows you to navigate and search through the HTML content easily.

Specifically, we will use a specialized **HTML parser** in BeautifulSoup ('html.parser') to analyze HTML code, break it down into its constituent elements, and build a data structure that represents the hierarchical structure of the document. This data structure is typically a tree-like representation called the **Document Object Model (DOM)**. Each element in the DOM corresponds to an HTML tag, and the relationships between elements reflect the nesting structure of the HTML document.

In [None]:
soup = BeautifulSoup(response.content, 'html.parser')


We can look at this tree-like representation:

In [None]:
# Get the pretty version of the HTML content
pretty_html = soup.prettify()

# Print only the first few lines (let's say 100 lines)
print('\n'.join(pretty_html.splitlines()[:100]))


<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://press.un.org/en" rel="canonical"/>
  <link href="https://press.un.org/en" rel="shortlink"/>
  <meta content="width" name="MobileOptimized"/>
  <meta content="true" name="HandheldFriendly"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <meta content="ie=edge" http-equiv="x-ua-compatible"/>
  <link href="/themes/custom/un3_press/favicon.ico" rel="icon" type="image/vnd.microsoft.icon"/>
  <link href="https://press.un.org/en" hreflang="en" rel="alternate"/>
  <link href="https://press.un.org/fr" hreflang="fr" rel="alternate"/>
  <link as="font" crossorigin="" href="/themes/custom/un3/un3_base/node_modules/%40fontsource/roboto/files/roboto-latin-400-normal.woff2" rel="preload" type="font/woff2"/>
  <link as="font" crossorigin="" href="/themes/custom/un3/un3_base/node_modules/%40fontsource/roboto/files/roboto-latin-400-italic.woff2" rel="preload" t

# **Navigating and extracting information from "soup":**

In [None]:
# Lets get the title tag of the page
soup.title

<title>Meetings Coverage and Press Releases | Meetings Coverage and Press Releases</title>

In [None]:
# We can also write this like:
soup.find("title")

## *Some common HTML tags:*
The most common HTML tags serve various functions and play essential roles in structuring a web page. Here are some of the most common HTML tags along with their functions:

1. **`<html>`:**
   - Defines the root of an HTML document.

2. **`<head>`:**
   - Contains metadata about the HTML document, such as the title, character set, and linked stylesheets.

3. **`<title>`:**
   - Sets the title of the HTML document, which appears in the browser's title bar or tab.

4. **`<body>`:**
   - Contains the main content of the HTML document, such as text, images, links, and other elements.

5. **`<h1> to <h6>`:**
   - Define heading levels, with `<h1>` being the highest level and `<h6>` the lowest.

6. **`<p>`:**
   - Represents a paragraph of text.

7. **`<a>`:**
   - Creates hyperlinks, linking to other pages or resources.

8. **`<img>`:**
   - Embeds images in the HTML document.

9. **`<ul>`:**
   - Defines an unordered list, typically used with `<li>` (list items) inside.

10. **`<ol>`:**
    - Defines an ordered list, where items are numbered, often used with `<li>` inside.

11. **`<li>`:**
    - Represents a list item within `<ul>` (unordered list) or `<ol>` (ordered list).

12. **`<div>`:**
    - Divides the HTML document into sections, often used for layout and styling.

13. **`<span>`:**
    - Defines a small section of text within a larger block of content.

14. **`<strong>` and `<em>`:**
    - `<strong>` is used to define strong importance or emphasis.
    - `<em>` is used to define emphasized text.

15. **`<br>`:**
    - Inserts a line break within text.

16. **`<hr>`:**
    - Represents a horizontal rule or line, often used to separate sections of content.

17. **`<form>`:**
    - Creates a form for user input, often used with input elements like text fields, buttons, and checkboxes.

18. **`<input>`:**
    - Defines an input field within a form.

19. **`<textarea>`:**
    - Defines a multiline text input field within a form.

20. **`<select>` and `<option>`:**
    - `<select>` creates a dropdown list within a form.
    - `<option>` defines an option within a dropdown list.

These tags provide the basic building blocks for structuring and styling web pages. Understanding how to use them effectively is crucial for creating well-organized and visually appealing HTML documents. Keep in mind that there are many more HTML tags, each serving a specific purpose, and their usage often depends on the requirements of the content and design.

## Let's find more tags using soup.find()



In [None]:
# Find the first <a> (anchor) tag
first_link = soup.find('a')
first_link

<a class="skip-link" href="#main-content">Skip to main content / navigation</a>

In [None]:
# Find the first <p> (paragraph) tag
first_paragraph = soup.find('p')
first_paragraph

<p>Press releases, press conferences, statements, official travel</p>

In [None]:
# We can use the .text attribute to retrieve the text content of a tag
first_paragraph.text

'Press releases, press conferences, statements, official travel'

## Now lets use soup.find_all() to search for all occurrences of a particular HTML tag within our parsed document.

It returns a ResultSet, which is essentially a list containing all the matching tags.

In [None]:
# Find all <a> tags in the document and display the 1st 10
all_links = soup.find_all('a')
all_links[:10]

[<a class="skip-link" href="#main-content">Skip to main content / navigation</a>,
 <a class="site-welcome navbar-brand text-body fs-default p-0" href="https://www.un.org/en" title="United Nations">
 <i class="fa-solid fa-house fs-4 me-m align-bottom"></i><span>Welcome to the United Nations</span>
 </a>,
 <a class="language-link nav-link" data-drupal-link-system-path="node/330972" href="/fr/frontpage" hreflang="fr">Français</a>,
 <a href="https://www.un.org/en" title="United Nations">
 <img alt="United Nations" src="/themes/custom/un3/un3_base/images/logos/UN_logo_en.svg"/>
 </a>,
 <a href="/en" rel="home" title="Meetings Coverage and Press Releases">
             Meetings Coverage and Press Releases
           </a>,
 <a class="language-link nav-link" data-drupal-link-system-path="node/330972" href="/fr/frontpage" hreflang="fr">Français</a>,
 <a href="/en" rel="home" title="Meetings Coverage and Press Releases">
             Meetings Coverage and Press Releases
           </a>,
 <a data

In [None]:
# Number of links in the page
print(f"There are {len(all_links)} links in this page")


There are 104 links in this page


In [None]:
# Extract links and link titles
links = [(a['href'], a.text) for a in soup.find_all('a')]
# Display the result
for link, title in links:
    print(f"Link: {link}\nTitle: {title}\n")


Link: #main-content
Title: Skip to main content / navigation

Link: https://www.un.org/en
Title: 
Welcome to the United Nations


Link: /fr/frontpage
Title: Français

Link: https://www.un.org/en
Title: 



Link: /en
Title: 
            Meetings Coverage and Press Releases
          

Link: /fr/frontpage
Title: Français

Link: /en
Title: 
            Meetings Coverage and Press Releases
          

Link: #CollapsingNavbar
Title: 

Link: /en
Title: Home

Link: /en/content/secretary-general
Title: Secretary-General

Link: /en/content/secretary-general
Title: Latest

Link: /en/content/secretary-general/press-release
Title: Press Releases

Link: /en/content/secretary-general/press-conference
Title: Press Conferences

Link: /en/content/general-assembly
Title: General Assembly

Link: /en/content/general-assembly
Title: Latest

Link: /en/content/general-assembly/meetings-coverage
Title: Meetings Coverage

Link: /en/content/general-assembly/press-release
Title: Press Releases

Link: /en/content

# **SelectorGadget and css selectors:**
Now, to make our lives even easier, we will use **css selectors *and* BeautifulSoup**  to find and collect content. A **CSS selector** is a pattern used to select and style HTML elements on a web page. Selectors define the criteria for matching elements. CSS selectors allow you to express complex selection criteria concisely. This can result in shorter and more readable code compared to using traditional methods like nested .find() calls.

We’re going to use a tool called **“SelectorGadget”** available here: https://selectorgadget.com/

When installed as an extension in the Chrome browser, it allows you to point and click at what you want to find and it defines a “css selector” that identifies the element or elements you need. So, once installed, we turn on SelectorGadget by clicking on the icon. Then we select one of the links that we want to capture. That highlights everything you’re currently capturing, (You can then select things you want it not to capture, if need be.)

We will then pass the "css selector" to "soup" to parse out content we are intrested in.


<img src="https://github.com/IshitaGopal/TRIADS_workshops/raw/main/intro_to_web_scraping/css_select.png"  width="1000">


But clicking on the first link also selcts some things we dont want 🧐:

<img src="https://github.com/IshitaGopal/TRIADS_workshops/raw/main/intro_to_web_scraping/extra_selects.png"  width="1000">


We can unselect things we dont want captured by clicking on it again. The box will turn red and deselect:

<img src="https://github.com/IshitaGopal/TRIADS_workshops/raw/main/intro_to_web_scraping/css_unselect.png"  width="1000">

# **Let's use the selector gadget and search "soup" using soup.select():**

First, lets get all of the headlines (displayed under meeting coverage and press releases section on the website) that we selected above.


In [None]:
all_headlines = soup.select("#block-un3-press-content .page-header a")
all_headlines[:5]


 <a href="/en/2024/sc15612.doc.htm" hreflang="en">Briefing Security Council on Afghanistan, Special Representative Urges de Facto Authorities Reverse Repressive Policies towards Women</a>,
 <a href="/en/2024/gaab4453.doc.htm" hreflang="en">Hold Managers Responsible for Implementing Staff Recruitment Recommendations, Speakers Urge as Fifth Committee Reviews Progress in Strengthening UN Accountability Culture </a>,
 <a href="/en/2024/ga12586.doc.htm-0" hreflang="en">Veto of Security Council Resolution Calling for Ceasefire in Gaza Emboldens Israel to Continue Crimes against Palestinian people, Speakers Tell General Assembly</a>,
 <a href="/en/2024/ga12586.doc.htm" hreflang="en">General Assembly: plenary</a>]

In [None]:
# There are 19 headlines
len(all_headlines)

19

In [None]:
# Lets extract all the text in each of the returned items:
headline_text = [headline.text for headline in all_headlines]
headline_text[:5]

 'Briefing Security Council on Afghanistan, Special Representative Urges de Facto Authorities Reverse Repressive Policies towards Women',
 'Hold Managers Responsible for Implementing Staff Recruitment Recommendations, Speakers Urge as Fifth Committee Reviews Progress in Strengthening UN Accountability Culture ',
 'Veto of Security Council Resolution Calling for Ceasefire in Gaza Emboldens Israel to Continue Crimes against Palestinian people, Speakers Tell General Assembly',
 'General Assembly: plenary']

We could also return the headline along with the date it was published. We can use the selector gadget and click on additional items we want returned:


<img src="https://github.com/IshitaGopal/TRIADS_workshops/raw/main/intro_to_web_scraping/css_select_date_too.png"  width="1000">

In [None]:
all_headlinesDate = soup.select('#block-un3-press-content .datetime , #block-un3-press-content .page-header a')
all_headlinesDate[:5]
# Note that the relative link to the full article in returned in the href attribute

[<time class="datetime" datetime="2024-03-07T12:00:00Z">7 March 2024</time>,
 <time class="datetime" datetime="2024-03-06T12:00:00Z">6 March 2024</time>,
 <a href="/en/2024/sc15612.doc.htm" hreflang="en">Briefing Security Council on Afghanistan, Special Representative Urges de Facto Authorities Reverse Repressive Policies towards Women</a>,
 <time class="datetime" datetime="2024-03-06T12:00:00Z">6 March 2024</time>]

In [None]:
for i in range(1, len(all_headlinesDate), 2):
  print(i)

1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37


In [None]:
# Lets extract all of this information into seperate lists:
# Extracting dates from the list 'all_headlinesDate' with a step of 2
headline_date = meeting_date = [all_headlinesDate[i].text for i in range(0, len(all_headlinesDate), 2)]
print(headline_date[:3])

# Extracting titles from the list 'all_headlinesDate' with a step of 2 starting from index 1
headline_title = [all_headlinesDate[i].text for i in range(1, len(all_headlinesDate), 2)]
print(headline_title[:3])

# Creating article links by combining the base URL and extracting href attributes
article_link = ["https://press.un.org" + all_headlinesDate[i]["href"] for i in range(1, len(all_headlinesDate), 2)]
print(article_link[:3])


['7 March 2024', '6 March 2024', '6 March 2024']
['https://press.un.org/en/2024/sc15613.doc.htm', 'https://press.un.org/en/2024/sc15612.doc.htm', 'https://press.un.org/en/2024/gaab4453.doc.htm']


In [None]:
# Let's convert all this data it into a pandas dataframe
headline_df = pd.DataFrame({"headline_date":headline_date,
              "headline_title":headline_title,
              "article_link": article_link})

headline_df

Unnamed: 0,headline_date,headline_title,article_link
0,7 March 2024,Honor Values of Ramadan in Sudan through Cessa...,https://press.un.org/en/2024/sc15613.doc.htm
1,6 March 2024,"Briefing Security Council on Afghanistan, Spec...",https://press.un.org/en/2024/sc15612.doc.htm
2,6 March 2024,Hold Managers Responsible for Implementing Sta...,https://press.un.org/en/2024/gaab4453.doc.htm
3,5 March 2024,Veto of Security Council Resolution Calling fo...,https://press.un.org/en/2024/ga12586.doc.htm-0
4,5 March 2024,General Assembly: plenary,https://press.un.org/en/2024/ga12586.doc.htm
5,5 March 2024,"If Not Managed Carefully, South Sudan Election...",https://press.un.org/en/2024/sc15611.doc.htm
6,5 March 2024,Fifth Committee Examines $3 Million Financing ...,https://press.un.org/en/2024/gaab4452.doc.htm
7,4 March 2024,"Amid ‘Catastrophic, Unconscionable, Shameful’ ...",https://press.un.org/en/2024/ga12585.doc.htm
8,4 March 2024,Syria’s Full Cooperation Essential to Closing ...,https://press.un.org/en/2024/sc15610.doc.htm
9,7 March 2024,Security Council 2140 Committee Discusses Work...,https://press.un.org/en/2024/sc15614.doc.htm


In [None]:
# We can loop throught the collected article links and gather the text from the entire article:
article_text = []
for link in article_link:
  response = requests.get(link)
  soup = BeautifulSoup(response.content, 'html.parser')
  text = " ".join([p.text for p in soup.select("#block-un3-press-content p")])
  article_text.append(text)

In [None]:
# Let's add the text to our dataframe
headline_df["article_text"] = article_text
headline_df

Unnamed: 0,headline_date,headline_title,article_link,article_text
0,7 March 2024,Honor Values of Ramadan in Sudan through Cessa...,https://press.un.org/en/2024/sc15613.doc.htm,April will mark one year since the outbreak of...
1,6 March 2024,"Briefing Security Council on Afghanistan, Spec...",https://press.un.org/en/2024/sc15612.doc.htm,Calling for sustained international engagement...
2,6 March 2024,Hold Managers Responsible for Implementing Sta...,https://press.un.org/en/2024/gaab4453.doc.htm,Managers who don’t implement the recommendatio...
3,5 March 2024,Veto of Security Council Resolution Calling fo...,https://press.un.org/en/2024/ga12586.doc.htm-0,The use of the Security Council veto by the Un...
4,5 March 2024,General Assembly: plenary,https://press.un.org/en/2024/ga12586.doc.htm,(Note: Due to the financial liquidity crisis ...
5,5 March 2024,"If Not Managed Carefully, South Sudan Election...",https://press.un.org/en/2024/sc15611.doc.htm,While South Sudan is not currently ready to ho...
6,5 March 2024,Fifth Committee Examines $3 Million Financing ...,https://press.un.org/en/2024/gaab4452.doc.htm,The Fifth Committee (Administrative and Budget...
7,4 March 2024,"Amid ‘Catastrophic, Unconscionable, Shameful’ ...",https://press.un.org/en/2024/ga12585.doc.htm,Lamenting the Security Council’s inability to ...
8,4 March 2024,Syria’s Full Cooperation Essential to Closing ...,https://press.un.org/en/2024/sc15610.doc.htm,Syria’s full cooperation with the Organisation...
9,7 March 2024,Security Council 2140 Committee Discusses Work...,https://press.un.org/en/2024/sc15614.doc.htm,"On 23 February 2024, the Security Council Comm..."


#**Fin!**