# UN Treaty Collection Scraper

## Project Overview
This notebook scrapes data from the UN Treaty Collection website to gather information about countries' participation in international treaties, specifically focusing on the Convention on the Prevention and Punishment of the Crime of Genocide (Paris, 9 December 1948).

## Data Collection Goals
We aim to extract the following information for each country:
1. Country name
2. Signature status (signed or not signed)
3. Approval status (ratification, accession, or succession)

## Methodology

### Signature Status Detection
- Extract country names from the first column of the table
- Check if the signature field (`<td>` element) contains actual data or a non-breaking space (`&nbsp;`)
- If data exists → country has signed the treaty
- If `&nbsp;` exists → country has not signed the treaty

### Approval Status Detection
- Extract the ratification/accession/succession information from the third column
- Apply the same logic to determine if a country has approved the treaty
- Record the specific type of approval when available

## Note on Treaty Approval Types
Different approval mechanisms reflect various ways countries commit to treaties:

- **Ratification**: When a country has signed the treaty and later formally approves it
- **Accession (a)**: When a country directly joins a treaty without having signed it first
- **Succession (d)**: When a newly independent country declares it will continue to be bound by a treaty that applied to its territory before independence

These different mechanisms achieve the same legal effect (treaty approval) but reflect differences in governmental structures and diplomatic processes.

## Data Structure
The collected data will be structured as follows:
```json
{
    "country-name": "Country Name",
    "signature-status": "signed/not signed",
    "approval-status": "approved/not-stated"
}
```

In [None]:
#import requests and beautifulSoup
import requests
from bs4 import BeautifulSoup
import re

In [None]:
url = "https://treaties.un.org/Pages/ViewDetails.aspx?src=TREATY&mtdsg_no=IV-1&chapter=4&clang=_en"
html = requests.get(url).content
doc = BeautifulSoup(html,'html.parser')

In [None]:
tableRows = doc.find(id = 'ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolderInnerPage_tblgrid').find_all('tr')

if tableRows:
    print(tableRows)
else:
    print("Element not found")

The data should be structured as follows: 

{
    'treatey-name': 'treaty name',
    'treay-code':'tready code',
    'country-name':'country-name',
}

In [None]:
import re
status_list = []

treaty_name = doc.find('div', class_= 'treatyCenter').text
page_title = re.sub(r'\s+', ' ', treaty_name)
treaty_title = re.sub(r'^\s*\d+\.\s*', '', page_title).strip()

for row in tableRows[1:]:
    country_name = row.find_all('td')[0].text
    signed = row.find_all('td')[1].decode_contents().strip()
    condition = (bool(signed) and signed != '&nbsp')
    if condition: 
        sig = signed.replace(signed, "signed")
    else:
        sig = signed.replace(signed, 'not signed')

    ratification = row.find_all('td')[2].decode_contents().strip()
    sec_condition = (bool(ratification) and ratification != '&nbsp')
    if (sec_condition):
        status = ratification.replace(ratification, 'approved')
    else:
        status = ratification.replace(ratification, 'not-stated')
    info_dec = {
        "country-name": country_name,
        "signature-status": sig,
        "approval-status": status
    }
    status_list.append(info_dec)

print (status_list)

Notes: Vietnam is written wrong "Viet Nam" it should be replaced. Some countries has numbers after refereing to other sections in the webpage.