# UN Human Rights Index Data Scraper

## Project Overview
This notebook implements a web scraper for the Universal Human Rights Index, focusing on extracting data from the Committee against Torture (CAT) and other UN human rights mechanisms. The scraper collects information about human rights recommendations and concerns addressed to various countries.

## Data Sources
- [Universal Human Rights Index - Committee against Torture](https://uhri.ohchr.org/en/search-human-rights-recommendations?mechanisms=c33cacab-3cce-4f17-85e7-d1498db6b1aa&mechanismsOpened=c33cacab-3cce-4f17-85e7-d1498db6b1aa)
- Special Rapporteur in the field of cultural rights

## Goals
1. Extract structured data about human rights recommendations
2. Identify which countries received recommendations
3. Categorize concerns by affected groups and human rights themes
4. Create a comprehensive dataset for analysis and visualization

## Data Structure
Each record in our dataset will follow this structure:
```json
{
    "year": "2022",
    "country": "Republic of Korea", 
    "term": "Fundamental legal safeguards",
    "concerned-group": [
        "Children", 
        "Youth & juveniles",
        "Persons deprived of their liberty & detainees"
    ],
    "human-rights-themes": [
        "Right to liberty and security of person",
        "Right to a fair trial",
        "Right not to be subjected to torture"
    ],
    "annotation-type": [
        "Recommendation",
        "Concern"
    ]    
}
```

## Workflow
1. **Data Extraction**: Scrape data from the Universal Human Rights Index
2. **Data Processing**: Filter by year and organize data
3. **Data Export**: Save structured data to CSV format
4. **Database Integration**: Upload data to MongoDB for further analysis
5. **API Development**: Create an interface for data access

## Potential Applications
- Analyze patterns of human rights concerns across countries
- Track changes in recommendations over time
- Identify frequently affected groups by country/region
- Support human rights advocacy with data-driven insights

---

**Disclaimer**: The classification labels for Concerned persons/groups, Human Rights Themes and SDGs are provided for informational purposes only. While we strive for accuracy, these labels may not be exhaustive and should be considered as indicative insights rather than definitive categorizations.

In [None]:
import pandas as pd
from playwright.async_api import async_playwright

#### STEP 01

In [None]:
# setup
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()
await page.goto('https://uhri.ohchr.org/en/search-human-rights-recommendations?mechanisms=c33cacab-3cce-4f17-85e7-d1498db6b1aa&mechanismsOpened=c33cacab-3cce-4f17-85e7-d1498db6b1aa')
await page.wait_for_load_state("domcontentloaded")


# wait for the needed element to load to ensure it exists 
wait = await page.wait_for_selector('.result-section-lists-wrapper.row')

#getting information about the effected persons
ep_class = await page.query_selector('.affected-persons')
ep_element = await ep_class.query_selector_all('li')
ep_title = []
for ep in ep_element:
    ep_title.append(await ep.text_content())

# getting countries information 
countries_class = await page.query_selector('.countries')
countries_element = await countries_class.query_selector('li')
countries_name = await countries_element.text_content()

# getting the annotation type  
anno_class = await page.query_selector('.annotationtype')
anno_element = await anno_class.query_selector_all('li')
anno_type = []
for type in anno_element:
    anno_type.append(await type.text_content())

# getting human rights themes
huth_class = await page.query_selector('.themes')
huth_element = await huth_class.query_selector_all('li')
huth_themes = []
for theme in huth_element:
    if await theme.text_content():
        huth_themes.append(await theme.text_content())
    else:
        huth_themes.append('not found')

# creating a dictionary with the information found 
dectionary = {
    'country': countries_name,
    'concerned-group': ep_title,
    'annotation-type':anno_type,
    'human-rights-theme': huth_themes,
}

# print 
print(dectionary)

# close the browser after finishing 
await browser.close()

#### STEP 2

In [None]:
# setup
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()
search = 'https://uhri.ohchr.org/en/search-human-rights-recommendations?mechanisms=c33cacab-3cce-4f17-85e7-d1498db6b1aa&mechanismsOpened=c33cacab-3cce-4f17-85e7-d1498db6b1aa'
await page.goto(search)
await page.wait_for_load_state("domcontentloaded")

# wait for the needed element to load to ensure it exists 
wait = await page.wait_for_selector('.result-section-column-1')

# creating the dictionary
information = []
wrapper_class = await page.query_selector_all('.result-section-column-1')
for element in wrapper_class:
    term_elements = await element.query_selector_all('h3[data-h3-number]')
    for term_element in term_elements:
        term = await term_element.text_content()
    ep_info = []
    effected_persons_class = await element.query_selector('.affected-persons')
    if effected_persons_class is not None: 
        ep_element = await effected_persons_class.query_selector_all('li')
        for ep in ep_element:
            op_content = await ep.text_content()
            ep_info.append(op_content)
    else:
       ep_info.append('not provided')
    country = []
    country_elements = await element.query_selector_all('.countries')
    for country_element in country_elements:
        country_names = await country_element.query_selector_all('li')
        for country_li in country_names:
            country_li_element = await country_li.text_content()
            country.append(country_li_element)
    information_dictionary = {
        'country': country,
        'term': term,
        'concerned-group': ep_info,
    }
    information.append(information_dictionary)

await browser.close()
len(information)

Next step would be extracting with regex so we have 'concerend-group': 'children', concerned-group:'youth & juviles', so when we aggregate, we can see which country has abused what group the most ––> using pandas

      {
      'term': 'Fundamental legal safeguards',
      'concerned-group': [
        'Law enforcement / police & prison officials',
        'Children',
        'Medical staff / health professionals',
        'Youth & juveniles',
        'Persons deprived of their liberty & detainees'
        ]
      }

#### STEP 3

In [None]:
# Setup 
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()
search = 'https://uhri.ohchr.org/en/search-human-rights-recommendations?mechanisms=c33cacab-3cce-4f17-85e7-d1498db6b1aa&mechanismsOpened=c33cacab-3cce-4f17-85e7-d1498db6b1aa'
await page.goto(search)
await page.wait_for_load_state("domcontentloaded")

# wait for the needed element to load to ensure it exists 
wait = await page.wait_for_selector('.result-row-section')

# creating the dictionary
for i in range (1):
    information = []

    wrapper_class = await page.query_selector_all('.result-row-section')
    for element in wrapper_class:
        year = []
        year_elements = await page.query_selector_all('.link-document-result-row')
        for year_content in year_elements:
            year_ = await year_content.text_content()
            year.append(year_)
        term_elements = await element.query_selector_all('h3[data-h3-number]')
        for term_element in term_elements:
            term = await term_element.text_content()
        ep_info = []
        effected_persons_class = await element.query_selector('.affected-persons')
        if effected_persons_class is not None: 
            ep_element = await effected_persons_class.query_selector_all('li')
            for ep in ep_element:
                op_content = await ep.text_content()
                ep_info.append(op_content)
        else:
            ep_info.append('not provided')
        country = []
        country_elements = await element.query_selector_all('.countries')
        for country_element in country_elements:
            country_names = await country_element.query_selector_all('li')
            for country_li in country_names:
                country_li_element = await country_li.text_content()
                country.append(country_li_element)
        information_dictionary = {
            'year': year,
            'country': country,
            'term': term,
            'concerned-group': ep_info,
        }
        information.append(information_dictionary)
    if i < 2:
        await page.get_by_text("Display more recommendations").click()
    else: 
        await browser.close()

information

#### STEP 4

In [None]:
# setup
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()
search = 'https://uhri.ohchr.org/en/search-human-rights-recommendations?mechanismsOpened=c33cacab-3cce-4f17-85e7-d1498db6b1aa%2C1803e7a0-a065-47d9-8aa6-fe890f858b07%2C84f66960-afa8-46cf-8ed1-b302b395e8fb%2C0c43227f-61e1-48f9-878d-5af39566249b%2Cc8bdd184-e10d-499d-931a-145a90cf546f%2C18fe6bff-ff5d-4dc3-9047-7b139e10834d%2Caf9f184b-b74c-4788-90fb-e1b3990710d0&mechanisms=af9f184b-b74c-4788-90fb-e1b3990710d0'
await page.goto(search)
await page.wait_for_load_state("domcontentloaded")


# wait for the needed element to load to ensure it exists 
wait = await page.wait_for_selector('.result-row-section')

# creating the dictionary
for i in range (30):
    information = []
    wrapper_class = await page.query_selector_all('.result-row-section')
    for element in wrapper_class:
        
        year_elements = await element.query_selector_all('.link-document-result-row')
        for year_class in year_elements:
            if year_class:
                year = await year_class.text_content()
            else:
                year = "not provided"
        term_elements = await element.query_selector_all('h3[data-h3-number]')
        if (term_elements):
            for term_element in term_elements:
                if term_element:
                    term = await term_element.text_content()
                else:
                    term = "not provided"
        else:
            term = "not provided"
        ep_info = []
        effected_persons_class = await element.query_selector('.affected-persons')
        if effected_persons_class is not None: 
            ep_element = await effected_persons_class.query_selector_all('li')
            for ep in ep_element:
                op_content = await ep.text_content()
                ep_info.append(op_content)
        else:
            ep_info.append('not provided')
        country = []
        country_elements = await element.query_selector_all('.countries')
        for country_element in country_elements:
            country_names = await country_element.query_selector_all('li')
            for country_li in country_names:
                if country_li:
                    country_li_element = await country_li.text_content()
                    country.append(country_li_element)
                else:
                    country_li_element = "not provided"
                    country.append(country_li_element)
        
        huth_class = await element.query_selector_all('.themes')
        huth_themes = []
        for huth_element in huth_class: 
            huth_element = await huth_element.query_selector_all('li')
            for theme in huth_element:
                if await theme.text_content():
                    huth_themes.append(await theme.text_content())
                else:
                    huth_themes.append('not provided')
        anno_class = await element.query_selector_all('.annotationtype')
        anno_type = []
        for anno in anno_class:
            anno_element = await anno.query_selector_all('li')
            for type in anno_element:
                if type: 
                    anno_type.append(await type.text_content())
                else: 
                    anno_type.append('not provided')
        information_dictionary = {
            'year': year,
            'country': country,
            'term': term,
            'concerned-group': ep_info,
            'human-rights-theme': huth_themes,
            'annotaion type': anno_type,
        }
        information.append(information_dictionary)
    if i < 29:
        await page.get_by_text('Display more recommendations').click()
    else:     
        await browser.close()


In [None]:
len(information)

In [None]:
df = pd.DataFrame(information)
df.to_csv('ohchr_SR-Cultural-Rights_2022_2024.csv', index=False)