# Your first scraper
In this project, we will guide you step by step through the process of:

1. creating a self-contained development environment.
1. retrieving some information from an API (a website for computers)
2. leveraging it to scrape a website that does not provide an API
3. saving the output for later processing

Here we query an API for a list of countries and their past leaders. We then extract and sanitize their short bio from Wikipedia. Finally, we save the data to disk.

This task is often the first (coding) step of a datascience project and you will often come back to it in the future.

You will study topics such as *scraping*, *data structures*, *regular expressions*, *concurrency* and *file handling*. We will point out useful resources at the appropriate time. 

Let's dive in!

## 0. Creating a clean environment

Use the [`venv`](https://docs.python.org/3/library/venv.html) command to create a new environment called `wikipedia_scraper_env`.

Activate it and add it to you `.gitignore` file. 

You will find more info about virtual environments in the course content and on the web.

## 1. API Scraping

### 1a. A simple API query
You will start with the basics: how to do a simple request to an [API endpoint](../../2.python/2.python_advanced/05.Scraping/5.apis.ipynb).

You will use the [requests](https://requests.readthedocs.io/en/latest/) external library through the `import` keyword. NOTE: external libraries need to be installed first. Check the [request Quickstart](https://requests.readthedocs.io/en/latest/user/quickstart/) section of the documentation to:

1. Use the `get()` method to connect to this endpoint: https://country-leaders.onrender.com/status
2. Check if the `status_code` is equal to 200, which means OK.
    * if OK, `print()` the `text`` of the response.
    * if not, `print()` the `status_code`. 

Here is an explanation of [HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes).


In [34]:
# import the requests library (1 line)
import requests
from bs4 import BeautifulSoup
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
import json

# assign the root url (without /status) to the root_url variable for ease of reference
root_url = "https://country-leaders.onrender.com"

# assign the /status endpoint to another variable called status_url
status_url = f"{root_url}/status"

# query the /status endpoint using the get() method and store it in the req variable
req = requests.get(status_url)

# check the status_code using a condition and print appropriate messages
if req.status_code == 200:
    print(req.text)
else:
    print(req.status_code)


"Alive"


### 1b. Dealing with JSON

[JSON](https://quickref.me/json) is the preferred format to deal with data over the web. You cannot avoid it so you would better get acquainted.

Connect to another endpoint called `/countries` but this time the API will return data in the JSON format. 


In [5]:
# Set the countries_url variable
countries_url = f"{root_url}/countries"

# query the /countries endpoint using the get() method and store it in the req variable
req = requests.get(countries_url)

# Get the JSON content and store it in the countries variable
countries = req.json()

# display the request's status code and the countries variable
print(req.status_code, countries)

403 {'message': 'The cookie is missing'}


### 1c. Cookies anyone?

It looks like the access to this API is restricted...
Query the `/cookie` endpoint and extract the appropriate field to access your cookie.

You will need to use this cookie in each of the following API requests.

In [6]:
# Set the cookie_url variable
cookie_url = f"{root_url}/cookie"

# Query the endpoint, set the cookies variable and display it
response = requests.get(cookie_url)
cookies = response.cookies

Try to query the countries endpoint using the cookie, save the output and print it.

In [7]:
# query the /countries endpoint, assign the output to the countries variable
countries = requests.get(countries_url, cookies=cookies).json()

# display the countries variable
print(countries)

['ma', 'fr', 'us', 'ru', 'be']


Chances are the cookie has expired... Thanksfully, you got a nice error message. For now, simply execute the last 2 cells quickly so you get a result.

### 1d. Getting the actual data from the API

Query the `/leaders` endpoint.

In [8]:
# Set the leaders_url variable
leaders_url = f"{root_url}/leaders"

# Choose one country and store it in the country variable
country = countries[0]

# Query the /leaders endpoint for the chosen country using the cookie
response = requests.get(f"{leaders_url}?country={country}", cookies=cookies)
leaders = response.json()


It looks like this endpoint requires additional information in order to return its result. Check the API [*documentation*](https://country-leaders.onrender.com/docs) in your web browser.

Change the query to accept *parameters*. You should know where to find help by now.

In [9]:
# query the /leaders endpoint using cookies and parameters (take any country in countries)
# assign the output to the leaders variable
leaders = requests.get(leaders_url, cookies=cookies, params={"country": countries[4]}).json()

### 1e. A sneak peak at the data (finally)

Look inside a few examples. Notice the dictionary keys available for each entry. You have your first example of *structured data*. This data was sanitized for your benefit, meaning it is readily exploitable without modification.

You will also notice there is a Wikipedia link for each entry. You will need to extract additional information there. This will be a case of *semi-structured* data.

The /countries endpoint returns a `list` of several country codes.

You need to loop through this list and query the /leaders endpoint for each one. Save each `json` result in a dictionary called `leaders_per_country`.

In [10]:
# 4 lines
# Initialize the leaders_per_country dictionary
leaders_per_country = {}

# Loop through countries, query the /leaders endpoint with cookies and parameters, and store results
for country in countries:
    leaders_per_country[country] = requests.get(leaders_url, cookies=cookies, params={"country": country}).json()


In [11]:
# Create a dictionary of all leaders per country in one line (1 line)
leaders_per_country = {country: requests.get(leaders_url, cookies=cookies, params={"country": country}).json() for country in countries}

It is finally time to create a `get_leaders()` function for the above code. You will build on it later-on. This function takes no parameter. Inside it, you will need to:
1. define the urls
2. get the cookies
2. get the countries
3. loop over them and save their leaders in a dictionary
4. return the dictionary

In [12]:
# < 15 lines
# Define the get_leaders() function
def get_leaders():
    root_url = "https://country-leaders.onrender.com"
    cookie_url = f"{root_url}/cookie"
    countries_url = f"{root_url}/countries"
    leaders_url = f"{root_url}/leaders"

    cookies = requests.get(cookie_url).cookies
    countries = requests.get(countries_url, cookies=cookies).json()

    leaders_per_country = {}
    for country in countries:
        leaders_per_country[country] = requests.get(leaders_url, cookies=cookies, params={"country": country}).json()

    return leaders_per_country


Test your function, save the result in the `leaders_per_country` dictionary and check its ouput.

In [13]:
# 2 lines
# Test the function and store the result
leaders_per_country = get_leaders()
print(leaders_per_country['be'][1]['last_name'])


Leterme


## 2. Extracting data from Wikipedia

Query one of the leaders' Wikipedia urls and display its `text` (not JSON).

In [14]:
# 3 lines
# Select one leader's Wikipedia URL and assign it to the wiki_url variable
wiki_url = leaders_per_country["us"][1]["wikipedia_url"]

# Query the Wikipedia page with a proper User-Agent header
response = requests.get(wiki_url, headers={"User-Agent": "Mozilla/5.0 (Wikipedia Scraper Project - Educational Use)"})

# Display the text content of the page
print(response.text[:100])  # Display first 1000 chars to avoid flooding the notebook

print(wiki_url)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-la
https://en.wikipedia.org/wiki/Barack_Obama


Ouch! You get the raw HTML code of the webpage. If you try to deal with it without tools, you will be there all night. Instead, use the [beautiful soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) *external* library. You will find more info about it [here](../../2.python/2.python_advanced/05.Scraping/1.beautifulsoup_basic.ipynb) and [here](../../2.python/2.python_advanced/05.Scraping/2.beautifulsoup_advanced.ipynb)

Using the Quickstart section, start by importing the library and loading the output of your `get_text()` function.

Use the `prettify()` function and print it to take a look. You will start the actual parsing in the next step.

In [15]:
# 3 lines
# Load the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# Use prettify() to visualize the structure
print(soup.prettify()[:100])  # Limit output to 1000 chars for readability

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-la


That looks better but you need to extract the right part of the webpage: the text of the first paragraph.

It is a bit tricky because Wikipedia pages slightly differ in structure from one language to the next. We cannot simply get the text for the first HTML paragraph.

You will start by getting all the HTML paragraphs from the HTML source and saving them in the `paragraphs` variable.

Use the documentation or google the appropriate keywords.

In [16]:
# 2 lines
# Extract all HTML paragraphs and store them in the paragraphs variable
paragraphs = soup.find_all("p")

# Display how many paragraphs were found
print(f"Found {len(paragraphs)} paragraphs")

# Show the first few paragraphs to inspect
print(paragraphs[:1])


Found 128 paragraphs
[<p class="mw-empty-elt">
</p>]


If you try different urls, you might find that the paragraph you want may be at a different index each time.

That is where you need to be clever and ask yourself what would be a reliable way to identify the right index ie. which string matches only the first paragraph whatever the language...

Spend a good 30 minutes on the problem and brainstorm with your fellow learners. If you come out empty handed, ask your coach.

1. Loop over the HTML paragraphs
2. When you have identified the correct one:
   * Store the [text](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#output) inside the `first_paragraph` variable
   * Exit the loop

In [17]:
# 1) Loop over the HTML paragraphs
paragraphs = soup.find_all("p")

first_paragraph = None
for p in paragraphs:
    text = p.get_text(strip=True)
    # Heuristic: skip empty or very short paragraphs (nav/meta/fillers)
    if len(text) > 60:
        # 2) Identified the correct one
        first_paragraph = text
        #    * Store in first_paragraph
        #    * Exit the loop
        break 

# Optional: sanity check
if first_paragraph is None:
    print("No meaningful paragraph found.")
else:
    print(first_paragraph)


Barack Hussein Obama II[a](born August 4, 1961) is an American politician who served as the 44thpresident of the United Statesfrom 2009 to 2017. A member of theDemocratic Party, he was the firstAfrican Americanpresident. Obama previously served as aU.S. senatorrepresenting Illinois from 2005 to 2008 and as anIllinois state senatorfrom 1997 to 2004.


At this stage, you can create a function to maintain consistency in your code. We will give you its *skeleton*, you will copy the code you wrote and make it work inside a function.

Don't forget to test your function.

In [18]:
def get_first_paragraph(url):
    """Fetch a Wikipedia page and return the first meaningful intro paragraph."""
    import requests
    from bs4 import BeautifulSoup

    response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
    soup = BeautifulSoup(response.text, "html.parser")

    first_paragraph = None
    for p in soup.find_all("p"):
        text = p.get_text(" ", strip=True)  # add a space between tags
        if len(text) > 60:
            first_paragraph = text
            break

    return first_paragraph


In [19]:
# Test: 3 lines
test_url = leaders_per_country["us"][0]["wikipedia_url"]
print(get_first_paragraph(test_url))


George Washington (February 22, 1732 [ O.S. February 11, 1731] [ a ] – December 14 , 1799) was a Founding Father and the first president of the United States , serving from 1789 to 1797. As commander of the Continental Army , Washington led Patriot forces to victory in the American Revolutionary War against the British Empire . He is commonly known as the Father of the Nation for his role in bringing about American independence .


### 2a. Regular expressions to the rescue

Now that you have extracted the content of the first paragraph, the only thing that remains to finish your Wikipedia scraper is to sanitize the output.

Indeed some Wikipedia references, HTML code, phonetic pronunciation etc. may linger. You might find *regular expressions* handy to get rid of them and obtain pristine text. You will find some useful documentation about regular expressions [here](../../2.python/2.python_advanced/03.Regex/regex.ipynb)

Once you have one of your regex working online, try it in the cell below. 

Hints: 
* Check the `sub()` method documentation.
* Make sure to test urls in different languages. Some may look good but other do not.

In [20]:
test_url = leaders_per_country["ma"][2]["wikipedia_url"]
print(get_first_paragraph(test_url))

محمد الخامس بن يوسف بن الحسن بن محمد بن عبد الرحمن بن هشام بن محمد بن عبد الله بن إسماعيل بن الشريف بن علي العلوي وُلد ( 1327 هـ / 10 أغسطس 1909م بالقصر السلطاني بفاس ) [ 1 ] وتوفي ( 1381 هـ / 26 فبراير 1961م بالرباط ) خَلَف والده السلطان مولاي يوسف الذي توفي بُكرة يوم الخميس 22 جمادى الأولى سنة 1346 هـ موافق 17 نوفمبر سنة 1927م [ 2 ] فبويع ابنه سيدي محمد سلطانا للمغرب في اليوم الموالي بعد صلاة الجمعة 23 جمادى الأولى سنة 1346 هـ موافق 18 نوفمبر سنة 1927م في القصر السلطاني بفاس [ 3 ] ولم يزل سلطان المغرب إلى سنة 1957م ، قضى منها المنفى بين ( 1953 - 1955 )، ثم اتخذ لقب الملك سنة 1957م ولم يزل ملكا إلى وفاته سنة 1961م ، ساند السلطان محمد الخامس نضالات الحركة الوطنية المغربية المطالبة بتحقيق الاستقلال ، الشيء الذي دفعه إلى الاصطدام بسلطات الحماية . وكانت النتيجة قيام سلطات الحماية بنفيه إلى مدغشقر . وعلى إثر ذلك اندلعت مظاهرات مطالبة بعودته إلى وطنه. وأمام اشتداد حدة المظاهرات، قبلت السلطات الفرنسية بإرجاع السلطان إلى عرشه يوم 16 نوفمبر 1955 . وبعد بضعة شهور تم إعلان استقلال المغرب . كان ا

In [21]:
# 3 lines

# clean the paragraph by removing reference markers and brackets
clean_paragraph = re.sub(r'\[[^\]]*\]', '', get_first_paragraph(test_url))

# display the cleaned text
print(clean_paragraph.strip())


محمد الخامس بن يوسف بن الحسن بن محمد بن عبد الرحمن بن هشام بن محمد بن عبد الله بن إسماعيل بن الشريف بن علي العلوي وُلد ( 1327 هـ / 10 أغسطس 1909م بالقصر السلطاني بفاس )  وتوفي ( 1381 هـ / 26 فبراير 1961م بالرباط ) خَلَف والده السلطان مولاي يوسف الذي توفي بُكرة يوم الخميس 22 جمادى الأولى سنة 1346 هـ موافق 17 نوفمبر سنة 1927م  فبويع ابنه سيدي محمد سلطانا للمغرب في اليوم الموالي بعد صلاة الجمعة 23 جمادى الأولى سنة 1346 هـ موافق 18 نوفمبر سنة 1927م في القصر السلطاني بفاس  ولم يزل سلطان المغرب إلى سنة 1957م ، قضى منها المنفى بين ( 1953 - 1955 )، ثم اتخذ لقب الملك سنة 1957م ولم يزل ملكا إلى وفاته سنة 1961م ، ساند السلطان محمد الخامس نضالات الحركة الوطنية المغربية المطالبة بتحقيق الاستقلال ، الشيء الذي دفعه إلى الاصطدام بسلطات الحماية . وكانت النتيجة قيام سلطات الحماية بنفيه إلى مدغشقر . وعلى إثر ذلك اندلعت مظاهرات مطالبة بعودته إلى وطنه. وأمام اشتداد حدة المظاهرات، قبلت السلطات الفرنسية بإرجاع السلطان إلى عرشه يوم 16 نوفمبر 1955 . وبعد بضعة شهور تم إعلان استقلال المغرب . كان الملك محمد الخام

Overwrite the `get_first_paragraph()` function by applying your regex to the first paragraph before returning it.

In [22]:
# 10 lines

def get_first_paragraph(url):
    """Fetch a Wikipedia page, extract and clean the first meaningful intro paragraph."""
    
    # query the Wikipedia URL
    response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
    
    # parse the HTML using BeautifulSoup
    soup = BeautifulSoup(response.text, "html.parser")
    
    # initialize variable
    first_paragraph = None
    
    # loop over HTML paragraphs
    for p in soup.find_all("p"):
        text = p.get_text(" ", strip=True)
        if len(text) > 60:
            first_paragraph = text
            break

    if not first_paragraph:
        return None

    # --- Apply regex cleaning before returning ---
    # remove reference markers like [1], [a], etc.
    cleaned = re.sub(r'\[[^\]]*\]', '', first_paragraph)
    # remove phonetic pronunciations like /ˈwɒʃɪŋtən/
    cleaned = re.sub(r'/[^/]+/', '', cleaned)
    # collapse extra spaces
    cleaned = re.sub(r'\s{2,}', ' ', cleaned).strip()
    
    return cleaned


In [23]:
test_url = leaders_per_country["ma"][2]["wikipedia_url"]
print(get_first_paragraph(test_url))

محمد الخامس بن يوسف بن الحسن بن محمد بن عبد الرحمن بن هشام بن محمد بن عبد الله بن إسماعيل بن الشريف بن علي العلوي وُلد ( 1327 هـ 26 فبراير 1961م بالرباط ) خَلَف والده السلطان مولاي يوسف الذي توفي بُكرة يوم الخميس 22 جمادى الأولى سنة 1346 هـ موافق 17 نوفمبر سنة 1927م فبويع ابنه سيدي محمد سلطانا للمغرب في اليوم الموالي بعد صلاة الجمعة 23 جمادى الأولى سنة 1346 هـ موافق 18 نوفمبر سنة 1927م في القصر السلطاني بفاس ولم يزل سلطان المغرب إلى سنة 1957م ، قضى منها المنفى بين ( 1953 - 1955 )، ثم اتخذ لقب الملك سنة 1957م ولم يزل ملكا إلى وفاته سنة 1961م ، ساند السلطان محمد الخامس نضالات الحركة الوطنية المغربية المطالبة بتحقيق الاستقلال ، الشيء الذي دفعه إلى الاصطدام بسلطات الحماية . وكانت النتيجة قيام سلطات الحماية بنفيه إلى مدغشقر . وعلى إثر ذلك اندلعت مظاهرات مطالبة بعودته إلى وطنه. وأمام اشتداد حدة المظاهرات، قبلت السلطات الفرنسية بإرجاع السلطان إلى عرشه يوم 16 نوفمبر 1955 . وبعد بضعة شهور تم إعلان استقلال المغرب . كان الملك محمد الخامس يكنى: أبا عبد الله.


## 3. Putting it all together

Let's go back to your `get_leaders()` function and update it with an *inner* loop over each leader. You will query the url provided and extract the first paragraph using the `get_first_paragraph()` function you just finished. You will then update that `leader`'s dictionary and move on to the next one.

Notice, the rest of the code should not change since you modify the leader's data one by one.

In [24]:
# < 20 lines

def get_leaders():
    """Retrieve leaders for all countries and enrich them with Wikipedia intro paragraphs."""

    root_url = "https://country-leaders.onrender.com"
    cookie_url = f"{root_url}/cookie"
    countries_url = f"{root_url}/countries"
    leaders_url = f"{root_url}/leaders"

    # Get cookie
    cookies = requests.get(cookie_url).cookies
    # Get countries
    countries = requests.get(countries_url, cookies=cookies).json()

    # Initialize container
    leaders_per_country = {}
    # Loop through each country
    for country in countries:
        leaders = requests.get(leaders_url, cookies=cookies, params={"country": country}).json()

        # Inner loop: scrape first paragraph for each leader
        for leader in leaders:
            wiki_url = leader.get("wikipedia_url")
            leader["first_paragraph"] = get_first_paragraph(wiki_url) if wiki_url else None

        # Save to main dictionary
        leaders_per_country[country] = leaders

    # Return results
    return leaders_per_country


In [25]:
# Check the output of your function
leaders_per_country = get_leaders()
print(leaders_per_country["us"][0])



AttributeError: 'str' object has no attribute 'get'

Does the function crash in the middle of the loop? Chances are the cookies have expired while looping over the leaders.

Modify your function with an *exception* or check if the `status_code` is a cookie error. In either case, get new cookies and query the api again.

If your code did not crash,

In [26]:
# < 25 lines

def get_leaders(country):
    """Retrieve leaders for a specific country and enrich them with Wikipedia intro paragraphs."""
    
    root_url = "https://country-leaders.onrender.com"
    cookie_url = f"{root_url}/cookie"
    leaders_url = f"{root_url}/leaders"
    
    # Helper to get a fresh cookie
    def get_cookie():
        return requests.get(cookie_url).cookies
    
    cookies = get_cookie()
    leaders_per_country = {}

    try:
        # Request leaders for the given country
        resp = requests.get(leaders_url, cookies=cookies, params={"country": country})

        # If cookie expired or invalid, refresh and retry
        if resp.status_code != 200 or "cookie" in resp.text.lower():
            print(f"Cookie expired while fetching {country}. Refreshing...")
            cookies = get_cookie()
            resp = requests.get(leaders_url, cookies=cookies, params={"country": country})

        leaders = resp.json()

        # Enrich each leader with their Wikipedia first paragraph
        for leader in leaders:
            wiki_url = leader.get("wikipedia_url")
            leader["first_paragraph"] = get_first_paragraph(wiki_url) if wiki_url else None

        leaders_per_country[country] = leaders

    except Exception as e:
        print(f"Error processing {country}: {e}")
    
    return leaders_per_country



Check the output of your function again.

In [27]:
# Check the output of your function (2 lines)
leaders_us = get_leaders("us")
print(leaders_us["us"][0])

{'id': 'Q23', 'first_name': 'George', 'last_name': 'Washington', 'birth_date': '1732-02-22', 'death_date': '1799-12-14', 'place_of_birth': 'Westmoreland County', 'wikipedia_url': 'https://en.wikipedia.org/wiki/George_Washington', 'start_mandate': '1789-04-30', 'end_mandate': '1797-03-04', 'first_paragraph': 'George Washington (February 22, 1732 – December 14 , 1799) was a Founding Father and the first president of the United States , serving from 1789 to 1797. As commander of the Continental Army , Washington led Patriot forces to victory in the American Revolutionary War against the British Empire . He is commonly known as the Father of the Nation for his role in bringing about American independence .'}


Well done! It took a while however... Let's speed things up. The main *bottleneck* is the loop. We call on the Wikipedia website many times.

You will use the same *session* to call all the wikipedia pages. Check the *Advanced Usage* section of the Requests module's documentation.

Start by modifying the `get_first_paragraph()` function to accept a session parameter and adjust the `get()` method call.

In [28]:
# < 20 lines
def get_first_paragraph(url, session=None):
    """Fetch and clean the first paragraph of a Wikipedia page using a shared session."""
    
    # use provided session or create a temporary one
    session = session or requests.Session()
    
    # query the Wikipedia URL (using session)
    response = session.get(url, headers={'User-Agent': 'Mozilla/5.0'})
    
    # parse HTML
    soup = BeautifulSoup(response.text, "html.parser")
    
    # extract the first meaningful paragraph
    first_paragraph = None
    for p in soup.find_all("p"):
        text = p.get_text(" ", strip=True)
        if len(text) > 60:
            first_paragraph = text
            break
    
    if not first_paragraph:
        return None

    # clean the text with regex
    cleaned = re.sub(r'\[[^\]]*\]', '', first_paragraph)   # remove [1], [a], etc.
    cleaned = re.sub(r'/[^/]+/', '', cleaned)              # remove phonetic /ˈ.../
    cleaned = re.sub(r'\s{2,}', ' ', cleaned).strip()      # collapse spaces

    return cleaned


In [29]:
with requests.Session() as s:
    test_url = "https://en.wikipedia.org/wiki/George_Washington"
    print(get_first_paragraph(test_url, s))


George Washington (February 22, 1732 – December 14 , 1799) was a Founding Father and the first president of the United States , serving from 1789 to 1797. As commander of the Continental Army , Washington led Patriot forces to victory in the American Revolutionary War against the British Empire . He is commonly known as the Father of the Nation for his role in bringing about American independence .


Modify your `get_leaders()` function to make use of a single session for all the Wikipedia calls.
1. create a `Session` object outside of the loop over countries.
2. pass it to the `get_first_paragraph()` function as an argument.

In [30]:
# < 25 lines
def get_leaders(country):
    """Retrieve leaders for a specific country and enrich them with Wikipedia intro paragraphs."""
    
    root_url = "https://country-leaders.onrender.com"
    cookie_url = f"{root_url}/cookie"
    countries_url = f"{root_url}/countries"
    leaders_url = f"{root_url}/leaders"
    
    # Helper to get a fresh cookie
    def get_cookie():
        return requests.get(cookie_url).cookies
    
    cookies = requests.get(cookie_url).cookies
    countries = requests.get(countries_url, cookies=cookies).json()

    leaders_per_country = {}

    try:
        # Request leaders for the given country
        resp = requests.get(leaders_url, cookies=cookies, params={"country": country})

        # If cookie expired or invalid, refresh and retry
        if resp.status_code != 200 or "cookie" in resp.text.lower():
            print(f"Cookie expired while fetching {country}. Refreshing...")
            cookies = get_cookie()
            resp = requests.get(leaders_url, cookies=cookies, params={"country": country})

        leaders = resp.json()

        # Enrich each leader with their Wikipedia first paragraph
        for leader in leaders:
            wiki_url = leader.get("wikipedia_url")
            leader["first_paragraph"] = get_first_paragraph(wiki_url) if wiki_url else None

        leaders_per_country[country] = leaders

    except Exception as e:
        print(f"Error processing {country}: {e}")
    
    # Create a single session for ALL Wikipedia calls and return it
    wiki_session = requests.Session()
    return leaders_per_country, wiki_session




In [31]:
# Call get_leaders() — this now returns both the leaders data and the wiki session
leaders_per_country, wiki_session = get_leaders("ma")

# Select the first Moroccan leader for example (country code "ma")
leader = leaders_per_country["ma"][0]
url = leader.get("wikipedia_url")

# Get the first paragraph using the same session
leader["first_paragraph"] = get_first_paragraph(url, session=wiki_session) if url else None

# Print the results
print(f"Leader name: {leader['first_name']} {leader['last_name']}")
print(f"Wikipedia URL: {url}")
print("\nExtracted first paragraph:\n")
print(leader["first_paragraph"])


Leader name: Mohammed None
Wikipedia URL: https://ar.wikipedia.org/wiki/%D9%85%D8%AD%D9%85%D8%AF_%D8%A7%D9%84%D8%B3%D8%A7%D8%AF%D8%B3_%D8%A8%D9%86_%D8%A7%D9%84%D8%AD%D8%B3%D9%86

Extracted first paragraph:

مُحمد السادس بن الحسن الثاني العلوي (مواليد 21 أغسطس 1963) هو ملك المملكة المغربية منذ عام 1999 والملك الثالث والعشرون للمغرب من سلالة العلويين الفيلاليين ، تولى الحكم خلفًا لوالده الملك الحسن الثاني بعد وفاته، وبويع ملكًا يوم الجمعة 9 ربيع الثاني سنة 1420 هـ الموافق 23 يوليو 1999 بالقصر الملكي بالرباط .


## 4. Saving your hard work

The final step is to save the ``leaders_per_country`` dictionary in the `leaders.json` file using the [json](https://docs.python.org/3/library/json.html) module. Check out the `with` statement.

In [33]:


# Save as JSON (2 lines)
with open("leaders.json", "w", encoding="utf-8") as f:
    json.dump(leaders_per_country, f, ensure_ascii=False, indent=2)

# Save as CSV (1 line)
pd.DataFrame([
    {"country": c, **l} for c, leaders in leaders_per_country.items() for l in leaders
]).to_csv("leaders.csv", index=False, encoding="utf-8")


NameError: name 'pd' is not defined

Make sure the file can be read back. Write the code to read the file. And check the variables are the same.

In [205]:
# 3 lines
# Read back the JSON file (1 line)
with open("leaders.json", "r", encoding="utf-8") as f:
    leaders_check = json.load(f)

# Compare variables (1 line)
print("Same content:", leaders_per_country.keys() == leaders_check.keys())

Same content: True


In [206]:
# Read JSON file (1 line)
with open("leaders.json", "r", encoding="utf-8") as f:
    data = json.load(f)

# Convert nested JSON into a flat DataFrame (1 line)
leaders_df = pd.DataFrame([
    {"country": c, **l} for c, leaders in data.items() for l in leaders
])

# Display first rows (1 line)
leaders_df.head()

Unnamed: 0,country,id,first_name,last_name,birth_date,death_date,place_of_birth,wikipedia_url,start_mandate,end_mandate,first_paragraph
0,be,Q12978,Guy,Verhofstadt,1953-04-11,,Dendermonde,https://nl.wikipedia.org/wiki/Guy_Verhofstadt,1999-07-12,2008-03-20,
1,be,Q12981,Yves,Leterme,1960-10-06,,Wervik,https://nl.wikipedia.org/wiki/Yves_Leterme,2009-11-25,2011-12-06,
2,be,Q12983,Herman,,1947-10-31,,Etterbeek,https://nl.wikipedia.org/wiki/Herman_Van_Rompuy,2008-12-30,2009-11-25,
3,be,Q14989,Léon,Delacroix,1867-12-27,1929-10-15,Saint-Josse-ten-Noode,https://nl.wikipedia.org/wiki/L%C3%A9on_Delacroix,1918-11-21,1920-11-20,
4,be,Q14990,Henry,Carton,1869-01-31,1951-05-06,Brussels,https://nl.wikipedia.org/wiki/Henri_Carton_de_...,1920-11-20,1921-12-16,


Make a function `save(leaders_per_country)` to call this code easily.

In [None]:
# 3 lines
def save(leaders_per_country):
    """Save leaders data to both JSON and CSV files."""
    # Save as JSON 
    with open("leaders.json", "w", encoding="utf-8") as f:
        json.dump(leaders_per_country, f, ensure_ascii=False, indent=2)

    # Save as CSV
    pd.DataFrame([
        {"country": c, **l} for c, leaders in leaders_per_country.items() for l in leaders
    ]).to_csv("leaders.csv", index=False, encoding="utf-8")

In [37]:
# Call the function (1 line)
save(leaders_per_country)


## 5. Tidy things up in a stand-alone python script

Congratulations! You now have a working scraper! However, your code is scattered throughout this notebook along side the tutorials. Hardly production ready...

Copy and paste what you need in a separate `leaders_scraper.py` file.
Make sure it works by calling `python3 leaders_scraper.py`

## (Optional) To go further

If you want to practice scraping, you can read this section and tackle the exercises.

1. Restructure your code by using OOP (see ReadMe).
2. You have noticed the API returns very partial results for country leaders. Many are missing. Overwrite the `get_leaders()` function to get its list from Wikipedia and extract their *personal details* from the frame on the side.

Good luck!