**Instructions**

Create a Python script to automate data extraction from Wikipedia pages. The script will retrieve HTML content, extract article titles and text, collect internal links, and consolidate these tasks into one function that accepts a Wikipedia URL.

This will be tested on a specific Wikipedia page to validate functionality.

1. Write a function to Get and parse html content from a Wikipedia page

2. Write a function to Extract article title

3. Write a function to Extract article text for each paragraph with their respective

headings. Map those headings to their respective paragraphs in the dictionary.

4. Write a function to collect every link that redirects to another Wikipedia page

5. Wrap all the previous functions into a single function that takes as parameters a Wikipedia link

6. Test the last function on a Wikipedia page of your choice

In [None]:
import requests
from bs4 import BeautifulSoup

def get_html_content(url):
    """Fetch and parse HTML content from a Wikipedia page."""
    response = requests.get(url)
    if response.status_code == 200:
        return BeautifulSoup(response.text, "html.parser")
    else:
        raise Exception("Failed to retrieve page")

def extract_title(soup):
    """Extract the article title."""
    title_tag = soup.find("h1", id="firstHeading")
    return title_tag.text.strip() if title_tag else "Title not found"

def extract_text(soup):
    """Extract article text with their respective paragraph headings."""
    content = {}
    current_section = "Introduction"
    paragraphs = []

    for element in soup.find_all(['h2', 'h3', 'p']):
        if element.name in ['h2', 'h3']:
            if paragraphs:
                content[current_section] = " ".join(paragraphs)
                paragraphs = []
            current_section = element.text.strip()
        elif element.name == 'p' and element.text.strip():
            paragraphs.append(element.text.strip())

    if paragraphs:
        content[current_section] = " ".join(paragraphs)

    return content

def extract_internal_links(soup, base_url):
    """Collect all internal Wikipedia links."""
    links = set()
    for link in soup.find_all("a", href=True):
        href = link["href"]
        if href.startswith("/wiki/") and ":" not in href:
            full_link = base_url + href
            links.add(full_link)
    return links

def scrape_wikipedia_page(url):
    """Consolidate all functions to scrape a Wikipedia page."""
    base_url = "https://en.wikipedia.org"
    soup = get_html_content(url)
    title = extract_title(soup)
    text_content = extract_text(soup)
    internal_links = extract_internal_links(soup, base_url)

    return {
        "title": title,
        "text": text_content,
        "internal_links": list(internal_links)
    }

In [None]:
# Testing the function
test_url = "https://en.wikipedia.org/wiki/Butterfly#:~:text=Butterflies%20navigate%20using%20a%20time,plants%20also%20influence%20butterfly%20behaviour."
data = scrape_wikipedia_page(test_url)

# Printing the title
print("Title:", data["title"], "\n")

# Printing the extracted text with headings
print("Article Content:")
for heading, paragraph in data["text"].items():
    print(f"\n{heading}\n{'-' * len(heading)}")
    print(paragraph)

# Printing internal Wikipedia links
print("\nInternal Wikipedia Links:")
for link in data["internal_links"]:
    print(link, "\n")

Title: Butterfly 

Article Content:

Contents
--------
Rhopalocera Butterflies are winged insects from the lepidopteran superfamily Papilionoidea, characterized by large, often brightly coloured wings that often fold together when at rest, and a conspicuous, fluttering flight. The oldest butterfly fossils have been dated to the Paleocene, about 56 million years ago, though molecular evidence suggests that they likely originated in the  Cretaceous.[1] Butterflies have a four-stage life cycle, and like other holometabolous insects they undergo complete metamorphosis.[2] Winged adults lay eggs on the food plant on which their larvae, known as caterpillars, will feed. The caterpillars grow, sometimes very rapidly, and when fully developed, pupate in a chrysalis. When metamorphosis is complete, the pupal skin splits, the adult insect climbs out, expands its wings to dry, and flies off. Some butterflies, especially in the tropics, have several generations in a year, while others have a singl