<a href="https://colab.research.google.com/github/Gaks978/DML-Checkpoint/blob/main/Web_Scraping_checkpoint.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

What You're Aiming For

The objective is to automate the extraction of HTML content, article titles, text, and internal links from Wikipedia pages into a consolidated function that accepts any Wikipedia URL for efficient data retrieval and processing.


Instructions

Create a Python script to automate data extraction from Wikipedia pages. The script will retrieve HTML content, extract article titles and text, collect internal links, and consolidate these tasks into one function that accepts a Wikipedia URL. This will be tested on a specific Wikipedia page to validate functionality.

1) Write a function to Get and parse html content from a Wikipedia page

2) Write a function to Extract article title

3) Write a function to Extract article text for each paragraph with their respective

headings. Map those headings to their respective paragraphs in the dictionary.

4) Write a function to collect every link that redirects to another Wikipedia page

5) Wrap all the previous functions into a single function that takes as parameters a Wikipedia link

6) Test the last function on a Wikipedia page of your choice

In [1]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Step 1: Get and parse HTML content from a Wikipedia page
def get_html_content(url):
    response = requests.get(url)
    response.raise_for_status()
    return BeautifulSoup(response.text, 'html.parser')

# Step 2: Extract article title
def extract_title(soup):
    return soup.find('h1', {'id': 'firstHeading'}).text.strip()

# Step 3: Extract article text and map it to headings
def extract_text_by_headings(soup):
    content = {}
    content_div = soup.find('div', {'class': 'mw-parser-output'})
    current_heading = "Introduction"
    content[current_heading] = []

    for element in content_div.find_all(['h2', 'h3', 'p'], recursive=False):
        if element.name in ['h2', 'h3']:
            span = element.find('span', {'class': 'mw-headline'})
            if span:
                current_heading = span.text.strip()
                content[current_heading] = []
        elif element.name == 'p':
            text = element.get_text(strip=True)
            if text:
                content[current_heading].append(text)

    # Combine paragraphs for each heading
    return {heading: ' '.join(paragraphs) for heading, paragraphs in content.items()}

# Step 4: Collect internal Wikipedia links
def extract_internal_links(soup):
    links = set()
    for link in soup.find_all('a', href=True):
        href = link['href']
        if href.startswith('/wiki/') and ':' not in href:  # Exclude special pages like "Category:", "File:", etc.
            full_url = urljoin('https://en.wikipedia.org', href)
            links.add(full_url)
    return list(links)

# Step 5: Wrap all into a single function
def extract_wikipedia_data(url):
    soup = get_html_content(url)
    title = extract_title(soup)
    content = extract_text_by_headings(soup)
    internal_links = extract_internal_links(soup)

    return {
        'url': url,
        'title': title,
        'content_by_heading': content,
        'internal_links': internal_links
    }

# Step 6: Test the function
if __name__ == "__main__":
    test_url = "https://en.wikipedia.org/wiki/Web_scraping"
    data = extract_wikipedia_data(test_url)

    # Displaying just the basics
    print("Title:", data['title'])
    print("\nHeadings and Content Preview:")
    for heading, text in list(data['content_by_heading'].items())[:3]:  # show top 3 sections only
        print(f"\n{heading}:\n{text[:300]}...")  # preview first 300 characters

    print(f"\nNumber of internal links found: {len(data['internal_links'])}")


Title: Web scraping

Headings and Content Preview:

Introduction:
...

Number of internal links found: 133
