Scraping text from Wikipedia  website using Beautiful Soup

After watching this video below, you will be able to:
https://www.youtube.com/watch?v=YY5skv756pc

1.1) Write a function to Get and parse html content from a Wikipedia page

1.2) Write a function to Extract article title

1.3) Write a function to Extract article text for each paragraph with their respective

headings. Map those headings to their respective paragraphs in the dictionary.

1.4) Write a function to collect every link that redirects to another Wikipedia page

1.5) Wrap all the previous functions into a single function that takes as parameters a Wikipedia link

1.6) Test the last function on a Wikipedia page of your choice

In [4]:
import requests
from bs4 import BeautifulSoup


def get_html_content(url):
    response = requests.get(url)
    
    if response.status_code == 200:
        return response.content
    else:
        print("Failed to retrieve HTML content.")
        return None


def extract_article_title(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    
    title_element = soup.find("h1", id="firstHeading")
    
    if title_element:
        return title_element.text
    else:
        return None


def extract_article_text(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    
    paragraphs = soup.find_all("p")
    
    article_data = {}
    
    for paragraph in paragraphs:
        heading_element = paragraph.find_previous(["h2", "h3", "h4", "h5", "h6"])
        
        if heading_element:
            heading = heading_element.text
        else:
            heading = "No Heading"
        
        text = paragraph.text
        
        if heading in article_data:
            article_data[heading].append(text)
        else:
            article_data[heading] = [text]
    
    return article_data


def collect_redirect_links(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    
    links = soup.find_all("a")
    
    redirect_links = []
    
    for link in links:
        href = link.get("href")
        
        if href and href.startswith("/wiki/") and ":" not in href:
            redirect_links.append(href)
    
    return redirect_links


def scrape_wikipedia_page(url):
    html_content = get_html_content(url)
    
    if html_content:
        title = extract_article_title(html_content)
        print("Article Title:", title)
        
        article_data = extract_article_text(html_content)
        for heading, paragraphs in article_data.items():
            print("Heading:", heading)
            for paragraph in paragraphs:
                print(paragraph)
        
        redirect_links = collect_redirect_links(html_content)
        print("Redirect Links:")
        for link in redirect_links:
            print(link)
    else:
        print("Failed to scrape Wikipedia page.")


scrape_wikipedia_page("https://en.wikipedia.org/wiki/Python_(programming_language)")

Article Title: Python (programming language)
Heading: Contents




Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation via the off-side rule.[34]

Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.[35][36]

Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python 0.9.0.[37] Python 2.0 was released in 2000. Python 3.0, released in 2008, was a major revision not completely backward-compatible with earlier versions. Python 2.7.18, released in 2020, was the last release of Python 2.[38]

Python consistently ranks as one of the most popular programming languages.[39][40][41

Redirect Links:
/wiki/Main_Page
/wiki/Main_Page
/wiki/Python_(programming_language)
/wiki/Python_(programming_language)
/wiki/Python_(programming_language)
/wiki/Programming_paradigm
/wiki/Multi-paradigm_programming_language
/wiki/Object-oriented_programming
/wiki/Procedural_programming
/wiki/Imperative_programming
/wiki/Functional_programming
/wiki/Structured_programming
/wiki/Reflective_programming
/wiki/Software_design
/wiki/Guido_van_Rossum
/wiki/Software_developer
/wiki/Python_Software_Foundation
/wiki/Software_release_life_cycle
/wiki/Software_release_life_cycle#Beta
/wiki/Type_system
/wiki/Duck_typing
/wiki/Dynamic_typing
/wiki/Strong_and_weak_typing
/wiki/Gradual_typing
/wiki/CPython
/wiki/Operating_system
/wiki/Windows
/wiki/MacOS
/wiki/Linux
/wiki/Android_(operating_system)
/wiki/Software_license
/wiki/Python_Software_Foundation_License
/wiki/Filename_extension
/wiki/Programming_language_implementation
/wiki/CPython
/wiki/PyPy
/wiki/Stackless_Python
/wiki/MicroPython
/wiki/Ci