<a href="https://colab.research.google.com/github/Oke-Dolapo/my_code_training_python/blob/main/automatic_html_extraction_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

What You're Aiming For

The objective is to automate the extraction of HTML content, article titles, text, and internal links from Wikipedia pages into a consolidated function that accepts any Wikipedia URL for efficient data retrieval and processing.


Instructions

Create a Python script to automate data extraction from Wikipedia pages. The script will retrieve HTML content, extract article titles and text, collect internal links, and consolidate these tasks into one function that accepts a Wikipedia URL. This will be tested on a specific Wikipedia page to validate functionality.

1) Write a function to Get and parse html content from a Wikipedia page

2) Write a function to Extract article title

3) Write a function to Extract article text for each paragraph with their respective

headings. Map those headings to their respective paragraphs in the dictionary.

4) Write a function to collect every link that redirects to another Wikipedia page

5) Wrap all the previous functions into a single function that takes as parameters a Wikipedia link

6) Test the last function on a Wikipedia page of your choice

In [4]:
import requests
import re
from bs4 import BeautifulSoup
from bs4.element import Tag


def get_wiki_content(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                      "(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    response = requests.get(url, headers = headers)
    if response.status_code == 200:
        return BeautifulSoup(response.content, "html.parser")
    else:
        raise Exception(f"Unable to retrieve page content. status code: {response.status_code}")

def article_title(soup):
    title = soup.find("h1", id = "firstHeading")
    if title:
        return title.text
    else:
        return "no title found"

def article_text(soup):
    content = soup.find("div", {"class": "mw-parser-output"})
    if not content:
        return {}
    text_data = {}
    bookmark_heading = "Introduction"
    paragraphs  = []

    def clean_text(text):
        text = re.sub(r"\[\d+\]", "", text)
        return text.strip()

    for elements in content.find_all(["h2", "h3", "p", "ul"]):
        if elements.name in ["h2", "h3"]:
            if paragraphs:
                text_data[bookmark_heading] = paragraphs
                paragraphs = []
            bookmark_heading = elements.get_text().replace("[edit]", "").strip()
        elif elements.name == "p":
            text = clean_text(elements.get_text().strip())
            if any(punct in paragraphs for punct in [".", "!", "?"]):
                continue
            if text:
                paragraphs.append(text)
        elif elements.name == "ul":
            items = [clean_text(li.get_text()) for li in elements.find_all("li")]
            for item in items:
                # keep if more than 2 words and has punctuation
                if len(item.split()) > 2 and any(punct in item for punct in [".", "!", "?"]):
                    paragraphs.append(item)
    if paragraphs:
        text_data[bookmark_heading] = paragraphs

    return text_data


def article_redirecting_links(soup):
    links = []
    for a_href in soup.find_all("a", href = True):
        href = a_href["href"]
        if href.startswith("/wiki/") and not href.startswith("/wiki/Special:"):
            making_full_url = "https://en.wikipedia.org" + href
            links.append(making_full_url)
    return list(set(links))

def start_web_scraping(wiki_url):
    soup = get_wiki_content(wiki_url)
    data = {
        "title" : article_title(soup),
        "text" : article_text(soup),
        "links" : article_redirecting_links(soup)
    }
    return data


#testing
url_link = "https://en.wikipedia.org/wiki/Nigerian_cuisine"
web = start_web_scraping(url_link)

print("  *TITLE*   ")
print(web["title"].upper())


for heading, paragraphs in web["text"].items():
    print(f"\n--- {heading} ---")
    for p in paragraphs:
        print("-", p)

print(f"\n   *FIRST 10 INTERNAL LINKS ({len(web['links'])} total)*   ")
for link in web["links"][:10]:
    print("-", link)


  *TITLE*   
NIGERIAN CUISINE

--- Introduction ---
- Nigerian cuisine consists of dishes or food items from the hundreds of Native African ethnic groups that comprise Nigeria. Like other West African cuisines, it uses spices and herbs with palm oil or groundnut oil to create deeply flavored sauces and soups.
- Nigerian feasts can be colourful and lavish, while aromatic market and roadside snacks cooked on barbecues or fried in oil are in abundance and varied. Bushmeat is also consumed in Nigeria. The brush-tailed porcupine and cane rats are the most popular bushmeat species in Nigeria.
- Tropical fruits such as watermelon, pineapple, coconut, banana, orange, papaya and mango are mostly consumed in Nigeria.
- Nigerian cuisine, like many West African cuisines, is known for being savoury and spicy.

--- Rice-based ---
- Coconut rice is rice made with coconut milk,  and other spices.
- Jollof rice is a rice dish made with pureed tomato and Scotch bonnet-based sauce.
- Ofada rice is a popu