<a href="https://colab.research.google.com/github/Jessica-Emereuwa/Data_science_Project/blob/main/Web_Scraping_CheckPoint.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Scraping text from Wikipedia  website using Beautiful Soup***


**Instructions**

After watching this video below, you will be able to:

➡️ https://www.youtube.com/watch?v=YY5skv756pc

1.1) Write a function to Get and parse html content from a Wikipedia page

1.2) Write a function to Extract article title

1.3) Write a function to Extract article text for each paragraph with their respective headings. Map those headings to their respective paragraphs in the dictionary.

1.4) Write a function to collect every link that redirects to another Wikipedia page

1.5) Wrap all the previous functions into a single function that takes as parameters a Wikipedia link

1.6) Test the last function on a Wikipedia page of your choice**



*   # **Write a function to Get and parse html content from a Wikipedia page**




In [2]:
import requests
from bs4 import BeautifulSoup


# 1.1) Function to Get and Parse HTML Content
def get_html_content(url):
    """Fetch and parse HTML content from a Wikipedia page."""
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        return soup
    else:
        raise Exception(f"Failed to retrieve content. Status code: {response.status_code}")



* #  **Write a function to Extract article title** **bold text**



In [3]:
# 1.2) Function to Extract Article Title
def extract_title(soup):
    """Extract the title of the Wikipedia article."""
    title = soup.find('h1', id='firstHeading').text
    return title.strip()



* # **Write a function to Extract article text for each paragraph with their respective headings. Map those headings to their respective paragraphs in the dictionary.**


In [4]:
# 1.3) Function to Extract Article Text by Headings
def extract_text_by_headings(soup):
    """Extract paragraphs under their respective headings."""
    content = soup.find('div', {'class': 'mw-parser-output'})
    result = {}
    current_heading = None

    # Iterate over elements in the content area
    for element in content.find_all(['h2', 'h3', 'h4', 'h5', 'h6', 'p']):
        if element.name in ['h2', 'h3', 'h4', 'h5', 'h6']:
            # Extract heading text
            current_heading = element.text.strip()
            result[current_heading] = []
        elif element.name == 'p' and current_heading:
            # Append paragraph text to the current heading
            result[current_heading].append(element.text.strip())

    # Remove empty sections
    result = {heading: paragraphs for heading, paragraphs in result.items() if paragraphs}
    return result



* # **Write a function to collect every link that redirects to another Wikipedia page**

In [5]:
# 1.4) Function to Collect Links Redirecting to Other Wikipedia Pages
def extract_links(soup):
    """Collect every link that redirects to another Wikipedia page."""
    links = []
    for link in soup.find_all('a', href=True):
        href = link['href']
        if href.startswith('/wiki/') and not ':' in href:
            full_url = 'https://en.wikipedia.org' + href
            links.append(full_url)
    return links




* # **Wrap all the previous functions into a single function that takes as parameters a Wikipedia link**

In [6]:
# 1.5) Wrapper Function
def parse_wikipedia_page(url):
    """Wrapper function to parse a Wikipedia page and return title, content, and links."""
    soup = get_html_content(url)

    # 1.2) Print the Article Title
    title = extract_title(soup)
    print(f"Title: {title}\n")

    # 1.3) Print Text by Headings
    text_by_headings = extract_text_by_headings(soup)
    print("Text by Headings:")
    for heading, paragraphs in text_by_headings.items():
        print(f"\n{heading}")
        for paragraph in paragraphs:
            print(paragraph)

    # 1.4) Print the Links
    links = extract_links(soup)
    print("\nLinks:")
    for link in links:
        print(link)

    # Return the parsed result
    return {
        'title': title,
        'text_by_headings': text_by_headings,
        'links': links
    }


* # **Test the last function on a Wikipedia page of your choice**

In [7]:
# 1.6) Testing the Function
if __name__ == "__main__":
    # Testing with the Wikipedia page for 'Data science'
    url = 'https://en.wikipedia.org/wiki/Data_science'
    print("Testing parse_wikipedia_page() on:", url, "\n")
    result = parse_wikipedia_page(url)

Testing parse_wikipedia_page() on: https://en.wikipedia.org/wiki/Data_science 

Title: Data science

Text by Headings:

Foundations
Data science is an interdisciplinary field[10] focused on extracting knowledge from typically large data sets and applying the knowledge and insights from that data to solve problems in a wide range of application domains. The field encompasses preparing data for analysis, formulating data science problems, analyzing data, developing data-driven solutions, and presenting findings to inform high-level decisions in a broad range of application domains. As such, it incorporates skills from computer science, statistics, information science, mathematics, data visualization, information visualization, data sonification, data integration, graphic design, complex systems, communication and business.[11][12] Statistician Nathan Yau, drawing on Ben Fry, also links data science to human–computer interaction: users should be able to intuitively control and explore dat