# Web Scraping Script Documentation

## Purpose

This Python script has been developed as part of Task 1 of the internship project, adhering to the following guidelines:

1. **Website Selection:**
   - The script is designed to scrape data from the Wikipedia website, a publicly accessible source of information.

2. **Web Scraping:**

   - Utilizing the Beautiful Soup and Requests libraries, the script extracts data from various sections of the Wikipedia page, focusing on headings and their respective word counts.


3. **Data Processing:**
   - The script counts the number of words under each heading, providing valuable information about the content structure.

4. **Automation:**
   - Automation is achieved by scheduling the script to run every 24 hours. This ensures regular updates to the dataset.

5. **Documentation:**
   - The script includes comprehensive documentation to explain its purpose, execution steps, components, dependencies, and any additional notes.


### 1. Import Necessary Libraries:

In [1]:
import requests
from bs4 import BeautifulSoup
import re
from tabulate import tabulate
import schedule
import time

## Imported Libraries

1. **requests**: This library is used for making HTTP requests. It is commonly employed to fetch data from websites.

2. **BeautifulSoup**: It is employed for parsing HTML content. BeautifulSoup facilitates the extraction of data from HTML and XML files.

3. **re (regular expressions)**: Regular expressions provide a powerful and flexible means to search, match, and manipulate text. This library is used for pattern matching in strings.

4. **tabulate**: This library is utilized for creating tables. It simplifies the process of formatting and displaying tabular data.

5. **schedule**: The schedule library allows for task scheduling. It enables the automation of recurring tasks at specified intervals.

6. **time**: This library is used for introducing delays in the script. It can be employed to control the timing of various operations.

### 2. Define Word Count Function

In [2]:
def count_words(text):
    """
    Function to count the number of words in a given text.

    Parameters:
    - text (str): The input text.

    Returns:
    - int: The number of words in the text.
    """
    words = re.findall(r'\b\w+\b', text)
    return len(words)

The `count_words` function is designed to take a text input and utilize a regular expression to determine the number of words within the text.


### 3. Define Web Scraping Functions:

In [3]:
def scrape_and_display():
    """
    Function to initiate web scraping and display the word count of subheadings on the Wikipedia page.
    """
    url = "https://en.wikipedia.org/"
    scrapeWikiArticle(url)

def scrapeWikiArticle(url):
    """
    Function to perform web scraping on the provided Wikipedia URL, count words, and display the results.

    Parameters:
    - url (str): The URL of the Wikipedia page.
    """
    try:
        response = requests.get(url)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        print(f"Error making the request: {e}")
        return

    soup = BeautifulSoup(response.content, 'html.parser')

    bodyContent = soup.find(id="bodyContent")
    if bodyContent:
        subheadings = bodyContent.find_all(['h2', 'h3', 'h4', 'h5', 'h6'])

        data = []
        for subheading in subheadings:
            subheading_text = subheading.get_text(strip=True)
            subheading_words = count_words(subheading_text)
            subheading_level = subheading.name

            data.append([subheading_level, subheading_text, subheading_words])

        total_words = sum([row[2] for row in data])
        data.append(["Total", "", total_words])

        headers = ["Level", "Subheading", "Number of Words"]
        print(tabulate(data, headers=headers, tablefmt="grid"))
    else:
        print("Body content not found")


#  Explanation


## `scrape_and_display` Function

## `scrapeWikiArticle` Function

Two functions are defined to facilitate web scraping and word count display. The scrape_and_display function sets the initial URL, and the scrapeWikiArticle function performs the actual scraping, counts the words in subheadings, and displays the results using the tabulate library.

### 4. Schedule the Task:

In [6]:
# Schedule the task to run every 2 minutes
schedule.every(2).minutes.do(scrape_and_display)


Every 2 minutes do scrape_and_display() (last run: [never], next run: 2023-12-11 18:45:29)

## Explanation:
This part of the script uses the schedule library to set up a recurring task. It schedules the scrape_and_display function
to run every 2 minutes.

### 5. Keep the Script Running:

In [5]:
# Keep the script running to allow scheduled tasks
while True:
    schedule.run_pending()
    time.sleep(1)

+---------+-------------------------------+-------------------+
| Level   | Subheading                    |   Number of Words |
| h2      | From today's featured article |                 5 |
+---------+-------------------------------+-------------------+
| h2      | Did you know ...              |                 3 |
+---------+-------------------------------+-------------------+
| h2      | In the news                   |                 3 |
+---------+-------------------------------+-------------------+
| h2      | On this day                   |                 3 |
+---------+-------------------------------+-------------------+
| h2      | From today's featured list    |                 5 |
+---------+-------------------------------+-------------------+
| h2      | Today's featured picture      |                 4 |
+---------+-------------------------------+-------------------+
| h2      | Other areas of Wikipedia      |                 4 |
+---------+-----------------------------

KeyboardInterrupt: 

## Explanation:
The script enters an infinite loop, allowing the scheduled tasks to run. It uses schedule.run_pending() to check and 
execute any pending scheduled tasks and introduces a 1-second delay between iterations to avoid unnecessary CPU usage.

# Dependencies

The script relies on the following dependencies to execute successfully:

1. **Python 3.x**

2. **Requests Library**

3. **BeautifulSoup Library**

4. **Regular Expression (re) Module**

5. **Tabulate Library**

6. **Schedule Library**



## Notes:
The script fetches subheadings from the Wikipedia page, counts the words, and displays the results in a tabulated format.
Adjustments may be needed for different websites or data processing requirements.
