1. Import necessary libraries:
   - `requests`: Used to send HTTP requests to the webpage.
   - `BeautifulSoup`: A library for parsing HTML content and navigating the document.

In [None]:
!pip install bs4

In [None]:
pip install requests

2. Define the target URL: The script specifies the URL of the Wikipedia Main Page.

3. Send an HTTP GET request to the URL using the `requests.get(url)` method and store the response in the `response` variable.

4. Check the response status using `print(response)`. This is just to verify that the request was successful.

In [None]:
import requests

url = "https://en.wikipedia.org/wiki/Main_Page"

response = requests.get(url)

print(response)

5. Create a BeautifulSoup object to parse the HTML content of the page. `html_content` stores the raw HTML text, and `soup` is the BeautifulSoup object.
6. Use `soup.prettify()` to make the HTML content more readable, although this step is optional.

In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Main_Page"

response = requests.get(url)
html_content = response.text

soup = BeautifulSoup(html_content, "html.parser")

soup.prettify()

7. Extract Featured Articles:
   - Use `soup.find_all("div", class_="mp-featured-article")` to locate all elements with the specified class that represent featured articles.
   - Iterate through the found elements, find the article title using `.find("h2").text`, and print each title.

8. Extract "Did You Know" items:
   - Use `soup.find("div", id="mp-dyk")` to locate the section containing "Did You Know" items.
   - Iterate through the list items within this section using `.find_all("li")`, and print each item's text.


In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Main_Page"

response = requests.get(url)
html_content = response.text

soup = BeautifulSoup(html_content, "html.parser")

# Example: Extracting all the featured article titles
featured_articles = soup.find_all("div", class_="mp-featured-article")

for article in featured_articles:
    title = article.find("h2").text
    print(title)

# Example: Extracting the "Did you know" section
did_you_know = soup.find("div", id="mp-dyk")

for item in did_you_know.find_all("li"):
    print(item.text)

9. Error Handling:
   - Before sending the request, a try-except block is used to catch any potential HTTP errors using `requests.exceptions.HTTPError`.

10. Exception Handling:
   - Inside the try-except block, there's another try-except block to handle possible attribute errors that may occur when parsing the HTML.


In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Main_Page"

try:
    response = requests.get(url)
    response.raise_for_status()
except requests.exceptions.HTTPError as error:
    print(error)
    exit()

html_content = response.text

try:
    soup = BeautifulSoup(html_content, "html.parser")
    featured_articles = soup.find_all("div", class_="mp-featured-article")
    for article in featured_articles:
        title = article.find("h2").text
        print(title)

    did_you_know = soup.find("div", id="mp-dyk")
    for item in did_you_know.find_all("li"):
        print(item.text)
except AttributeError as error:
    print(error)
    exit()

11. Saving Data:
   - The extracted data, including featured article titles and "Did You Know" items, is stored in a Python dictionary called `data`.
   - This data is then saved as a JSON file named "data.json" using the `json.dump(data, outfile)` method.
   
12. The script includes proper error handling and exit statements to gracefully handle HTTP errors and attribute errors during the web scraping process.

In [None]:
import requests
from bs4 import BeautifulSoup
import json

url = "https://en.wikipedia.org/wiki/Main_Page"

try:
    response = requests.get(url)
    response.raise_for_status()
except requests.exceptions.HTTPError as error:
    print(error)
    exit()

html_content = response.text

try:
    soup = BeautifulSoup(html_content, "html.parser")
    featured_articles = soup.find_all("div", class_="mp-featured-article")
    article_titles = [article.find("h2").text for article in featured_articles]

    did_you_know = soup.find("div", id="mp-dyk")
    items = [item.text for item in did_you_know.find_all("li")]

    data = {"featured_articles": article_titles, "did_you_know": items}

    with open("data.json", "w") as outfile:
        json.dump(data, outfile)

except AttributeError as error:
    print(error)
    exit()

Ethical web scraping involves responsible and respectful behavior towards websites, their data, and their users. It's crucial to strike a balance between extracting valuable information and respecting the rights and wishes of website owners and data subjects.

Respecting Robots.txt:

Always check a website's robots.txt file to ensure you are allowed to scrape the data. Respect the rules specified in this file.