In [4]:
!pip install requests
!pip install beautifulsoup4



In [6]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

def crawl_webpage(url, max_depth=3, allowed_domains=None):
    visited_urls = set()  # Set to track visited URLs
    urls_to_visit = [(url, 0)]  # Initialize the list of URLs to visit, starting from the input URL
    
    while urls_to_visit:
        current_url, depth = urls_to_visit.pop(0)  # Pop the next URL to visit
        if current_url in visited_urls or depth > max_depth:  # Skip if already visited or max depth exceeded
            continue
        
        print(f"Crawling: {current_url} (Depth: {depth})")  # Print the URL being crawled
        try:
            response = requests.get(current_url)  # Get the response from the URL
            if response.status_code == 200:  # If the page was fetched successfully
                visited_urls.add(current_url)  # Mark URL as visited
                soup = BeautifulSoup(response.text, "html.parser")  # Parse the page content
                
                # Extract all the links from the page
                for link in soup.find_all("a"):
                    link_href = link.get("href")
                    if link_href:
                        absolute_link = urljoin(current_url, link_href)  # Get the absolute URL
                        parsed_link = urlparse(absolute_link)  # Parse the link to check its domain
                        
                        # Print the found link
                        print(f"Found link: {absolute_link}")
                        
                        # If allowed_domains is specified, check if the domain of the link is allowed
                        if allowed_domains:
                            if parsed_link.netloc in allowed_domains:
                                urls_to_visit.append((absolute_link, depth + 1))  # Add the link to visit
                        else:
                            # If no domain filter, visit any link
                            urls_to_visit.append((absolute_link, depth + 1))
        except Exception as e:
            print(f"Error crawling {current_url}: {e}")  # Handle exceptions

# Start crawling from a specific URL and limit the crawl to specific websites
start_url = "https://www.bbc.com"
allowed_domains = ["bbc.com"]  # Only crawl within bbc.com
crawl_webpage(start_url, max_depth=3, allowed_domains=allowed_domains)

Crawling: https://www.bbc.com (Depth: 0)
Found link: https://www.bbc.com#main-content
Found link: https://www.bbc.com/
Found link: https://www.bbc.com/
Found link: https://www.bbc.com/news
Found link: https://www.bbc.com/sport
Found link: https://www.bbc.com/business
Found link: https://www.bbc.com/innovation
Found link: https://www.bbc.com/culture
Found link: https://www.bbc.com/arts
Found link: https://www.bbc.com/travel
Found link: https://www.bbc.com/future-planet
Found link: https://www.bbc.com/video
Found link: https://www.bbc.com/live
Found link: https://www.bbc.com/home
Found link: https://www.bbc.com/news
Found link: https://www.bbc.com/news/topics/c2vdnvdg6xxt
Found link: https://www.bbc.com/news/war-in-ukraine
Found link: https://www.bbc.com/news/us-canada
Found link: https://www.bbc.com/news/uk
Found link: https://www.bbc.com/news/politics
Found link: https://www.bbc.com/news/england
Found link: https://www.bbc.com/news/northern_ireland
Found link: https://www.bbc.com/news/

In [None]:
Explanation of the Code:

URL Management: The function crawl_webpage() takes an initial URL, max_depth (maximum depth to crawl), and allowed_domains (optional list of domains to restrict crawling).

Fetching Web Pages: The requests.get(current_url) function fetches the webpage, and BeautifulSoup parses the HTML to extract links.

Link Filtering: The urlparse function is used to extract the domain from the URL to ensure we only crawl links within the allowed domains (if provided).

Crawl Depth: The crawling process continues to the next level by appending the link to urls_to_visit until the maximum depth is reached.

Additional Considerations:

User Input: You can modify the code to take user input for the search term (e.g., news topic) and filter links based on that.

Data Storage: For a real-world application, you would typically store the crawled data (like news articles) into a database or 
file system for further processing (e.g., sentiment analysis or topic classification).


Web crawling refers to the process of automatically navigating the web and collecting information from websites. 
This technique is essential for search engines, data extraction, and many other applications where automated access to the internet is required.

When it comes to gathering news stories on a specific topic, the process typically involves the following steps:

Web Crawling: Using an automated crawler to visit websites and extract news articles based on the given topic.
HTML Parsing: Extracting relevant content (e.g., headlines, articles) from the HTML structure of the pages using a parser.
Filtering: Allowing the user to limit the crawl to specific websites or domains. 
This is particularly important when focusing on collecting news from trusted sources.

Storing Results: Saving the collected news stories for further processing, analysis, or display.

For this experiment, we use the Python libraries requests and BeautifulSoup for crawling and parsing HTML content. 
requests allows us to fetch the content from URLs, while BeautifulSoup is used to parse the HTML content and extract useful information.