**Experiment 13:
Program to implement a simple web crawler and scrapping web pages.
Algorithm
A simple web crawler is a program that systematically navigates through web pages, extracts information, and may follow links to discover more pages.
1.	Seed URL: Start with an initial URL (seed URL) that you want to crawl.
2.	Initialize Queue: Create a queue to manage the URLs to be crawled. Initially, enqueue the seed URL.
3.	Crawl Loop:
4.	Start a loop that continues until the queue is empty or a specified limit is reached.
5.	Dequeue a URL from the queue.
6.	Send an HTTP request to fetch the HTML content of the page corresponding to the dequeued URL.
7.	Parse the HTML content to extract relevant information or links. Libraries like BeautifulSoup or Scrapy in Python are commonly used for HTML parsing.
8.	Process the extracted information or store it for further analysis.
9.	Enqueue any new URLs found on the page, ensuring they haven't been visited before to avoid duplicate crawling.
10.	Repeat: Repeat the crawl loop until the queue is empty or a specified limit is reached.
11.	Data Storage (Optional): Optionally, store the extracted data in a database or file for later analysis.
12.	Respect Robots.txt: Follow ethical practices by respecting the rules specified in the "robots.txt" file on websites, which can define which parts of a site are off-limits for crawling.
13.	Error Handling: Implement error handling to manage issues like connection errors or unexpected content during the crawling process

**

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

def get_links(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    links = set()
    for anchor in soup.find_all('a'):
        href = anchor.get('href')
        if href and href.startswith('http'):
            links.add(href)
        else:
            full_url = urljoin(url, href)
            links.add(full_url)

    return links

def crawl(start_url, max_depth=3):
    visited = set()
    queue = [(start_url, 0)]

    while queue:
        current_url, depth = queue.pop(0)
        if current_url in visited or depth > max_depth:
            continue

        print(f"Depth: {depth}, Crawling: {current_url}")

        try:
            links = get_links(current_url)
            visited.add(current_url)
            queue.extend((link, depth + 1) for link in links if link not in visited)
        except Exception as e:
            print(f"Error crawling {current_url}: {e}")

if __name__ == "__main__":
    seed_url = "https://www.internshala.com"
    crawl(seed_url)


Depth: 0, Crawling: https://www.internshala.com
Depth: 1, Crawling: https://trainings.internshala.com/machine-learning-course/?utm_source=is_web_internshala-menu-dropdown-most-popular
Depth: 1, Crawling: https://trainings.internshala.com/android-course/?utm_source=is_web_internshala-menu-dropdown-most-popular
Depth: 1, Crawling: https://www.internshala.com/internships/marketing-internship
Depth: 1, Crawling: https://trainings.internshala.com/french-course/?utm_source=is_web_internshala-menu-dropdown
Depth: 1, Crawling: https://trainings.internshala.com/ansys-course/?utm_source=is_web_internshala-menu-dropdown
Depth: 1, Crawling: https://internshala.com/internships/hr-internship
Depth: 1, Crawling: https://trainings.internshala.com/?utm_source=is_web_homepage_banner/#placement-and-job-guarantee-courses
Depth: 1, Crawling: https://internshala.com/internships/civil-internship
Depth: 1, Crawling: https://trainings.internshala.com/iot-course/?utm_source=is_web_internshala-menu-dropdown
Dept

KeyboardInterrupt: ignored