# Crawling an entire site
While crawling the entire site we might find one link may of the times in the page.To avoid crawling the same page twice, it is extremely important that all internal links discovered are formatted consistently, and kept in a running set for easy lookups, while the program is running.

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

In [None]:
# pages = set()
# def getlinks(pageUrl):
#     html = urlopen(f'http://en.wikipedia.org{pageUrl}')
#     bs = BeautifulSoup(html.read(), 'html.parser')
#     for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
#         if link.attrs['href'] not in pages:
#             #we have found a new page
#             newPage = link.attrs['href']
#             print(newPage)
#             pages.add(newPage)
#             getlinks(newPage) #recursively run this function to get all the links of this page

# getlinks('')

Python has a default recursion limit (the number of times a program can recursively call itself) of 1,000. Because Wikipedia’s network of links is extremely large, this program will eventually hit that recursion limit and stop, unless you put in a recursion counter or something to prevent that from happening.

In [None]:
pages = set()
def getlinks(pageUrl):
    html = urlopen(f'http://en.wikipedia.org{pageUrl}')
    bs = BeautifulSoup(html.read(), 'html.parser')
    try:
        print(bs.h1.get_text()) #title of the page
        print(bs.find(id ='mw-content-text').find_all('p')[0].text) #first paragraph of page
        # print(bs.find(id='ca-edit').find('span').find('a').attrs['href']) #edit links
    except AttributeError:
        print('page not found! continuing')
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if link.attrs['href'] not in pages:
            #we have found a new page
            newPage = link.attrs['href']
            print('-'*20)
            print(newPage)
            pages.add(newPage)
            getlinks(newPage) #recursively run this function to get all the links of this page

getlinks('')

# Handling redirects
Redirects allow a web server to point one domain name or URL to a piece of content
at a different location. There are two types of redirects:<br>
• Server-side redirects, where the URL is changed before the page is loaded <br>
• Client-side redirects, sometimes seen with a “You will be redirected in 10 seconds” type of message, where the page loads before redirecting to the new one <br>
<br>
With server-side redirects, you usually don’t have to worry. If you’re using the urllib
library with Python 3.x, it handles redirects automatically! If you’re using the requests
library, make sure to set the allow-redirects flag to True:
<br>
`r = requests.get('http://github.com', allow_redirects=True)`