## Random Walk through Wikipedia
 
We start by writing a little script hat returns all the links tags present inside a wikipedia page.

In [1]:
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

# get html content in bs4 objects
html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')
 
# get all the "a" tags that represent links in a web page.
link_tags = bs.find_all('a')
for t in link_tags[:3]:
    print(t,'\n'+'_'*50+'\n')

<a id="top"></a> 
__________________________________________________

<a href="/wiki/Wikipedia:Protection_policy#semi" title="This article is semi-protected to promote compliance with the policy on biographies of living persons"><img alt="Page semi-protected" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Semi-protection-shackle.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png 2x" width="20"/></a> 
__________________________________________________

<a class="mw-jump-link" href="#mw-head">Jump to navigation</a> 
__________________________________________________



The problem with the previous script is that it returns all the available links, including sidebar, footer, and header links and other irrelevants elements. We need to filter those tags to include only the pertinent, ie the links that redirect to other wikipedia articles. Inspecting those linkes gives us the following patterns : 
    
   *  They reside within the div with the id set to bodyContent.
   *  The URLs do not contain colons.
   *  The URLs begin with /wiki/.

In [2]:
# first we get the div tag that has its id attribute set to 'bodyContent'
div = bs.find('div', {'id':'bodyContent'})

# now we get all the 'a' tags that starts with '/wiki/' and doesn't contains the ':' mark
for link in div.find_all('a', href=re.compile('^(/wiki/)((?!:).)*$'))[:5]:
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Philadelphia,_Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)


Let's rewrite this in a more functionnal way. We'll use a `set` instead of a `list` to avoid duplicate links 

In [7]:
import random

def getLinks(initial_link:str)->list:
    """ takes an initial link and return all the wiki articles links found inside of it """
    
    # get html content in bs4 objects
    html = urlopen('http://en.wikipedia.org{}'.format(initial_link))
    bs = BeautifulSoup(html, 'html.parser')
    
    # first we get the div tag that has its id attribute set to 'bodyContent'
    div = bs.find('div', {'id':'bodyContent'})
    
    links = set()
    for link in div.find_all('a', href=re.compile('^(/wiki/)((?!:).)*$')):
        if 'href' in link.attrs:
            if link['href'] not in links:
                links.add(link.attrs['href'])
    
    return links


def web_crawl_wikipedia(initial_link:str):
    """ random walk through a website, going from link to link """
    
    # get the initial list of links from the first article
    liste = getLinks(initial_link)
    
    # for each article, repeat the operation
    while(len(liste) > 0):
        random_link = random.choice(tuple(liste))
        print(random_link)
        liste = getLinks(random_link)

In [8]:
web_crawl_wikipedia('/wiki/Kevin_Bacon')

/wiki/Albert_Finney
/wiki/Timothy_Hutton
/wiki/Satellite_Award_for_Best_Cast_%E2%80%93_Television_Series
/wiki/Satellite_Award_for_Best_Actor_%E2%80%93_Miniseries_or_Television_Film
/wiki/Nelson_Mandela
/wiki/Camfed
/wiki/Climate_change


KeyboardInterrupt: 

## Collecting Data Across an Entire Site (Wikipedia)

Web crawlers would be fairly boring if all they did was hop from one page to the other. To make them useful, you need to be able to do something on the page while you’re there. Let’s look at how to build a scraper that collects the title, the first paragraph of content, and the link to edit the page (if available).

In [11]:
pages = set()

def getLinks(pageUrl:str):
    """ 
    start with a page, print its title, first paragraph and its edit link if available.
    get all the "a" tags (links) inside the initial page, loop over them and repeat the operation for each
    """
    global pages
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    
    try:
        # Title, first paragraph, and link to edit the page.
        print(bs.h1.get_text())
        print(bs.find(id ='mw-content-text').find_all('p')[0])
        print(bs.find(id='ca-edit').find('span').find('a').attrs['href'])
    
    except AttributeError:
        print('This page is missing something! Continuing.')
    
    # get all the wiki links inside this page
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        # verify that the tag is valid and not already explored
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                # We have encountered a new page
                newPage = link.attrs['href']
                print('_'*40,'\n',newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks('')

Main Page
<p>The <b><a href="/wiki/Cleveland_Centennial_half_dollar" title="Cleveland Centennial half dollar">Cleveland Centennial half dollar</a></b> is a commemorative <a href="/wiki/Half_dollar_(United_States_coin)" title="Half dollar (United States coin)">United States half dollar</a>, dated 1936, issued to mark the 100th anniversary of <a href="/wiki/Cleveland" title="Cleveland">Cleveland, Ohio</a>, as an incorporated city, and in commemoration of the <a href="/wiki/Great_Lakes_Exposition" title="Great Lakes Exposition">Great Lakes Exposition</a>, held in Cleveland that year. In the mid-1930s, <a href="/wiki/United_States_commemorative_coins#Early_commemoratives" title="United States commemorative coins">commemorative coins</a> were increasing in value, and <a href="/wiki/Cincinnati" title="Cincinnati">Cincinnati</a> businessman <a href="/wiki/Thomas_G._Melish" title="Thomas G. Melish">Thomas G. Melish</a>, a coin collector, lobbied Congress to authorize several new issues, for wh

KeyboardInterrupt: 

## Crawling Across the Internet

Just as in the previous example, the web crawlers you are going to build will follow links from page to page, building out a map of the web. But this time, they will not ignore external links that leads to other websites; they will follow them.

The preceding program starts at http://oreilly.com and randomly hops from external link to external link.

The `urlparse` function parse an URL into six components, returning a 6-item named tuple. This corresponds to the general structure of a URL: **`scheme://netloc/path;parameters?query#fragment`**.

In [15]:
# import libs, initialize the set object and set the random seed 
import re
import random
import datetime

from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup

pages = set()
random.seed(datetime.datetime.now())

In [28]:
def getInternalLinks(bs, includeUrl):
    """ Retrieves a list of all Internal links found on a page """
    includeUrl = '{}://{}'.format(urlparse(includeUrl).scheme, urlparse(includeUrl).netloc)
    internalLinks = []
    
    # Finds all the local links of the current page. Two possible forms.
    # Either a link that starts with "/" 
    # Or a more calssical http then the initial page.
    for link in bs.find_all('a', href=re.compile('^(/|.*'+includeUrl+')')):
            if link.attrs['href'] not in internalLinks:

                # in the case the link is in the form of /TermsOfSevice
                if(link.attrs['href'].startswith('/')):
                    internalLinks.append(includeUrl+link.attrs['href'])
                    
                # in the case the link is in the form of InitialPage/TermsOfSevice    
                else:
                    internalLinks.append(link.attrs['href'])
    return internalLinks
            
    
def getExternalLinks(bs, excludeUrl):
    """ return a Python list of all external links found on a page """
    externalLinks = []
    
    # Finds all links that start with "http" that do not contain the current URL
    for link in bs.find_all('a', href=re.compile('^(http|www)((?!'+excludeUrl+').)*$')):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in externalLinks:
                externalLinks.append(link.attrs['href'])
    return externalLinks


def getRandomExternalLink(startingPage):
    html = urlopen(startingPage)
    bs = BeautifulSoup(html, 'html.parser')
    
    externalLinks = getExternalLinks(bs, urlparse(startingPage).netloc)
    
    if len(externalLinks) == 0:
        print('No external links, looking around the site for one')
        domain = '{}://{}'.format(urlparse(startingPage).scheme, urlparse(startingPage).netloc)
        internalLinks = getInternalLinks(bs, domain)
        return getRandomExternalLink(random.choice(internalLinks))
    else:
        return random.choice(externalLinks)
    
    
def followExternalOnly(startingSite):
    externalLink = getRandomExternalLink(startingSite)
    print('Random external link is: {}'.format(externalLink))
    followExternalOnly(externalLink)
           

In [30]:
followExternalOnly('http://oreilly.com')

Random external link is: https://play.google.com/store/apps/details?id=com.safariflow.queue
Random external link is: http://developer.android.com/index.html
Random external link is: https://developers.google.com/
Random external link is: https://flutter.dev/
Random external link is: https://policies.google.com/privacy
Random external link is: https://calendar.google.com/calendar
Random external link is: https://accounts.google.com/TOS?loc=FR&hl=fr
Random external link is: https://chat.google.com/
Random external link is: https://accounts.google.com/TOS?loc=FR&hl=fr
Random external link is: https://www.google.com/about/philosophy.html?hl=fr
Random external link is: https://twitter.com/google
No external links, looking around the site for one
Random external link is: https://help.pscp.tv/customer/portal/articles/2460220
Random external link is: https://help.twitter.com/using-twitter/hide-delete-broadcast
Random external link is: https://about.twitter.com/en/who-we-are/brand-toolkit.html


HTTPError: HTTP Error 403: Forbidden