In this section, you’ll begin a project that will become a Six Degrees of Wikipedia solution finder: you’ll be able to take the Eric Idle page and find the fewest number of link clicks that will take you to the Kevin Bacon page.

In [25]:
from urllib.request import urlopen, urlparse
from bs4 import BeautifulSoup
import re 
import random

In [2]:
html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find_all('a'):
    if 'href' in link.attrs:
        print(link.attrs['href'])

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
/w/index.php?title=Special:CreateAccount&returnto=Kevin+Bacon
/w/index.php?title=Special:UserLogin&returnto=Kevin+Bacon
/w/index.php?title=Special:CreateAccount&returnto=Kevin+Bacon
/w/index.php?title=Special:UserLogin&returnto=Kevin+Bacon
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
#
#Early_life_and_education
#Acting_career
#Early_work
#1980s
#1990s
#2000s
#2010s
#Other_ventures
#Six_Degrees_of_Kevin_Bacon
#Personal_life
#Accolades
#Awards_and_nominations
#Other_honors
#S

Si examinas los enlaces que apuntan a páginas de artículos (a diferencia de otras páginas internas), verás que todos tienen tres cosas en común:

* Residen dentro del div con el id configurado en bodyContent.

* Las URL no contienen dos puntos.

* Las URL comienzan con /wiki/.

In [4]:
for link in bs.find('div', {'id':'bodyContent'}).find_all(
    'a', href=re.compile('^(/wiki/)((?!:).)*$')): # ((?!:).) -> Esta combinación dice "cualquier carácter que no sea :".
    print(link.attrs['href'])
                                                # El asterisco * significa "cero o más" de la combinación anterior (((?!:).)). Es decir, busca cualquier secuencia de caracteres que no contenga :.

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Peabody_Awards
/wiki/Philadelphia
/wiki/Kevin_Bacon_filmography
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Leading_man
/wiki/Character_actor
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/National_Lampoon%27s_Animal_House
/wiki/Footloose_(1984_film)
/wiki/Diner_(1982_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Frost/Nixon_(film)
/wiki/Friday_the_13th_(1980_film)
/wiki/Tremors_(1990_film)
/wiki/The_River_Wild
/wiki/The_Woodsman_(2004_film)
/wiki/Crazy,_Stupid,_Love
/wiki/Patriots_Day_(film)
/wiki/Losing_Chase
/wiki/Loverboy_(2005_film)
/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Miniseries_or_Television_Film
/wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Miniseries_or_Television_Movie
/wiki/Michael_Strobl
/wiki/HBO
/wiki/Taking_Chance
/wiki/Fox_Broa

A more complex crawler

In [12]:
import datetime
import random

random.seed(datetime.datetime.now().microsecond)
def getLinks(articleUrl):
    html = urlopen('http://en.wikipedia.org{}'.format(articleUrl))
    bs = BeautifulSoup(html, 'html.parser')
    return bs.find('div', {'id': 'bodyContent'}).find_all('a', href=re.compile('^(/wiki/)((?!:).)*$'))

links = getLinks('/wiki/Kevin_Bacon')
while len(links) > 0:
    newArticle = links[random.randint(0, len(links)-1)].attrs['href']
    print(newArticle)
    links = getLinks(newArticle)

/wiki/HBO_Films
/wiki/Chris_Albrecht
/wiki/HBO_Original_Programming
/wiki/Sky_One
/wiki/Sky_Mix
/wiki/Road_Wars_(TV_series)
/wiki/Sick_of_It_(TV_series)
/wiki/Ross_Kemp_in_Search_of_Pirates
/wiki/The_Race_(TV_series)
/wiki/Project_Catwalk
/wiki/Louie_Spence%27s_Showbusiness
/wiki/Wolfe_(TV_series)
/wiki/After_Hours_(2015_British_TV_series)
/wiki/Police_Stop!
/wiki/The_Race_(TV_series)
/wiki/Mad_Dogs_(British_TV_series)
/wiki/British_Academy_Television_Awards_2011
/wiki/2021_British_Academy_Television_Awards
/wiki/The_Ranganation
/wiki/Broadcast_Awards
/wiki/Debbie_tucker_green
/wiki/Channel_4
/wiki/The_Comic_Strip
/wiki/Arnold_Brown_(comedian)
/wiki/Sarah_Millican
/wiki/The_Bubble_(UK_TV_series)
/wiki/Marcus_Brigstocke
/wiki/Television
/wiki/Philadelphia
/wiki/Atlantic_City,_New_Jersey
/wiki/Wilmington,_Delaware
/wiki/1860_United_States_census
/wiki/Maine
/wiki/Foreign_policy_of_the_United_States
/wiki/Iran%E2%80%93Iraq_War
/wiki/2004_Qamishli_riots
/wiki/Popular_Front_for_Change_and_L

KeyboardInterrupt: 

A way to avoid explore the same page twice

In [7]:
pages = set()
def getLinks(pageUrl):
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')

    for link in bs.find_all('a', href=re.compile('^(/wiki/)')): # Al no tener el $ al final indica que sea todos los /wiki/ de la pagina
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages: # Means is a new page
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
       
getLinks('')

/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Special:Search
/wiki/Special:MyContributions
/wiki/Special:MyTalk
/wiki/Special:WhatLinksHere/User_talk:2806:261:486:7AE:847E:59FE:A0B8:B5FF


HTTPError: HTTP Error 404: Not Found

Crawlers for collect data

In [17]:
pages = set()
def getLinks(pageUrl):
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    try:
        print(bs.h1.get_text()) # Access to title
        print(bs.find(id ='mw-content-text').find_all('p')[0].prettify()) # Access to the first paragraph
        print(bs.find(id='ca-edit').find('span')
             .find('a').attrs['href'].prettify())# Get Edit links
    except AttributeError:
        print("The page is missing something! Continuing")
        
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                newPage = link.attrs['href']
                print('-'*20)
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks('')

Main Page
<p>
 The
 <b>
  <a href="/wiki/UEFA_Euro_2004_final" title="UEFA Euro 2004 final">
   UEFA Euro 2004 final
  </a>
 </b>
 was the final match of
 <a href="/wiki/UEFA_Euro_2004" title="UEFA Euro 2004">
  Euro 2004
 </a>
 , the 12th
 <a href="/wiki/UEFA_European_Championship" title="UEFA European Championship">
  European Championship
 </a>
 , organised by
 <a href="/wiki/UEFA" title="UEFA">
  UEFA
 </a>
 for the senior men's national
 <a href="/wiki/Association_football" title="Association football">
  association football
 </a>
 teams of its member associations. The match was played at the
 <a href="/wiki/Est%C3%A1dio_da_Luz" title="Estádio da Luz">
  Estádio da Luz
 </a>
 in
 <a href="/wiki/Lisbon" title="Lisbon">
  Lisbon
 </a>
 , Portugal, and contested by
 <a href="/wiki/Portugal_national_football_team" title="Portugal national football team">
  Portugal
 </a>
 and
 <a href="/wiki/Greece_national_football_team" title="Greece national football team">
  Greece
 </a>
 . The t

IndexError: list index out of range

Crawling across the internet

In [19]:
def getInternalLinks(bs, url):
    netloc = urlparse(url).netloc #domain www.domain.com    
    scheme = urlparse(url).scheme #protocol http or https
    internalLinks = set()
    for link in bs.find_all('a'):
        if not link.attrs.get('href'): # If the link doesnt has href then jump to next bucle iteration
            continue
        parsed = urlparse(link.attrs['href']) # we get netloc and scheme
        if parsed.netloc == '': # if the netloc is empty, then is a relative link, that means a www.samedomain.com/page
            internalLinks.add(f'{scheme}://{netloc}/{link.attrs["href"].strip("/")}') # convert this relative link in absolute link (strip delete rhe "/" in the end of link)
        elif parsed.netloc == netloc: # If the netloc is the same, then is an absolute link
            internalLinks.add(link.attrs['href']) # We just add the link 
    return list(internalLinks) 

In [20]:
def getExternalLinks(bs, url):
    netloc = urlparse(url).netloc
    externalLinks = set()
    for link in bs.find_all('a'):
        if not link.attrs.get('href'):
            continue
        parsed = urlparse(link.attrs['href'])
        if parsed.netloc != '' and parsed.netloc != netloc: # if isn't empty and in the same page
            externalLinks.add(link.attrs['href'])
    return list(externalLinks)

In [22]:
def getRandomExternalLink(startingPage):
    bs = BeautifulSoup(urlopen(startingPage), 'html.parser')
    externalLinks = getExternalLinks(bs, startingPage)
    if not len(externalLinks): # if is empty
        print("No external links, looking around the site for one")
        internalLinks = getInternalLinks(bs, startingPage) # we looking for other link inside the page 
        return getRandomExternalLink(random.choice(internalLinks)) # repeat the process until get an external link
    else:
        return random.choice(externalLinks) # Return a random external link (only one)

In [34]:
def followExternalOnly(startingPage):
    externalLink = getRandomExternalLink(startingPage)
    print(f'Random external link is: {externalLink}')
    followExternalOnly(externalLink)

In [35]:
followExternalOnly('https://www.oreilly.com/')

Random external link is: https://learning.oreilly.com/search/?query=author%3A%22Sari%20Greene%22&extended_publisher_data=true&highlight=true&include_assessments=false&include_case_studies=true&include_courses=true&include_playlists=true&include_collections=true&include_notebooks=true&include_sandboxes=true&include_scenarios=true&is_academic_institution_account=false&source=user&sort=date_added&facet_json=true&json_facets=true&page=0&include_facets=false
Random external link is: https://oreillylearning.in/
Random external link is: https://www.oreilly.com/emails/newsletters/
Random external link is: https://play.google.com/store/apps/details?id=com.safariflow.queue
Random external link is: https://store.google.com/?playredirect=true
Random external link is: https://home.nest.com/es/MX
Random external link is: https://nest.com/-apps/learn-more-about-nest/
No external links, looking around the site for one


IndexError: Cannot choose from an empty sequence

In [36]:
allExtLinks = []
allIntLinks = []

def getAllExternalLinks(url):
    bs = BeautifulSoup(urlopen(url), 'html.parser')
    internalLinks = getInternalLinks(bs, url)
    externalLinks = getExternalLinks(bs, url)
    
    for link in externalLinks:
        if link not in allExtLinks:
            allExtLinks.append(link)
            print(link)
    
    for link in internalLinks:
        if link not in allIntLinks:
            allIntLinks.append(link)
            getAllExternalLinks(link)
    

allIntLinks.append('https://oreilly.com')
getAllExternalLinks('https://www.oreilly.com/')

https://learning.oreilly.com/search/?query=author%3A%22Ken%20Kousen%22&extended_publisher_data=true&highlight=true&include_assessments=false&include_case_studies=true&include_courses=true&include_playlists=true&include_collections=true&include_notebooks=true&include_sandboxes=true&include_scenarios=true&is_academic_institution_account=false&source=user&sort=date_added&facet_json=true&json_facets=true&page=0&include_facets=false
https://itunes.apple.com/us/app/safari-to-go/id881697395
https://oreillylearning.in/
https://www.amazon.com/OReilly-Media-Inc/dp/B087YYHL5C/ref=sr_1_2?dchild=1&keywords=oreilly&qid=1604964116&s=mobile-apps&sr=1-2
https://learning.oreilly.com/search/?query=author%3A%22Bruno%20Gon%C3%A7alves%22&extended_publisher_data=true&highlight=true&include_assessments=false&include_case_studies=true&include_courses=true&include_playlists=true&include_collections=true&include_notebooks=true&include_sandboxes=true&include_scenarios=true&is_academic_institution_account=false&so

HTTPError: HTTP Error 404: Not Found