# Six Degrees of Kevin Bacon

![](images/Kevin_Bacon.jpg)

This activity is motivated by the text **Web Scraping with Python** by Ryan Mitchell, available through O'Reilly [here](http://shop.oreilly.com/product/0636920078067.do).  This book goes in depth with much more on using different libraries with Python around common webscraping tasks and I highly recommend it.  We will focus on the activity of moving from a base page to further pages through their links.  

In [1]:
import requests
from bs4 import BeautifulSoup

#request makes the request to the website and receives that response
#beautiful soup helps navigate the responses

### Scraping Links

Below, we take the page dealing with the six degrees of Keving Bacon problem.  Here, our goal is to extract links to other pages that we will subsequently pass to requests.  Recall that a link is located in an `<a>` tag and the link is contained in the `href` attribute.  For example, the tag

```HTML
<a href="/wiki/Six_degrees_of_separation" title="Six degrees of separation">six degrees of separation</a>
```

references the Six Degrees of Separation article.  Note that this is a url within Wikipedia.  We can isolate these inner Wikipedia references.  To begin, let's inspect the link content.

In [2]:
response = requests.get('https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon')

In [3]:
soup = BeautifulSoup(response.text, 'html.parser')

In [4]:
soup.find('a')

<a id="top"></a>

In [6]:
soup.find_all('a')[:10]

#list of tags
#each has attributes

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#p-search">Jump to search</a>,
 <a class="image" href="/wiki/File:Kevin_Bacon.jpg"><img alt="" class="thumbimage" data-file-height="461" data-file-width="369" height="275" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Kevin_Bacon.jpg/220px-Kevin_Bacon.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Kevin_Bacon.jpg/330px-Kevin_Bacon.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/d/d2/Kevin_Bacon.jpg 2x" width="220"/></a>,
 <a class="internal" href="/wiki/File:Kevin_Bacon.jpg" title="Enlarge"></a>,
 <a class="mw-redirect" href="/wiki/Parlor_game" title="Parlor game">parlor game</a>,
 <a href="/wiki/Six_degrees_of_separation" title="Six degrees of separation">six degrees of separation</a>,
 <a href="/wiki/Kevin_Bacon" title="Kevin Bacon">Kevin Bacon</a>,
 <a href="/wiki/Hollywood" title="Hollywood">Hollywood</a>,
 <a href="/wiki/Kevin_Bacon" ti

In [7]:
for link in soup.find_all('a')[:10]:
    print(link.attrs)

{'id': 'top'}
{'class': ['mw-jump-link'], 'href': '#mw-head'}
{'class': ['mw-jump-link'], 'href': '#p-search'}
{'href': '/wiki/File:Kevin_Bacon.jpg', 'class': ['image']}
{'href': '/wiki/File:Kevin_Bacon.jpg', 'class': ['internal'], 'title': 'Enlarge'}
{'href': '/wiki/Parlor_game', 'class': ['mw-redirect'], 'title': 'Parlor game'}
{'href': '/wiki/Six_degrees_of_separation', 'title': 'Six degrees of separation'}
{'href': '/wiki/Kevin_Bacon', 'title': 'Kevin Bacon'}
{'href': '/wiki/Hollywood', 'title': 'Hollywood'}
{'href': '/wiki/Kevin_Bacon', 'title': 'Kevin Bacon'}


In [9]:
for link in soup.find_all('a')[:15]:
    if 'href' in link.attrs:
        print(link.attrs['href'])

#navigation of tags and use attributes to filter
#the "." identifies a file

#mw-head
#p-search
/wiki/File:Kevin_Bacon.jpg
/wiki/File:Kevin_Bacon.jpg
/wiki/Parlor_game
/wiki/Six_degrees_of_separation
/wiki/Kevin_Bacon
/wiki/Hollywood
/wiki/Kevin_Bacon
/wiki/Kevin_Bacon
/wiki/Charitable_organization
/wiki/SixDegrees.org
#History
#Bacon_numbers


Okay, seems there are links outside of the inner wiki links.  However, we see that the wiki links contain `/wiki/`, no colons, and the links are all within the body of the page.  Exploiting these means we can write a regular expression 

```
^(/wiki/)((?!:).)*$
```

that will match only the wiki links.  

In [10]:
import re

In [11]:
for link in soup.find('div', {'id': 'bodyContent'}).find_all('a', href = re.compile('^(/wiki/)((?!:).)*$'))[:10]:
    if 'href' in link.attrs:
        print(link.attrs['href'])
        
#grabbing div tags w/ bodycontent as a part of it
#^(/wiki/)((?!:).)*$ is a regular expression

/wiki/Parlor_game
/wiki/Six_degrees_of_separation
/wiki/Kevin_Bacon
/wiki/Hollywood
/wiki/Kevin_Bacon
/wiki/Kevin_Bacon
/wiki/Charitable_organization
/wiki/SixDegrees.org
/wiki/Premiere_(magazine)
/wiki/Kevin_Bacon


### A Function for Links

Now, let's write a function that extracts the link from any wikipedia page.  We should be able to use the idea that the links we care about are located in the same place as our Six Degrees example.  

In [12]:
def get_wikilinks(url):
    links = []
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    for link in soup.find('div', {'id': 'bodyContent'}).find_all('a', href = re.compile('^(/wiki/)((?!:).)*$')):
        links.append(link)
    return links

In [13]:
links = get_wikilinks('https://en.wikipedia.org/wiki/Kevin_Bacon')

In [14]:
links[:10]

#And then we can follow the links we scraped
#these hrefs are relative links

[<a class="mw-disambig" href="/wiki/Kevin_Bacon_(disambiguation)" title="Kevin Bacon (disambiguation)">Kevin Bacon (disambiguation)</a>,
 <a href="/wiki/San_Diego_Comic-Con" title="San Diego Comic-Con">San Diego Comic-Con</a>,
 <a href="/wiki/Philadelphia" title="Philadelphia">Philadelphia</a>,
 <a href="/wiki/Pennsylvania" title="Pennsylvania">Pennsylvania</a>,
 <a href="/wiki/Kyra_Sedgwick" title="Kyra Sedgwick">Kyra Sedgwick</a>,
 <a href="/wiki/Sosie_Bacon" title="Sosie Bacon">Sosie Bacon</a>,
 <a href="/wiki/Edmund_Bacon_(architect)" title="Edmund Bacon (architect)">Edmund Bacon</a>,
 <a href="/wiki/Michael_Bacon_(musician)" title="Michael Bacon (musician)">Michael Bacon</a>,
 <a href="/wiki/Footloose_(1984_film)" title="Footloose (1984 film)">Footloose</a>,
 <a href="/wiki/JFK_(film)" title="JFK (film)">JFK</a>]

### Connecting Pages

Now, we want to follow these references, gather more urls, and repeat. For the sake of not running to exhaustion, I abbreviate the output using only a large length requirement for the link list.  To traverse all the pages we would simply change the 

```python
while len(links) > 100:
```

to 

```python
while len(links) > 0:
```

In [15]:
import random

def get_wikilinks(url):
    links = []
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    for link in soup.find('div', {'id': 'bodyContent'}).find_all('a', href = re.compile('^(/wiki/)((?!:).)*$')):
        links.append(link)
    return links

In [13]:
links = get_wikilinks('https://en.wikipedia.org/wiki/Kevin_Bacon')
while len(links) > 100:
    newArticle = 'https://en.wikipedia.org' + links[random.randint(0, len(links)-1)].attrs['href']
    print(newArticle)
    links = get_wikilinks(newArticle)

https://en.wikipedia.org/wiki/Stacy_Keach
https://en.wikipedia.org/wiki/Old_Yeller-Belly
https://en.wikipedia.org/wiki/The_Simpsons_(season_29)
https://en.wikipedia.org/wiki/D%27oh!
https://en.wikipedia.org/wiki/TV_Land
https://en.wikipedia.org/wiki/Nicktoons_(United_States)
https://en.wikipedia.org/wiki/Hallmark_Channel
https://en.wikipedia.org/wiki/Super_Bowl
https://en.wikipedia.org/wiki/New_Haven,_Connecticut
https://en.wikipedia.org/wiki/List_of_Connecticut_locations_by_per_capita_income
https://en.wikipedia.org/wiki/Vernon,_Connecticut
https://en.wikipedia.org/wiki/Pacific_Islander_(U.S._Census)
https://en.wikipedia.org/wiki/Hispanic_and_Latino_Americans
https://en.wikipedia.org/wiki/Chicano
https://en.wikipedia.org/wiki/List_of_U.S._communities_with_Hispanic_majority_populations_in_the_2010_census
https://en.wikipedia.org/wiki/Kendale_Lakes,_Florida
https://en.wikipedia.org/wiki/Native_Alaskan
https://en.wikipedia.org/wiki/2010_United_States_Census
https://en.wikipedia.org/wiki/

### Problem

Write a function to retrieve a list of albums of any area you are interested in using Wikipedia's list of list of albums page: https://en.wikipedia.org/wiki/Lists_of_albums.

In [16]:
response_2 = requests.get('https://en.wikipedia.org/wiki/List_of_triple_albums')

In [18]:
soup_2 = BeautifulSoup(response_2.text, 'html.parser')

In [19]:
soup.find('li')

<li><a href="/wiki/The_Allman_Brothers_Band" title="The Allman Brothers Band">The Allman Brothers Band</a> - <i>Jones Beach, Wantagh, NY 8/24/04</i></li>

In [27]:
soup.find_all('li')[:10]

[<li><a href="/wiki/The_Allman_Brothers_Band" title="The Allman Brothers Band">The Allman Brothers Band</a> - <i>Jones Beach, Wantagh, NY 8/24/04</i></li>,
 <li>The Allman Brothers Band – <i>Chronicles: 3 Classic Albums</i> (2005) - compilation</li>,
 <li><a href="/wiki/Alan_Silva" title="Alan Silva">Alan Silva</a> and The Celestrial Communication Orchestra- <i><a class="mw-redirect" href="/wiki/Seasons" title="Seasons">Seasons</a></i> - (3×LP, 2×CD)</li>,
 <li><a href="/wiki/Alquin" title="Alquin">Alquin</a> - <i>The Ultimate Collection</i> (2007) - 3×CD compilation</li>,
 <li><a href="/wiki/America_(band)" title="America (band)">America</a> - <i><a href="/wiki/Highway_(America_album)" title="Highway (America album)">Highway - 30 years of America</a></i> (2000)</li>,
 <li>America - <i>The Triple Album Collection</i> (2015)</li>,
 <li><a href="/wiki/Armand_(singer)" title="Armand (singer)">Armand</a> - <i>Singles A's &amp; B's</i> (2003)</li>,
 <li><a href="/wiki/The_Band" title="The B

In [28]:
soup.find('li').text

'The Allman Brothers Band - Jones Beach, Wantagh, NY 8/24/04'

In [24]:
for link in soup.find_all('li')[:3]:
    print(link.attrs)

{}
{}
{}


In [25]:
for link in soup.find_all('a')[:10]:
    print(link.attrs)

{'id': 'top'}
{'class': ['mw-jump-link'], 'href': '#mw-head'}
{'class': ['mw-jump-link'], 'href': '#p-search'}
{'href': '/wiki/Triple_album', 'class': ['mw-redirect'], 'title': 'Triple album'}
{'href': '/wiki/Gramophone_record', 'class': ['mw-redirect'], 'title': 'Gramophone record'}
{'href': '/wiki/Compact_disc', 'title': 'Compact disc'}
{'href': '/wiki/Wikipedia:WikiProject_Lists#Incomplete_lists', 'title': 'Wikipedia:WikiProject Lists'}
{'class': ['external', 'text'], 'href': '//en.wikipedia.org/w/index.php?title=List_of_triple_albums&action=edit'}
{'href': '/wiki/Wikipedia:Identifying_reliable_sources', 'title': 'Wikipedia:Identifying reliable sources'}
{'href': '#top'}


In [29]:
for result in soup.find_all('li'):
    print(result.text)

The Allman Brothers Band - Jones Beach, Wantagh, NY 8/24/04
The Allman Brothers Band – Chronicles: 3 Classic Albums (2005) - compilation
Alan Silva and The Celestrial Communication Orchestra- Seasons - (3×LP, 2×CD)
Alquin - The Ultimate Collection (2007) - 3×CD compilation
America - Highway - 30 years of America (2000)
America - The Triple Album Collection (2015)
Armand - Singles A's & B's (2003)
The Band – The Last Waltz
Syd Barrett – Crazy Diamond – compilation
The Beatles – Anthology 1 3xLP, 2xCD
The Beatles – Anthology 2 3xLP, 2xCD
The Beatles – Anthology 3 3xLP, 2xCD
The Beatles – Live at the BBC Volume 1 3xLP, 2xCD
The Beatles – On Air – Live at the BBC Volume 2 3xLP, 2xCD
Carla Bley – Escalator over the Hill
Blue Floyd – Keswick Thr., Glenside, Pa, 1-21-2000 (2010) - live
Bob Dylan - Triplicate (2017)
Boards of Canada – Geogaddi
Bonzo Dog Doo-Dah Band – Cornology
Boris with Merzbow – Rock Dream 3xLP, 2xCD
Billy Bragg – Must I Paint You A Picture? The Essential Billy Bragg (2003)