# Web Scraping with Beautiful Soup - Lab

## Introduction

Now that you've read and seen some docmentation regarding the use of Beautiful Soup, its time to practice and put that to work! In this lab you'll formalize some of our example code into functions and scrape the lyrics from an artist of your choice.

## Objectives
You will be able to:
* Scrape Static webpages
* Select specific elements from the DOM

## Link Scraping

Write a function to collect the links to each of the song pages from a given artist page.

In [1]:
#Starter Code

from bs4 import BeautifulSoup
import requests

def get_songs(artist_url):
    url = artist_url #Put the URL of your AZLyrics Artist Page here!
    html_page = requests.get(url) #Make a get request to retrieve the page
    soup = BeautifulSoup(html_page.content, 'html.parser') #Pass the page contents to beautiful soup for parsing
    albums = soup.find_all("div", class_="album")

    #The example from our lecture/reading
    data = [] #Create a storage container
    for album_n in range(len(albums)):
        #On the last album, we won't be able to look forward
        if album_n == len(albums)-1:
            cur_album = albums[album_n]
            album_songs = cur_album.findNextSiblings('a')
            for song in album_songs:
                page = song.get('href')
                title = song.text
                album = cur_album.text
                data.append((title, page, album))
        else:
            cur_album = albums[album_n]
            next_album = albums[album_n+1]
            saca = cur_album.findNextSiblings('a') #songs after current album
            sbna = next_album.findPreviousSiblings('a') #songs before next album
            album_songs = [song for song in saca if song in sbna] #album songs are those listed after the current album but before the next one!
            for song in album_songs:
                page = song.get('href')
                title = song.text
                album = cur_album.text
                data.append((title, page, album))
    return data

In [2]:
tl_songs = get_songs('https://www.azlyrics.com/t/thinlizzy.html')
tl_songs[:2]

[('The Friendly Ranger At Clontarf Castle',
  '../lyrics/thinlizzy/thefriendlyrangeratclontarfcastle.html',
  'album: "Thin Lizzy" (1971)'),
 ('Honesty Is No Excuse',
  '../lyrics/thinlizzy/honestyisnoexcuse.html',
  'album: "Thin Lizzy" (1971)')]

## Text Scraping
Write a secondary function that scrapes the lyrics for each song page.

In [3]:
#Remember to open up the webpage in a browser and control-click/right-click and go to inspect!
from bs4 import BeautifulSoup
import requests

#Example page
url = 'https://www.azlyrics.com/lyrics/thinlizzy/thefriendlyrangeratclontarfcastle.html'


html_page = requests.get(url)
soup = BeautifulSoup(html_page.content, 'html.parser')
soup.prettify()[:1000]

'<!DOCTYPE html>\n<html lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>\n  <meta content="width=device-width, initial-scale=1" name="viewport"/>\n  <meta content=\'Lyrics to "The Friendly Ranger At Clontarf Castle" song by Thin Lizzy: The friendly ranger paused And scooping a bowl of beans Spreading them like stars Falling like justi...\' name="description"/>\n  <meta content="The Friendly Ranger At Clontarf Castle lyrics, Thin Lizzy The Friendly Ranger At Clontarf Castle lyrics, Thin Lizzy lyrics" name="keywords"/>\n  <meta content="noarchive" name="robots"/>\n  <meta content="//www.azlyrics.com/az_logo_tr.png" property="og:image"/>\n  <title>\n   Thin Lizzy - The Friendly Ranger At Clontarf Castle Lyrics | AZLyrics.com\n  </title>\n  <link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/css/bootstrap.min.css" rel="stylesheet"/>\n  <link href="//www.azlyrics.com/bsaz.css" rel="stylesheet"/>\n  <!-- HTML5 shim and Respond.j

In [4]:
divs = soup.find_all('div')

for n in range(len(divs)):
    print(n, divs[n], "\n\n\n\n")

0 <div id="fb-root"></div> 




1 <div class="container">
<!-- Brand and toggle get grouped for better mobile display -->
<div class="navbar-header">
<button class="navbar-toggle collapsed" data-target="#search-collapse" data-toggle="collapse" type="button">
<span class="glyphicon glyphicon-search"></span>
</button>
<button class="navbar-toggle collapsed" data-target="#artists-collapse" data-toggle="collapse" type="button">
<span class="glyphicon glyphicon-th-list"></span>
</button>
<a class="navbar-brand" href="//www.azlyrics.com"><img alt="AZLyrics.com" class="pull-left" src="//www.azlyrics.com/az_logo_tr.png" style="max-height:40px; margin-top:-10px;"/></a>
</div>
<ul class="collapse navbar-collapse nav navbar-nav" id="artists-collapse">
<li>
<div class="btn-group text-center" role="group">
<a class="btn btn-menu" href="//www.azlyrics.com/a.html">A</a>
<a class="btn btn-menu" href="//www.azlyrics.com/b.html">B</a>
<a class="btn btn-menu" href="//www.azlyrics.com/c.html">C</a>
<a cla

In [5]:
sectionv1 = soup.find('div', {"class" : "col-xs-12 col-lg-8 text-center"})
sectionv1.find_all('div')

[<div class="div-share noprint">
 <div class="fb-like" data-action="like" data-href="https://www.azlyrics.com/lyrics/thinlizzy/thefriendlyrangeratclontarfcastle.html" data-layout="button_count" data-share="false" data-show-faces="false" style="float:left;"></div>
 <!-- AddThis Button BEGIN -->
 <script src="https://s7.addthis.com/js/300/addthis_widget.js#username=azlyrics" type="text/javascript"></script>
 <div class="addthis_toolbox addthis_default_style" style="float:right;">
 <a class="btn btn-xs btn-share addthis_button_email">
 <span class="playblk"><img alt="Email" class="playblk" height="18" src="//www.azlyrics.com/images/email.svg" width="56"/></span>
 </a>
 <a class="btn btn-xs btn-share addthis_button_print" style="margin-right: 0px !important;">
 <span class="playblk"><img alt="Print" class="playblk" height="18" src="//www.azlyrics.com/images/print.svg" width="56"/></span>
 </a>
 </div>
 </div>,
 <div class="fb-like" data-action="like" data-href="https://www.azlyrics.com/lyr

In [13]:
import re
# My version: works but includes 'Usage of azlyrics.com...' intro text
#def get_lyrics(url):
#    html_page = requests.get(url)
#    soup = BeautifulSoup(html_page.content, 'html.parser')
#    for elem in soup(text=re.compile(r'Usage of azlyrics')):
#        print(elem.parent)

# From solutions:
def scrape_lyrics(song_page_url):
    html_page = requests.get(song_page_url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    main_page = soup.find('div', {"class": "container main-page"})
    main_l2 = main_page.find('div', {"class" : "row"})
    main_l3 = main_l2.find('div', {"class" : "col-xs-12 col-lg-8 text-center"})
    lyrics = main_l3.findAll('div')[6].text
    return lyrics

In [7]:
# lyrics1 = get_lyrics('https://www.azlyrics.com/lyrics/thinlizzy/thefriendlyrangeratclontarfcastle.html')

<div>
<!-- Usage of azlyrics.com content by any third-party lyrics provider is prohibited by our licensing agreement. Sorry about that. -->
The friendly ranger paused<br/>
And scooping a bowl of beans<br/>
Spreading them like stars<br/>
Falling like justice on different scenes<br/>
<br/>
"I'm damned"<br/>
"Indeed, comrade"<br/>
"I'm being bombed"<br/>
And all the people's faces turned strawberry blonde<br/>
<br/>
By the morning gate the friendly ranger waits<br/>
For the sun making sure it's not late<br/>
"Just in time" "No need to fear" "Well, just in case"<br/>
And all the people are happy for another year<br/>
<br/>
And in the evening shade he climbs upon the sun<br/>
Getting it's glow<br/>
He goes on<br/>
Singing this song<br/>
<br/>
To feel the goodness glowing inside<br/>
To walk down a street with my arms about your hips, side by side<br/>
To play with a sad eyed child till he smiles<br/>
To look at a starry sky at night, realize the miles<br/>
<br/>
To see the sun set behind t

## Synthesizing
Create a script using your two functions above to scrape all of the song lyrics for a given artist.


# RECEIVING ERRORS due to too many requests. Top version is mine, second is from solutions. Try both.

In [15]:
#Use this block for your code!
tl_songs = get_songs('https://www.azlyrics.com/t/thinlizzy.html')
lyrics = []
for song in tl_songs:
    url_end = song[1].replace('..','')
    url = "https://www.azlyrics.com" + url_end
    song_lyrics = get_lyrics(url)
    lyrics.append(song_lyrics)
return lyrics

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

In [16]:
songs = get_songs("https://www.azlyrics.com/t/thinlizzy.html")
url_base = "https://www.azlyrics.com"
lyrics = []
for song in songs:
    try:
        url_sffx = song[1].replace('..','')
        url = url_base + url_sffx
        lyr = get_lyrics(url)
        lyrics.append(lyr)
    except:
        lyrics.append("N/A")

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

In [10]:
len(tl_songs)

134

In [11]:
len(lyrics)

10

## Visualizing
Generate two bar graphs to compare lyrical changes for the artist of your chose. For example, the two bar charts could compare the lyrics for two different songs or two different albums.

In [None]:
#Use this block for your code!

## Level - Up

Think about how you structured the data from your web scraper. Did you scrape the entire song lyrics verbatim? Did you simply store the words and their frequency counts, or did you do something else entirely? List out a few different options for how you could have stored this data. What are advantages and disadvantages of each? Be specific and think about what sort of analyses each representation would lend itself to.

In [None]:
#Use this block for your code!

## Summary

Congratulations! You've now practiced your Beautiful Soup knowledge!