# What will you need?

* Python libraries
    * [`requests`](https://requests.readthedocs.io/en/master/) which we allow us to get the webpage we want; and
    * [`Beautiful Soup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) that parses the content of the webpage and allows us extracting tags from an HTML document.

* Knowledge
    - [some basic HTML knowledge](https://www.w3schools.com/html/html_elements.asp) (Its tree structure and that tags define the branches where the information we search are.)
    - [String Python methods](https://www.w3schools.com/python/python_ref_string.asp)
    
    
# Important steps

## Step 1: Website inspection

### In the browser

Access the webpage's HTML code code by right clicking and choosing to access it source code, i.e., `view page` (Firefox) or `view page source` (Chrome and Microsoft Edge). If you need details of an specific element right click on it and choose `inspect element`(Firefox) or `inspect` (Chrome and Microsoft Edge), instead.

### Use `prettify` from Beautiful Soup

We will see in the following examples that prettify can helping in seeing in a clearer way the code within jupyter notebook also.
    
During the inspection our goal is to check for **tags** around the information of interest (content of the tag).

Some of tags often encountered are:

`<head>` : Contains metadata useful to the web browser which is rendering the page but which is invisible to the user.

`<body>` : Contains the content of an HTML document with which the user interacts. Every page has only one body.

`<div>`: Section of the body.

`<p>`: Delimits paragraphs.

`<a>` : Creates a hyperlink to web pages, files, email addresses, locations in the same page, or anything else a URL can address.

For more definitions of elements check these links: [dev_mozilla](https://developer.mozilla.org/en-US/docs/Web/HTML/Element) or [w3s](https://www.w3schools.com/html/html_elements.asp)

<img src="../images/webpage_code_ex01.JPG" width="800" />

## Step 2: Access Content of Website

For this we need to :

1. Access website using `requests`
2. Parse content with `Beautiful Soup` so we can extract what we need within tags


## Step 3: Use parser to extract desired information

You can use different attributes of the parser depending of which information you want and the tag that wraps it, as we will see soon.

# Examples

## Example 1: Extracting text from webpages

The function bellow performs the necessary steps to parsing the webpage.

In [1]:
import requests
from bs4 import BeautifulSoup

def parse_website(url):
    """ 
    Parse content of a website
    
    Args:
        url (str): url of the website of which we want to acess the content 
        
    Return:
        parser: representation of the document as a nested data structure.
    """
    # Send request and catch response
    response = requests.get(url)

    # get the content of the response
    content = response.content

    # parse webpage
    parser = BeautifulSoup(content, "lxml")
    
    return parser  

In [2]:
main_url = "https://en.wikipedia.org/wiki/List_of_James_Bond_films"
parser = parse_website(main_url)

Inspecting in the website...

![](../images/james_bond_films_page_html.PNG)

Inspecting using prettify...

In [3]:
print(parser.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of James Bond films - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"63c58dcd-5c19-43ff-8ab3-12f8676265d6","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_James_Bond_films","wgTitle":"List of James Bond films","wgCurRevisionId":1070734356,"wgRevisionId":1070734356,"wgArticleId":33190861,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Use dmy dates from J

Extracting title of the page...

In [4]:
parser.title

<title>List of James Bond films - Wikipedia</title>

In [5]:
type(parser.title)

bs4.element.Tag

In [6]:
title = parser.title.text
title

'List of James Bond films - Wikipedia'

Extracting some text...

`Body` is the main branch of the HTML document where all elements such as paragraphs, hyperlinks are located. To access paragraphs we use tag `p`. If we use `find` the 1st paragraph is shown, if we use `find_all` we wil have acess to all paragraphs.

In [7]:
# only the 1st paragraph

print(parser.body.find('p').text)


James Bond is a fictional character created by novelist Ian Fleming in 1953. A British secret agent working for MI6 under the codename 007, he has been portrayed on film by actors Sean Connery, David Niven, George Lazenby, Roger Moore, Timothy Dalton, Pierce Brosnan and Daniel Craig in twenty-seven productions. All but two films were made by Eon Productions, which now holds the adaptation rights to all of Fleming's Bond novels.[1][2]



In [8]:
# only the all paragraphs

parser.body.find_all('p')

[<p>
 <a href="/wiki/James_Bond_(literary_character)" title="James Bond (literary character)">James Bond</a> is a <a href="/wiki/Character_(arts)" title="Character (arts)">fictional character</a> created by novelist <a href="/wiki/Ian_Fleming" title="Ian Fleming">Ian Fleming</a> in 1953. A British secret agent working for <a href="/wiki/Secret_Intelligence_Service" title="Secret Intelligence Service">MI6</a> under the codename 007, he has been <a href="/wiki/Portrayal_of_James_Bond_in_film" title="Portrayal of James Bond in film">portrayed on film</a> by actors <a href="/wiki/Sean_Connery" title="Sean Connery">Sean Connery</a>, <a href="/wiki/David_Niven" title="David Niven">David Niven</a>, <a href="/wiki/George_Lazenby" title="George Lazenby">George Lazenby</a>, <a href="/wiki/Roger_Moore" title="Roger Moore">Roger Moore</a>, <a href="/wiki/Timothy_Dalton" title="Timothy Dalton">Timothy Dalton</a>, <a href="/wiki/Pierce_Brosnan" title="Pierce Brosnan">Pierce Brosnan</a> and <a href="

In [9]:
type(parser.body.find_all('p'))

bs4.element.ResultSet

In [10]:
print(parser.body.find_all('p')[1].text)

In 1961, producers Albert R. Broccoli and Harry Saltzman purchased the filming rights to Fleming's novels.[3] They founded Eon Productions and, with financial backing by United Artists, produced Dr. No, directed by Terence Young and featuring Connery as Bond.[4] Following its release in 1962, Broccoli and Saltzman created the holding company Danjaq to ensure future productions in the James Bond film series.[5] The series currently has twenty-five films, with the most recent, No Time to Die, released in September 2021. With a combined gross of nearly $7 billion to date, it is the fifth-highest-grossing film series.[6] Accounting for inflation, it has earned over $14 billion at current prices.[a] The films have won five Academy Awards: for Sound Effects (now Sound Editing) in Goldfinger (at the 37th Awards), to John Stears for Visual Effects in Thunderball (at the 38th Awards), to Per Hallberg and Karen Baker Landers for Sound Editing, to Adele and Paul Epworth for Original Song in Skyfa

In [11]:
len(parser.body.find_all('p'))

56

In [12]:
# find all paragraphs within the body of html
list_paragraphs = parser.body.find_all('p')
# extract the string within it
list_paragraphs = [p.text for p in list_paragraphs]
# show the first 2 paragraphs
list_paragraphs[:2]

["\nJames Bond is a fictional character created by novelist Ian Fleming in 1953. A British secret agent working for MI6 under the codename 007, he has been portrayed on film by actors Sean Connery, David Niven, George Lazenby, Roger Moore, Timothy Dalton, Pierce Brosnan and Daniel Craig in twenty-seven productions. All but two films were made by Eon Productions, which now holds the adaptation rights to all of Fleming's Bond novels.[1][2]\n",
 'In 1961, producers Albert R. Broccoli and Harry Saltzman purchased the filming rights to Fleming\'s novels.[3] They founded Eon Productions and, with financial backing by United Artists, produced Dr. No, directed by Terence Young and featuring Connery as Bond.[4] Following its release in 1962, Broccoli and Saltzman created the holding company Danjaq to ensure future productions in the James Bond film series.[5] The series currently has twenty-five films, with the most recent, No Time to Die, released in September 2021. With a combined gross of ne

In [13]:
text_films = ' '.join(list_paragraphs).strip()
# First 2000 characters
print(text_films[:2000])

James Bond is a fictional character created by novelist Ian Fleming in 1953. A British secret agent working for MI6 under the codename 007, he has been portrayed on film by actors Sean Connery, David Niven, George Lazenby, Roger Moore, Timothy Dalton, Pierce Brosnan and Daniel Craig in twenty-seven productions. All but two films were made by Eon Productions, which now holds the adaptation rights to all of Fleming's Bond novels.[1][2]
 In 1961, producers Albert R. Broccoli and Harry Saltzman purchased the filming rights to Fleming's novels.[3] They founded Eon Productions and, with financial backing by United Artists, produced Dr. No, directed by Terence Young and featuring Connery as Bond.[4] Following its release in 1962, Broccoli and Saltzman created the holding company Danjaq to ensure future productions in the James Bond film series.[5] The series currently has twenty-five films, with the most recent, No Time to Die, released in September 2021. With a combined gross of nearly $7 bi

## Exemplo 2: Extracting tables from websites

Usually, you will find tables under [tag `tbody`](https://www.w3schools.com/tags/tag_tbody.asp)

<img src="../images/tbody.PNG" width="400" />

### Example 2.1

The <tr> HTML element defines a row of cells in a table. The row's cells can then be established using a mix of <td> (data cell) and <th> (header cell) elements.

In [14]:
len(parser.find_all('tbody'))

8

In [15]:
parser.find_all('tbody')[1]

<tbody><tr>
<th rowspan="2" scope="col">Title
</th>
<th rowspan="2" scope="col">Year
</th>
<th rowspan="2" scope="col">Bond actor
</th>
<th rowspan="2" scope="col">Director(s)
</th>
<th class="unsortable" colspan="2">Box office (millions)<sup class="reference" id="cite_ref-FOOTNOTEBlockAutrey_Wilson2010428–429_15-20"><a href="#cite_note-FOOTNOTEBlockAutrey_Wilson2010428–429-15">[14]</a></sup>
</th>
<th class="unsortable" colspan="2">Budget (millions)<sup class="reference" id="cite_ref-FOOTNOTEBlockAutrey_Wilson2010428–429_15-21"><a href="#cite_note-FOOTNOTEBlockAutrey_Wilson2010428–429-15">[14]</a></sup>
</th>
<th class="unsortable" rowspan="2" scope="col"><span class="nowrap"><abbr title="References">Ref(s)</abbr></span>
</th></tr>
<tr class="unsortable">
<th data-sort-type="number" scope="col">Actual $
</th>
<th data-sort-type="number" scope="col">Adjusted 2005 $
</th>
<th data-sort-type="number" scope="col">Actual $
</th>
<th data-sort-type="number" scope="col">Adjusted 2005 $
</th>

Finding heads...

In [16]:
table_02 = parser.find_all('tbody')[1]

In [17]:
table_02.find_all('th', scope = "col")

[<th rowspan="2" scope="col">Title
 </th>,
 <th rowspan="2" scope="col">Year
 </th>,
 <th rowspan="2" scope="col">Bond actor
 </th>,
 <th rowspan="2" scope="col">Director(s)
 </th>,
 <th class="unsortable" rowspan="2" scope="col"><span class="nowrap"><abbr title="References">Ref(s)</abbr></span>
 </th>,
 <th data-sort-type="number" scope="col">Actual $
 </th>,
 <th data-sort-type="number" scope="col">Adjusted 2005 $
 </th>,
 <th data-sort-type="number" scope="col">Actual $
 </th>,
 <th data-sort-type="number" scope="col">Adjusted 2005 $
 </th>,
 <th colspan="4" scope="col">Total of non-Eon films
 </th>]

![](../images/tab2.PNG)

In [18]:
# Obtain column names within tag <th> with attribute col
list_col_01 = table_02.find_all('th', scope = "col")
list_col_01 = [item.text.strip() for item in list_col_01 if ('Ref' not in item.text) & ('Total' not in item.text)]
list_col_01

['Title',
 'Year',
 'Bond actor',
 'Director(s)',
 'Actual $',
 'Adjusted 2005 $',
 'Actual $',
 'Adjusted 2005 $']

In [19]:
# Obtain complement of column names at the attribute unsortable and some manipulation so we can have the correct names
list_col_02 = table_02.find_all('th', class_="unsortable")
list_col_02 = [item.text.strip().replace('[14]',"") for item in list_col_02 if ('Ref' not in item.text) & ('Total' not in item.text)]
list_col_02=list_col_02*2
list_col_02.sort()
list_col_02


['Box office (millions)',
 'Box office (millions)',
 'Budget (millions)',
 'Budget (millions)']

In [20]:
# Putting all together
list_columns = [list_col_01[idx] if idx in range(len(list_col_01[:4])) else list_col_02[idx-4] +' '+ list_col_01[idx] for idx in range(len(list_col_01)) ]
list_columns

['Title',
 'Year',
 'Bond actor',
 'Director(s)',
 'Box office (millions) Actual $',
 'Box office (millions) Adjusted 2005 $',
 'Budget (millions) Actual $',
 'Budget (millions) Adjusted 2005 $']

In [21]:
# Obtain title of the movies
list_films = table_02.find_all('th', scope = "row")
list_films = [film.text.strip() for film in list_films]
list_films

['Casino Royale', 'Never Say Never Again']

In [22]:
# Obtain all other information about those movies
list_info_films = [item.text.strip() for item in table_02.find_all('td')]
list_info_films = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 8 != 7]
# showing the first 10 elements of the list
list_info_films[:10]

['1967',
 'David Niven',
 'Ken HughesJohn HustonJoseph McGrathRobert ParrishVal GuestRichard Talmadge',
 '44.4',
 '260.0',
 '12.0',
 '70.0',
 '1983',
 'Sean Connery',
 'Irvin Kershner']

In [23]:
# Organizing information in list_info_films by features
list_year_film = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 0 ]
list_actor = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 1 ]
list_director = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 2 ]
list_box_office_actual = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 3 ]
list_box_office_adj_2005 = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 4 ]
list_budget_actual = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 5 ]
list_budget_adj_2005 = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 6 ]

In [24]:
list_of_lists_films = [list_films, list_year_film, list_actor, list_director, list_box_office_actual, list_box_office_adj_2005, 
                 list_budget_actual, list_budget_adj_2005]

In [25]:
# Build a dictionary for our dataframe
dict_films = {list_columns[idx]:list_of_lists_films[idx] for idx in range(len(list_columns))}
# showing 2 items of the dictionary
dict(list(dict_films.items())[0:2])

{'Title': ['Casino Royale', 'Never Say Never Again'], 'Year': ['1967', '1983']}

In [26]:
import pandas as pd

df_films = pd.DataFrame(dict_films)
df_films

Unnamed: 0,Title,Year,Bond actor,Director(s),Box office (millions) Actual $,Box office (millions) Adjusted 2005 $,Budget (millions) Actual $,Budget (millions) Adjusted 2005 $
0,Casino Royale,1967,David Niven,Ken HughesJohn HustonJoseph McGrathRobert Parr...,44.4,260.0,12.0,70.0
1,Never Say Never Again,1983,Sean Connery,Irvin Kershner,160.0,314.0,36.0,71.0


### Example 2.2 : Another table

In [27]:
url = "https://nl.wikipedia.org/wiki/Lijst_van_titelsongs_uit_de_James_Bondfilms"

In [28]:
parser_2 = parse_website(url)

In [29]:
len(parser_2.find_all("tbody"))

3

In [30]:
table_01 = parser_2.find('tbody')

In [31]:
print(table_01.prettify())

<tbody>
 <tr>
  <th>
   Titelsong
  </th>
  <th>
   Artiest
  </th>
  <th>
   Film
  </th>
  <th>
   Jaar
  </th>
  <th>
   Componist
  </th>
 </tr>
 <tr>
  <td>
   <i>
    <a href="/wiki/James_Bond_Theme" title="James Bond Theme">
     James Bond Theme
    </a>
   </i>
   en
   <br/>
   <i>
    Kingston Calypso
   </i>
  </td>
  <td>
   Orkest o.l.v.
   <a href="/wiki/John_Barry" title="John Barry">
    John Barry
   </a>
  </td>
  <td>
   <i>
    <a href="/wiki/Dr._No_(film)" title="Dr. No (film)">
     Dr. No
    </a>
   </i>
  </td>
  <td>
   1962
  </td>
  <td>
   <a href="/wiki/Monty_Norman" title="Monty Norman">
    Monty Norman
   </a>
   &amp;
   <a href="/wiki/John_Barry" title="John Barry">
    John Barry
   </a>
  </td>
 </tr>
 <tr>
  <td>
   <i>
    <a href="/wiki/From_Russia_with_Love_(film)" title="From Russia with Love (film)">
     From Russia with Love
    </a>
   </i>
  </td>
  <td>
   <a href="/wiki/Matt_Monro" title="Matt Monro">
    Matt Monro
   </a>
  </td>
  <t

In [32]:
list_columns_songs = table_01.find_all('th')
list_columns_songs = [item.text.strip() for item in list_columns_songs]

In [33]:
list_columns_songs

['Titelsong', 'Artiest', 'Film', 'Jaar', 'Componist']

In [34]:
songs_data = pd.DataFrame(columns = list_columns_songs)

In [35]:
# Create a for loop to fill EU_population_data
for j in table_01.find_all('tr')[1:]:
    row_data = j.find_all('td')
    row = [i.text for i in row_data]
    length = len(songs_data)
    songs_data.loc[length] = row

In [36]:
songs_data

Unnamed: 0,Titelsong,Artiest,Film,Jaar,Componist
0,James Bond Theme en Kingston Calypso,Orkest o.l.v. John Barry,Dr. No,1962,Monty Norman & John Barry\n
1,From Russia with Love,Matt Monro,From Russia with Love,1963,John Barry & Lionel Bart\n
2,Goldfinger,Shirley Bassey,Goldfinger,1964,John Barry & Anthony Newley & Leslie Bricusse\n
3,Thunderball,Tom Jones,Thunderball,1965,John Barry & Don Black\n
4,You Only Live Twice,Nancy Sinatra,You Only Live Twice,1967,John Barry & Leslie Bricusse\n
5,On Her Majesty's Secret Service,Orkest o.l.v. John Barry,On Her Majesty's Secret Service,1969,John Barry\n
6,Diamonds Are Forever,Shirley Bassey,Diamonds Are Forever,1971,John Barry & Don Black\n
7,Live and Let Die,Paul McCartney & Wings,Live and Let Die,1973,Paul McCartney & Linda McCartney\n
8,The Man with the Golden Gun,Lulu,The Man with the Golden Gun,1974,John Barry & Don Black\n
9,Nobody Does It Better,Carly Simon,The Spy Who Loved Me,1977,Marvin Hamlisch & Carole Bayer Sager\n


## Access Information within Hyperlinks

At the beginning we saw that hyperlinks are associated with tags `<a>`. Inspecting the web page you notice that the the attribute `href` points to the hyperlink address.

In [37]:
url = "https://www.stlyrics.com/b/bestofbondjamesbond.htm"

In [38]:
parser_03 = parse_website(url)

In [39]:
parser_03.find_all('a')

[<a class="pull-left mainLogo" href="/"><img alt="Best of Bond... James Bond Lyrics - STLyrics.com" height="46" src="/images/desktop/Logo.png" width="180"/></a>,
 <a href="/">Soundtracks:</a>,
 <a href="/a.htm">A</a>,
 <a href="/b.htm">B</a>,
 <a href="/c.htm">C</a>,
 <a href="/d.htm">D</a>,
 <a href="/e.htm">E</a>,
 <a href="/f.htm">F</a>,
 <a href="/g.htm">G</a>,
 <a href="/h.htm">H</a>,
 <a href="/i.htm">I</a>,
 <a href="/j.htm">J</a>,
 <a href="/k.htm">K</a>,
 <a href="/l.htm">L</a>,
 <a href="/m.htm">M</a>,
 <a href="/n.htm">N</a>,
 <a href="/o.htm">O</a>,
 <a href="/p.htm">P</a>,
 <a href="/q.htm">Q</a>,
 <a href="/r.htm">R</a>,
 <a href="/s.htm">S</a>,
 <a href="/t.htm">T</a>,
 <a href="/u.htm">U</a>,
 <a href="/v.htm">V</a>,
 <a href="/w.htm">W</a>,
 <a href="/x.htm">X</a>,
 <a href="/y.htm">Y</a>,
 <a href="/z.htm">Z</a>,
 <a href="/19.htm">#</a>,
 <a href="/songs/">List of artists:</a>,
 <a href="/songs/a.html">A</a>,
 <a href="/songs/b.html">B</a>,
 <a href="/songs/c.html">C

In [40]:
def retrieve_hyperlinks(main_url):
    """ 
    Find hyperlinks in 'main_url' 
    
    Args:
        main_url: Main webpage containing hyperlink
        
    Return:
        list of url: list of hyperlinks from main_url
        
    """
    # parse website containing hyperlinks
    parser = parse_website(main_url)
    
    # Find all 'a' tags (which define hyperlinks): a_tags

    a_tags = parser.find_all('a')

    # Create a list with hyperlinks found

    list_links = [link.get('href') for link in a_tags]

    # Remove none values if there is some
    
    list_links = list(filter(None, list_links)) 
    
    return list_links

In [41]:
main_url = "https://www.stlyrics.com/b/bestofbondjamesbond.htm"

list_links = retrieve_hyperlinks(main_url)

list_links = list(set(list_links))

print('\n Number of links before filtering:', len(list_links))
list_links[:20]


 Number of links before filtering: 109


['/contact.htm',
 '/songs/i.html',
 '/h/hawkeye.htm',
 '/lyrics/bestofbondjamesbond/youknowmyname.htm',
 '/lyrics/bestofbondjamesbond/anotherwaytodie.htm',
 '/songs/0-9.html',
 '/dmca.htm',
 '/19/355.htm',
 '/songs/e.html',
 '/songs/q.html',
 '/songs/o.html',
 '/songs/',
 '/n/notimetodie.htm',
 '/',
 '/h.htm',
 '/c.htm',
 '/r.htm',
 '/songs/k.html',
 '/lyrics/bestofbondjamesbond/liveandletdie.htm',
 '/g.htm']

In [42]:
list_links = [link for link in list_links if 'bestofbondjamesbond' in link]
print('\n Number of links after filtering:', len(list_links))
list_links


 Number of links after filtering: 23


['/lyrics/bestofbondjamesbond/youknowmyname.htm',
 '/lyrics/bestofbondjamesbond/anotherwaytodie.htm',
 '/lyrics/bestofbondjamesbond/liveandletdie.htm',
 '/lyrics/bestofbondjamesbond/goldfinger.htm',
 '/lyrics/bestofbondjamesbond/moonraker.htm',
 '/lyrics/bestofbondjamesbond/onhermajestyssecretservice.htm',
 '/lyrics/bestofbondjamesbond/wehaveallthetimeintheworld.htm',
 '/lyrics/bestofbondjamesbond/alltimehigh.htm',
 '/lyrics/bestofbondjamesbond/dieanotherday.htm',
 '/lyrics/bestofbondjamesbond/thunderball.htm',
 '/lyrics/bestofbondjamesbond/foryoureyesonly.htm',
 '/lyrics/bestofbondjamesbond/tomorrowneverdies.htm',
 '/lyrics/bestofbondjamesbond/diamondsareforever.htm',
 '/lyrics/bestofbondjamesbond/licencetokill.htm',
 '/lyrics/bestofbondjamesbond/youonlylivetwice.htm',
 '/lyrics/bestofbondjamesbond/aviewtoakill.htm',
 '/lyrics/bestofbondjamesbond/themanwiththegoldengun.htm',
 '/lyrics/bestofbondjamesbond/theworldisnotenough.htm',
 '/lyrics/bestofbondjamesbond/goldeneye.htm',
 '/lyrics

In [43]:
complete_urls = ["https://www.stlyrics.com"+link for link in list_links]
complete_urls

['https://www.stlyrics.com/lyrics/bestofbondjamesbond/youknowmyname.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/anotherwaytodie.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/liveandletdie.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/goldfinger.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/moonraker.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/onhermajestyssecretservice.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/wehaveallthetimeintheworld.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/alltimehigh.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/dieanotherday.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/thunderball.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/foryoureyesonly.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/tomorrowneverdies.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/diamondsareforever.htm',
 'https://

To retrieve a lyric what we have learnt in other sections are applied, i.e.,

In [44]:
# hyperlink to first lyric
lyrics_url = complete_urls[0]

# parse hyperlink
parse_lyrics = parse_website(lyrics_url)

# Access lyric
lyrics_list = parse_lyrics.find_all('div', class_="highlight")

# extract the string within tags and remove any space at begin or end 
lyrics_list=[item.text.strip() for item in lyrics_list ]
    
# Remove none values if there is some
lyrics_list = list(filter(None, lyrics_list)) 

print('\n'.join(lyrics_list))

If you take a life, do you know what you'll give
Odds are, you won't like what it is
When the storm arrives, would you be seen with me
By the merciless eyes I've deceived
I've seen angels fall from blinding heights
But you yourself are nothing so divine
Just next in line
Arm yourself because no-one else here will save you
The odds will betray you
And I will replace you
You can't deny the prize; it may never fulfill you
It longs to kill you
Are you willing to die
The coldest blood runs through my veins
You know my name
If you come inside, things will not be the same
When you return to my night
If you think you've won
You never saw me change
The game that we have been playing
I've seen diamonds cut through harder men
Than you yourself
but if you must pretend
You may meet your end
Arm yourself because no-one else here will save you
The odds will betray you
And I will replace you
You can't deny the prize; it may never fulfill you
It longs to kill you
Are you willing to die
The coldest bloo