The links we want are underlined in black
We will get them + get the texts inside these links

![The links we want are underlined in black](picture.png)

# Set-up and Workflow

### Importing the packages

In [1]:
# Load the packages
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

### Making a GET request

In [2]:
# Defining the url of the site
base_site = "https://en.wikipedia.org/wiki/Music"

# Making a get request
response = requests.get(base_site)
response.status_code

200

In [3]:
# Extracting the HTML
html = response.content

# Checking that the reply is indeed an HTML code by inspecting the first 100 symbols
html[:100]

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title'

### Making the soup

In [4]:
# Convert HTML to a BeautifulSoup object. This will allow us to parse out content from the HTML more easily.
# Using the default parser as it is included in Python
soup = BeautifulSoup(html, "html.parser")

# Extracting data from nested tags

In [5]:
# Our objective now is to extract all links that can be found under a section heading
# Marked as 'Main article:' or 'See also:'
# By quick inspection, we see that these are contained in div tags with attribute 'role' set to 'note'

div_notes = soup.find_all("div", {"role": "note"})
div_notes

[<div class="hatnote navigation-not-searchable" role="note">For other uses, see <a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>.</div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/History_of_music" title="History of music">History of music</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Music_of_Egypt" title="Music of Egypt">Music of Egypt</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Music_of_Greece" title="Music of Greece">Music of Greece</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/20th-century_music" title="20th-century music">20th-century music</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Musical_composition" title="Musical composition">Musical composition<

In [6]:
div_notes[0]

<div class="hatnote navigation-not-searchable" role="note">For other uses, see <a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>.</div>

In [7]:
# We can apply find() and find_all() to a tag in the same way we do it to the whole document
div_notes[0].find('a')

<a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>

In [8]:
# A naive approach to get all links would be to use find
div_links = [div.find('a') for div in div_notes]
div_links

[<a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>,
 <a href="/wiki/History_of_music" title="History of music">History of music</a>,
 <a href="/wiki/Music_of_Egypt" title="Music of Egypt">Music of Egypt</a>,
 <a href="/wiki/Music_of_Greece" title="Music of Greece">Music of Greece</a>,
 <a href="/wiki/20th-century_music" title="20th-century music">20th-century music</a>,
 <a href="/wiki/Musical_composition" title="Musical composition">Musical composition</a>,
 <a href="/wiki/Musical_notation" title="Musical notation">Musical notation</a>,
 <a href="/wiki/Musical_improvisation" title="Musical improvisation">Musical improvisation</a>,
 <a href="/wiki/Music_theory" title="Music theory">Music theory</a>,
 <a href="/wiki/History_of_music" title="History of music">History of music</a>,
 <a href="/wiki/Strophic_form" title="Strophic form">Strophic form</a>,
 <a href="/wiki/Aesthetics_of_music" title="Aesthetics of music">Aesthe

In [9]:
len(div_links)

23

In [10]:
# However, some divs have more than 1 link
div_notes[10]

<div class="hatnote navigation-not-searchable" role="note">See also: <a href="/wiki/Strophic_form" title="Strophic form">Strophic form</a>, <a href="/wiki/Binary_form" title="Binary form">Binary form</a>, <a href="/wiki/Ternary_form" title="Ternary form">Ternary form</a>, <a class="mw-redirect" href="/wiki/Rondo_form" title="Rondo form">Rondo form</a>, <a href="/wiki/Variation_(music)" title="Variation (music)">Variation (music)</a>, and <a class="mw-redirect" href="/wiki/Musical_development" title="Musical development">Musical development</a></div>

In [11]:
# This div has 6 links in it
div_notes[10].find_all('a')

[<a href="/wiki/Strophic_form" title="Strophic form">Strophic form</a>,
 <a href="/wiki/Binary_form" title="Binary form">Binary form</a>,
 <a href="/wiki/Ternary_form" title="Ternary form">Ternary form</a>,
 <a class="mw-redirect" href="/wiki/Rondo_form" title="Rondo form">Rondo form</a>,
 <a href="/wiki/Variation_(music)" title="Variation (music)">Variation (music)</a>,
 <a class="mw-redirect" href="/wiki/Musical_development" title="Musical development">Musical development</a>]

In [12]:
# Therefore we need to use find_all
# Let's use a for loop

# Define initially empty list of links
div_links = []

for div in div_notes:
    anchors = div.find_all('a')
    
    # Need to add every link from anchors to div_links
    for a in anchors:
        div_links.append(a)
    
    # Can use div_links.extend(anchors) instead of the for loop
    

In [13]:
div_links

[<a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>,
 <a href="/wiki/History_of_music" title="History of music">History of music</a>,
 <a href="/wiki/Music_of_Egypt" title="Music of Egypt">Music of Egypt</a>,
 <a href="/wiki/Music_of_Greece" title="Music of Greece">Music of Greece</a>,
 <a href="/wiki/20th-century_music" title="20th-century music">20th-century music</a>,
 <a href="/wiki/Musical_composition" title="Musical composition">Musical composition</a>,
 <a href="/wiki/Musical_notation" title="Musical notation">Musical notation</a>,
 <a href="/wiki/Musical_improvisation" title="Musical improvisation">Musical improvisation</a>,
 <a href="/wiki/Music_theory" title="Music theory">Music theory</a>,
 <a href="/wiki/History_of_music" title="History of music">History of music</a>,
 <a href="/wiki/Elements_of_music" title="Elements of music">Elements of music</a>,
 <a href="/wiki/Strophic_form" title="Strophic form">Stroph

In [14]:
# We now have a complete list
len(div_links)

31

In [15]:


# Let's get the URLs
note_urls = [urljoin(base_site, l.get('href')) for l in div_links]
note_urls

['https://en.wikipedia.org/wiki/Music_(disambiguation)',
 'https://en.wikipedia.org/wiki/History_of_music',
 'https://en.wikipedia.org/wiki/Music_of_Egypt',
 'https://en.wikipedia.org/wiki/Music_of_Greece',
 'https://en.wikipedia.org/wiki/20th-century_music',
 'https://en.wikipedia.org/wiki/Musical_composition',
 'https://en.wikipedia.org/wiki/Musical_notation',
 'https://en.wikipedia.org/wiki/Musical_improvisation',
 'https://en.wikipedia.org/wiki/Music_theory',
 'https://en.wikipedia.org/wiki/History_of_music',
 'https://en.wikipedia.org/wiki/Elements_of_music',
 'https://en.wikipedia.org/wiki/Strophic_form',
 'https://en.wikipedia.org/wiki/Binary_form',
 'https://en.wikipedia.org/wiki/Ternary_form',
 'https://en.wikipedia.org/wiki/Rondo_form',
 'https://en.wikipedia.org/wiki/Variation_(music)',
 'https://en.wikipedia.org/wiki/Musical_development',
 'https://en.wikipedia.org/wiki/Aesthetics_of_music',
 'https://en.wikipedia.org/wiki/Neuroscience_of_music',
 'https://en.wikipedia.org/

In [16]:
len(note_urls)

31

# Scraping multiple pages automatically - Extracting all the text from the note URLs

In [17]:
# We will use the links we obtained above
note_urls

['https://en.wikipedia.org/wiki/Music_(disambiguation)',
 'https://en.wikipedia.org/wiki/History_of_music',
 'https://en.wikipedia.org/wiki/Music_of_Egypt',
 'https://en.wikipedia.org/wiki/Music_of_Greece',
 'https://en.wikipedia.org/wiki/20th-century_music',
 'https://en.wikipedia.org/wiki/Musical_composition',
 'https://en.wikipedia.org/wiki/Musical_notation',
 'https://en.wikipedia.org/wiki/Musical_improvisation',
 'https://en.wikipedia.org/wiki/Music_theory',
 'https://en.wikipedia.org/wiki/History_of_music',
 'https://en.wikipedia.org/wiki/Elements_of_music',
 'https://en.wikipedia.org/wiki/Strophic_form',
 'https://en.wikipedia.org/wiki/Binary_form',
 'https://en.wikipedia.org/wiki/Ternary_form',
 'https://en.wikipedia.org/wiki/Rondo_form',
 'https://en.wikipedia.org/wiki/Variation_(music)',
 'https://en.wikipedia.org/wiki/Musical_development',
 'https://en.wikipedia.org/wiki/Aesthetics_of_music',
 'https://en.wikipedia.org/wiki/Neuroscience_of_music',
 'https://en.wikipedia.org/

In [18]:
# The objective is to get all the useful text from those wikipedia pages

# We will do that by extracting all text contained in a paragraph element,
# for all paragraphs on a page,
# for all pages (in note_urls)

In [19]:
import lxml

In [20]:
# initialize list to store paragraph text for each webpage
par_text = []


# creating a loop counter
i = 0

# Loop through each URL in note_urls
for url in note_urls:
    
    # connect to every webpage
    note_resp = requests.get(url)
    
    # checking if the request is successful
    if note_resp.status_code == 200:            # Everything is OK!
        print('URL #{0}: {1}'.format(i+1,url))    # print out the number of iteration and the URL to keep track of place in loop
    
    else:                                       # Something is wrong!
        print('Status code {0}: Skipping URL #{1}: {2}'.format(note_resp.status_code, i+1, url))
        i = i+1
        continue
        
    
    # get HTML from webpage
    note_html = note_resp.content
    
    # convert HTML to BeautifulSoup object
    note_soup = BeautifulSoup(note_resp.content, 'lxml')
    
    # find all "p" tags on the webpage
    note_pars = note_soup.find_all("p")
    
    # Get the text from each "p" tag
    text = [p.text for p in note_pars]
    
    # Append text from each "p" tag to our list, par_text
    par_text.append(text)
    
    # Incrementing the loop counter
    i = i+1


URL #1: https://en.wikipedia.org/wiki/Music_(disambiguation)
URL #2: https://en.wikipedia.org/wiki/History_of_music
URL #3: https://en.wikipedia.org/wiki/Music_of_Egypt
URL #4: https://en.wikipedia.org/wiki/Music_of_Greece
URL #5: https://en.wikipedia.org/wiki/20th-century_music
URL #6: https://en.wikipedia.org/wiki/Musical_composition
URL #7: https://en.wikipedia.org/wiki/Musical_notation
URL #8: https://en.wikipedia.org/wiki/Musical_improvisation
URL #9: https://en.wikipedia.org/wiki/Music_theory
URL #10: https://en.wikipedia.org/wiki/History_of_music
URL #11: https://en.wikipedia.org/wiki/Elements_of_music
URL #12: https://en.wikipedia.org/wiki/Strophic_form
URL #13: https://en.wikipedia.org/wiki/Binary_form
URL #14: https://en.wikipedia.org/wiki/Ternary_form
URL #15: https://en.wikipedia.org/wiki/Rondo_form
URL #16: https://en.wikipedia.org/wiki/Variation_(music)
URL #17: https://en.wikipedia.org/wiki/Musical_development
URL #18: https://en.wikipedia.org/wiki/Aesthetics_of_music
UR

In [21]:
# Inspecting the result for the first page
par_text[0]

['Music is an art form consisting of sound and silence, expressed through time.\n',
 'Music may also refer to:\n']

In [22]:
# We see that we have a list of all paragraph strings
# It would be more useful to have all the text as one string, not as a list of strings

# Merging all paragraphs of the first page into one long string
page_text = "".join(par_text[0])
page_text

'Music is an art form consisting of sound and silence, expressed through time.\nMusic may also refer to:\n'

In [23]:
# Let's do that for all pages

# Merging all paragraphs for all pages
page_text = ["".join(text) for text in par_text]

# Inspect the result for some webpage
page_text[0]

'Music is an art form consisting of sound and silence, expressed through time.\nMusic may also refer to:\n'

In [24]:
# Inspect result
print(page_text[4])

The following Wikipedia articles deal with 20th-century music.







In [25]:
# Creating a dictionary with the (key,value) pairs being (url,text)
url_to_text = dict(zip(note_urls, page_text))  # You don't need to know the specifics of these functions

In [26]:
print(url_to_text['https://en.wikipedia.org/wiki/Music_theory'])


Music theory is the study of the practices and possibilities of music. The Oxford Companion to Music describes three interrelated uses of the term "music theory". The first is the "rudiments", that are needed to understand music notation (key signatures, time signatures, and rhythmic notation); the second is learning scholars' views on music from antiquity to the present; the third is a sub-topic of musicology that "seeks to define processes and general principles in music". The musicological approach to theory differs from music analysis "in that it takes as its starting-point not the individual work or performance but the fundamental materials from which it is built."[1]
Music theory is frequently concerned with describing how musicians and composers make music, including tuning systems and composition methods among other topics. Because of the ever-expanding conception of what constitutes music (see Definition of music), a more inclusive definition could be the consideration of any

In [27]:
# A word of caution:
# We have not extracted all of the main content's text,
# as some text may be contained in lists and tables, outside of paragraphs we scraped