## Using BeautifulSoup

BeautifulSoup is a library which parses HTML and creates a tree structure of python objects that we can navigate through, extract information from, and edit

In [1]:
#!conda install -y bs4

#### Convert the raw HTML string to a BeautifulSoup object

In [1]:
import requests
from bs4 import BeautifulSoup

### Task 1: Extract the poem titles, url and poems with Beautiful Soup

In [2]:
url = 'https://www.loc.gov/programs/poetry-and-literature/poet-laureate/poet-laureate-projects/poetry-180/all-poems/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url,headers=headers)
response.text

'<!DOCTYPE html>\n\n\n<html lang="en" class="no-js" prefix="lc: http://loc.gov/#">\n<head>\n\n    \n<meta charset="utf-8">\n<meta name="viewport" content="width=device-width,initial-scale=1"/>\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="version" content="$Revision$"/>\n<meta name="msvalidate.01" content="5C89FB9D99590AB2F55BD95C3A59BD81"/>\n<link title="schema(DC)" rel="schema.dc" href="http://purl.org/dc/elements/1.1/"/>\n<meta name="dc.language" content="eng" />\n<meta name="dc.source" content="Library of Congress, Washington, D.C. 20540 USA" />\n\n\n    <link rel="alternate" type="application/json" href="https://www.loc.gov/programs/poetry-and-literature/poet-laureate/poet-laureate-projects/poetry-180/all-poems/?fo=json" />\n\n\n<meta property="fb:admins" content="libraryofcongress"/>\n<meta property="og:site_name" content="The Library of Congress"/>\n<meta property="og:type" content="article" />\n\n\n<meta property="twitter:site" content="librarycongress"/>\

In [3]:
# check status code
response.status_code

200

In [4]:
type(response.text)

str

In [5]:
soup = BeautifulSoup(response.text,'html.parser')
soup

<!DOCTYPE html>

<html class="no-js" lang="en" prefix="lc: http://loc.gov/#">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width,initial-scale=1" name="viewport">
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="$Revision$" name="version">
<meta content="5C89FB9D99590AB2F55BD95C3A59BD81" name="msvalidate.01"/>
<link href="http://purl.org/dc/elements/1.1/" rel="schema.dc" title="schema(DC)"/>
<meta content="eng" name="dc.language"/>
<meta content="Library of Congress, Washington, D.C. 20540 USA" name="dc.source"/>
<link href="https://www.loc.gov/programs/poetry-and-literature/poet-laureate/poet-laureate-projects/poetry-180/all-poems/?fo=json" rel="alternate" type="application/json"/>
<meta content="libraryofcongress" property="fb:admins"/>
<meta content="The Library of Congress" property="og:site_name"/>
<meta content="article" property="og:type"/>
<meta content="librarycongress" property="twitter:site"/>
<title>
    
        List of All 180 Poems  | 


In [6]:
type(soup)

bs4.BeautifulSoup

(Bonus: [what other parsers are there and what is the difference between them?](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers))

In [7]:
# Tab name : title and text
soup.head.title.text

'\n    \n        List of All 180 Poems \xa0|\xa0\n    \n        Poetry 180 \xa0|\xa0\n    \n        Poet Laureate Projects \xa0|\xa0\n    \n        Poet Laureate \xa0|\xa0\n    \n        Poetry & Literature \xa0|\xa0\n    \n        Programs \xa0|\xa0\n    \n        Library of Congress \n    \n    '

In [26]:
# Can use chained .find() method: (header,div,li)
soup.find('div').find('li').a['href']

'/discover/'

In [35]:
# Getting all href links in the html page
for link in soup.find_all('a'):
    print(link.get('href'))
    
    

#skip-to-content
/
/discover/
/services-and-programs/
/visit/
/education/
/connect/
/about/
https://ask.loc.gov/
/help/
/contact/
https://catalog.loc.gov/
http://copyright.gov/
https://www.congress.gov/
https://www.loc.gov
https://www.loc.gov/programs/
https://www.loc.gov/programs/poetry-and-literature/
https://www.loc.gov/programs/poetry-and-literature/poet-laureate/
https://www.loc.gov/programs/poetry-and-literature/poet-laureate/poet-laureate-projects/
https://www.loc.gov/programs/poetry-and-literature/poet-laureate/poet-laureate-projects/poetry-180/
#
#
#
None
https://www.loc.gov/programs/poetry-and-literature/about-this-program/
https://www.loc.gov/programs/poetry-and-literature/poet-laureate/
https://www.loc.gov/programs/poetry-and-literature/national-ambassador-for-young-peoples-literature/
https://www.loc.gov/programs/poetry-and-literature/prizes/
https://www.loc.gov/programs/poetry-and-literature/audio-recordings/
https://www.loc.gov/programs/poetry-and-literature/featured-vid

In [49]:
# find where the poem title and link is in the html page (inspect: class_ item-description-title)

soup.find(class_='item-description-title').a.text

#or

#soup.find('span',attrs={'class':'item-description-title'})

'\n        \n\n        \n            Poem 001:\n        \n\n        \n            \n                \n                    \n                    Introduction to Poetry\n                    \n                \n            \n        \n\n        \n        '

In [54]:
one_title = soup.find(class_='item-description-title').a.text
one_title

'\n        \n\n        \n            Poem 001:\n        \n\n        \n            \n                \n                    \n                    Introduction to Poetry\n                    \n                \n            \n        \n\n        \n        '

In [58]:
one_title.split('\n')[11].strip()

'Introduction to Poetry'

In [59]:
# use find_all
all_poem_titles = soup.find_all(class_='item-description-title')
all_poem_titles
titles =[]
for title in all_poem_titles:
    titles.append(title.a.text.split('\n')[11].strip())

In [60]:
titles

['Introduction to Poetry',
 'The Good Life',
 'Abecedarian Requiring Further Examination of Anglikan Seraphym Subjugation of a Wild Indian Rezervation',
 'Question',
 'Thanks',
 'How Bright It Is',
 '"Do You Have Any Advice For Those of Us Just Starting Out?"',
 'Numbers',
 'The Cord',
 'At the Un-National Monument Along the Canadian Border',
 'Can We Touch Your Hair?',
 'The Bat',
 'Did I Miss Anything?',
 'Neglect',
 'The Poet',
 'Radio',
 'Bad Day',
 'The Farewell',
 'The Partial Explanation',
 'Wife',
 'Wheels',
 'Remora, Remora',
 'Tour',
 'After Us',
 'Domestic Work, 1937',
 'Before She Died',
 'Poetry',
 'American Cheese',
 'Advice from the Experts',
 'One Morning',
 'Walking Home',
 'Rabbits and Fire',
 'The Meadow',
 'The Summer I Was Sixteen',
 'Hand Shadows',
 'El Florida Room',
 "She Didn't Mean to Do It",
 'Cartoon Physics, part 1',
 'Snow',
 'Driving to Town Late to Mail a Letter',
 'Halloween',
 'The Poetry of Bad Weather',
 'The Green One Over There',
 'A Man I Knew',
 

In [99]:
lst1= [1,2,3]
lst2 =['a','b','c']
list(zip(lst1,lst2))

[(1, 'a'), (2, 'b'), (3, 'c')]

In [100]:
pd.DataFrame(list(zip(lst1,lst2)),columns=['num','alpha'])

Unnamed: 0,num,alpha
0,1,a
1,2,b
2,3,c


In [None]:
# to find one poem link 
soup.find('div', attrs={'class' : 'item-description'}).a['href']

In [69]:
# Extract all poem links
links = []
poemlinks = soup.find_all('span', attrs={'class' : 'item-description-title'})
for link in poemlinks:
    links.append(link.a['href'])
links

['https://www.loc.gov/programs/poetry-and-literature/poet-laureate/poet-laureate-projects/poetry-180/all-poems/item/poetry-180-001/introduction-to-poetry/',
 'https://www.loc.gov/programs/poetry-and-literature/poet-laureate/poet-laureate-projects/poetry-180/all-poems/item/poetry-180-002/the-good-life/',
 'https://www.loc.gov/programs/poetry-and-literature/poet-laureate/poet-laureate-projects/poetry-180/all-poems/item/poetry-180-003/abecedarian-requiring-further-examination-of-anglikan-seraphym-subjugation-of-a-wild-indian-rezervation/',
 'https://www.loc.gov/programs/poetry-and-literature/poet-laureate/poet-laureate-projects/poetry-180/all-poems/item/poetry-180-004/question/',
 'https://www.loc.gov/programs/poetry-and-literature/poet-laureate/poet-laureate-projects/poetry-180/all-poems/item/poetry-180-005/thanks/',
 'https://www.loc.gov/programs/poetry-and-literature/poet-laureate/poet-laureate-projects/poetry-180/all-poems/item/poetry-180-006/how-bright-it-is/',
 'https://www.loc.gov/

In [68]:
# Put the info of title and link in dataframe
import pandas as pd
df = pd.DataFrame({'title':titles,'links':links})
df

Unnamed: 0,title,links
0,Introduction to Poetry,https://www.loc.gov/programs/poetry-and-litera...
1,The Good Life,https://www.loc.gov/programs/poetry-and-litera...
2,Abecedarian Requiring Further Examination of A...,https://www.loc.gov/programs/poetry-and-litera...
3,Question,https://www.loc.gov/programs/poetry-and-litera...
4,Thanks,https://www.loc.gov/programs/poetry-and-litera...
...,...,...
175,How to Change a Frog Into a Prince,https://www.loc.gov/programs/poetry-and-litera...
176,Eagle Plain,https://www.loc.gov/programs/poetry-and-litera...
177,End of April,https://www.loc.gov/programs/poetry-and-litera...
178,Bike Ride with Older Boys,https://www.loc.gov/programs/poetry-and-litera...


In [75]:
# Extract the lyric of one poem
lyric1 = BeautifulSoup(requests.get(df.links[2]).text)
lyric1.find('div',attrs={'class':'poem'}).pre.text

'Angels don’t come to the reservation.\r\nBats, maybe, or owls, boxy mottled things.\r\nCoyotes, too. They all mean the same thing—\r\ndeath. And death\r\neats angels, I guess, because I haven’t seen an angel\r\nfly through this valley ever.\r\nGabriel? Never heard of him. Know a guy named Gabe though—\r\nhe came through here one powwow and stayed, typical\r\nIndian. Sure he had wings,\r\njailbird that he was. He flies around in stolen cars. Wherever he stops,\r\nkids grow like gourds from women’s bellies.\r\nLike I said, no Indian I’ve ever heard of has ever been or seen an angel.\r\nMaybe in a Christmas pageant or something—\r\nNazarene church holds one every December,\r\norganized by Pastor John’s wife. It’s no wonder\r\nPastor John’s son is the angel—everyone knows angels are white.\r\nQuit bothering with angels, I say. They’re no good for Indians.\r\nRemember what happened last time\r\nsome white god came floating across the ocean?\r\nTruth is, there may be angels, but if there ar

In [81]:
df['poem_lyric'] = 'blank'
df

Unnamed: 0,title,links,poem_lyric
0,Introduction to Poetry,https://www.loc.gov/programs/poetry-and-litera...,blank
1,The Good Life,https://www.loc.gov/programs/poetry-and-litera...,blank
2,Abecedarian Requiring Further Examination of A...,https://www.loc.gov/programs/poetry-and-litera...,blank
3,Question,https://www.loc.gov/programs/poetry-and-litera...,blank
4,Thanks,https://www.loc.gov/programs/poetry-and-litera...,blank
...,...,...,...
175,How to Change a Frog Into a Prince,https://www.loc.gov/programs/poetry-and-litera...,blank
176,Eagle Plain,https://www.loc.gov/programs/poetry-and-litera...,blank
177,End of April,https://www.loc.gov/programs/poetry-and-litera...,blank
178,Bike Ride with Older Boys,https://www.loc.gov/programs/poetry-and-litera...,blank


In [84]:
import time
df['poem_lyric'] = 'blank'
for i in range(len(df.links[:5])):
    lyric = BeautifulSoup(requests.get(df.links[i]).text)
    df['poem_lyric'][i] = lyric.find(class_='poem').pre.text
    time.sleep(3)

In [85]:
df

Unnamed: 0,title,links,poem_lyric
0,Introduction to Poetry,https://www.loc.gov/programs/poetry-and-litera...,I ask them to take a poem\r\nand hold it up to...
1,The Good Life,https://www.loc.gov/programs/poetry-and-litera...,When some people talk about moneyThey speak as...
2,Abecedarian Requiring Further Examination of A...,https://www.loc.gov/programs/poetry-and-litera...,"Angels don’t come to the reservation.\r\nBats,..."
3,Question,https://www.loc.gov/programs/poetry-and-litera...,Body my house\r\nmy horse my hound\r\nwhat wil...
4,Thanks,https://www.loc.gov/programs/poetry-and-litera...,Thanks for the tree\r\nbetween me & a sniper’s...
...,...,...,...
175,How to Change a Frog Into a Prince,https://www.loc.gov/programs/poetry-and-litera...,blank
176,Eagle Plain,https://www.loc.gov/programs/poetry-and-litera...,blank
177,End of April,https://www.loc.gov/programs/poetry-and-litera...,blank
178,Bike Ride with Older Boys,https://www.loc.gov/programs/poetry-and-litera...,blank


In [86]:
# if you want to save the poems as txt files
import os
import time
for i in range(len(df.links[:2])):
    lyric = BeautifulSoup(requests.get(df.links[i]).text)
    time.sleep(3)
    filename = f"poem/{df.title[i]}.txt"
    os.makedirs(os.path.dirname(filename), exist_ok=True)
    with open(filename,'w') as f:
        f.write(lyric.find(class_='poem').pre.text)
        f.close()

In [None]:
# Read from saved txt files

In [88]:
lines = []
for i in range(len(df.links[:2])):
    filename = f"poem/{df.title[i]}.txt"
    with open(filename, "r") as f:
        lines.append(f.read())
        f.close()

In [89]:
lines

["I ask them to take a poem\nand hold it up to the light\nlike a color slide\n                  \nor press an ear against its hive.\n                \nI say drop a mouse into a poem\nand watch him probe his way out,\nor walk inside the poem's room\nand feel the walls for a light switch.\n                  \nI want them to waterski\nacross the surface of a poem\nwaving at the author's name on the shore.\n                 \nBut all they want to do\nis tie the poem to a chair with rope\nand torture a confession out of it.\n                 \nThey begin beating it with a hose\nto find out what it really means.",
 'When some people talk about moneyThey speak as if it were a mysterious loverWho went out to buy milk and neverCame back, and it makes me nostalgicFor the years I lived on coffee and bread,Hungry all the time, walking to work on paydayLike a woman journeying for waterFrom a village without a well, then livingOne or two nights like everyone elseOn roast chicken and red wine.']

# Variations in extracting links

In [90]:
html = """<a class="px-2 py-1 default-link list-group-item-action d-block" href="https://lyrics.az/the-weeknd/-/live-for-freestyle.html" title='The Weeknd - "Live For" Freestyle lyrics'>"Live For" Freestyle</a>"""
test = BeautifulSoup(html)
test
#test.find('a',attrs={'class':'px-2'})['href']

<html><body><a class="px-2 py-1 default-link list-group-item-action d-block" href="https://lyrics.az/the-weeknd/-/live-for-freestyle.html" title='The Weeknd - "Live For" Freestyle lyrics'>"Live For" Freestyle</a></body></html>

In [96]:
test.a['href']

'https://lyrics.az/the-weeknd/-/live-for-freestyle.html'

In [97]:
test.find('a',attrs={'class':'px-2'})['href']

'https://lyrics.az/the-weeknd/-/live-for-freestyle.html'

In [98]:
html1 = """<td class="tal qx"><strong><a href="/lyric/36119617/The+Weeknd/Or+Nah">Or Nah</a></strong></td>"""
test1 = BeautifulSoup(html1)
test1.find('td',attrs={'class':'tal qx'}).a['href']

'/lyric/36119617/The+Weeknd/Or+Nah'

### Task 2: curriculum extraction - Exercise
From the [SPICED Academy Data Science Page](https://www.spiced-academy.com/en/program/data-science), extract the text about the curriculum and store in in a table:

| Topic  | Description |
| ------------- | ------------- |
| Data Analysis in Python  | Become fluent in using Python to collect...  |
| Machine Learning  | Delve into the world of Supervised and Unsupervised...  |

#### Get the HTML text from the website
https://www.spiced-academy.com/en/program/data-science

In [None]:
html = BeautifulSoup(requests.get('https://www.spiced-academy.com/en/program/data-science').text)

In [None]:
headings = []
desc = []
for heading in html.find_all(class_='description'):
    headings.append(heading.h3.text)
    desc.append(heading.find(class_='mob-hidden').text)

In [None]:
pd.DataFrame({'heading':headings,'desc':desc})

## AVOIDING GETTING BLOCKED OR BEING A BAD SCRAPER

### Customising your requests
* Add a user agent - `Google-> whats my user agent`
* Add some sleeping function

In [None]:
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}

In [None]:
http_request = requests.get(url,headers=headers)

In [None]:
import time
for i in range(10):
    time.sleep(1)
    http_request = requests.get(url,headers=headers)
    #do something with the text

## Next steps: 
* Pick a lyrics website - lyrics.com, lyrics.az ,azlyrics.com, or one of your own choosing
* Set a user agent, set a timer (optional)
* Inspect their webpage and find a page with all artists songs linked
* Get that site, and check the response type that you get 
* Save that page to disk
* IF YOU GET BLOCKED - wait a bit and try again / try another ip address / proxy routing etc 

In [104]:
for i,link in enumerate(range(90,102)):
    print(i,link)

0 90
1 91
2 92
3 93
4 94
5 95
6 96
7 97
8 98
9 99
10 100
11 101


In [None]:
link = []