# Webscraping with Beautiful Soup

<h2>Content</h2>

Explain content

<h2>Caveat</h2>
Web sites often change the format of their pages so this may not always work. If it doesn't, rework the examples after examining the html content of the page (most browsers will let you see the html source - look for a "page source" option - though you might have to turn on the developer mode in your browser preferences. For example, on Chrome you need to click the "developer mode" check box under Extensions in the Preferences/Options menu. 

<h2>1. How to set up Soup</h2>
The provided code serves as a template and will not run if exectued. For executable code, see examples

<h3>Import necessary modules</h3>

In [22]:
# import necessary libraries 
from bs4 import BeautifulSoup
import requests

<h3>Http request response cycle</h3>

In [24]:
# BeautifulSoup http response cycle 
    # checks if website can be reached
url =  "https://quotes.toscrape.com/"#"https://www.yourexampleurl.com/"
response = requests.get(url)
if response.status_code == 200:
    print("Success")
else:
    print("Failure")

Success


<h3>Set up the BeautifulSoup object</h3>

In [27]:
# parses page content and prints xml structure
results_page = BeautifulSoup(response.content,'lxml')
print(results_page.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert

<h3>Useful Soup functions</h3>

In [28]:

# find_all
    #find_all finds all instances of a specified tag
    #returns a result_set (a list)
all_a_tags = results_page.find_all('a')
print(type(all_a_tags))

# find
    # find finds the first instance of a specified tag
    # returns a bs4 element
div_tag = results_page.find('div')

# recursive application of Soup functions
    #Soup functions can be recursively applied on elements
div_tag.find('a')



# Both find as well as find_all can be qualified by css selectors
    # using selector=value
    # using a dictionary

        #When using this method and looking for 'class' use 'class_' (because class is a reserved word in python)
        #Note that we get a list back because find_all returns a list
        #Note you need the underscore following class since class is a reserved keyword
results_page.find_all('item',class_='css-class-title')


        #Since we're using a string as the key, the fact that class is a reserved word is not a problem
        #We get an element back because find returns an element
results_page.find('item',{'class':'css-class-title'})

# get_text
    # get_text() returns the marked up text (the content) enclosed in a tag.
    # returns a string
results_page.find('item',{'class':'css-class-title'}).get_text()


# get
    # get returns the value of a tag attribute (e.g., href)
    # returns a string
item_tag = results_page.find('item',{'class':'css-class-title'})
item_link = item_tag.find('a')
print("a tag:",item_link)
link_url = item_link.get('href')
print("link url:",link_url)
print(type(link_url))

<class 'bs4.element.ResultSet'>


AttributeError: 'NoneType' object has no attribute 'get_text'

In [None]:
/html/body/div/div[2]/div[1]/div[1]/div/a[1]

## 2. Scraping Example 1 - Scraping from Quotes to Scrape

In this example we systematically collect data from https://quotes.toscrape.com/. This is provided by Scrapinghub and dedicated to webscraping tutorials.

2.1 First we connect to the website and parse its content

In [29]:
# import necessary libraries 
from bs4 import BeautifulSoup
import requests 

# BeautifulSoup http response cycle 
url = "https://quotes.toscrape.com/"
response = requests.get(url)
if response.status_code == 200:
    print("Success")
else:
    print("Failure")

# parses page content and prints xml structure
results_page = BeautifulSoup(response.content,'lxml')
print(results_page.prettify())

Success
<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/autho

2.2 We try to find text, author, and tags of a quote

In [41]:
first_quote_text = results_page.find('span',class_='text').get_text()
first_quote_author = results_page.find('small',class_='author').get_text()
first_quote_tags = results_page.find('div',class_='tags').get_text() # naive approach

print("Text: "+first_quote_text)
print("Author: "+first_quote_author)
print("Tags: "+first_quote_tags) # what went wrong here?



Text: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein
Tags: 
            Tags:
            
change
deep-thoughts
thinking
world



In [5]:
# solution attempt
print("tags: "+first_quote_tags.replace("\n", ", ")) # what went wrong here? .replace("\n", "")

tags: ,             Tags:,             , change, deep-thoughts, thinking, world, 


In [43]:
# solution 
    # recursive application of Soup functions combined with a for loop 
first_quote_tags = results_page.find('div',class_='tags') 

tags_list = []

for tags in first_quote_tags.find_all('a'):
    tags_list.append(tags.get_text())

print("Text: "+first_quote_text)
print("Author: "+first_quote_author)
print("Tags: "+str(tags_list)) # storing all tags in a list, makes the iterable

Text: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein
Tags: ['change', 'deep-thoughts', 'thinking', 'world']


2.3 We try to find text and author of all quotes and store them as a json file

In [45]:
import json 
import pandas as pd
from pandas import DataFrame

# create a list to store all authors
author_list = []

# find all authors listed on the webpage
authors = results_page.find_all('small',class_='author')

# iterate through results and append author list 
for author in authors:
    author_list.append(author.get_text())

# create a dataframe
df = DataFrame (author_list,columns=['authors'])


# save data frame as json
#df.to_json(r'C:\Users\danie\OneDrive\Dokumente\PhD\04 - Lehre\01 - UI DS\Code\02-Scraping\authors.json')

# this is what the file looks like
data = df.to_json(orient='index')
print(data)

{"0":{"authors":"Albert Einstein"},"1":{"authors":"J.K. Rowling"},"2":{"authors":"Albert Einstein"},"3":{"authors":"Jane Austen"},"4":{"authors":"Marilyn Monroe"},"5":{"authors":"Albert Einstein"},"6":{"authors":"Andr\u00e9 Gide"},"7":{"authors":"Thomas A. Edison"},"8":{"authors":"Eleanor Roosevelt"},"9":{"authors":"Steve Martin"}}


In [44]:
df

Unnamed: 0,authors
0,Albert Einstein
1,J.K. Rowling
2,Albert Einstein
3,Jane Austen
4,Marilyn Monroe
5,Albert Einstein
6,André Gide
7,Thomas A. Edison
8,Eleanor Roosevelt
9,Steve Martin


2.4 We write everything into a function 

In [18]:
def get_author_list():

    # import necessary libraries 
    from bs4 import BeautifulSoup
    import requests 

    # BeautifulSoup http response cycle 
    url = "https://quotes.toscrape.com/"
    response = requests.get(url)
    if response.status_code == 200:
        print("Success")
    else:
        print("Failure")

    # parses page content and prints xml structure
    results_page = BeautifulSoup(response.content,'lxml')
    

    # create a list to store all authors
    author_list = []

    # find all authors listed on the webpage
    authors = results_page.find_all('small',class_='author')

    # iterate through results and append author list 
    for author in authors:
        author_list.append(author.get_text())

    print("authors have been extracted successfully and stored as a json file")
    return author_list

In [19]:
author_list = get_author_list()

df = DataFrame (author_list,columns=['authors'])
df.to_json(r'C:\Users\danie\OneDrive\Dokumente\PhD\04 - Lehre\01 - UI DS\Code\02-Scraping\authors.json')



Success
authors have been extracted successfully and stored as a json file


['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']

## 3. Scraping Example 2 - Scraping a real live page 

In [46]:
import requests 


url = "https://api.llama.fi/protocols"
response = requests.get(url)

print(response)


<Response [200]>


In [50]:
response.text

'[{"id":"1","name":"Uniswap","address":"0x1f9840a85d5af5bf1d1762f925bdaddc4201f984","symbol":"UNI","url":"https://info.uniswap.org/","description":"A fully decentralized protocol for automated liquidity provision on Ethereum.\\r\\n","chain":"Ethereum","logo":null,"audits":"2","audit_note":null,"gecko_id":"uniswap","cmcId":"7083","category":"Dexes","chains":["Ethereum"],"module":"uniswap/index.js","twitter":"Uniswap","audit_links":["https://github.com/Uniswap/uniswap-v3-core/tree/main/audits","https://github.com/Uniswap/uniswap-v3-periphery/tree/main/audits","https://github.com/ConsenSys/Uniswap-audit-report-2018-12"],"oracles":["Uniswap"],"slug":"uniswap","tvl":4641707124.474894,"chainTvls":{"Ethereum":4641707124.474894},"change_1h":0.5385022367304941,"change_1d":-0.5998220016819431,"change_7d":2.2099088878963897,"fdv":26137303088,"mcap":13587670116},{"id":"2","name":"WBTC","address":"0x2260fac5e5542a773aa44fbcfedf7c193bc2c599","symbol":"WBTC","url":"https://wbtc.network/","description