# Ready-To-Use Webscrapping Notebok with BeautifulSoup and Python


## Ex.1 : Extracting data inside of html tags

We start by importing the requests library and making a simple GET request to the URL we choose, and store the result in a variable called r :

(We will only extract the quotes with a photo, as you see on the website, 8 of them are on the main page.)

In [1]:
import requests
url = 'https://www.brainyquote.com/topics/motivational-quotes'
r = requests.get(url)
print(r.content)

b'\n<!DOCTYPE html>\n<html lang="en">\n<head>\n<title>Motivational Quotes - BrainyQuote</title>\n<meta name="robots" content="all">\n<meta charset="utf-8" />\n<meta name="viewport" content="width=device-width, initial-scale=1.0, viewport-fit=cover"><meta name="description" content="Explore 285 Motivational Quotes by authors including Winston Churchill, Confucius, and Helen Keller at BrainyQuote.">\n<meta name="googlebot" content="NOODP">\n<meta property="ver" content="13.3.11:5238611">\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="msapplication-config" content="none">\n<meta name="apple-mobile-web-app-capable" content="yes">\n<meta property="ts" content="1676364178">\n<meta property="og:site_name" content="BrainyQuote">\n<meta property="og:title" content="Motivational Quotes - BrainyQuote">\n<meta property="og:type" content="article">\n<meta property="og:description" content="Explore 285 Motivational Quotes by authors including Winston Churchill, Confucius, and He

Now we will parse the Data with BS and a parser html5lib, we also create the quotes variable that we will be using later :

(We need to pass two values into BeautifulSoup():

1: HTML string from the website; here ‘r.content’

2: What HTML parser to use; ‘html5lib’)

In [2]:
from bs4 import BeautifulSoup
import csv
soup = BeautifulSoup(r.content, 'html5lib')

quotes = []

At this point, you want to go to the site you are scraping. Open up the Devtools (F12), and go to the Elements tab. We are going to be looking for the top table layer :
exemple1.svg

In [3]:
table = soup.find('div', attrs = {'id':'quotesList'})

In [None]:
print(table.prettify())

We are looking for the information under “img alt” (so we only get the quotes with a image), quotes key so let’s create a quote variable and assign this data to it :

In [5]:
for row in table.findAll('div', attrs = {'class':'qti-listm'}):
  quote = {}
  try:
    Author = quote['author'] = row.img['alt'].split("-")[1]
    Text = quote['text'] = row.img['alt'].split("-")[0]
    quotes.append(quote)
    print(quote)

  except TypeError:
    continue

# Don't run it more than once or the data will keep adding with themself

{'author': ' Winston Churchill', 'text': "If you're going through hell, keep going. "}
{'author': ' Jim Rohn', 'text': 'Either you run the day or the day runs you. '}
{'author': ' George S. Patton', 'text': 'A good plan violently executed now is better than a perfect plan executed next week. '}
{'author': ' A. P. J. Abdul Kalam', 'text': 'We should not give up and we should not allow the problem to defeat us. '}
{'author': ' Walter Elliot', 'text': 'Perseverance is not a long race; it is many short races one after the other. '}
{'author': ' Zig Ziglar', 'text': 'What you get by achieving your goals is not as important as what you become by achieving your goals. '}
{'author': ' Les Brown', 'text': 'You are never too old to set another goal or to dream a new dream. '}
{'author': ' Horace', 'text': "Don't think, just do. "}


As you can see, I wrapped it in a ‘try’ statement. In this case if one of the rows does not have the data you are looking for, you will not get an error, and the loop will continue forward. I also split the results at ‘-’. As you saw earlier, the text and the author name are separated using an ‘-’ Let’s use that to separate the two and splitting them.

We save the Data in a csv File, for that we will first put the data in a DataFrame and organize it a bit :

In [6]:
import pandas as pd

In [7]:
data = {
    'Author':[Author],
    'Text' :[Text],
              }             # Here we define columns names, with the variable we create in the loop earlier 
df = pd.DataFrame(quotes)
print(df)

                  author                                               text
0      Winston Churchill         If you're going through hell, keep going. 
1               Jim Rohn       Either you run the day or the day runs you. 
2       George S. Patton  A good plan violently executed now is better t...
3   A. P. J. Abdul Kalam  We should not give up and we should not allow ...
4          Walter Elliot  Perseverance is not a long race; it is many sh...
5             Zig Ziglar  What you get by achieving your goals is not as...
6              Les Brown  You are never too old to set another goal or t...
7                 Horace                             Don't think, just do. 


In [8]:
df.to_csv("/content/sample_data/data.csv", index=False)  # We save the result in a csv file

## Ex.2 : Extracting raw text

This time we will extract all the quotes from the main page that doesn't have any image with them, here the text will be raw so we need another method to get it.

In [9]:
import requests
import pandas as pd
url = 'https://www.brainyquote.com/topics/motivational-quotes'
r = requests.get(url)
print(r.content)

b'\n<!DOCTYPE html>\n<html lang="en">\n<head>\n<title>Motivational Quotes - BrainyQuote</title>\n<meta name="robots" content="all">\n<meta charset="utf-8" />\n<meta name="viewport" content="width=device-width, initial-scale=1.0, viewport-fit=cover"><meta name="description" content="Explore 285 Motivational Quotes by authors including Winston Churchill, Confucius, and Helen Keller at BrainyQuote.">\n<meta name="googlebot" content="NOODP">\n<meta property="ver" content="13.3.11:5238611">\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="msapplication-config" content="none">\n<meta name="apple-mobile-web-app-capable" content="yes">\n<meta property="ts" content="1676364726">\n<meta property="og:site_name" content="BrainyQuote">\n<meta property="og:title" content="Motivational Quotes - BrainyQuote">\n<meta property="og:type" content="article">\n<meta property="og:description" content="Explore 285 Motivational Quotes by authors including Winston Churchill, Confucius, and He

In [10]:
from bs4 import BeautifulSoup
import csv
soup = BeautifulSoup(r.content, 'html5lib')

quotes = []

In [11]:
table = soup.find('div', attrs = {'id':'quotesList'})

In [None]:
print(table.prettify())

In [None]:
data = '' 
for data in table.find_all("a"):
  print(data.get_text()) 

It's complicated to classify this raw text into a Dataframe, so you can fill a csv file by hand or use NTLK pre-trained model to classify each feature, store it in a variable and then into a DataFrame.

## Ex.3 : Loop to fetch data throught all the pages of a website

In [14]:
import requests
import bs4
base_url = 'http://quotes.toscrape.com/'
result = requests.get('http://quotes.toscrape.com/')
# result.text

In [15]:
soup = bs4.BeautifulSoup(result.text,'lxml')
# soup

In [16]:
soup.select('.author')

[<small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">J.K. Rowling</small>,
 <small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">Jane Austen</small>,
 <small class="author" itemprop="author">Marilyn Monroe</small>,
 <small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">André Gide</small>,
 <small class="author" itemprop="author">Thomas A. Edison</small>,
 <small class="author" itemprop="author">Eleanor Roosevelt</small>,
 <small class="author" itemprop="author">Steve Martin</small>]

In [17]:
for author in soup.select('.author'):
    print(author.text)

Albert Einstein
J.K. Rowling
Albert Einstein
Jane Austen
Marilyn Monroe
Albert Einstein
André Gide
Thomas A. Edison
Eleanor Roosevelt
Steve Martin


In [18]:
authors = set()

for author in soup.select('.author'):
    authors.add(author.text)

In [19]:
authors

{'Albert Einstein',
 'André Gide',
 'Eleanor Roosevelt',
 'J.K. Rowling',
 'Jane Austen',
 'Marilyn Monroe',
 'Steve Martin',
 'Thomas A. Edison'}

In [20]:
type(authors)

set

Create list of all quotes in first page :

In [21]:
quotes = []

for quote in soup.select('.text'):
    quotes.append(quote.text)

In [22]:
quotes

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']

Inspect the site and use Beautiful Soup to extract the top ten tags from the requests text shown on the top right from the home page (e.g Love,Inspirational,Life, etc...) :

In [23]:
soup.select('.tag-item')

[<span class="tag-item">
 <a class="tag" href="/tag/love/" style="font-size: 28px">love</a>
 </span>, <span class="tag-item">
 <a class="tag" href="/tag/inspirational/" style="font-size: 26px">inspirational</a>
 </span>, <span class="tag-item">
 <a class="tag" href="/tag/life/" style="font-size: 26px">life</a>
 </span>, <span class="tag-item">
 <a class="tag" href="/tag/humor/" style="font-size: 24px">humor</a>
 </span>, <span class="tag-item">
 <a class="tag" href="/tag/books/" style="font-size: 22px">books</a>
 </span>, <span class="tag-item">
 <a class="tag" href="/tag/reading/" style="font-size: 14px">reading</a>
 </span>, <span class="tag-item">
 <a class="tag" href="/tag/friendship/" style="font-size: 10px">friendship</a>
 </span>, <span class="tag-item">
 <a class="tag" href="/tag/friends/" style="font-size: 8px">friends</a>
 </span>, <span class="tag-item">
 <a class="tag" href="/tag/truth/" style="font-size: 8px">truth</a>
 </span>, <span class="tag-item">
 <a class="tag" href

In [24]:
for tag in soup.select('.tag-item'):
    print(tag.text)


love


inspirational


life


humor


books


reading


friendship


friends


truth


simile



Notice how there is more than one page, and subsequent pages look like this http://quotes.toscrape.com/page/2/. Use what you know about for loops and string concatenation to loop through all the pages and get all the unique authors on the website :

In [25]:
base_url = 'http://quotes.toscrape.com/page/'

In [26]:
authors = set()

for i in range (1,11):
    scrape_url = base_url + str(i)
    result = requests.get(scrape_url)
    soup = bs4.BeautifulSoup(result.text,'lxml')

    for author in soup.select('.author'):
        authors.add(author.text)

In [None]:
authors

In [None]:
scrape_url = base_url + str(999999)
result = requests.get(scrape_url)
soup = bs4.BeautifulSoup(result.text,'lxml')
soup

In [29]:
page_valid = True
page = 1
authors = set()
base_url = 'http://quotes.toscrape.com/page/'

while page_valid:
    scrape_url = base_url + str(page)
    result = requests.get(scrape_url)

    if 'No quotes found!' in result.text:
        break

    soup = bs4.BeautifulSoup(result.text,'lxml')

    for author in soup.select('.author'):
        authors.add(author.text)

    page +=1

In [30]:
authors

{'Albert Einstein',
 'Alexandre Dumas fils',
 'Alfred Tennyson',
 'Allen Saunders',
 'André Gide',
 'Ayn Rand',
 'Bob Marley',
 'C.S. Lewis',
 'Charles Bukowski',
 'Charles M. Schulz',
 'Douglas Adams',
 'Dr. Seuss',
 'E.E. Cummings',
 'Eleanor Roosevelt',
 'Elie Wiesel',
 'Ernest Hemingway',
 'Friedrich Nietzsche',
 'Garrison Keillor',
 'George Bernard Shaw',
 'George Carlin',
 'George Eliot',
 'George R.R. Martin',
 'Harper Lee',
 'Haruki Murakami',
 'Helen Keller',
 'J.D. Salinger',
 'J.K. Rowling',
 'J.M. Barrie',
 'J.R.R. Tolkien',
 'James Baldwin',
 'Jane Austen',
 'Jim Henson',
 'Jimi Hendrix',
 'John Lennon',
 'Jorge Luis Borges',
 'Khaled Hosseini',
 "Madeleine L'Engle",
 'Marilyn Monroe',
 'Mark Twain',
 'Martin Luther King Jr.',
 'Mother Teresa',
 'Pablo Neruda',
 'Ralph Waldo Emerson',
 'Stephenie Meyer',
 'Steve Martin',
 'Suzanne Collins',
 'Terry Pratchett',
 'Thomas A. Edison',
 'W.C. Fields',
 'William Nicholson'}