# Scrapping [Quotes to Scrape](http://quotes.toscrape.com) using Python and BeautifulSoup

![](https://i.imgur.com/3on9fB7.png)

## Web Scraping
This is the process of collecting and parsing raw data from the a website and storing it in a database. For this project the data will be saved in a csv file.

## Quotes toScrape.com
A website containing pages of quotes from some of the most influential people in human histroy. As the name suggests the website is designed in way that it's easy for information to be scraped from it. So, like many beginners before me I am using quotes.toscrape.com website as a testing ground to sharpen my web scrapping skills. My target is to scrape the text, the author, the url to the author and the tags assigned per quote

## Task
To create a function that scrapes data from 'quotes.toscrape.com' website
### Objectives
- To scrape quotes from each desired page on the website.
- To scrape the text from each quote per page.
- To scrape the author's name from each quote per page.
- To scrape the the tags from each quote per page.
- To scrape the author's url and quotes's tags' url from each quote per page.

## Outline of the project
1. Understanding the structure of Quotes to Scrape Website
2. Installing and Importing required libraries
3. Simulating the page and extracting the tags of different elements from website using BeautifulSoup
4. Using the tags I will access different elements of the website and extract the required text.
5. Storing the extracted data into the quote dictionary.
6. Using pandas I will create a quote dataframe from which a csv file will be created.

[Quotes to Scrape Website](http://quotes.toscrape.com)
* Quotes to Scrape Website (Code)
- ![](https://i.imgur.com/Q8MfFh5.jpg)
- The desired end-product of this scraping project:
![](https://i.imgur.com/7VJQ44U.jpg)

## Examing the website in python with a Python Library - BeautifulSoup
- Start with importing the requests and BeautifulSoup libraries.
- Then create a viable with **'http://quotes.toscrape.com/'** as the foundation of all pages on the website.
- Using get() function from requests inorder to get a web object containing the data from web page.

In [None]:
import requests
from bs4 import BeautifulSoup

#Getting the web page with the data for the quotes as a Beautiful Soup document
response = requests.get('http://quotes.toscrape.com/page/2/')

#checking if the request was successful
if response.status_code != 200:
    print('Status code:', response.status_code)
    raise Exception('Failed to fetch web page because acceptable entries are daily or weekly or monthly')
# A Valid url status code ranges between 200 to 299
print('Status Code:', response.status_code)

Status Code: 200


In [None]:
#Writing the page contents into a html file
with open('scrapping quotes.html','w') as f:
    f.write(response.text)
# Converting the page to  Beautiful soup document using html.parser
doc = BeautifulSoup(response.text,'html.parser')

![](https://i.imgur.com/KvsoBMi.jpg)

Defining a function to check the status code of the extracted elements into a beautifulsoup object using requests, response.status_code and BeautifulSoup.

In [None]:
def get_page_data(page):
    #Getting the web page with the data for the quotes as a Beautiful Soup document
    page_repos_url = 'http://quotes.toscrape.com/page/' + str(page) + '/'
    response = requests.get(page_repos_url)
    #checking if the request was successful
    if response.status_code != 200:
        print('Status code:', response.status_code)
        raise Exception('Failed to fetch web page because acceptable entries are daily or weekly or monthly')
    doc = BeautifulSoup(response.text,'html.parser')
    return doc;

## Extracting the elements of the website
The different elements are:
* quote text
* author's name
* author's url
* tags of the quote
* url of the tags of the quote

In [None]:
#After inspecting the elements on the website and using 'find_all' function the following tags are created
t_tags = doc.find_all('span', class_= 'text')
a_tags = doc.find_all('small', class_= 'author')
u_c_tags = doc.find_all('div', itemtype = 'http://schema.org/CreativeWork')
q_tags = doc.find_all('div', class_= 'tags')

In [None]:
#Inorder to get the text from the t_tag we use our knowledge of indexing of beautifulsoup objects
q_text = (t_tags)[2].text
#q_text contains the text of the 3rd quote on the 2nd page of the quote website.
q_text

"“If you can't explain it to a six year old, you don't understand it yourself.”"

In [None]:
#Inorder to get the a_name from the a_tag we use our knowledge of indexing of beautifulsoup objects
a_name = (a_tags)[2].text
#author_name contains the name of the author of the 3rd quote on the 2nd page of the quote website.
a_name

'Albert Einstein'

In [None]:
#From inspection of the website we realise the author's url is in the 'a' tags of the 'div' tag
u_tags = (u_c_tags[2]).find_all('a')

'''Using our knowledge of indexing and 'href' tags we are to get the url for the author of the 
3rd quote on the 2nd page of the website.'''
a_url = 'http://quotes.toscrape.com/' + (u_tags[0]['href'])
a_url

'http://quotes.toscrape.com//author/Albert-Einstein'

In [None]:
#Inorder to get the t_quote from the q_tags we use our knowledge of indexing of beautifulsoup objects
t_quote = (q_tags)[2].text.split()

'''Inorder to clean our result we use the list function 'pop' to remove the first element of t_quote
since for all tags its 'Tags' and we dont want it in our results'''
t_quote.pop(0)
t_quote

['simplicity', 'understand']

In [None]:
'''From inspection of the website we realise the urls for the tags in a quote are in the 'a' tags 
of the 'div' tag.'''
raw_t_q_url = (u_c_tags[2]).find_all('a',class_='tag')

'''Since there is more than one tag per quote in most cases we will need to use a for-loop inorder 
to extract the urls for the tags in the 3rd quote of the 2nd page of quotes.toscrape.com.'''
for i in raw_t_q_url:
    #To store proper urls we need to use 'href' sub-tag on each element of raw_tags_of_quote_url
    i_t_q_url = 'http://quotes.toscrape.com/' + i['href']
    print(i_t_q_url)

http://quotes.toscrape.com//tag/simplicity/page/1/
http://quotes.toscrape.com//tag/understand/page/1/


## Now we are going to create functions inorder to automate scrapping from all quotes on a desired page
### Our sample page will be *'page 3'*

In [None]:
sample_page = get_page_data(3)

In [None]:
def get_t_repositotries(doc):
    #Scrappping the required tags per quote in "doc"
    t_tags = doc.find_all('span', class_= 'text')
    a_tags = doc.find_all('small', class_= 'author')
    u_c_tags = doc.find_all('div', itemtype = 'http://schema.org/CreativeWork')
    q_tags = doc.find_all('div', class_= 'tags')
    return t_tags, a_tags, u_c_tags, q_tags

In [None]:
repos3 = get_t_repositotries(sample_page)

In [None]:
def get_t_for_elements(repo):
    #Get the data for quote's text, and author's name and adding them to the dictionary
    #The author's name and quote's text tags are in the 2nd and 1st element of repo respectively
    length = len(repo[0])
    q_text = []
    a_name = []
    for i in range (length):
        raw_q_text = (repo[0])[i].text
        raw_a_name = (repo[1])[i].text
        q_text.append(raw_q_text)
        a_name.append(raw_a_name)
    return q_text, a_name

In [None]:
text_elements = get_t_for_elements(repos3)

In [None]:
text_elements[0]

['“I love you without knowing how, or when, or from where. I love you simply, without problems or pride: I love you in this way because I do not know any other way of loving but this, in which there is no I or you, so intimate that your hand upon my chest is my hand, so intimate that when I fall asleep your eyes close.”',
 '“For every minute you are angry you lose sixty seconds of happiness.”',
 '“If you judge people, you have no time to love them.”',
 '“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”',
 '“Beauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.”',
 '“Today you are You, that is truer than true. There is no one alive who is Youer than You.”',
 '“If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales.”',
 '“It is impossible to live without fai

In [None]:
text_elements[1]

['Pablo Neruda',
 'Ralph Waldo Emerson',
 'Mother Teresa',
 'Garrison Keillor',
 'Jim Henson',
 'Dr. Seuss',
 'Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Bob Marley']

In [None]:
#We need to create a viable base_url which will contain the basic element of a quote.toscrape url
base_url = 'http://quotes.toscrape.com/'
a_url =[]
def get_a_url(repo):
    #Getting the url for the author of the text inorder to add it to the dictionary
    #The url for each author is part of the 3rd element of repo
    for i in repo[2]:
        #Inorder to extract the urls we need to extract the 'a' tags from Tag_repositories[2]
        u_tags = (i.find_all('a'))
        raw_a_url = base_url + (u_tags[0]['href'])
        a_url.append(raw_a_url)
    return a_url

In [None]:
A_url_element = get_a_url(repos3)
A_url_element

['http://quotes.toscrape.com//author/Pablo-Neruda',
 'http://quotes.toscrape.com//author/Ralph-Waldo-Emerson',
 'http://quotes.toscrape.com//author/Mother-Teresa',
 'http://quotes.toscrape.com//author/Garrison-Keillor',
 'http://quotes.toscrape.com//author/Jim-Henson',
 'http://quotes.toscrape.com//author/Dr-Seuss',
 'http://quotes.toscrape.com//author/Albert-Einstein',
 'http://quotes.toscrape.com//author/J-K-Rowling',
 'http://quotes.toscrape.com//author/Albert-Einstein',
 'http://quotes.toscrape.com//author/Bob-Marley']

In [None]:
def get_ts_of_quote(repo):
    #The tags_of_quote tags are in the 4th element of repo
    #We need to create the list 't_quote' since the function is returning more than one value per quote
    length = len(repo[3])
    t_quote = []
    for i in range (length):
        raw_t_quote = (repo[3])[i].text.split()
        t_quote.append(raw_t_quote)
    return t_quote

In [None]:
ts_of_quote = get_ts_of_quote(repos3)
ts_of_quote

[['Tags:', 'love', 'poetry'],
 ['Tags:', 'happiness'],
 ['Tags:', 'attributed-no-source'],
 ['Tags:', 'humor', 'religion'],
 ['Tags:', 'humor'],
 ['Tags:', 'comedy', 'life', 'yourself'],
 ['Tags:', 'children', 'fairy-tales'],
 [],
 ['Tags:', 'imagination'],
 ['Tags:', 'music']]

In [None]:
def get_ts_of_quote_url(repo):
    #The url for each tag per quote is part of the 3rd element of repo
    #We need to create the list 'ts_of_quote_url' since the function is returning more than one value per quote 
    ts_of_quote_url = []
    for i in repo[2]:
        #Inorder to extract the urls we need to extract the 'a' tags with the class 'tag' from repo[2]
        raw_ts_of_quote_url = i.find_all('a',class_='tag')
        i_ts_of_quote_url = []
        for z in raw_ts_of_quote_url:
            #To store proper urls we need to use 'href' sub-tag on each element of raw_ts_of_quote_url
            individual_ts_of_quote_url = base_url + z['href']
            i_ts_of_quote_url.append(individual_ts_of_quote_url)
        ts_of_quote_url.append(i_ts_of_quote_url)
    return ts_of_quote_url

In [None]:
ts_of_quote_url = get_ts_of_quote_url(repos3)
ts_of_quote_url

[['http://quotes.toscrape.com//tag/love/page/1/',
  'http://quotes.toscrape.com//tag/poetry/page/1/'],
 ['http://quotes.toscrape.com//tag/happiness/page/1/'],
 ['http://quotes.toscrape.com//tag/attributed-no-source/page/1/'],
 ['http://quotes.toscrape.com//tag/humor/page/1/',
  'http://quotes.toscrape.com//tag/religion/page/1/'],
 ['http://quotes.toscrape.com//tag/humor/page/1/'],
 ['http://quotes.toscrape.com//tag/comedy/page/1/',
  'http://quotes.toscrape.com//tag/life/page/1/',
  'http://quotes.toscrape.com//tag/yourself/page/1/'],
 ['http://quotes.toscrape.com//tag/children/page/1/',
  'http://quotes.toscrape.com//tag/fairy-tales/page/1/'],
 [],
 ['http://quotes.toscrape.com//tag/imagination/page/1/'],
 ['http://quotes.toscrape.com//tag/music/page/1/']]

## Scraping multiple pages
We will write code to scrape any number of page of the quote-toscrape website, with the output being a csv file.

In [None]:
b = 5
p = list(range(1,b+1,1))
print (p)

[1, 2, 3, 4, 5]


In [None]:
import requests
from bs4 import BeautifulSoup
base_url = 'http://quotes.toscrape.com/'
# quote_dict is the dictionary from the csv file is going to be created from 
quote_dict = {'quote_text' : [], 'author_name' : [], 'author_url' :[], 'tags_of_quote' : [], 'tags_of_quote_url' : []}

def scrape_quotes(number):
    '''Get the quote's text, author's name, tags in the quote, author's url,  and the url for each of the tags in the quote per page 
    on quotes.toscrape.com and write them to a CSV file'''
    # page_number stands for the number of pages starting from page 1 to be scraped per csv file
    if type(number) == int:
        #Since scrape_quotes takes inputes of only the type 'int'
        if number < 11:
            #Since quotes.toscrape.com has a maximum of 10 pages
            page_list = (list(range(1,number+1,1)))
            # We use the range function to create a range of number from 1 to number with intervals of 1
            # We use the function list to make sure the value of page_list is a list
            for i in page_list:
                page = i
                path = str(page) + '.csv'
                #Creatig Page_doc which will be a beautiful Soup document
                Page_doc = get_page_data(page)
                #Tag_repositories will contain the tags from which we will get our desired data
                Tag_repositories = get_tag_repositotries(Page_doc)
                #Author_url extracts the url for the author 
                Author_url = get_author_url(Tag_repositories)
                #Tags_of_quote_url extracts the url of the tags per quote
                Tags_of_quote_url = get_tags_of_quote_url(Tag_repositories)
                #Quote_data extracts the author's name and quote's text of the quote
                Quote_data = get_tags_for_elements(Tag_repositories)
                #Quote_tags extracts the tags of the quote
                Tags_of_quote = get_tags_of_quote_(Tag_repositories)
                #Creates a csv file from the dict
            write_csv(quote_dict, path)
            print('The data for quotes on page "{}" of quotes.toscrape.com written to file "{}"'.format(page, path))
        else:
            print('The input {} is out of range, <=10.'.format(number))
    else:
        print('The input must be a integer.'.format(number));

def get_page_data(page):
    #Getting the web page with the data for the quotes as a Beautiful Soup document
    page_repos_url = 'http://quotes.toscrape.com/page/' + str(page) + '/'
    response = requests.get(page_repos_url)
    #checking if the request was successful
    if response.status_code != 200:
        print('Status code:', response.status_code)
        raise Exception('Failed to fetch web page because acceptable entries are daily or weekly or monthly')
    doc = BeautifulSoup(response.text,'html.parser')
    return doc;

def get_tag_repositotries(doc):
    #Scrappping the tags per element of a quote in "doc"
    text_tags = doc.find_all('span', class_= 'text')
    author_tags = doc.find_all('small', class_= 'author')
    url_contents_tags = doc.find_all('div', itemtype = 'http://schema.org/CreativeWork')
    quote_tags = doc.find_all('div', class_= 'tags')
    return text_tags, author_tags, url_contents_tags, quote_tags;

def get_author_url(repo):
    #Getting the url for the author of the text inorder to add it to the dictionary
    #The url for each author is part of the 3rd element of Tag_repositories
    for i in repo[2]:
        #Inorder to extract the urls we need to extract the 'a' tags from Tag_repositories[2]
        url_tags = (i.find_all('a'))
        author_url = base_url + (url_tags[0]['href'])
        quote_dict['author_url'].append(author_url)
    return quote_dict

def get_tags_of_quote_url(repo):
    #Getting the urls of the tags attached to a quote inorder to add them to the dictionary
    #A list will be created which will be added to the dict
    #The url for each tag per quote is part of the 3rd element of Tag_repositories
    for i in repo[2]:
        #Inorder to extract the urls we need to extract the 'a' tags with the class 'tag' from Tag_repositories[2]
        raw_tags_of_quote_url = i.find_all('a',class_='tag')
        tags_of_quote_url = []
        for z in raw_tags_of_quote_url:
            #To store proper urls we need to use 'href' sub-tag on each element of raw_tags_of_quote_url
            individual_tags_of_quote_url = base_url + z['href']
            tags_of_quote_url.append(individual_tags_of_quote_url)
        quote_dict['tags_of_quote_url'].append(tags_of_quote_url)
    return quote_dict

def get_tags_for_elements(repo):
    #Get the data for quote's text, and author's name and adding them to the dictionary
    #The author's name and quote's text tags are in the 2nd and 1st element of Tag_repositories respectively
    length = len(repo[0])
    for i in range (length):
        quote_text = (repo[0])[i].text
        author_name = (repo[1])[i].text
        quote_dict['quote_text'].append(quote_text)
        quote_dict['author_name'].append(author_name)
    return quote_dict

def get_tags_of_quote_(repo):
    #Get the list of quote's tags and add it to the dictionary
    #The tags_of_quote tags are in the 4th element of Tag_repositories
    length = len(repo[3])
    for i in range (length):
        individual_tags_of_quote = (repo[3])[i].text.split()
        quote_dict['tags_of_quote'].append(individual_tags_of_quote)
    return quote_dict

def write_csv(items, path):
    #Creates a csv file from a df
    with open(path, 'w') as f:
        if len(items) == 0:
            return
    import pandas as pd
    if type(items) == dict:
        #Creates a df from a dict
        df = pd.DataFrame(items)
        df.to_csv(path, index=False)
    else:
        print('Either 1st arguments isnt a dictionary or 2nd arguments isnt a valid path')

In [None]:
scrape_quotes(4)

The data for quotes on page "4" of quotes.toscrape.com written to file "4.csv"


In [None]:
import pandas as pd
quote_df = pd.DataFrame(quote_dict)
quote_df

Unnamed: 0,quote_text,author_name,author_url,tags_of_quote,tags_of_quote_url
0,“The world as we have created it is a process ...,Albert Einstein,http://quotes.toscrape.com//author/Albert-Eins...,"[Tags:, change, deep-thoughts, thinking, world]",[http://quotes.toscrape.com//tag/change/page/1...
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,http://quotes.toscrape.com//author/J-K-Rowling,"[Tags:, abilities, choices]",[http://quotes.toscrape.com//tag/abilities/pag...
2,“There are only two ways to live your life. On...,Albert Einstein,http://quotes.toscrape.com//author/Albert-Eins...,"[Tags:, inspirational, life, live, miracle, mi...",[http://quotes.toscrape.com//tag/inspirational...
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,http://quotes.toscrape.com//author/Jane-Austen,"[Tags:, aliteracy, books, classic, humor]",[http://quotes.toscrape.com//tag/aliteracy/pag...
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,http://quotes.toscrape.com//author/Marilyn-Monroe,"[Tags:, be-yourself, inspirational]",[http://quotes.toscrape.com//tag/be-yourself/p...
5,“Try not to become a man of success. Rather be...,Albert Einstein,http://quotes.toscrape.com//author/Albert-Eins...,"[Tags:, adulthood, success, value]",[http://quotes.toscrape.com//tag/adulthood/pag...
6,“It is better to be hated for what you are tha...,André Gide,http://quotes.toscrape.com//author/Andre-Gide,"[Tags:, life, love]","[http://quotes.toscrape.com//tag/life/page/1/,..."
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison,http://quotes.toscrape.com//author/Thomas-A-Ed...,"[Tags:, edison, failure, inspirational, paraph...",[http://quotes.toscrape.com//tag/edison/page/1...
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt,http://quotes.toscrape.com//author/Eleanor-Roo...,"[Tags:, misattributed-eleanor-roosevelt]",[http://quotes.toscrape.com//tag/misattributed...
9,"“A day without sunshine is like, you know, nig...",Steve Martin,http://quotes.toscrape.com//author/Steve-Martin,"[Tags:, humor, obvious, simile]",[http://quotes.toscrape.com//tag/humor/page/1/...


The csv file should look like:
- [Sample result 'csv file'](https://drive.google.com/file/d/13AQ75u6YeFOV7zSsywqNPg3HE5WahAuB/view?usp=sharing)
- ![](https://i.imgur.com/7VJQ44U.jpg) 

## Summary
* The Scraping was done using Python libraries such as Requests, and BeautifulSoup for extracting the data
* Five fields were scrape from each quote i.e. 'quote_text', 'author_name', 'author_url', 'tags_of_quote', 'tags_of_quote_url'.
* Scraping all 10 pages of the website will produce a csv file with 5 columns and 101 rows.

## Future work
- Organising the extracted  according to the most used tags.
- Code optimization
- Improving the documentation part of the project
- Working on a dynamic website with a similar structure.

## References
- [Quotes.toScrape.com home page](http://quotes.toscrape.com)
- [Quaotes.toScrape.com page 3](http://quotes.toscrape.com/page/3/)
- [Let's Build a Python Web Scraping Project from Scratch - Hands-On Tutorial](https://www.youtube.com/watch?v=RKsLLG-bzEY)
- [Web Scraping and REST APIs - Jovian](https://share.descript.com/view/ge5yco930I5)

In [None]:
!pip install jovian --upgrade --quiet
import jovian

In [None]:
jovian.commit(project="Scrapping [Quotes to Scrape](http://quotes.toscrape.com) using Python and BeautifulSoup")

<IPython.core.display.Javascript object>