# QUOTES SCRAPPER
This workbook demonstrates **WEB SCRAPPING** in python.

Here we make use of **https://quotes.toscrape.com** web page to scrape the data present and create a dictionary to helps search quotes via author names, tags etc.

In [243]:
website_url = 'https://quotes.toscrape.com'

In [244]:
import requests
import os

In [245]:
status = requests.get(website_url)
status.status_code

200

In [246]:
type(status.status_code)

int

From above a status code of 200 indicates that response from the website is successful.

In [247]:
website_text = status.text

**website_text** now contains the **HTML** code of the webpage.

In [248]:
print(website_text)

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
    
    
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="auth

Now we save the **HTML** content above in a file to be used later on for extracting values and other tasks.

In [249]:
with open('quotestoscrape.txt','w') as f:
    f.write(website_text)

In [250]:
from bs4 import BeautifulSoup

Here **Beautiful Soup** library is used to parse HTML/XML data but first a variable to store the content of **quotestoscrape.txt** file is set up to be passed to create **soup** object.

In [251]:
quotes = ""

with open("quotestoscrape.txt","r") as f:
    quotes+=f.read()

quotes

'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n    \n    \n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>\

NOTE: In Python when reading a file,
1. ReadLines output a list with each element being successive lines of file content.
2. Read reads the entire content in a single string.
3. Readline everytime it is called reads successive lines of file content.

In [252]:
import lxml

Because of it's speed here lxml's html parser is used. Otherwise one may use Python's **HTML** parser with the cost of speed.

In [253]:
soup=BeautifulSoup(quotes,'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert

HTML parser error : error parsing attribute name
ss="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" 
                                                                               ^
HTML parser error : error parsing attribute name
         <meta class="keywords" itemprop="keywords" content="abilities,choices" 
                                                                               ^
HTML parser error : error parsing attribute name
eywords" itemprop="keywords" content="inspirational,life,live,miracle,miracles" 
                                                                               ^
HTML parser error : error parsing attribute name
ta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" 
                                                                               ^
HTML parser error : error parsing attribute name
 <meta class="keywords" itemprop="keywords" content="be-yourself,inspirational" 
                          

### RETRIEVING QUOTES AND ASSOCIATED AUTHORS
Using beautiful soup the content is scrapped to obtain a list of quotes followed by names of authors of such quotes along with tags the quote belongs to.

In [255]:
#map author to quotes
#TC=O(n^2)
authors_with_quotes={}
author_pos={}
tags_with_authors={}
quotes_to_tags={}

for pos,item in enumerate(soup.find_all('div',attrs={'class':'quote'})):

    quote=str(item.find('span',attrs={'class':'text'}).string)
    author=str(item.find('small',attrs={'class':'author'}).string)

    if author not in authors_with_quotes:
        authors_with_quotes[author]=[]
        
    authors_with_quotes[author].append(quote)
    author_pos[pos]=author

    for tag_item in item.find_all('a',attrs={'class':'tag'}):

        if str(tag_item.string) not in tags_with_authors:
            tags_with_authors[str(tag_item.string)]=[]
            tags_with_authors[str(tag_item.string)].append(author)
        elif str(tag_item.string) in tags_with_authors and author not in list(tags_with_authors.values()):
            tags_with_authors[str(tag_item.string)].append(author)
        
        if quote not in quotes_to_tags:
            quotes_to_tags[quote]=[]
        quotes_to_tags[quote].append(str(tag_item.string))


In [256]:
def Find_Authors_With_Labels(tag_name):
    return tags_with_authors[tag_name]

In [257]:
tag='humor'
ans=Find_Authors_With_Labels(tag)

print(f"The following are authors with {tag} quotes\n",ans)

The following are authors with humor quotes
 ['Jane Austen', 'Steve Martin']


### OBTAIN TOP TAGS
On the site there exists a set of most searched tags which is retrived.

In [258]:
Top_searched_tags=[]

for item in soup.find(class_='tags-box').find_all('a'):
    Top_searched_tags.append(str(item.string))


print(f"There exists {len(Top_searched_tags)} most searched tags and they are\n")
Top_searched_tags

There exists 10 most searched tags and they are



['love',
 'inspirational',
 'life',
 'humor',
 'books',
 'reading',
 'friendship',
 'friends',
 'truth',
 'simile']

## SCRAPPING MULTIPLE PAGES
Having done scrapping on the first page the same is repeated for all pages of the website.

In [259]:
pages_of_website=[]

for page_num in range(2,11):
    page_link=website_url+'/page/'+str(page_num)+'/'
    pages_of_website.append(page_link)

In [260]:
pages_of_website

['https://quotes.toscrape.com/page/2/',
 'https://quotes.toscrape.com/page/3/',
 'https://quotes.toscrape.com/page/4/',
 'https://quotes.toscrape.com/page/5/',
 'https://quotes.toscrape.com/page/6/',
 'https://quotes.toscrape.com/page/7/',
 'https://quotes.toscrape.com/page/8/',
 'https://quotes.toscrape.com/page/9/',
 'https://quotes.toscrape.com/page/10/']

In [263]:
def Map_Quotes_to_Tags_Author_to_Quotes_and_Tags_to_Authors():
   

   #here data is pulled from 2nd page onwards
   for website_page_link in pages_of_website:
       res=requests.get(website_page_link)

       if res.status_code==200:
           soup = BeautifulSoup(res.text,'lxml')
           pos=len(author_pos)

           #get authors and their quotes
           for item in soup.find_all('div',attrs={'class':'quote'}):

                quote=str(item.find('span',attrs={'class':'text'}).string)
                author=str(item.find('small',attrs={'class':'author'}).string)

                if author not in authors_with_quotes:
                    authors_with_quotes[author]=[]
                authors_with_quotes[author].append(quote)

                if author not in list(author_pos.values()):
                    author_pos[pos]=author
                    pos+=1

                #map tags and authors
                for tag_item in item.find_all('a',attrs={'class':'tag'}):
                    #iff tag or author represented by name both are new the following is carried out
                    if str(tag_item.string) not in tags_with_authors:
                        tags_with_authors[str(tag_item.string)]=[]
                        tags_with_authors[str(tag_item.string)].append(author)
                    elif str(tag_item.string) in tags_with_authors and author not in list(tags_with_authors.values()):
                        tags_with_authors[str(tag_item.string)].append(author)
                    
                    if quote not in quotes_to_tags:
                        quotes_to_tags[quote]=[]
                    quotes_to_tags[quote].append(str(tag_item.string))
            
            

In [264]:
Map_Quotes_to_Tags_Author_to_Quotes_and_Tags_to_Authors()

HTML parser error : error parsing attribute name
temprop="keywords" content="friends,heartbreak,inspirational,life,love,sisters" 
                                                                               ^
HTML parser error : error parsing attribute name
           <meta class="keywords" itemprop="keywords" content="courage,friends" 
                                                                               ^
HTML parser error : error parsing attribute name
     <meta class="keywords" itemprop="keywords" content="simplicity,understand" 
                                                                               ^
HTML parser error : error parsing attribute name
            <meta class="keywords" itemprop="keywords" content="love" /    > 
                                                                      ^
HTML parser error : error parsing attribute name
            <meta class="keywords" itemprop="keywords" content="fantasy" /    > 
                                      

In [265]:
print(f"There exists {len(authors_with_quotes)} authors with quotes in the webpage")

There exists 50 authors with quotes in the webpage


#### DISPLAY AUTHORS WITH TAGS

In [266]:
tag='life'
print(f"There exists {len(Find_Authors_With_Labels(tag))} authors with quotes about {tag} and they are:\n",Find_Authors_With_Labels(tag))

There exists 13 authors with quotes about life and they are:
 ['Albert Einstein', 'André Gide', 'Marilyn Monroe', 'Douglas Adams', 'Mark Twain', 'Allen Saunders', 'Dr. Seuss', 'Albert Einstein', 'George Bernard Shaw', 'Ralph Waldo Emerson', 'Mark Twain', 'Jimi Hendrix', 'Khaled Hosseini']


#### DISPLAY QUOTES OF AUTHOR

In [267]:
def Show_Quotes_Of_Author(author_name):
    for quote in authors_with_quotes[author_name]:
        print(quote)
        print('\n')

In [268]:
author_name='Douglas Adams'
print(f"Here are all quotes of {author_name}:\n")
Show_Quotes_Of_Author(author_name)

Here are all quotes of Douglas Adams:

“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”




#### DISPLAY QUOTES GIVEN AUTHOR AND TAG
Displays all quotes for a given tag and author if they exist

In [274]:
def Display_Quotes_For_Tag_And_Author(tag_name,author_name):
    if tag_name not in tags_with_authors:
        print(f"No tag with {tag_name} found")
    elif author_name not in authors_with_quotes:
        print(f"No author named {author_name} found")
    elif author_name not in list(tags_with_authors[tag_name]):
        print(f"No {tag_name} quotes found of author {author_name}")
    
    res=[]
    
    for quote in authors_with_quotes[author_name]:
        if tag_name in quotes_to_tags[quote]:
            res.append(quote)

    return res

In [277]:
tag_name='life'
author_name='Albert Einstein'
ans=Display_Quotes_For_Tag_And_Author(tag_name,author_name)

print(f"All quotes of {author_name} about {tag_name} are as follows:\n")
ans

All quotes of Albert Einstein about life are as follows:



['“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“Life is like riding a bicycle. To keep your balance, you must keep moving.”']