<div style="background-color: lightgreen; color: black; padding: 20px;">
    <h3>Web Scrapping (Quotes) using beautiful soup 
</h3> </div>

# Objective:

* The objective of this assignment is to help trainees gain hands-on experience with Beautiful Soup, a popular Python library for web scraping. By the end of this assignment, trainees should be able to scrape data from websites, navigate HTML structures, and store the extracted data in various formats.

* Task 1: Install and Set Up Beautiful Soup
* Install Required Libraries: Install Beautiful Soup along with a parser like lxml and the requests library for making HTTP requests.

* !pip install beautifulsoup4
* !pip install numpy
* !pip install pandas

 -  Easy-to-use functions and methods for navigating and extracting data.
 -  Can handle a wide range of HTML and XML structures.
 -  Parses documents efficiently, even large ones.
 -  Tolerant of malformed or incorrect HTML.
  -  Can be customized with additional features and plugins.
  


#### Common use cases for BeautifulSoup:

 - Web scraping: Extracting data from websites for analysis or automation.
 - Data extraction: Parsing HTML or XML files to extract specific information.
 - Web automation: Interacting with web pages programmatically.
 - Web development: Creating and manipulating HTML structures

### Web Scrapping using Beautiful Soup 

In [37]:
#pip install BeautifulSoup4
#importing necessary libraries
from bs4 import BeautifulSoup  #bs4 is BeautifulSoup Version 4
import requests
from urllib.request import urlopen 

In [75]:
#Task 2: Choose a Website to Scrape


#Goal is to scrap following sections:

    #Title
    #Quotes
    #Author
    #Tags

#Task 3: Scrape Data Using Beautiful Soup
#Make an HTTP Request: Use the requests library to send a GET request to the website and retrieve the HTML content.


In [2]:
url = "http://quotes.toscrape.com"
url

'http://quotes.toscrape.com'

In [40]:
# requests is a popular Python library used for making HTTP requests.
# we are sending request to get the URL meaning asking for permission to view the content of the url 
response = requests.get(url)
html_content = response.text
print(html_content)

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
    
    
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="auth

In [45]:
#Parse the html content

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")
#a stands for anchor
#b stands for bold
#p stands for paragrapgh
#href stands for hyperlink reference
#triple qoutes for multiple lines

In [46]:
print(soup.prettify())
#prettify helps in making the quotes more clean

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert

In [47]:
#Extract quotes, authors and tags

quotes = soup.find_all('div', class_='quote')


for quote in quotes:
    # Extract the text of the quote
    text = quote.find('span', class_='text').text
    
    # Extract the author of the quote
    author = quote.find('small', class_='author').text
    
    # Extract the tags associated with the quote
    tags = [tag.text for tag in quote.find_all('a', class_='tag')]

# The print statement should be aligned with the 'for' loop, not indented
print(text)
print (author)
print (tags) 
 # Added a print statement

“A day without sunshine is like, you know, night.”
Steve Martin
['humor', 'obvious', 'simile']


In [48]:
#Handle Missing Data: Implement error handling to manage cases where certain elements might be missing or where requests might fail.

#Task 4: Store the Scraped Data

#Store the scraped data in a JSON file.

import json

def store_data(data):
    with open('scraped_data.json', 'w') as file:
        json.dump(data, file)

#Save Data to a CSV 
import csv

# Assuming 'quotes' is a list of dictionaries with 'text', 'author', and 'tags' keys
with open('quotes.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Quote', 'Author', 'Tags'])  # Write header row

    for quote in quotes:
        # Join tags into a single string
        tags_string = ', '.join(quote.get('tags', [])) 
        writer.writerow([quote.get('text'), quote.get('author'), tags_string])

In [49]:
soup.title

<title>Quotes to Scrape</title>

In [50]:
soup.title.name

'title'

In [51]:
soup.title.string

'Quotes to Scrape'

In [52]:
soup.div

<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
<

In [53]:
 soup.a
#'a' stands for anchor which is a hyperlink that creates a link to another resource

<a href="/" style="text-decoration: none">Quotes to Scrape</a>

In [54]:
soup.get_text()
print(soup.get_text())





Quotes to Scrape








Quotes to Scrape




Login






“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein
(about)


            Tags:
            
change
deep-thoughts
thinking
world



“It is our choices, Harry, that show what we truly are, far more than our abilities.”
by J.K. Rowling
(about)


            Tags:
            
abilities
choices



“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
by Albert Einstein
(about)


            Tags:
            
inspirational
life
live
miracle
miracles



“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
by Jane Austen
(about)


            Tags:
            
aliteracy
books
classic
humor



“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
by Marilyn Monroe

In [55]:
whole = soup.find_all('a')
whole
#the code is not clear to me 

[<a href="/" style="text-decoration: none">Quotes to Scrape</a>,
 <a href="/login">Login</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/change/page/1/">change</a>,
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>,
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>,
 <a class="tag" href="/tag/world/page/1/">world</a>,
 <a href="/author/J-K-Rowling">(about)</a>,
 <a class="tag" href="/tag/abilities/page/1/">abilities</a>,
 <a class="tag" href="/tag/choices/page/1/">choices</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
 <a class="tag" href="/tag/life/page/1/">life</a>,
 <a class="tag" href="/tag/live/page/1/">live</a>,
 <a class="tag" href="/tag/miracle/page/1/">miracle</a>,
 <a class="tag" href="/tag/miracles/page/1/">miracles</a>,
 <a href="/author/Jane-Austen">(about)</a>,
 <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>,
 <a class="tag" href

In [56]:
soup.find_all('span')
#comparatively clearer but can only see the author and quotes

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span>by <small class="author" itemprop="author">J.K. Rowling</small>
 <a href="/author/J-K-Rowling">(about)</a>
 </span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,

In [57]:
entire = soup.find_all('div')
entire
#we can see text, author and tags 

[<div class="container">
 <div class="row header-box">
 <div class="col-md-8">
 <h1>
 <a href="/" style="text-decoration: none">Quotes to Scrape</a>
 </h1>
 </div>
 <div class="col-md-4">
 <p>
 <a href="/login">Login</a>
 </p>
 </div>
 </div>
 <div class="row">
 <div class="col-md-8">
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
 <a class="tag" href="/tag/change/page/1/">change</a>
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>
 <a class="tag" href="/tag

In [58]:
#using for loop to extract the text 
for al in entire:
        quotes = soup.find(itemprop = 'text').text
        author = soup.find(itemprop = 'author').text
        tags = soup.find(class_='tags').text
print(quotes)
print(author)
print(tags)

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Albert Einstein

            Tags:
            
change
deep-thoughts
thinking
world



In [59]:
quotess = soup.find_all(itemprop = 'text')
for al in quotess:
    print(al.text)

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”


In [60]:
authorr = soup.find_all(itemprop = 'author')
for al in authorr:
    print(al.text)

Albert Einstein
J.K. Rowling
Albert Einstein
Jane Austen
Marilyn Monroe
Albert Einstein
André Gide
Thomas A. Edison
Eleanor Roosevelt
Steve Martin


In [61]:
tagss = soup.find_all(class_='tags')
for al in tagss:
    print(al.text)


            Tags:
            
change
deep-thoughts
thinking
world


            Tags:
            
abilities
choices


            Tags:
            
inspirational
life
live
miracle
miracles


            Tags:
            
aliteracy
books
classic
humor


            Tags:
            
be-yourself
inspirational


            Tags:
            
adulthood
success
value


            Tags:
            
life
love


            Tags:
            
edison
failure
inspirational
paraphrased


            Tags:
            
misattributed-eleanor-roosevelt


            Tags:
            
humor
obvious
simile



In [62]:
#assembling everything together
all_quotes = soup.find_all(itemprop = 'text')
all_author = soup.find_all(itemprop = 'author')
all_tags = soup.find_all(class_='tags')

for q,a,t in zip(all_quotes,all_author,all_tags):
    print(q.text)
    print(a.text)
    print(t.text)
#Python's in-built function zip is used 
#zip iterates over multiple sequences simultaneously
#In this case it is iterating over all_quotes,all_author & all_tags

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Albert Einstein

            Tags:
            
change
deep-thoughts
thinking
world

“It is our choices, Harry, that show what we truly are, far more than our abilities.”
J.K. Rowling

            Tags:
            
abilities
choices

“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Albert Einstein

            Tags:
            
inspirational
life
live
miracle
miracles

“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Jane Austen

            Tags:
            
aliteracy
books
classic
humor

“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Marilyn Monroe

            Tags:
            
be-yourself
inspirational

“Try not to become a man of success. Rather become a man of v

In [63]:
data = {'Quote':all_quotes,'Author':all_author,'Tags':all_tags}
data

{'Quote': [<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
  <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
  <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
  <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
  <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
  <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
  <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for w

In [64]:
import pandas as pd 

In [65]:
df = pd.DataFrame(data)
df

Unnamed: 0,Quote,Author,Tags
0,[“The world as we have created it is a process...,[Albert Einstein],"[\n Tags:\n , [], \n, [c..."
1,"[“It is our choices, Harry, that show what we ...",[J.K. Rowling],"[\n Tags:\n , [], \n, [a..."
2,[“There are only two ways to live your life. O...,[Albert Einstein],"[\n Tags:\n , [], \n, [i..."
3,"[“The person, be it gentleman or lady, who has...",[Jane Austen],"[\n Tags:\n , [], \n, [a..."
4,"[“Imperfection is beauty, madness is genius an...",[Marilyn Monroe],"[\n Tags:\n , [], \n, [b..."
5,[“Try not to become a man of success. Rather b...,[Albert Einstein],"[\n Tags:\n , [], \n, [a..."
6,[“It is better to be hated for what you are th...,[André Gide],"[\n Tags:\n , [], \n, [l..."
7,"[“I have not failed. I've just found 10,000 wa...",[Thomas A. Edison],"[\n Tags:\n , [], \n, [e..."
8,[“A woman is like a tea bag; you never know ho...,[Eleanor Roosevelt],"[\n Tags:\n , [], \n, [m..."
9,"[“A day without sunshine is like, you know, ni...",[Steve Martin],"[\n Tags:\n , [], \n, [h..."


In [66]:
df.shape

(10, 3)

In [67]:
df.size

30

In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Quote   10 non-null     object
 1   Author  10 non-null     object
 2   Tags    10 non-null     object
dtypes: object(3)
memory usage: 372.0+ bytes


In [69]:
df.describe()

Unnamed: 0,Quote,Author,Tags
count,10,10,10
unique,10,8,10
top,[“The world as we have created it is a process...,[Albert Einstein],"[\n Tags:\n , [], \n, [c..."
freq,1,3,1


In [70]:
df.head()

Unnamed: 0,Quote,Author,Tags
0,[“The world as we have created it is a process...,[Albert Einstein],"[\n Tags:\n , [], \n, [c..."
1,"[“It is our choices, Harry, that show what we ...",[J.K. Rowling],"[\n Tags:\n , [], \n, [a..."
2,[“There are only two ways to live your life. O...,[Albert Einstein],"[\n Tags:\n , [], \n, [i..."
3,"[“The person, be it gentleman or lady, who has...",[Jane Austen],"[\n Tags:\n , [], \n, [a..."
4,"[“Imperfection is beauty, madness is genius an...",[Marilyn Monroe],"[\n Tags:\n , [], \n, [b..."


In [71]:
df.isnull()

Unnamed: 0,Quote,Author,Tags
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,False,False
6,False,False,False
7,False,False,False
8,False,False,False
9,False,False,False


In [72]:
df.isnull().sum()

Quote     0
Author    0
Tags      0
dtype: int64

In [73]:
#saving data as json file 
df.to_json('data.json')

In [74]:
#saving data as csv file 
df.to_csv('data.csv', index = False)