<a href="https://colab.research.google.com/github/Bhavu211/Data_Science/blob/main/web_scrapping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**What is Web Scraping?**

Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications. There are many different ways to perform web scraping to obtain data from websites. These include using online services, particular API’s or even creating your code for web scraping from scratch. Many large websites, like Google, Twitter, Facebook, StackOverflow, etc. have API’s that allow you to access their data in a structured format. This is the best option, but there are other sites that don’t allow users to access large amounts of data in a structured form or they are simply not that technologically advanced. In that situation, it’s best to use Web Scraping to scrape the website for data.

Web scraping requires two parts, namely the crawler and the scraper. The crawler is an artificial intelligence algorithm that browses the web to search for the particular data required by following the links across the internet. The scraper, on the other hand, is a specific tool created to extract data from the website. The design of the scraper can vary greatly according to the complexity and scope of the project so that it can quickly and accurately extract the data.

Why is Python a popular programming language for Web Scraping?
Python seems to be in fashion these days! It is the most popular language for web scraping as it can handle most of the processes easily. It also has a variety of libraries that were created specifically for Web Scraping. Scrapy is a very popular open-source web crawling framework that is written in Python. It is ideal for web scraping as well as extracting data using APIs. Beautiful soup is another Python library that is highly suitable for Web Scraping. It creates a parse tree that can be used to extract data from HTML on a website. Beautiful soup also has multiple features for navigation, searching, and modifying these parse trees.

What is Web Scraping used for?
Web Scraping has multiple applications across various industries. Let’s check out some of these now!

1. Price Monitoring
Web Scraping can be used by companies to scrap the product data for their products and competing products as well to see how it impacts their pricing strategies. Companies can use this data to fix the optimal pricing for their products so that they can obtain maximum revenue.

2. Market Research
Web scraping can be used for market research by companies. High-quality web scraped data obtained in large volumes can be very helpful for companies in analyzing consumer trends and understanding which direction the company should move in the future.

3. News Monitoring
Web scraping news sites can provide detailed reports on the current news to a company. This is even more essential for companies that are frequently in the news or that depend on daily news for their day-to-day functioning. After all, news reports can make or break a company in a single day!

4. Sentiment Analysis
If companies want to understand the general sentiment for their products among their consumers, then Sentiment Analysis is a must. Companies can use web scraping to collect data from social media websites such as Facebook and Twitter as to what the general sentiment about their products is. This will help them in creating products that people desire and moving ahead of their competition.

5. Email Marketing
Companies can also use Web scraping for email marketing. They can collect Email ID’s from various sites using web scraping and then send bulk promotional and marketing Emails to all the people owning these Email ID’s.



**How to Scrap a Webpage?**

In [None]:
import requests
import bs4
from bs4 import BeautifulSoup

In [None]:
link='https://quotes.toscrape.com/'

In [None]:
res= requests.get(link)

In [None]:
res

<Response [200]>

In [None]:
print(res.text)

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
    
    
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="auth

In [None]:
html= res.text

In [None]:
type(html)

str

In [None]:
fd=open('main.html','w')
fd.write(html)
fd.close()

In [None]:
soup=BeautifulSoup(res.text, 'html.parser')
print(soup.find('span',class_='text').text[1:-1])

The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.


In [None]:
soup.find_all('span',class_='text')

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.

**Scraping Quotes**

In [None]:
quotes=[]
for quote in soup.find_all('span',class_='text'):
  quotes.append(quote.text[1:-1])
quotes

['The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.',
 'It is our choices, Harry, that show what we truly are, far more than our abilities.',
 'There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.',
 'The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.',
 "Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.",
 'Try not to become a man of success. Rather become a man of value.',
 'It is better to be hated for what you are than to be loved for what you are not.',
 "I have not failed. I've just found 10,000 ways that won't work.",
 "A woman is like a tea bag; you never know how strong it is until it's in hot water.",
 'A day without sunshine is like, you know, night.']

**Scraping Quotes with Author Details**

In [None]:
import requests

In [None]:
link= 'https://quotes.toscrape.com/'

In [None]:
res= requests.get(link)

In [None]:
print(res.text)

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
    
    
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="auth

In [None]:
html=res.text

In [None]:
fd=open('main.html','w')
fd.write(html)
fd.close()

In [None]:
import bs4
from bs4 import BeautifulSoup

In [None]:
soup=BeautifulSoup(res.text, 'html.parser')
print(soup.find('span',class_='text').text[1:-1])

The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.


In [None]:
soup.find_all('span',class_='text')

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.

In [None]:
quotes=[]
for quote in soup.find_all('span',class_='text'):
  quotes.append(quote.text[1:-1])
quotes

['The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.',
 'It is our choices, Harry, that show what we truly are, far more than our abilities.',
 'There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.',
 'The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.',
 "Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.",
 'Try not to become a man of success. Rather become a man of value.',
 'It is better to be hated for what you are than to be loved for what you are not.',
 "I have not failed. I've just found 10,000 ways that won't work.",
 "A woman is like a tea bag; you never know how strong it is until it's in hot water.",
 'A day without sunshine is like, you know, night.']

In [None]:
soup.find_all('div',class_='quote')

[<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
 <a class="tag" href="/tag/change/page/1/">change</a>
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>
 <a class="tag" href="/tag/world/page/1/">world</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
 <span>by <small class="author" itempr

In [None]:
for sp in soup.find_all('div',class_='quote'):
  print(sp)
  print()

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
<span>by <small class="author" itemprop="author">J.K. 

In [None]:
sp

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>
<span>by <small class="author" itemprop="author">Steve Martin</small>
<a href="/author/Steve-Martin">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="humor,obvious,simile" itemprop="keywords"/>
<a class="tag" href="/tag/humor/page/1/">humor</a>
<a class="tag" href="/tag/obvious/page/1/">obvious</a>
<a class="tag" href="/tag/simile/page/1/">simile</a>
</div>
</div>

In [None]:
quote= sp.find('span', class_='text').text[1:-1]

In [None]:
quote

'A day without sunshine is like, you know, night.'

In [None]:
author= sp.find('small',class_='author').text

In [None]:
author

'Steve Martin'

In [None]:
author_id= sp.find('a').get('href')

In [None]:
author_id

'/author/Steve-Martin'

In [None]:
tags= []
for tag in sp.find_all('a',class_='tag'):
  tags.append(tag.text)

In [None]:
tags

['humor', 'obvious', 'simile']

In [None]:
','.join(tags)

'humor,obvious,simile'

In [None]:
print(quote, author, author_id, tags)
print('-'*10)

A day without sunshine is like, you know, night. Steve Martin /author/Steve-Martin ['humor', 'obvious', 'simile']
----------


In [None]:
for sp in soup.find_all('div',class_='quote'):
  quote= sp.find('span', class_='text').text[1:-1]
  author= sp.find('small',class_='author').text
  author_id= sp.find('a').get('href')
  tags= []
  for tag in sp.find_all('a',class_='tag'):
    tags.append(tag.text)
  tags= ','.join(tags)
  print(quote)
  print(author)
  print(author_id)
  print(tags)
  print('-'*10)

The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.
Albert Einstein
/author/Albert-Einstein
change,deep-thoughts,thinking,world
----------
It is our choices, Harry, that show what we truly are, far more than our abilities.
J.K. Rowling
/author/J-K-Rowling
abilities,choices
----------
There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.
Albert Einstein
/author/Albert-Einstein
inspirational,life,live,miracle,miracles
----------
The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.
Jane Austen
/author/Jane-Austen
aliteracy,books,classic,humor
----------
Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.
Marilyn Monroe
/author/Marilyn-Monroe
be-yourself,inspirational
----------
Try not to become a man of success. Rather become a man of value.
Albe

In [None]:
data=[]
for sp in soup.find_all('div',class_='quote'):
  quote= sp.find('span', class_='text').text[1:-1]
  author= sp.find('small',class_='author').text
  author_id= sp.find('a').get('href')
  tags= []
  for tag in sp.find_all('a',class_='tag'):
    tags.append(tag.text)
  tags= ','.join(tags)
  data.append([quote, author, author_id, tags])

In [None]:
data[0]

['The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.',
 'Albert Einstein',
 '/author/Albert-Einstein',
 'change,deep-thoughts,thinking,world']

In [None]:
import pandas as pd

In [None]:
df= pd.DataFrame(data, columns=['quote','author','author_id','tags'])

In [None]:
df.to_csv('quotes.csv', index=False)

In [None]:
df

Unnamed: 0,quote,author,author_id,tags
0,The world as we have created it is a process o...,Albert Einstein,/author/Albert-Einstein,"change,deep-thoughts,thinking,world"
1,"It is our choices, Harry, that show what we tr...",J.K. Rowling,/author/J-K-Rowling,"abilities,choices"
2,There are only two ways to live your life. One...,Albert Einstein,/author/Albert-Einstein,"inspirational,life,live,miracle,miracles"
3,"The person, be it gentleman or lady, who has n...",Jane Austen,/author/Jane-Austen,"aliteracy,books,classic,humor"
4,"Imperfection is beauty, madness is genius and ...",Marilyn Monroe,/author/Marilyn-Monroe,"be-yourself,inspirational"
5,Try not to become a man of success. Rather bec...,Albert Einstein,/author/Albert-Einstein,"adulthood,success,value"
6,It is better to be hated for what you are than...,André Gide,/author/Andre-Gide,"life,love"
7,"I have not failed. I've just found 10,000 ways...",Thomas A. Edison,/author/Thomas-A-Edison,"edison,failure,inspirational,paraphrased"
8,A woman is like a tea bag; you never know how ...,Eleanor Roosevelt,/author/Eleanor-Roosevelt,misattributed-eleanor-roosevelt
9,"A day without sunshine is like, you know, night.",Steve Martin,/author/Steve-Martin,"humor,obvious,simile"


**Scraping Quotes from Multiple Pages**

In [None]:
for page in range(2,11):
  link= 'https://quotes.toscrape.com/page/' + str(page)
  res = requests.get(link)
  soup=BeautifulSoup(res.text, 'html.parser')
  print(link)

https://quotes.toscrape.com/page/2
https://quotes.toscrape.com/page/3
https://quotes.toscrape.com/page/4
https://quotes.toscrape.com/page/5
https://quotes.toscrape.com/page/6
https://quotes.toscrape.com/page/7
https://quotes.toscrape.com/page/8
https://quotes.toscrape.com/page/9
https://quotes.toscrape.com/page/10


In [None]:
from tqdm import tqdm

In [None]:
data=[]
for page in tqdm(range(1,11)):
  link= 'https://quotes.toscrape.com/page/' + str(page)
  res = requests.get(link)
  soup=BeautifulSoup(res.text, 'html.parser')
  for sp in soup.find_all('div', class_='quote'):
    quote= sp.find('span', class_='text').text[1:-1]
    author= sp.find('small',class_='author').text
    author_id= sp.find('a').get('href')
    tags= []
    for tag in sp.find_all('a',class_='tag'):
      tags.append(tag.text)
    tags= ','.join(tags)
    data.append([quote, author, author_id, tags])


100%|██████████| 10/10 [00:00<00:00, 10.52it/s]


In [None]:
len(data)

100

In [None]:
df = pd.DataFrame(data, columns=['quote','author','author_id','tags'])

In [None]:
df.to_csv('quotes.csv', index=False)

In [None]:
df

Unnamed: 0,quote,author,author_id,tags
0,The world as we have created it is a process o...,Albert Einstein,/author/Albert-Einstein,"change,deep-thoughts,thinking,world"
1,"It is our choices, Harry, that show what we tr...",J.K. Rowling,/author/J-K-Rowling,"abilities,choices"
2,There are only two ways to live your life. One...,Albert Einstein,/author/Albert-Einstein,"inspirational,life,live,miracle,miracles"
3,"The person, be it gentleman or lady, who has n...",Jane Austen,/author/Jane-Austen,"aliteracy,books,classic,humor"
4,"Imperfection is beauty, madness is genius and ...",Marilyn Monroe,/author/Marilyn-Monroe,"be-yourself,inspirational"
...,...,...,...,...
95,You never really understand a person until you...,Harper Lee,/author/Harper-Lee,better-life-empathy
96,You have to write the book that wants to be wr...,Madeleine L'Engle,/author/Madeleine-LEngle,"books,children,difficult,grown-ups,write,write..."
97,Never tell the truth to people who are not wor...,Mark Twain,/author/Mark-Twain,truth
98,"A person's a person, no matter how small.",Dr. Seuss,/author/Dr-Seuss,inspirational


In [None]:
df['author_link']='https://quotes.toscrape.com' + df['author_id']

In [None]:
df.head()

Unnamed: 0,quote,author,author_id,tags,author_link
0,The world as we have created it is a process o...,Albert Einstein,/author/Albert-Einstein,"change,deep-thoughts,thinking,world",https://quotes.toscrape.com/author/Albert-Eins...
1,"It is our choices, Harry, that show what we tr...",J.K. Rowling,/author/J-K-Rowling,"abilities,choices",https://quotes.toscrape.com/author/J-K-Rowling
2,There are only two ways to live your life. One...,Albert Einstein,/author/Albert-Einstein,"inspirational,life,live,miracle,miracles",https://quotes.toscrape.com/author/Albert-Eins...
3,"The person, be it gentleman or lady, who has n...",Jane Austen,/author/Jane-Austen,"aliteracy,books,classic,humor",https://quotes.toscrape.com/author/Jane-Austen
4,"Imperfection is beauty, madness is genius and ...",Marilyn Monroe,/author/Marilyn-Monroe,"be-yourself,inspirational",https://quotes.toscrape.com/author/Marilyn-Monroe


In [None]:
df.to_csv('quotes.csv', index=False)

**Book Scraper| Scraping Books data from Home-Page**

In [52]:
import requests
from bs4 import BeautifulSoup
link='https://books.toscrape.com/catalogue/page-1.html'

In [53]:
res=requests.get(link)
soup=BeautifulSoup(res.text,'html.parser')

In [54]:
soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')

[<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <div class="image_container">
 <a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
 <div class="product_price">
 <p class="price_color">Â£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>
 </li>,
 <li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <div class="image_container">
 <a 

In [55]:
data=[]

for sp in soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3'):
  #print(sp)
  #print('------')
  book_link= ('https://books.toscrape.com/catalogue/'+sp.find_all('a')[-1].get('href'))
  img_link= 'https://books.toscrape.com/'+sp.find('img').get('src')
  title= (sp.find_all('a')[-1].get('title'))
  ratings= sp.find('p').get('class')[-1]
  price= sp.find('p', class_='price_color').text[1:]
  availability= sp.find('p', class_='instock availability').text.strip()

  data.append([title, book_link, img_link, ratings, price, availability])
  #print(title,'|', book_link,'|', img_link,'|', ratings,'|', price,'|', availability)
  #print('------')
#
  #break

In [56]:
data[0]

['A Light in the Attic',
 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'https://books.toscrape.com/../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg',
 'Three',
 '£51.77',
 'In stock']

In [57]:
len(data)

20

**Scrapping Books data from Multiple Pages**

In [58]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm
link = 'https://books.toscrape.com/catalogue/page-1.html'

In [59]:
print('https://books.toscrape.com/catalogue/page-'+str(2)+'.html')

https://books.toscrape.com/catalogue/page-2.html


In [60]:
for i in range(1,51):
  print('https://books.toscrape.com/catalogue/page-'+str(i)+'.html')

https://books.toscrape.com/catalogue/page-1.html
https://books.toscrape.com/catalogue/page-2.html
https://books.toscrape.com/catalogue/page-3.html
https://books.toscrape.com/catalogue/page-4.html
https://books.toscrape.com/catalogue/page-5.html
https://books.toscrape.com/catalogue/page-6.html
https://books.toscrape.com/catalogue/page-7.html
https://books.toscrape.com/catalogue/page-8.html
https://books.toscrape.com/catalogue/page-9.html
https://books.toscrape.com/catalogue/page-10.html
https://books.toscrape.com/catalogue/page-11.html
https://books.toscrape.com/catalogue/page-12.html
https://books.toscrape.com/catalogue/page-13.html
https://books.toscrape.com/catalogue/page-14.html
https://books.toscrape.com/catalogue/page-15.html
https://books.toscrape.com/catalogue/page-16.html
https://books.toscrape.com/catalogue/page-17.html
https://books.toscrape.com/catalogue/page-18.html
https://books.toscrape.com/catalogue/page-19.html
https://books.toscrape.com/catalogue/page-20.html
https://b

In [61]:
data=[]
for i in tqdm(range(1,51)):
  link= 'https://books.toscrape.com/catalogue/page-'+str(i)+'.html'
  res=requests.get(link)
  soup=BeautifulSoup(res.text,'html.parser')
  for sp in soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3'):
    book_link= ('https://books.toscrape.com/catalogue/'+sp.find_all('a')[-1].get('href'))
    img_link= 'https://books.toscrape.com/'+sp.find('img').get('src')
    title= (sp.find_all('a')[-1].get('title'))
    ratings= sp.find('p').get('class')[-1]
    price= sp.find('p', class_='price_color').text[1:]
    availability= sp.find('p', class_='instock availability').text.strip()

  data.append([title, ratings,price,availability, book_link, img_link])

100%|██████████| 50/50 [00:17<00:00,  2.93it/s]


In [62]:
len(data)

50

In [63]:
df=pd.DataFrame(data, columns=['title','ratings','price','availability','book_link','img_link'])

In [64]:
df.head()

Unnamed: 0,title,ratings,price,availability,book_link,img_link
0,It's Only the Himalayas,Two,£45.17,In stock,https://books.toscrape.com/catalogue/its-only-...,https://books.toscrape.com/../media/cache/27/a...
1,You can't bury them all: Poems,Two,£33.63,In stock,https://books.toscrape.com/catalogue/you-cant-...,https://books.toscrape.com/../media/cache/e9/2...
2,The Natural History of Us (The Fine Art of Pre...,Three,£45.22,In stock,https://books.toscrape.com/catalogue/the-natur...,https://books.toscrape.com/../media/cache/5d/7...
3,"Rat Queens, Vol. 3: Demons (Rat Queens (Collec...",Three,£50.40,In stock,https://books.toscrape.com/catalogue/rat-queen...,https://books.toscrape.com/../media/cache/f3/e...
4,In the Country We Love: My Family Divided,Four,£22.00,In stock,https://books.toscrape.com/catalogue/in-the-co...,https://books.toscrape.com/../media/cache/fe/e...


In [65]:
df.isnull().sum()

Unnamed: 0,0
title,0
ratings,0
price,0
availability,0
book_link,0
img_link,0


In [66]:
df.to_csv('books.csv', index=False)

**Individual Page Scraper**

In [67]:
import requests
import pandas as pd
from tqdm import tqdm
from bs4 import BeautifulSoup


In [68]:
df= pd.read_csv('books.csv')
df.head()

Unnamed: 0,title,ratings,price,availability,book_link,img_link
0,It's Only the Himalayas,Two,£45.17,In stock,https://books.toscrape.com/catalogue/its-only-...,https://books.toscrape.com/../media/cache/27/a...
1,You can't bury them all: Poems,Two,£33.63,In stock,https://books.toscrape.com/catalogue/you-cant-...,https://books.toscrape.com/../media/cache/e9/2...
2,The Natural History of Us (The Fine Art of Pre...,Three,£45.22,In stock,https://books.toscrape.com/catalogue/the-natur...,https://books.toscrape.com/../media/cache/5d/7...
3,"Rat Queens, Vol. 3: Demons (Rat Queens (Collec...",Three,£50.40,In stock,https://books.toscrape.com/catalogue/rat-queen...,https://books.toscrape.com/../media/cache/f3/e...
4,In the Country We Love: My Family Divided,Four,£22.00,In stock,https://books.toscrape.com/catalogue/in-the-co...,https://books.toscrape.com/../media/cache/fe/e...


**Approach-1**

In [69]:
data=[]
for link in tqdm(df['book_link']):
  res=requests.get(link)
  soup=BeautifulSoup(res.text,'html.parser')
  typ= (soup.find('ul',class_='breadcrumb').find_all('a')[2].text)
  upc= (soup.find('table', class_='table table-striped').find_all('td')[0].text)
  price_x= (soup.find('table', class_='table table-striped').find_all('td')[2].text[2:])
  price_i= (soup.find('table', class_='table table-striped').find_all('td')[3].text[2:])
  tax= (soup.find('table', class_='table table-striped').find_all('td')[4].text[2:])
  avail= (soup.find('table', class_='table table-striped').find_all('td')[5].text)
  reviews= (soup.find('table', class_='table table-striped').find_all('td')[6].text)
  data.append([typ,upc,price_x,price_i,tax,avail,reviews])

100%|██████████| 50/50 [00:12<00:00,  4.01it/s]


In [70]:
typ

'Travel'

In [71]:
upc

'228ba5e7577e1d49'

In [72]:
price_x

'26.08'

In [73]:
price_i

'26.08'

In [74]:
tax

'0.00'

In [75]:
avail

'In stock (1 available)'

In [76]:
reviews

'0'

**Approach-2**

In [77]:
data=[]
for link in tqdm(df['book_link']):
  res=requests.get(link)
  soup=BeautifulSoup(res.text,'html.parser')
  typ= (soup.find('ul',class_='breadcrumb').find_all('a')[2].text)
  temp=soup.find('table', class_='table table-striped').find_all('td')
  upc= temp[0].text
  price_x=temp[2].text[2:]
  price_i=temp[3].text[2:]
  tax=temp[4].text[2:]
  avail=temp[5].text
  reviews=temp[6].text
  data.append([typ,upc,price_x,price_i,tax,avail,reviews])

100%|██████████| 50/50 [00:12<00:00,  4.00it/s]


In [80]:
data=[]
for link in tqdm(df['book_link']):
  res=requests.get(link)
  soup=BeautifulSoup(res.text,'html.parser')
  typ= (soup.find('ul',class_='breadcrumb').find_all('a')[2].text)
  temp=soup.find('table', class_='table table-striped').find_all('td')
  upc= temp[0].text
  price_x=temp[2].text[2:]
  price_i=temp[3].text[2:]
  tax=temp[4].text[2:]
  avail=temp[5].text # This line is extracting the availability data
  reviews=temp[6].text
  # Change 'Avail' to 'availability' and remove 'avail' from the list to match the data structure
  data.append([typ,upc,price_x,price_i,tax,reviews, avail])

df= pd.DataFrame(data, columns=['Category','upc','Price_e_tax','Price_i_tax','tax','reviews','quantity']) # Changed Avail to availability in column names and removed quantity
df.head()

100%|██████████| 50/50 [00:12<00:00,  3.88it/s]


Unnamed: 0,Category,upc,Price_e_tax,Price_i_tax,tax,reviews,quantity
0,Travel,a22124811bfa8350,45.17,45.17,0.0,0,In stock (19 available)
1,Poetry,55f9da0c5eea2e10,33.63,33.63,0.0,0,In stock (17 available)
2,Young Adult,cedf82b5086e4691,45.22,45.22,0.0,0,In stock (16 available)
3,Sequential Art,c82a3e358c773c73,50.4,50.4,0.0,0,In stock (16 available)
4,Nonfiction,b136b1b180ca753a,22.0,22.0,0.0,0,In stock (16 available)


In [81]:
df.to_csv('data.csv', index=False)

**Data Combining**

In [82]:
import pandas as pd
df_1=pd.read_csv('books.csv')
df_2=pd.read_csv('data.csv')

In [83]:
df_2.head()

Unnamed: 0,Category,upc,Price_e_tax,Price_i_tax,tax,reviews,quantity
0,Travel,a22124811bfa8350,45.17,45.17,0.0,0,In stock (19 available)
1,Poetry,55f9da0c5eea2e10,33.63,33.63,0.0,0,In stock (17 available)
2,Young Adult,cedf82b5086e4691,45.22,45.22,0.0,0,In stock (16 available)
3,Sequential Art,c82a3e358c773c73,50.4,50.4,0.0,0,In stock (16 available)
4,Nonfiction,b136b1b180ca753a,22.0,22.0,0.0,0,In stock (16 available)


In [84]:
df_1.head()

Unnamed: 0,title,ratings,price,availability,book_link,img_link
0,It's Only the Himalayas,Two,£45.17,In stock,https://books.toscrape.com/catalogue/its-only-...,https://books.toscrape.com/../media/cache/27/a...
1,You can't bury them all: Poems,Two,£33.63,In stock,https://books.toscrape.com/catalogue/you-cant-...,https://books.toscrape.com/../media/cache/e9/2...
2,The Natural History of Us (The Fine Art of Pre...,Three,£45.22,In stock,https://books.toscrape.com/catalogue/the-natur...,https://books.toscrape.com/../media/cache/5d/7...
3,"Rat Queens, Vol. 3: Demons (Rat Queens (Collec...",Three,£50.40,In stock,https://books.toscrape.com/catalogue/rat-queen...,https://books.toscrape.com/../media/cache/f3/e...
4,In the Country We Love: My Family Divided,Four,£22.00,In stock,https://books.toscrape.com/catalogue/in-the-co...,https://books.toscrape.com/../media/cache/fe/e...


**1) Creating a New DataFrame**

In [88]:
df_1.head()

Unnamed: 0,title,ratings,price,availability,book_link,img_link
0,It's Only the Himalayas,Two,£45.17,In stock,https://books.toscrape.com/catalogue/its-only-...,https://books.toscrape.com/../media/cache/27/a...
1,You can't bury them all: Poems,Two,£33.63,In stock,https://books.toscrape.com/catalogue/you-cant-...,https://books.toscrape.com/../media/cache/e9/2...
2,The Natural History of Us (The Fine Art of Pre...,Three,£45.22,In stock,https://books.toscrape.com/catalogue/the-natur...,https://books.toscrape.com/../media/cache/5d/7...
3,"Rat Queens, Vol. 3: Demons (Rat Queens (Collec...",Three,£50.40,In stock,https://books.toscrape.com/catalogue/rat-queen...,https://books.toscrape.com/../media/cache/f3/e...
4,In the Country We Love: My Family Divided,Four,£22.00,In stock,https://books.toscrape.com/catalogue/in-the-co...,https://books.toscrape.com/../media/cache/fe/e...


In [89]:
df_2.head()

Unnamed: 0,Category,upc,Price_e_tax,Price_i_tax,tax,reviews,quantity
0,Travel,a22124811bfa8350,45.17,45.17,0.0,0,In stock (19 available)
1,Poetry,55f9da0c5eea2e10,33.63,33.63,0.0,0,In stock (17 available)
2,Young Adult,cedf82b5086e4691,45.22,45.22,0.0,0,In stock (16 available)
3,Sequential Art,c82a3e358c773c73,50.4,50.4,0.0,0,In stock (16 available)
4,Nonfiction,b136b1b180ca753a,22.0,22.0,0.0,0,In stock (16 available)


In [85]:
df= pd.DataFrame()

In [90]:
df['title']= df_1['title']
df['ratings']= df_1['ratings']
df['price']= df_1['price']
df['availability']= df_1['availability']
df['book_link']= df_1['book_link']
df['img_link']= df_1['img_link']
df['Category']= df_2['Category']
df['upc']= df_2['upc']
df['Price_e_tax']= df_2['Price_e_tax']
df['Price_i_tax']= df_2['Price_i_tax']
df['tax']= df_2['tax']
df['reviews']= df_2['reviews']
df['quantity']= df_2['quantity']





In [91]:
df.head()

Unnamed: 0,title,upc,Category,price,Price_e_tax,Price_i_tax,tax,ratings,reviews,quantity,availability,book_link,img_link
0,It's Only the Himalayas,a22124811bfa8350,Travel,£45.17,45.17,45.17,0.0,Two,0,In stock (19 available),In stock,https://books.toscrape.com/catalogue/its-only-...,https://books.toscrape.com/../media/cache/27/a...
1,You can't bury them all: Poems,55f9da0c5eea2e10,Poetry,£33.63,33.63,33.63,0.0,Two,0,In stock (17 available),In stock,https://books.toscrape.com/catalogue/you-cant-...,https://books.toscrape.com/../media/cache/e9/2...
2,The Natural History of Us (The Fine Art of Pre...,cedf82b5086e4691,Young Adult,£45.22,45.22,45.22,0.0,Three,0,In stock (16 available),In stock,https://books.toscrape.com/catalogue/the-natur...,https://books.toscrape.com/../media/cache/5d/7...
3,"Rat Queens, Vol. 3: Demons (Rat Queens (Collec...",c82a3e358c773c73,Sequential Art,£50.40,50.4,50.4,0.0,Three,0,In stock (16 available),In stock,https://books.toscrape.com/catalogue/rat-queen...,https://books.toscrape.com/../media/cache/f3/e...
4,In the Country We Love: My Family Divided,b136b1b180ca753a,Nonfiction,£22.00,22.0,22.0,0.0,Four,0,In stock (16 available),In stock,https://books.toscrape.com/catalogue/in-the-co...,https://books.toscrape.com/../media/cache/fe/e...


In [92]:
df.to_csv('combined_data.csv', index=False)