# Websrapping and storing into a json file

- Webscrap the info of books from a webpage which consists of 50 pages and each pahge contains the info of 20 books
- First, webscape the book info from a single webpage and store it into a json file
- Then, webscrape all the book info from the 50 webpages and store them into a json file

In [1]:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
import csv
import json

**Snippet for the testing codes**

```python
import json

with open('total_book_info.json', 'w') as f:
    json.dump(results, f)

with open('total_book_info.json') as f:
    total_book=json.load(f)
    print(json.dumps(total_book, indent=2))
```

## Collect data from a single webpage

In [2]:
url = 'https://books.toscrape.com/'
def clean (item):
    return item.strip().replace('£', '').replace('Â', '')

def get_page(url):
    r=requests.get(url)
    soup=BeautifulSoup(r.text, 'lxml')
    return soup

def parse_soup(soup):
    books_list=[]
    books=soup.find_all('article', class_='product_pod')

    for book in books:
        book_list={}
        
        title=book.h3.a.text
        price=float(clean(book.find('p', class_='price_color').text))
        stock=clean(book.find('p', class_='instock availability').text)

        book_list['Title']=title
        book_list['Price']=price
        book_list['Stock']=stock

        books_list.append(book_list)
    return books_list


soup=get_page(url)
book_list=parse_soup(soup)

pprint(book_list)

[{'Price': 51.77, 'Stock': 'In stock', 'Title': 'A Light in the ...'},
 {'Price': 53.74, 'Stock': 'In stock', 'Title': 'Tipping the Velvet'},
 {'Price': 50.1, 'Stock': 'In stock', 'Title': 'Soumission'},
 {'Price': 47.82, 'Stock': 'In stock', 'Title': 'Sharp Objects'},
 {'Price': 54.23, 'Stock': 'In stock', 'Title': 'Sapiens: A Brief History ...'},
 {'Price': 22.65, 'Stock': 'In stock', 'Title': 'The Requiem Red'},
 {'Price': 33.34, 'Stock': 'In stock', 'Title': 'The Dirty Little Secrets ...'},
 {'Price': 17.93, 'Stock': 'In stock', 'Title': 'The Coming Woman: A ...'},
 {'Price': 22.6, 'Stock': 'In stock', 'Title': 'The Boys in the ...'},
 {'Price': 52.15, 'Stock': 'In stock', 'Title': 'The Black Maria'},
 {'Price': 13.99,
  'Stock': 'In stock',
  'Title': 'Starving Hearts (Triangular Trade ...'},
 {'Price': 20.66, 'Stock': 'In stock', 'Title': "Shakespeare's Sonnets"},
 {'Price': 17.46, 'Stock': 'In stock', 'Title': 'Set Me Free'},
 {'Price': 52.29,
  'Stock': 'In stock',
  'Title': "

**Store the data into a json file**

In [3]:
with open('book_info.json', 'w') as f:
    json.dump(book_list, f)

**Open the json file and confirm the data**

In [4]:
with open('book_info.json', 'r') as f:
    data=json.load(f)
    
    print(json.dumps(data, indent=2))

[
  {
    "Title": "A Light in the ...",
    "Price": 51.77,
    "Stock": "In stock"
  },
  {
    "Title": "Tipping the Velvet",
    "Price": 53.74,
    "Stock": "In stock"
  },
  {
    "Title": "Soumission",
    "Price": 50.1,
    "Stock": "In stock"
  },
  {
    "Title": "Sharp Objects",
    "Price": 47.82,
    "Stock": "In stock"
  },
  {
    "Title": "Sapiens: A Brief History ...",
    "Price": 54.23,
    "Stock": "In stock"
  },
  {
    "Title": "The Requiem Red",
    "Price": 22.65,
    "Stock": "In stock"
  },
  {
    "Title": "The Dirty Little Secrets ...",
    "Price": 33.34,
    "Stock": "In stock"
  },
  {
    "Title": "The Coming Woman: A ...",
    "Price": 17.93,
    "Stock": "In stock"
  },
  {
    "Title": "The Boys in the ...",
    "Price": 22.6,
    "Stock": "In stock"
  },
  {
    "Title": "The Black Maria",
    "Price": 52.15,
    "Stock": "In stock"
  },
  {
    "Title": "Starving Hearts (Triangular Trade ...",
    "Price": 13.99,
    "Stock": "In stock"
  },
  {
  

<h2> Collecting all the data from 50 webpages </h2>

<h3> Analyse the web url </3>

- When webpage moves to next page, only the nubmer chanes so we can put that number into a variable

https://books.toscrape.com/catalogue/page-1.html <br>
https://books.toscrape.com/catalogue/page-2.html 

                    .

                    . 
                    
                    . 
                    
https://books.toscrape.com/catalogue/page-50.html                    

In [5]:
# no pagination in this example

# url = f'https://books.toscrape.com/catalogue/page-{i}.html'

import time


page_num=3 # Control the number of webpages from which we will collect data

def clean (item):
    return item.strip().replace('£', '').replace('Â', '')

def retrieve_all(page_num):
    total_book_list=[]
    for i in range(page_num):
        url = f'https://books.toscrape.com/catalogue/page-{i+1}.html' # i starts with 0 so we shoudl add 1 to it
        print(url)
        
        def get_page(url):
            r=requests.get(url)
            soup=BeautifulSoup(r.text, 'lxml')
            return soup

        soup=get_page(url)
        
        def parse_soup(soup):
            books_list=[]
            books=soup.find_all('article', class_='product_pod')
            
            for book in books:
                book_list={}
                
                title=book.h3.a.text
                price=float(clean(book.find('p', class_='price_color').text))
                stock=clean(book.find('p', class_='instock availability').text)

                book_list['Title']=title
                book_list['Price']=price
                book_list['Stock']=stock
                print(book_list)
                
                books_list.append(book_list)
            return books_list
    
        total_book_list.append(parse_soup(soup)) 

        time.sleep(0.5)  
    
    return total_book_list

results=retrieve_all(page_num)

https://books.toscrape.com/catalogue/page-1.html
{'Title': 'A Light in the ...', 'Price': 51.77, 'Stock': 'In stock'}
{'Title': 'Tipping the Velvet', 'Price': 53.74, 'Stock': 'In stock'}
{'Title': 'Soumission', 'Price': 50.1, 'Stock': 'In stock'}
{'Title': 'Sharp Objects', 'Price': 47.82, 'Stock': 'In stock'}
{'Title': 'Sapiens: A Brief History ...', 'Price': 54.23, 'Stock': 'In stock'}
{'Title': 'The Requiem Red', 'Price': 22.65, 'Stock': 'In stock'}
{'Title': 'The Dirty Little Secrets ...', 'Price': 33.34, 'Stock': 'In stock'}
{'Title': 'The Coming Woman: A ...', 'Price': 17.93, 'Stock': 'In stock'}
{'Title': 'The Boys in the ...', 'Price': 22.6, 'Stock': 'In stock'}
{'Title': 'The Black Maria', 'Price': 52.15, 'Stock': 'In stock'}
{'Title': 'Starving Hearts (Triangular Trade ...', 'Price': 13.99, 'Stock': 'In stock'}
{'Title': "Shakespeare's Sonnets", 'Price': 20.66, 'Stock': 'In stock'}
{'Title': 'Set Me Free', 'Price': 17.46, 'Stock': 'In stock'}
{'Title': "Scott Pilgrim's Preciou

In [6]:
import json

with open('total_book_info.json', 'w') as f:
    json.dump(results, f)

In [7]:
with open('total_book_info.json') as f:
    total_book=json.load(f)
    
    print(json.dumps(total_book, indent=2))

[
  [
    {
      "Title": "A Light in the ...",
      "Price": 51.77,
      "Stock": "In stock"
    },
    {
      "Title": "Tipping the Velvet",
      "Price": 53.74,
      "Stock": "In stock"
    },
    {
      "Title": "Soumission",
      "Price": 50.1,
      "Stock": "In stock"
    },
    {
      "Title": "Sharp Objects",
      "Price": 47.82,
      "Stock": "In stock"
    },
    {
      "Title": "Sapiens: A Brief History ...",
      "Price": 54.23,
      "Stock": "In stock"
    },
    {
      "Title": "The Requiem Red",
      "Price": 22.65,
      "Stock": "In stock"
    },
    {
      "Title": "The Dirty Little Secrets ...",
      "Price": 33.34,
      "Stock": "In stock"
    },
    {
      "Title": "The Coming Woman: A ...",
      "Price": 17.93,
      "Stock": "In stock"
    },
    {
      "Title": "The Boys in the ...",
      "Price": 22.6,
      "Stock": "In stock"
    },
    {
      "Title": "The Black Maria",
      "Price": 52.15,
      "Stock": "In stock"
    },
    {
   