# Websrapping and storing into csv file

- Webscrap the info of books from a webpage which consts of 50 pages and each pahge contains the info of 20 books
- First, webscape the book info from a single webpage and store it into a csv file
- Then, webscrape all the book info from the 50 pages

In [1]:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
import csv

**Snippet for the testing codes**

```python
f=open('books_info.csv', 'w')
writer=csv.writer(f)
writer.writerow(['Title', 'Price', 'Stock'])
f.close()

with open('books_info.csv', 'r') as f:
    reader=csv.reader(f)
    print(next(reader))
```

## Collect data from a single webpage

**Opening a csv file**

In [2]:
f=open('books_info.csv', 'w', newline='')
writer=csv.writer(f)
writer.writerow(['Title', 'Price', 'Stock'])

19

In [3]:
url = 'https://books.toscrape.com/'
def clean (item):
    return item.strip().replace('£', '').replace('Â', '')

def get_page(url):
    r=requests.get(url)
    soup=BeautifulSoup(r.text, 'lxml')
    return soup

def parse_soup(soup):
    book_list=[]
    books=soup.find_all('article', class_='product_pod')

    for book in books:
        title=book.h3.a.text
        price=float(clean(book.find('p', class_='price_color').text))
        stock=clean(book.find('p', class_='instock availability').text)


        writer.writerow([title, price, stock])

    return book_list


soup=get_page(url)
book_list=parse_soup(soup)

**Closing the csv file**

In [4]:
f.close()

**Confirming the stored data in the csv file**

In [5]:
with open ('books_info.csv', 'r') as f:
    reader=csv.reader(f)
    for line in reader:
        print(line)

['Title', 'Price', 'Stock']
['A Light in the ...', '51.77', 'In stock']
['Tipping the Velvet', '53.74', 'In stock']
['Soumission', '50.1', 'In stock']
['Sharp Objects', '47.82', 'In stock']
['Sapiens: A Brief History ...', '54.23', 'In stock']
['The Requiem Red', '22.65', 'In stock']
['The Dirty Little Secrets ...', '33.34', 'In stock']
['The Coming Woman: A ...', '17.93', 'In stock']
['The Boys in the ...', '22.6', 'In stock']
['The Black Maria', '52.15', 'In stock']
['Starving Hearts (Triangular Trade ...', '13.99', 'In stock']
["Shakespeare's Sonnets", '20.66', 'In stock']
['Set Me Free', '17.46', 'In stock']
["Scott Pilgrim's Precious Little ...", '52.29', 'In stock']
['Rip it Up and ...', '35.02', 'In stock']
['Our Band Could Be ...', '57.25', 'In stock']
['Olio', '23.88', 'In stock']
['Mesaerion: The Best Science ...', '37.59', 'In stock']
['Libertarianism for Beginners', '51.33', 'In stock']
["It's Only the Himalayas", '45.17', 'In stock']


<h2> Collecting all the data from 50 webpages </h2>

<h3> Analyse the web url </3>

- When webpage moves to next page, only the nubmer chanes so we can put that number into a variable

https://books.toscrape.com/catalogue/page-1.html <br>
https://books.toscrape.com/catalogue/page-2.html 

                    .

                    . 
                    
                    . 
                    
https://books.toscrape.com/catalogue/page-50.html                    

In [6]:
# no pagination in this example

# url = f'https://books.toscrape.com/catalogue/page-{i}.html'

import time


page_num=3 # Control the number of webpages from which we will collect data

def clean (item):
    return item.strip().replace('£', '').replace('Â', '')

def retrieve_all(page_num):
    total_book_list=[]
    for i in range(page_num):
        url = f'https://books.toscrape.com/catalogue/page-{i+1}.html' # i starts with 0 so we shoudl add 1 to it
        print(url)
        
        def get_page(url):
            r=requests.get(url)
            soup=BeautifulSoup(r.text, 'lxml')
            return soup

        soup=get_page(url)
        
        def parse_soup(soup):
            book_list=[]
            books=soup.find_all('article', class_='product_pod')
            for book in books:
                title=book.h3.a.text

                price=float(clean(book.find('p', class_='price_color').text))

                stock=clean(book.find('p', class_='instock availability').text)

                book_list.append((title, price, stock))
            return book_list
    
        total_book_list.append(parse_soup(soup)) 

        time.sleep(0.5)  
    
    return total_book_list

results=retrieve_all(page_num)

https://books.toscrape.com/catalogue/page-1.html
https://books.toscrape.com/catalogue/page-2.html
https://books.toscrape.com/catalogue/page-3.html


**if we want to check whether all the info from 50 pages are collected then we can use below code**

```python
pprint(results[49][19])
```
output: ('1,000 Places to See ...', 26.08, 'In stock')


<h3> Store the extracted data into csv file </h3>

- will create 50 csv files. Here I just collected 3 pages

In [7]:
for idx, result in enumerate(results):
    f=open(f'books_info{idx+1}.csv', 'w', newline='')
    writer=csv.writer(f)
    writer.writerow(['Title', 'Price', 'Stock'])
    
    for (title, price, stock) in result:
        title=title
        price=price
        stock=stock
        
        writer.writerow([title, price, stock])
f.close()

<h3> Combine all the csv files into a single csv file </h3>

In [8]:
import os
from glob import glob
import pandas as pd

os.listdir('.')

['.ipynb_checkpoints',
 'books.db',
 'books_info.csv',
 'books_info1.csv',
 'books_info2.csv',
 'books_info3.csv',
 'chromedriver.exe',
 'chromedriver_win32.zip',
 'python_beautifulSoup_advanced.ipynb',
 'python_beautifulSoup_basic.ipynb',
 'python_beautifulSoup_SQLite .ipynb',
 'python_BeautifulSoup_storeToCsvFile.ipynb',
 'python_BeautifulSoup_storeToSQLiteDB.html',
 'python_BeautifulSoup_storeToSQLiteDB.ipynb',
 'python_selenium_advanced.ipynb',
 'python_selenium_advanced_2.ipynb',
 'python_selenium_basic.ipynb',
 'result.jpg',
 'result1.jpg',
 'sel.py',
 'total_book_info.csv',
 '__pycache__']

In [9]:
sorted(glob('books_info*.csv'))

['books_info.csv', 'books_info1.csv', 'books_info2.csv', 'books_info3.csv']

In [10]:
df=pd.concat([ pd.read_csv(file) for file in sorted(glob('books_info*.csv'))], axis=0)

In [11]:
df.head()

Unnamed: 0,Title,Price,Stock
0,A Light in the ...,51.77,In stock
1,Tipping the Velvet,53.74,In stock
2,Soumission,50.1,In stock
3,Sharp Objects,47.82,In stock
4,Sapiens: A Brief History ...,54.23,In stock


In [12]:
len(df)

80

In [13]:
df.to_csv('total_book_info.csv', index=False)

**Confirmed the collected data**

In [14]:
with open('total_book_info.csv') as f:
    reader=csv.reader(f)
    
    for line in reader:
        print(line)

['Title', 'Price', 'Stock']
['A Light in the ...', '51.77', 'In stock']
['Tipping the Velvet', '53.74', 'In stock']
['Soumission', '50.1', 'In stock']
['Sharp Objects', '47.82', 'In stock']
['Sapiens: A Brief History ...', '54.23', 'In stock']
['The Requiem Red', '22.65', 'In stock']
['The Dirty Little Secrets ...', '33.34', 'In stock']
['The Coming Woman: A ...', '17.93', 'In stock']
['The Boys in the ...', '22.6', 'In stock']
['The Black Maria', '52.15', 'In stock']
['Starving Hearts (Triangular Trade ...', '13.99', 'In stock']
["Shakespeare's Sonnets", '20.66', 'In stock']
['Set Me Free', '17.46', 'In stock']
["Scott Pilgrim's Precious Little ...", '52.29', 'In stock']
['Rip it Up and ...', '35.02', 'In stock']
['Our Band Could Be ...', '57.25', 'In stock']
['Olio', '23.88', 'In stock']
['Mesaerion: The Best Science ...', '37.59', 'In stock']
['Libertarianism for Beginners', '51.33', 'In stock']
["It's Only the Himalayas", '45.17', 'In stock']
['A Light in the ...', '51.77', 'In sto

<h2> Create a csv file for each webpage while parsing </h2>

In [15]:
# no pagination in this example

# url = f'https://books.toscrape.com/catalogue/page-{i}.html'

import time
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd


page_num=3 # Control the number of webpages from which we will collect data

def clean (item):
    return item.strip().replace('£', '').replace('Â', '')

def retrieve_all(page_num):

    for i in range(page_num):
        url = f'https://books.toscrape.com/catalogue/page-{i+1}.html' # i starts with 0 so we shoudl add 1 to it
        print(url)
        
        def get_page(url):
            r=requests.get(url)
            soup=BeautifulSoup(r.text, 'lxml')
            return soup

        soup=get_page(url)
        print('get_page finished')
        
        def parse_soup(soup):
            print('parse_soup starts')
            
            f=open(f'books_info{i+1}.csv', 'w', newline='')
            writer=csv.writer(f)
            writer.writerow(['Title', 'Price', 'Stock'])
          
            books=soup.find_all('article', class_='product_pod')
            
   
            for book in books:
                
                title=book.h3.a.text
                price=float(clean(book.find('p', class_='price_color').text))
                stock=clean(book.find('p', class_='instock availability').text)
                
                writer.writerow([title, price, stock])
                
            f.close()
    
        parse_soup(soup)
        print('parse_soup finished')

        time.sleep(0.5)  
    


retrieve_all(page_num)

https://books.toscrape.com/catalogue/page-1.html
get_page finished
parse_soup starts
parse_soup finished
https://books.toscrape.com/catalogue/page-2.html
get_page finished
parse_soup starts
parse_soup finished
https://books.toscrape.com/catalogue/page-3.html
get_page finished
parse_soup starts
parse_soup finished


<h3> Combine all the csv files into a single csv file </h3>

In [16]:
import os
from glob import glob
import pandas as pd

os.listdir('.')

['.ipynb_checkpoints',
 'books.db',
 'books_info.csv',
 'books_info1.csv',
 'books_info2.csv',
 'books_info3.csv',
 'chromedriver.exe',
 'chromedriver_win32.zip',
 'python_beautifulSoup_advanced.ipynb',
 'python_beautifulSoup_basic.ipynb',
 'python_beautifulSoup_SQLite .ipynb',
 'python_BeautifulSoup_storeToCsvFile.ipynb',
 'python_BeautifulSoup_storeToSQLiteDB.html',
 'python_BeautifulSoup_storeToSQLiteDB.ipynb',
 'python_selenium_advanced.ipynb',
 'python_selenium_advanced_2.ipynb',
 'python_selenium_basic.ipynb',
 'result.jpg',
 'result1.jpg',
 'sel.py',
 'total_book_info.csv',
 '__pycache__']

In [17]:
sorted(glob('books_info*.csv'))

['books_info.csv', 'books_info1.csv', 'books_info2.csv', 'books_info3.csv']

In [18]:
df=pd.concat([ pd.read_csv(file) for file in sorted(glob('books_info*.csv'))], axis=0)

In [19]:
df.head()

Unnamed: 0,Title,Price,Stock
0,A Light in the ...,51.77,In stock
1,Tipping the Velvet,53.74,In stock
2,Soumission,50.1,In stock
3,Sharp Objects,47.82,In stock
4,Sapiens: A Brief History ...,54.23,In stock


In [20]:
df.to_csv('total_book_info.csv', index=False)

<h3>Confirmed the collected data</h3>

In [21]:
with open('total_book_info.csv') as f:
    reader=csv.reader(f)
    
    for line in reader:
        print(line)

['Title', 'Price', 'Stock']
['A Light in the ...', '51.77', 'In stock']
['Tipping the Velvet', '53.74', 'In stock']
['Soumission', '50.1', 'In stock']
['Sharp Objects', '47.82', 'In stock']
['Sapiens: A Brief History ...', '54.23', 'In stock']
['The Requiem Red', '22.65', 'In stock']
['The Dirty Little Secrets ...', '33.34', 'In stock']
['The Coming Woman: A ...', '17.93', 'In stock']
['The Boys in the ...', '22.6', 'In stock']
['The Black Maria', '52.15', 'In stock']
['Starving Hearts (Triangular Trade ...', '13.99', 'In stock']
["Shakespeare's Sonnets", '20.66', 'In stock']
['Set Me Free', '17.46', 'In stock']
["Scott Pilgrim's Precious Little ...", '52.29', 'In stock']
['Rip it Up and ...', '35.02', 'In stock']
['Our Band Could Be ...', '57.25', 'In stock']
['Olio', '23.88', 'In stock']
['Mesaerion: The Best Science ...', '37.59', 'In stock']
['Libertarianism for Beginners', '51.33', 'In stock']
["It's Only the Himalayas", '45.17', 'In stock']
['A Light in the ...', '51.77', 'In sto