# sqlite3 - construction of database

The lecture for this script <a href="https://www.youtube.com/watch?v=RZI-v-Z1W4c">here</a>

- Basic of sql - create t-shirts stock inventory db

- Webscapping to get information of books from a website

- I made some modificaiton to the codes which were used in the lecture to make them a little bit more simple

- Also, I wrote some codes which can retrieve all the information from the 50 webpages

<h2>A simple db for t-shirts inventory management</h2>

In [1]:
import sqlite3

In [2]:
con =sqlite3.connect('example.db') # if the databse does not exist, this command create the database

cur=con.cursor() # execute command

# creage a table for our database

**Add the priamry key to 'sku', which serves as a constriant and will prevent any repeated duplicate input when getting the code executed multiple times**

In [3]:
# set the 'sku' to primary key constraint, which will prevent any duplicate input 

cur.execute(''' 
                CREATE TABLE IF NOT EXISTS tshirts
                (sku text primary key, 
                name text,
                size text,
                price real) 
                '''
           )


cur.execute(''' INSERT INTO tshirts VALUES
('sku1234', 'balck logo tshirts', 'medium', '24.99') ''')

# commit these changes to the example.db
con.commit()

for row in cur.execute ('''SELECT * FROM tshirts'''):
    print(row)

IntegrityError: UNIQUE constraint failed: tshirts.sku

**INSERT OR IGNORE will not show Error messagen even if the same data try to be inserted**

In [4]:
cur.execute(''' INSERT OR IGNORE INTO tshirts VALUES
('sku1235', 'balck logo tshirts', 'large', '24.99') ''') 

# commit these changes to the example.db
con.commit()

for row in cur.execute ('''select * from tshirts'''):
    print(row)
    
con.commit()

('sku1234', 'balck logo tshirts', 'medium', 24.99)
('sku1235', 'balck logo tshirts', 'large', 24.99)


<h2> Web scraping and store the info into our db </h2>

In [5]:
import requests
from bs4 import BeautifulSoup
import sqlite3
from pprint import pprint

<h3> Create db and connect to it</h3>

In [6]:
con=sqlite3.connect('books.db')

cur=con.cursor()

cur.execute(""" CREATE TABLE IF NOT EXISTS books
                (title text PRIMARY KEY,
                price real,
                stock text)""")

<sqlite3.Cursor at 0x261b024ece0>

<h3> Collecting data from a single website </h3>

In [7]:
# no pagination in this example

url = 'https://books.toscrape.com/'

def clean (item):
    return item.strip().replace('£', '').replace('Â', '')

def get_page(url):
    r=requests.get(url)
    soup=BeautifulSoup(r.text, 'lxml')
    return soup

def parse_soup(soup):
    book_list=[]
    books=soup.find_all('article', class_='product_pod')

    for book in books:
        title=book.h3.a.text
        price=float(clean(book.find('p', class_='price_color').text))
        stock=clean(book.find('p', class_='instock availability').text)


        book_list.append((title, price, stock))

    return book_list


soup=get_page(url)
book_list=parse_soup(soup)

pprint(book_list)

[('A Light in the ...', 51.77, 'In stock'),
 ('Tipping the Velvet', 53.74, 'In stock'),
 ('Soumission', 50.1, 'In stock'),
 ('Sharp Objects', 47.82, 'In stock'),
 ('Sapiens: A Brief History ...', 54.23, 'In stock'),
 ('The Requiem Red', 22.65, 'In stock'),
 ('The Dirty Little Secrets ...', 33.34, 'In stock'),
 ('The Coming Woman: A ...', 17.93, 'In stock'),
 ('The Boys in the ...', 22.6, 'In stock'),
 ('The Black Maria', 52.15, 'In stock'),
 ('Starving Hearts (Triangular Trade ...', 13.99, 'In stock'),
 ("Shakespeare's Sonnets", 20.66, 'In stock'),
 ('Set Me Free', 17.46, 'In stock'),
 ("Scott Pilgrim's Precious Little ...", 52.29, 'In stock'),
 ('Rip it Up and ...', 35.02, 'In stock'),
 ('Our Band Could Be ...', 57.25, 'In stock'),
 ('Olio', 23.88, 'In stock'),
 ('Mesaerion: The Best Science ...', 37.59, 'In stock'),
 ('Libertarianism for Beginners', 51.33, 'In stock'),
 ("It's Only the Himalayas", 45.17, 'In stock')]


**Snippet for the test of above code**

```python

from pprint import pprint

book_list=[]

def clean (item):
    return item.strip().replace('£', '').replace('Â', '')

count=0
for book in books:
    title=book.h3.a.text
    price=float(clean(book.find('p', class_='price_color').text))
    stock=clean(book.find('p', class_='instock availability').text)
    if count <5:
        print(title)
        print(price)
        print(stock)
        print()
        book_list.append((title, price, stock))
        
    
    count+=1
    
pprint(book_list)

```

<h3> Store the extracted data into our db </h3>

In [8]:
cur.executemany("""INSERT OR IGNORE INTO books VALUES (?, ?, ?)""", book_list)
con.commit()

**Confirmed the books database**

<img src='result.jpg' width=600 height=400>

<h2> Collecting all the data from 50 webpages </h2>

<h3> Analyse the web url </3>

- When webpage moves to next page, only the nubmer chanes so we can put that number into a variable

https://books.toscrape.com/catalogue/page-1.html <br>
https://books.toscrape.com/catalogue/page-2.html 

                    .

                    . 
                    
                    . 
                    
https://books.toscrape.com/catalogue/page-50.html                    

In [9]:
# no pagination in this example

# url = f'https://books.toscrape.com/catalogue/page-{i}.html'

import time


page_num=50

def clean (item):
    return item.strip().replace('£', '').replace('Â', '')

def retrieve_all(page_num):
    total_book_list=[]
    for i in range(page_num):
        url = f'https://books.toscrape.com/catalogue/page-{i+1}.html' # i starts with 0 so we shoudl add 1 to it
        print(url)
        
        def get_page(url):
            r=requests.get(url)
            soup=BeautifulSoup(r.text, 'lxml')
            return soup

        soup=get_page(url)
        
        def parse_soup(soup):
            book_list=[]
            books=soup.find_all('article', class_='product_pod')
            for book in books:
                title=book.h3.a.text

                price=float(clean(book.find('p', class_='price_color').text))

                stock=clean(book.find('p', class_='instock availability').text)

                book_list.append((title, price, stock))
            return book_list
    
        total_book_list.append(parse_soup(soup)) 

        time.sleep(0.5)  
    
    return total_book_list

results=retrieve_all(page_num)


https://books.toscrape.com/catalogue/page-1.html
https://books.toscrape.com/catalogue/page-2.html
https://books.toscrape.com/catalogue/page-3.html
https://books.toscrape.com/catalogue/page-4.html
https://books.toscrape.com/catalogue/page-5.html
https://books.toscrape.com/catalogue/page-6.html
https://books.toscrape.com/catalogue/page-7.html
https://books.toscrape.com/catalogue/page-8.html
https://books.toscrape.com/catalogue/page-9.html
https://books.toscrape.com/catalogue/page-10.html
https://books.toscrape.com/catalogue/page-11.html
https://books.toscrape.com/catalogue/page-12.html
https://books.toscrape.com/catalogue/page-13.html
https://books.toscrape.com/catalogue/page-14.html
https://books.toscrape.com/catalogue/page-15.html
https://books.toscrape.com/catalogue/page-16.html
https://books.toscrape.com/catalogue/page-17.html
https://books.toscrape.com/catalogue/page-18.html
https://books.toscrape.com/catalogue/page-19.html
https://books.toscrape.com/catalogue/page-20.html
https://b

In [10]:
pprint(results[49][19])

('1,000 Places to See ...', 26.08, 'In stock')


<h3> Store the extracted data into our db </h3>

In [11]:
for result in results:
    cur.executemany("""INSERT OR IGNORE INTO books VALUES (?, ?, ?)""", result)
con.commit()

<img src='result1.jpg' width=600 height=800>