# Exercício Prático
**1. Coletar os seguintes dados da página: https://books.toscrape.com**

**Catálogo:**
- Classics

- Science Fiction

- Humor

- Business

**Coletar os seguintes dados de cada livro:**
- Nome do livro

- Preço em libras

- Avaliação dos consumidores

- Disponível em estoque

**2. Faça um plano escrito para cada uma das perguntas de negócio, contendo:**
- Saída: A simulação da tabela e gráfico final.

- Processo: A sequência de passos organizada pela lógica de execução

- Entrada: O link para as fontes de dados.


# Planejamento para solução (SAPE)


### Saida
**Solução resposta**
- Coletar os dados referente a todas as categorias

**Formato da entrega**
- Tabela em formato .csv

**Local da entrega**
- Repositório Git

### Processo

**Construção da resposta**
1. Realizar coleta atavés do pacote BeatifulSoup:
2. Reconhecer a estrutura do site e html
3. Verificar referência no html que serão válidas para todas as coletas
4. Coletar o page size para coletar todos os dados
5. Coletar URLs de todos os books
6. Criar um Loop For para coletar todos os dados solicitados: Nome do livro, Preço em libras, Avaliação dos consumidores, Disponível em estoque e Categoria do livro
7. Gerar um DataFrame dos dados coletados com Pandas
8. Criar coluna com dia e hórario do scrapy, através do datetime.
9. Exportar dados para .csv

**Formato da Entrega**
- Tabela, com as seguintes colunas:
| book_name | book_category | book_price | book_rating | book_stock | book_id | scrapy_datetime |

**Local de Entrega**
- Jupyter Notebook

### Entrada

**Fonte de dados**
- https://books.toscrape.com/

**Ferramentas**
- Python 3.8.5
- BeautifulSoup
- Jupyter Notebook

# 0.0 Import Libs

In [3]:
import pandas as pd
import numpy as np
import requests
import re
import sqlite3

from sqlalchemy import create_engine
from datetime import datetime
from bs4 import BeautifulSoup

# 1.0 Create path and headers

In [None]:
url = 'https://books.toscrape.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5),AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
page = requests.get(url, headers=headers)

# BeautifulSoup Objetct
soup = BeautifulSoup(page.text, 'html.parser')

## 1.1 All URL Books

In [None]:
#================== Get all url books ====================#
# page size
n_page = soup.find('form', class_='form-horizontal')
n_page = n_page.get_text().split(' ')
n_page = n_page[0].split('\n')[3]
n_page = int(n_page) / 20
page_list = list(np.arange(1, int(n_page)+1,1))

list_url = []

for i in range(len(page_list)):
    url = 'https://books.toscrape.com/catalogue/page-'+ str(page_list[i]) +'.html' 
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')
    product_url = soup.find_all('div',class_='image_container')
    l = [p.find('a').get('href') for p in product_url]
    list_url = list_url + l 

# 2.0 Get Books Details

In [None]:
#================== Get books details ====================#
#Nome do livro
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5),AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
rating_list = ['One','Two','Three','Four','Five']
cols = ['book_name','book_category','book_price','book_rating','book_stock','book_id']
books = pd.DataFrame(columns=cols)

for i in range(len(list_url)):
    url = 'https://books.toscrape.com/catalogue/' + list_url[i]
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')

    # product name
    product_details = soup.find('div', class_='col-sm-6 product_main')
    name = str(product_details.h1.string)

    # product price
    product_details = soup.find('p', class_='price_color')
    price = product_details.get_text()[1:]

    #product rating
    for p in range(len(rating_list)):
        product_details = soup.find('div', class_='col-sm-6 product_main')

        if product_details.find('p', class_='star-rating '+ str(rating_list[p])):
            aux = product_details.find('p', class_='star-rating '+ str(rating_list[p]))
            rating = aux['class'][1]
        else:
            next

    # product stock
    product_details = soup.find('p', class_='instock availability')
    stock = product_details.get_text().strip()
    
    # product category
    product_details = soup.find('ul', class_='breadcrumb')
    category = list(filter(None, product_details.get_text().split('\n')))[2]
    
    # create dataframe
    aux = pd.DataFrame([name,category,price,rating,stock]).T
    aux.columns = ['book_name','book_category','book_price','book_rating','book_stock']
    aux['book_id'] = i
    
    books = pd.concat([books, aux], axis=0)

# reset index
books = books.reset_index().drop('index',axis=1)

# create datetime scrap

books['scrapy_datetime'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

In [None]:
# Export to csv
books.to_csv('books-list-complete.csv', index=False )

# 2.1 Clean Data

In [109]:
data = pd.read_csv('books-list-complete.csv')

# book_name
# book_category
data['book_category'] = data['book_category'].apply(lambda x: x.replace(' ','_').lower() if pd.notnull(x) else x )

# book_price
data['book_price'] = data['book_price'].apply(lambda x: x.replace('£','')).astype(float)

# book_rating
data['book_rating'] = data['book_rating'].apply(lambda x: 1 if x == 'One' else
                                                          2 if x == 'Two' else
                                                          3 if x == 'Three' else
                                                          4 if x == 'Four' else 5)

# book_stock
regex = '(\d+)'
data['book_stock'] = data['book_stock'].apply(lambda x: re.search(regex, x).group(0) if pd.notnull(x) else x)

# book_id
# scrapy-datetime

# book category add_a_comment to Default
data['book_category'] = data['book_category'].apply(lambda x: 'default' if x == 'add_a_comment' else x)

# book category sorted
data = data.sort_values('book_category', ascending=True)

In [111]:
# export to csv data cleaned
data.to_csv('books-data-clean.csv', index=False )

# 3.0 Create database

In [4]:
data = pd.read_csv('books-data-clean.csv')

In [9]:
#========== Create database ==========#
query_schema = """
    CREATE TABLE books_scrapy(
        book_name       TEXT, 
        book_category   TEXT, 
        book_price      REAL, 
        book_rating     INTERGER, 
        book_stock      INTERGER,
        book_id         INTERGER, 
        scrapy_datetime TEXT
    )

"""

conn = sqlite3.connect('books_scrapy.sqlite')
cursor = conn.execute(query_schema)
conn.commit()

In [14]:
# insert dataset in database
conn = create_engine('sqlite:///books_scrapy.sqlite', echo=False)

data.to_sql('books_scrapy', con=conn, if_exists='append', index=False)

## 3.1 Pratice time!

In [48]:
query = """
    SELECT *
    FROM books_scrapy
"""

df = pd.read_sql_query(query, conn)
df.head()

Unnamed: 0,book_name,book_category,book_price,book_rating,book_stock,book_id,scrapy_datetime
0,Logan Kade (Fallen Crest High #5.5),academic,13.12,2,5,616,2021-07-16 15:48:52
1,Fifty Shades Freed (Fifty Shades #3),adult_fiction,15.36,5,3,844,2021-07-16 15:48:52
2,Art and Fear: Observations on the Perils (and ...,art,48.63,4,9,441,2021-07-16 15:48:52
3,The Story of Art,art,41.14,4,7,500,2021-07-16 15:48:52
4,The New Drawing on the Right Side of the Brain,art,43.02,3,8,450,2021-07-16 15:48:52


In [27]:
query = """
    SELECT distinct book_category
    FROM books_scrapy
"""

df = pd.read_sql_query(query, conn)
df.head()

Unnamed: 0,book_category
0,academic
1,adult_fiction
2,art
3,autobiography
4,biography


In [32]:
query = """
    SELECT *
    FROM books_scrapy
    WHERE book_category = 'autobiography' or book_category = 'biography'
"""

df = pd.read_sql_query(query, conn)
df.head()

Unnamed: 0,book_name,book_category,book_price,book_rating,book_stock,book_id,scrapy_datetime
0,M Train,autobiography,27.18,1,11,402,2021-07-16 15:48:52
1,Running with Scissors,autobiography,12.91,4,3,785,2021-07-16 15:48:52
2,The Argonauts,autobiography,10.93,2,15,163,2021-07-16 15:48:52
3,A Heartbreaking Work of Staggering Genius,autobiography,54.29,5,3,885,2021-07-16 15:48:52
4,Approval Junkie: Adventures in Caring Too Much,autobiography,58.81,5,5,637,2021-07-16 15:48:52


In [36]:
query = """
    SELECT *
    FROM books_scrapy
    WHERE book_price between 10 and 20
"""

df = pd.read_sql_query(query, conn)
df.sort_values('book_price',ascending=True)

Unnamed: 0,book_name,book_category,book_price,book_rating,book_stock,book_id,scrapy_datetime
191,An Abundance of Katherines,young_adult,10.00,5,5,638,2021-07-16 15:48:52
153,The Origin of Species,science,10.01,4,7,501,2021-07-16 15:48:52
46,The Tipping Point: How Little Things Can Make ...,default,10.02,2,3,716,2021-07-16 15:48:52
169,Patience,sequential_art,10.16,3,16,84,2021-07-16 15:48:52
25,Greek Mythic History,default,10.23,5,14,302,2021-07-16 15:48:52
...,...,...,...,...,...,...,...
52,The Zombie Room,default,19.69,5,1,913,2021-07-16 15:48:52
185,I've Got Your Number,womens_fiction,19.69,1,3,827,2021-07-16 15:48:52
93,The Age of Genius: The Seventeenth Century and...,history,19.73,1,16,71,2021-07-16 15:48:52
130,Reskilling America: Learning to Labor in the T...,nonfiction,19.83,2,16,78,2021-07-16 15:48:52


In [50]:
query = """
    SELECT *
    FROM books_scrapy
    WHERE book_price between 10 and 20 and book_rating >=4
    ORDER BY book_category DESC
"""

df = pd.read_sql_query(query, conn)
df

Unnamed: 0,book_name,book_category,book_price,book_rating,book_stock,book_id,scrapy_datetime
0,Scarlet (The Lunar Chronicles #2),young_adult,14.57,4,3,782,2021-07-16 15:48:52
1,Set Me Free,young_adult,17.46,5,19,12,2021-07-16 15:48:52
2,The Darkest Corners,young_adult,11.33,5,5,601,2021-07-16 15:48:52
3,Kill the Boy Band,young_adult,15.52,5,5,619,2021-07-16 15:48:52
4,An Abundance of Katherines,young_adult,10.00,5,5,638,2021-07-16 15:48:52
...,...,...,...,...,...,...,...
70,Walt Disney's Alice in Wonderland,childrens,12.96,5,14,223,2021-07-16 15:48:52
71,The Third Wave: An Entrepreneurâs Vision of ...,business,12.61,5,15,138,2021-07-16 15:48:52
72,Running with Scissors,autobiography,12.91,4,3,785,2021-07-16 15:48:52
73,History of Beauty,art,10.29,4,8,479,2021-07-16 15:48:52


In [71]:
query = """
    SELECT book_category, AVG(book_price) as "mean", AVG(book_rating) as "mean_rating"
    FROM books_scrapy
    GROUP BY book_category
"""

df = pd.read_sql_query(query, conn)
df.head()

Unnamed: 0,book_category,mean,mean_rating
0,academic,13.12,2.0
1,adult_fiction,15.36,5.0
2,art,38.52,3.625
3,autobiography,37.053333,3.0
4,biography,33.662,2.2


In [75]:
query = """
    SELECT book_rating, AVG(book_price) as "mean"
    FROM books_scrapy
    GROUP BY book_rating
"""

df = pd.read_sql_query(query, conn)
df

Unnamed: 0,book_rating,mean
0,1,34.561195
1,2,34.810918
2,3,34.69202
3,4,36.093296
4,5,35.37449


In [76]:
query = """
    SELECT book_rating, book_category, AVG(book_price) as "mean"
    FROM books_scrapy
    GROUP BY book_rating, book_category
"""

df = pd.read_sql_query(query, conn)
df

Unnamed: 0,book_rating,book_category,mean
0,1,autobiography,34.015000
1,1,biography,39.550000
2,1,business,43.930000
3,1,childrens,32.593750
4,1,christian,25.770000
...,...,...,...
172,5,sports_and_games,44.840000
173,5,thriller,33.715000
174,5,travel,26.080000
175,5,womens_fiction,40.902500


In [80]:
query = """
    SELECT book_name, book_price, book_stock ,book_price * book_stock as "revenue"
    FROM books_scrapy
"""

df = pd.read_sql_query(query, conn)
df

Unnamed: 0,book_name,book_price,book_stock,revenue
0,Logan Kade (Fallen Crest High #5.5),13.12,5,65.60
1,Fifty Shades Freed (Fifty Shades #3),15.36,3,46.08
2,Art and Fear: Observations on the Perils (and ...,48.63,9,437.67
3,The Story of Art,41.14,7,287.98
4,The New Drawing on the Right Side of the Brain,43.02,8,344.16
...,...,...,...,...
995,Golden (Heart of Dread #3),42.21,3,126.63
996,Catching Jordan (Hundred Oaks),50.83,14,711.62
997,The Art of Not Breathing,40.83,1,40.83
998,No Love Allowed (Dodge Cove #1),54.65,12,655.80


In [83]:
query = """
    SELECT SUM(book_stock) as "total stock", SUM(book_price * book_stock) as "revenue"
    FROM books_scrapy
"""

df = pd.read_sql_query(query, conn)
df

Unnamed: 0,total stock,revenue
0,8585,300188.27


In [87]:
query = """
    SELECT book_category, SUM(book_stock) as "total stock", SUM(book_price * book_stock) as "revenue"
    FROM books_scrapy
    GROUP BY book_category
    HAVING "total stock" >= 100
"""

df = pd.read_sql_query(query, conn)
df

Unnamed: 0,book_category,total stock,revenue
0,business,133,4019.64
1,childrens,229,7957.43
2,default,1861,65232.01
3,fantasy,372,14591.77
4,fiction,588,21445.99
5,food_and_drink,319,10347.51
6,historical_fiction,194,6946.26
7,history,181,6467.28
8,horror,136,4472.98
9,music,111,4161.89
