# Webscraping OkadaBooks

## Part 1: Getting specific data 

There are 22 categories on okadabooks.com. My aim is to extract all relevant information (Title, Author, No of reads, Rating, Price and Link to Book) from each bookcard. Before looping all through the categories, the code for getting each specific content is generated from one of the categories (In this case: fiction).

### Reading the webpage into python

In [1]:
import requests

In [2]:
### Calling headers to imitate browser window
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}

In [None]:
response = requests.get('https://okadabooks.com/category/fiction/5', headers=headers)

In [None]:
### Print first 500 characters of Html
print(response.text[0:500])

### Parsing the html through BeautifulSoup

In [4]:
from bs4 import BeautifulSoup

In [4]:
soup= BeautifulSoup(response.text, 'html.parser')

In [5]:
## On 'https://okadabooks.com/category/fiction/5' , Right-click on a bookcard and click on inspect,
## the take note the class section which highlights the whole bookcard.

bookcards= soup.find_all('div', {'class':'book__card'})

In [6]:
bookcards

[<div class="book__card"><div class="book"><div class="book__cover--container"><a href="/book/about/gibborim_-_the_beginning/29355"><img alt="GIBBORIM - THE BEGINNING." src="https://okada-assets-production.s3.eu-west-2.amazonaws.com/applications/content/images/bookImages/946a5b1abba6598021a0cb5cc24c5653.jpg"/></a></div><div class="book__content"><h3 class="title"><a href="/book/about/gibborim_-_the_beginning/29355"><span class="text-truncate">gibborim - the beginning.</span></a></h3><div class="stats__container"><div class="reads"><span class="icon"><i class="icon ion-md-eye"></i></span><span class="text">0<!-- --> <!-- -->read</span></div><div class="ratings" title="ratings"><span class="icon"><i class="icon ion-md-star"></i></span><span class="text">0</span></div></div><span class="prize"><strong>₦200.00</strong></span><p class="description"><span width="0"><span></span><span>When the rebel angels were cast down from Heaven...
 It happened in waves.
 Want to know more?
 Find out.</sp

In [7]:
len(bookcards)

12

In [8]:
### Concentrating on the first bookcard to scrap specific data.
bookcard= bookcards[0]

In [28]:
### This helps to arrange the html contect in a structured manner so that the different tags can 
### quickly be gotten.

print(bookcard.prettify())

<div class="book__card">
 <div class="book">
  <div class="book__cover--container">
   <a href="/book/about/kanyinsola_adeyeye_my_survival_story/29231">
    <img alt="Kanyinsola Adeyeye (my survival story)" src="https://okada-assets-production.s3.eu-west-2.amazonaws.com/applications/content/images/bookImages/bdecd1725185bc3bc7ff06d7afae67a1.jpg"/>
   </a>
  </div>
  <div class="book__content">
   <h3 class="title">
    <a href="/book/about/kanyinsola_adeyeye_my_survival_story/29231">
     <span class="text-truncate">
      kanyinsola adeyeye (my survival story)
     </span>
    </a>
    <span class="info">
     18+
    </span>
   </h3>
   <div class="stats__container">
    <div class="reads">
     <span class="icon">
      <i class="icon ion-md-eye">
      </i>
     </span>
     <span class="text">
      2
      <!-- -->
      <!-- -->
      reads
     </span>
    </div>
    <div class="ratings" title="ratings">
     <span class="icon">
      <i class="icon ion-md-star">
      </i>
   

In [9]:
### Title
title = bookcard.find('span', {'class':"text-truncate"})

In [10]:
title.text

'gibborim - the beginning.'

In [11]:
###Price
price= bookcard.find("strong")

In [12]:
price.text

'₦200.00'

In [14]:
price.text[1:-3]

'200'

In [56]:
bookcard.find('strong').text[1:-3]

'200'

In [15]:
### Author's name

Author= bookcard.find('h5', {'class':"name"})

In [16]:
Author.text

'by Amobi Ivan'

In [17]:
Author.text[3:]

'Amobi Ivan'

In [18]:
### No of reads
read= bookcard.find('div', {'class':"reads"}).text[:-5]

In [19]:
read.text

'0 read'

In [21]:
read.text[:-5]

'0'

In [22]:
### Rating's score
rating = bookcard.find('div', {'title':"ratings"})

In [23]:
rating.text

'0'

In [27]:
### Blurb/ Book description
blurb = bookcard.find('p', {'class':"description"})

In [28]:
blurb.text

'When the rebel angels were cast down from Heaven...\r\nIt happened in waves.\r\nWant to know more?\r\nFind out....'

In [29]:
### The link to the book is embedded in the 'a' tag while book cover is in the 'img' tage
bookcard.find_all('a')

[<a href="/book/about/gibborim_-_the_beginning/29355"><img alt="GIBBORIM - THE BEGINNING." src="https://okada-assets-production.s3.eu-west-2.amazonaws.com/applications/content/images/bookImages/946a5b1abba6598021a0cb5cc24c5653.jpg"/></a>,
 <a href="/book/about/gibborim_-_the_beginning/29355"><span class="text-truncate">gibborim - the beginning.</span></a>,
 <a href="https://okadabooks.com/user/RABBONI"><div class="avatar"><img alt="Author Image" src="/static/assets/images/default-avatar.png"/></div><h5 class="name">by Amobi Ivan</h5></a>,
 <a href="/book/about/gibborim_-_the_beginning/29355"><span class=""><span>Available to read on app and web</span><i class="icon fa fa-dribbble"></i></span></a>]

In [32]:
###About the book(Link to a summary of the book)
bookcard.find('a')['href']

In [None]:
base = 'https://okadabooks.com'

In [33]:
book_link= base + bookcard.find('a')['href'] 

'https://okadabooks.com/book/about/gibborim_-_the_beginning/29355'

In [46]:
### Bookcover image
bookcard.find('img')['src']

'https://okada-assets-production.s3.eu-west-2.amazonaws.com/applications/content/images/bookImages/946a5b1abba6598021a0cb5cc24c5653.jpg'

In [None]:
book_image= bookcard.find('img')['src']

---

## Part 2 : Automation

In each category, there is a load more button which when clicked on, reveals more bookcards. I am going to use selenium to click on the button till it reaches the end of the page and it is no longer available. This is when dynamic scraping begins.

In [None]:
from selenium import webdriver

In [None]:
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito') 

## In the final forloop, a headless argument will be used in order not to open the browser when 
## automation is going on. Incognito option means an incognito winow will be used.

In [None]:
### The executable path is where I my chrome driver is located in my browser.
driver=webdriver.Chrome(executable_path= 'C:/Users/ru/Downloads/Programs/chromedriver', chrome_options=options)

In [None]:
driver.get('https://okadabooks.com/category/fiction/5')

In [None]:
### Get the css path by right clicking on the html tag gotten from inspecting the load more button.
### This clicks it once. 

okada= driver.find_element_by_css_selector('#_app > div.main-wrapper > main > div > main > div > div.col-lg-9 > div > div.container.d-flex.justify-content-center.mb-5 > button')
okada.click()

In [None]:
### This is to continue pressing the load more button till it is no longer there. It will be commented
### here. Check final forloop for usage.

import time 

# LoadMore = True
# while LoadMore:
#    time.sleep(1)
#    try:
#        if okada:
#            okada.click()
#    except:
#        LoadMore = False

## Part 3: Getting all the urls for each category

Instead of typing all the urls for each category, it can easily gottten from the html tag. On okadbooks.com/store, right-click and inspect the category wrapper (where you click, it redirects you to the page of the category.)


In [None]:
r = requests.get("https://okadabooks.com/store", headers=headers)
soup= BeautifulSoup(r.text, 'html.parser')

In [None]:
soup.find('a')[8]['href']

In [None]:
### Corresponding category can be gotten from the url
soup.find('a')[8]['href'][10:-2]

In [None]:
base = 'https://okadabooks.com'
urls=[]
category= []
for i in range(8,30):
    urls.append(base + soup.find_all('a')[i]['href'])
    if i <=16:
        category.append(soup.find_all('a')[i]['href'][10:-2])
    else:
        category.append(soup.find_all('a')[i]['href'][10:-3])

In [None]:
urls

In [None]:
### Creating a dictionary with category as its value
cat= dict(zip(urls,category))

In [None]:
cat

---

## Part 4: Final Scraping

All sections of the code as explained above have been merged into a forloop that iterates through the 22 pages.

In [None]:
## important

from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import time

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
r = requests.get("https://okadabooks.com/store", headers=headers)
soup= BeautifulSoup(r.text, 'html.parser')

base = 'https://okadabooks.com'
urls=[]
category= []

for i in range(8,30):
    urls.append(base + soup.find_all('a')[i]['href'])
    if i <=16:
        category.append(soup.find_all('a')[i]['href'][10:-2])
    else:
        category.append(soup.find_all('a')[i]['href'][10:-3])
        
cat= dict(zip(urls,category))

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver=webdriver.Chrome(executable_path= 'C:/Users/ru/Downloads/Programs/chromedriver', chrome_options=options)


for url in urls:
    driver.get(url)
    okada= driver.find_element_by_css_selector('#_app > div.main-wrapper > main > div > main > div > div.col-lg-9 > div > div.container.d-flex.justify-content-center.mb-5 > button')

    LoadMore = True
    while LoadMore:
        time.sleep(1)
        try:
            okada.click()
        except:
            LoadMore = False
    
    page_source= driver.page_source  ##get page source after clicking all load more buttons
    
    soup= BeautifulSoup(page_source, 'html.parser')
    
    bookcards= soup.find_all('div', {'class':'book__card'})
    bookcard= bookcards[0]
      
    records=[]
    
    for bookcard in bookcards:
        title= bookcard.find('span', {'class':"text-truncate"}).text
        author= bookcard.find('h5', {'class':"name"}).text[3:]
        genre= cat[url]
        price= bookcard.find("strong").text[1:-3]
        ratings= bookcard.find('div', {'title':"ratings"}).text
        reads= bookcard.find('div', {'class':"reads"}).text[:-5]
        blurb = bookcard.find('p', {'class':"description"}).text
        book_link= base + bookcard.find('a')['href']
        
        records.append(( title, author, genre, price, ratings, reads, blurb, book_link ))
        
    time.sleep(1)

print("i have succesfully scraped {} books".format(len(title)))

26400

In [None]:
len(records)

In [None]:
records[0:3]

In [6]:
import pandas as pd


In [None]:
df= pd.DataFrame(records, columns= ["title", "author", "genre", "price", "ratings", "reads","blurb", "book_link"])

In [None]:
df.head()