In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd 

In [2]:
pages = []
titles = []
prices = []
stars = []
urls = []

pages_to_scrap = 5

In [3]:
# Get the raw data
for i in range(1, pages_to_scrap+1):
    page_url = ('http://books.toscrape.com/catalogue/page-{}.html').format(i)
    pages.append(page_url)  
    # Parse the text
for item in pages:
    response = requests.get(item, timeout=(3.05, 5)) # get all content from the 'item'
    # if response: # response.status_code == 200
    #     print('Success!')
    # else: # elif response.status_code == 404
    #     print('Not Found.')
    soup = BeautifulSoup(response.text, 'html.parser') # take all the text from 'page' parse it and return onto the 'soup'
    # soup.prettify() # prettify() will give the actual indentation of the code also
    # Get text from all 'h3'
    for i in soup.findAll('h3'):
        title = i.getText()
        titles.append(title)
        # Get text from all 'h3'
    for j in soup.findAll('p', class_='price_color'):
        price = j.getText()
        prices.append(price) 
        # Get the star ratings for each
    for k in soup.findAll('p', class_='star-rating'):
        for m, n in k.attrs.items():
            star = n[1]
            stars.append(star)
    # Find URLs of all images
    divs = soup.findAll('div', class_='image_container')
    for thumb in divs:
        tags = thumb.find('img', class_='thumbnail')
        url = 'http://books.toscrape.com/'+str(tags['src'])
        clean_urls = url.replace('../', '')
        urls.append(clean_urls)

In [4]:
# This will return a code to indicate the status of the request
response.status_code
# 200 - OK --> SUCCESS
# 204 - NO CONTENT --> SUCCESS
# 304 - NOT MODIFIED --> SUCCESS
# 404 - NOT FOUND --> UNSUCCESS

200

# Performance
When using requests, especially in a production application environment, it’s important to consider performance implications. Features like timeout control, sessions, and retry limits can help you keep your application running smoothly.

## Timeouts
When you make an inline request to an external service, your system will need to wait upon the response before moving on. If your application waits too long for that response, requests to your service could back up, your user experience could suffer, or your background jobs could hang.

By default, requests will wait indefinitely on the response, so you should almost always specify a timeout duration to prevent these things from happening. To set the request’s timeout, use the timeout parameter. timeout can be an integer or float representing the number of seconds to wait on a response before timing out:

>>> requests.get('https://api.github.com', timeout=1)
<Response [200]>
>>> requests.get('https://api.github.com', timeout=3.05)
<Response [200]>

In the first request, the request will timeout after 1 second. In the second request, the request will timeout after 3.05 seconds.

You can also pass a tuple to timeout with the first element being a connect timeout (the time it allows for the client to establish a connection to the server), and the second being a read timeout (the time it will wait on a response once your client has established a connection):

>>> requests.get('https://api.github.com', timeout=(2, 5))
<Response [200]>

If the request establishes a connection within 2 seconds and receives data within 5 seconds of the connection being established, then the response will be returned as it was before. If the request times out, then the function will raise a Timeout exception:

>>> import requests
from requests.exceptions import Timeout

>>> try:
        response = requests.get('https://api.github.com', timeout=1)
    except Timeout:
        print('The request timed out')
    else:
        print('The request did not time out')
        
Your program can catch the Timeout exception and respond accordingly.

In [5]:
# Gather all the data in a dictionary
data = {'Titles':titles, 'Prices':prices, 'Star ratings':stars, 'URLs':urls}
data

{'Titles': ['A Light in the ...',
  'Tipping the Velvet',
  'Soumission',
  'Sharp Objects',
  'Sapiens: A Brief History ...',
  'The Requiem Red',
  'The Dirty Little Secrets ...',
  'The Coming Woman: A ...',
  'The Boys in the ...',
  'The Black Maria',
  'Starving Hearts (Triangular Trade ...',
  "Shakespeare's Sonnets",
  'Set Me Free',
  "Scott Pilgrim's Precious Little ...",
  'Rip it Up and ...',
  'Our Band Could Be ...',
  'Olio',
  'Mesaerion: The Best Science ...',
  'Libertarianism for Beginners',
  "It's Only the Himalayas",
  'In Her Wake',
  'How Music Works',
  'Foolproof Preserving: A Guide ...',
  'Chase Me (Paris Nights ...',
  'Black Dust',
  'Birdsong: A Story in ...',
  "America's Cradle of Quarterbacks: ...",
  'Aladdin and His Wonderful ...',
  'Worlds Elsewhere: Journeys Around ...',
  'Wall and Piece',
  'The Four Agreements: A ...',
  'The Five Love Languages: ...',
  'The Elephant Tree',
  'The Bear and the ...',
  "Sophie's World",
  'Penny Maybe',
  'Maud

In [6]:
# Create dataframe from the dictionary
my_data = pd.DataFrame(data=data)
my_data

Unnamed: 0,Titles,Prices,Star ratings,URLs
0,A Light in the ...,Â£51.77,Three,http://books.toscrape.com/media/cache/2c/da/2c...
1,Tipping the Velvet,Â£53.74,One,http://books.toscrape.com/media/cache/26/0c/26...
2,Soumission,Â£50.10,One,http://books.toscrape.com/media/cache/3e/ef/3e...
3,Sharp Objects,Â£47.82,Four,http://books.toscrape.com/media/cache/32/51/32...
4,Sapiens: A Brief History ...,Â£54.23,Five,http://books.toscrape.com/media/cache/be/a5/be...
...,...,...,...,...
95,Lumberjanes Vol. 3: A ...,Â£19.92,Two,http://books.toscrape.com/media/cache/5f/b1/5f...
96,"Layered: Baking, Building, and ...",Â£40.11,One,http://books.toscrape.com/media/cache/98/d1/98...
97,Judo: Seven Steps to ...,Â£53.90,Two,http://books.toscrape.com/media/cache/5f/52/5f...
98,Join,Â£35.67,Five,http://books.toscrape.com/media/cache/93/63/93...


In [7]:
# Reset the default index(starting from 0) to 1
my_data.index+=1
my_data

Unnamed: 0,Titles,Prices,Star ratings,URLs
1,A Light in the ...,Â£51.77,Three,http://books.toscrape.com/media/cache/2c/da/2c...
2,Tipping the Velvet,Â£53.74,One,http://books.toscrape.com/media/cache/26/0c/26...
3,Soumission,Â£50.10,One,http://books.toscrape.com/media/cache/3e/ef/3e...
4,Sharp Objects,Â£47.82,Four,http://books.toscrape.com/media/cache/32/51/32...
5,Sapiens: A Brief History ...,Â£54.23,Five,http://books.toscrape.com/media/cache/be/a5/be...
...,...,...,...,...
96,Lumberjanes Vol. 3: A ...,Â£19.92,Two,http://books.toscrape.com/media/cache/5f/b1/5f...
97,"Layered: Baking, Building, and ...",Â£40.11,One,http://books.toscrape.com/media/cache/98/d1/98...
98,Judo: Seven Steps to ...,Â£53.90,Two,http://books.toscrape.com/media/cache/5f/52/5f...
99,Join,Â£35.67,Five,http://books.toscrape.com/media/cache/93/63/93...


In [8]:
# Export the data as Excel file
my_data.to_excel('C:/Users/HP/Desktop/Python/Web_Scraping/Output.xlsx')