# Web Scraping

Web scraping for academic or personal use as per AI:

1. Wikipedia
2. Books To Scrape (https://books.toscrape.com/)
3. Quotes To Scrape (https://quotes.toscrape.com/)
4. Scrape This Site (http://www.scrapethissite.com/)
5. IMDb (with restrictions)
6. GitHub (public repositories)
7. Data.gov
8. Craigslist (with limitations)
9. Amazon product listings (with restrictions)
10. The New York Times Developer Network (with API key)
11. OpenStreetMap
12. Goodreads (with API)
13. National Weather Service
1. **Common Crawl** - A public repository of web crawl data that can be freely accessed and used.
2. **Wikipedia** - The content is freely available under a Creative Commons license, and scraping is allowed within certain guidelines.
3. **OpenWeatherMap** - Provides weather data, and they offer a free tier API which you can scrape with permission.
4. **Reddit** - Allows scraping for non-commercial purposes as long as it abides by their API usage policy.
5. **Twitter** - Permits scraping through their API for personal use, subject to rate limits.
6. **IMDB** - Has a public dataset available for scraping for non-commercial use.
7. **GitHub** - Public repositories can be scraped, but there are rate limits and terms of service to consider.
8. **News websites with public APIs** - Such as The Guardian and The New York Times, often provide APIs for accessing their data.

Before scraping any website:

1. Checking the site's robots.txt file
2. Reviewing their terms of service
3. Using APIs when available
4. Respecting rate limits and not overloading servers
5. Identifying your scraper in the user-agent string
6. Avoid scraping login-protected content or data behind paywalls without permission.
7. For academic or research purposes, contacting the website administrators to seek explicit permission is also advisable.

Refer
* https://en.wikipedia.org/robots.txt
* [Google Scholer](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=web+scraping+python&btnG=&oq=web+scraping)

Overview
* Understanding HTML
* Extracting Web Content
* Extracting all books from a page
* Extracting data of a single book
* Fetching the data of all books in one page
* Scraping all books from all the pages
* Fixing the data formatting


### Import libs

In [198]:
import requests
from bs4 import BeautifulSoup as bs # can fetch data from html tags
from urllib.parse import urljoin
import time
import json
import os
import csv

### Demo

In [69]:
html_doc = """
<!DOCTYPE html>
<html>
<head>
    <title>Book Store</title>
</head>
<body>
    <div id="bestsellers">
        <h2>Best Selling Books</h2>
        <ul>
            <li><a href="/book1">The Great Gatsby</a></li>
            <li><a href="/book2">To Kill a Mockingbird</a></li>
            <li><a href="/book3">1984</a></li>
        </ul>
    </div>
    <div id="new-releases">
        <h2>New Releases</h2>
        <ul>
            <li><a href="/book4">The Testaments</a></li>
            <li><a href="/book5">Normal People</a></li>
        </ul>
    </div>
</body>
</html>
"""

soup = bs(html_doc, 'html.parser')
titles = [a.get_text() for a in soup.find_all('a')]
print(titles)

['The Great Gatsby', 'To Kill a Mockingbird', '1984', 'The Testaments', 'Normal People']


### Get website data
1. **200 OK**: The request has been successfully processed, and the server returns the requested content.
2. **404 Not Found**: The requested resource or page could not be found on the server.
3. **403 Forbidden**: Access to the requested resource is forbidden or not allowed for the client.
4. **500 Internal Server Error**: The server encountered an unexpected error while processing the request.
5. **302 Found (or 301 Moved Permanently)**: The requested resource has been temporarily (or permanently) moved to a different URL, and the client should follow the redirection.

In [168]:
url = "https://books.toscrape.com/"
base_url = "https://books.toscrape.com/index.html"
book_list = requests.get(base_url)
if result.status_code == 200:
    print("Success")
else:
    print(f"Failed: {book_list.status_code}")

Success


### Verify received data

In [53]:
print(bs(book_list.content[:500], 'html.parser').prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  &lt;meta http-equiv="content-type" content="text/html; charset=UTF-8" /
 </head>
</html>



### Parse received data

In [80]:
# bs?

In [79]:
# This crawls the html page we downloaded
# pass parser = "html.parser" if allowed & necessary
soup = bs(
    markup = book_list.content
)

In [57]:
books = soup.find_all(name = "li", class_ = "col-xs-6 col-sm-4 col-md-3 col-lg-3")
print(len(books))
print(books[1])

20
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="thumbnail" src="media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg"/></a>
</div>
<p class="star-rating One">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a></h3>
<div class="product_price">
<p class="price_color">£53.74</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>
</li>


### Get first book meta data

In [58]:
# Search for hyperlink tag
book_one_anchor = books[1].findChild("a")
book_one_anchor

<a href="catalogue/tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="thumbnail" src="media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg"/></a>

In [59]:
# Extract book detail page end point link
book_one_url = book_one_anchor.get("href")
book_one_url

'catalogue/tipping-the-velvet_999/index.html'

In [61]:
book_one_url = urljoin(base_url, book_one_url)
book_one_url

'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'

### Get first book HTML code from link

In [173]:
# Extract details webpage html data
the_book_one_page = requests.get(book_one_url)
#print(bs(the_book_one_page.content[6000:len(the_book_one_page.content)], 'html.parser').prettify())

In [89]:
book_one_soup = bs(markup = the_book_one_page.content)
title = book_one_soup.find("h1")
title.text

'Tipping the Velvet'

In [95]:
book_one_table = book_one_soup.find_all("tr")

In [97]:
print(book_one_table, end = "\n\n")
print(book_one_table[0], end = "\n\n") # Key Value Pair
print(len(book_one_table))

[<tr>
<th>UPC</th><td>90fa61229261140a</td>
</tr>, <tr>
<th>Product Type</th><td>Books</td>
</tr>, <tr>
<th>Price (excl. tax)</th><td>£53.74</td>
</tr>, <tr>
<th>Price (incl. tax)</th><td>£53.74</td>
</tr>, <tr>
<th>Tax</th><td>£0.00</td>
</tr>, <tr>
<th>Availability</th>
<td>In stock (20 available)</td>
</tr>, <tr>
<th>Number of reviews</th>
<td>0</td>
</tr>]

<tr>
<th>UPC</th><td>90fa61229261140a</td>
</tr>

7


In [172]:
# Extract data from HTML and add it to dictionary
book_one_dict = {
    "Title" : title
}
for book in book_one_table:
    key = book.find("th").text
    value = book.find("td").text
    book_one_dict[key] = value

book_one_dict

{'Title': <h1>Tipping the Velvet</h1>,
 'UPC': '90fa61229261140a',
 'Product Type': 'Books',
 'Price (excl. tax)': '£53.74',
 'Price (incl. tax)': '£53.74',
 'Tax': '£0.00',
 'Availability': 'In stock (20 available)',
 'Number of reviews': '0'}

### Create a function to get HTML code from link

In [219]:
# Encapsulate all of the above code in a single function
# find(), find_all(), select(), select_one()
def scrape_book(book_url: str):
    book_page = requests.get(book_url)
    book_soup = bs(book_page.content)

    # store data in dict
    book_dict = {}

    image = book_soup.find(name = "div", class_ = "item active").findChild("img").get("src")
    title = book_soup.find("h1").text
    description = book_soup.find("p").text
    book_dict["title"] = title
    book_dict["image"] = urljoin(url, image)
    # book_dict["description"] = description

    book_table_data = book_soup.find_all("tr")

    # for product info iterate and get all key-value pairs
    for book in book_table_data:
        key = book.find("th").text
        value = book.find("td").text

        book_dict[key] = value

    return book_dict

In [171]:
# Test the method with dummy url
scrape_book("http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

{'title': 'A Light in the Attic',
 'image': 'https://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg',
 'UPC': 'a897fe39b1053632',
 'Product Type': 'Books',
 'Price (excl. tax)': '£51.77',
 'Price (incl. tax)': '£51.77',
 'Tax': '£0.00',
 'Availability': 'In stock (22 available)',
 'Number of reviews': '0'}

### Create a function to scrape the entire webpage

In [249]:
book_list = []
def scrape_page(base_url: str):
    page = requests.get(base_url)
    page_soup = bs(page.content)
    
    books = page_soup.find_all(name = "li", class_ = "col-xs-6 col-sm-4 col-md-3 col-lg-3")

    for i in range(len(books)):
        relative_path = books[i].findChild("a").get("href") # its relative path because the full url is not provided
        book_url = urljoin(base_url, relative_path)
        book_data = scrape_book(book_url)
        book_list.append(book_data)
        print(f"Fetched {i + 1}/{len(books)}")

    next_page_url = page_soup.find(name = "li", class_ = "next").findChild("a").get("href")	


    # Wait for 1 seconds before making next call
    time.sleep(1) 

    try:
        # Recursive call to continuoulsy fetch items from next page until next page doesnt exist
        scrape_page(urljoin(base_url, next_page_url))
    except:
        pass

In [250]:
# fetch all books from all pages
scrape_page('http://books.toscrape.com/index.html')
len(book_list)

Fetched 1/20
Fetched 2/20
Fetched 3/20
Fetched 4/20


KeyboardInterrupt: 

### Format unstructred data

In [251]:
book_list[0]

{'title': 'A Light in the Attic',
 'image': 'https://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg',
 'UPC': 'a897fe39b1053632',
 'Product Type': 'Books',
 'Price (excl. tax)': '£51.77',
 'Price (incl. tax)': '£51.77',
 'Tax': '£0.00',
 'Availability': 'In stock (22 available)',
 'Number of reviews': '0'}

In [252]:
def fix(item: dict):
    if item.get("Price (excl. tax)") is type(str):
        price = float(item["Price (excl. tax)"][1:])
        item["Price"] = price
        item.pop("Price (excl. tax)")
        if item["Price (incl. tax)"] in item:
            item.pop("Price (incl. tax)")
        

    tax = float(item["Tax"][1:])
    item["Tax"] = tax

    stuff = item["Availability"].split("(")
    availability = stuff[0].strip()
    quantity = int(stuff[1].split(" ")[0])
    is_available = True if quantity > 0 else False
    item["Availability"] = availability
    item["Quantity"] = quantity
    item["IsAvailable"] = is_available

    item["Number of reviews"] = float(item["Number of reviews"])

    return item
    
formatted_books_list = list(map(lambda x : fix(x), book_list))
formatted_books_list[:3]

[{'title': 'A Light in the Attic',
  'image': 'https://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg',
  'UPC': 'a897fe39b1053632',
  'Product Type': 'Books',
  'Price (excl. tax)': '£51.77',
  'Price (incl. tax)': '£51.77',
  'Tax': 0.0,
  'Availability': 'In stock',
  'Number of reviews': 0.0,
  'Quantity': 22,
  'IsAvailable': True},
 {'title': 'Tipping the Velvet',
  'image': 'https://books.toscrape.com/media/cache/08/e9/08e94f3731d7d6b760dfbfbc02ca5c62.jpg',
  'UPC': '90fa61229261140a',
  'Product Type': 'Books',
  'Price (excl. tax)': '£53.74',
  'Price (incl. tax)': '£53.74',
  'Tax': 0.0,
  'Availability': 'In stock',
  'Number of reviews': 0.0,
  'Quantity': 20,
  'IsAvailable': True},
 {'title': 'Soumission',
  'image': 'https://books.toscrape.com/media/cache/ee/cf/eecfe998905e455df12064dba399c075.jpg',
  'UPC': '6957f44c3847a760',
  'Product Type': 'Books',
  'Price (excl. tax)': '£50.10',
  'Price (incl. tax)': '£50.10',
  'Tax': 0.0,
  'Availabi

### Export all books to csv

In [196]:
def convert_dict_to_csv():
    # Using "with" to open a file automatically handles closing file manager
    with open('all_books.csv','w') as f:
        # w = csv.writer(sys.stderr) # see whats being generated
        w = csv.writer(f)
        w.writerow(book.keys())
        for book in book_list:
            w.writerow(book.values())

### Export all books to json

In [197]:
def convert_dict_to_json():
    # Convert and write JSON object to file
    with open("all_books.json", "w") as outfile: 
        json.dump(book_list, outfile)

### Export all books to text file

In [161]:
def convert_dict_to_textfile():
    with open('all_books.txt', 'w') as convert_file: 
     convert_file.write(json.dumps(book_list))

### Download Images of all books

In [211]:
def download_images():
    image_url_list = list(map(lambda x : x["image"], book_list))
    
    for i in range(len(image_url_list)):
        img_data = requests.get(image_url_list[i])
        
        if img_data.status_code != 200:
            print(f"Failed to download {image_url_list[i]} with error {book_list.status_code}")
            continue
            
        filename = image_url_list[i].split('/')[-1]

        img_dir = "book_images"
        
        if img_dir not in os.listdir(os.curdir):
            os.mkdir(img_dir)
        
        with open(f"{img_dir}/{filename}", 'wb') as f:
            f.write(img_data.content)
        
        print(f"Downloaded image {i + 1}/{len(image_url_list)}")

In [212]:
download_images()

Downloaded 1/20
Downloaded 2/20
Downloaded 3/20
Downloaded 4/20
Downloaded 5/20
Downloaded 6/20
Downloaded 7/20
Downloaded 8/20
Downloaded 9/20
Downloaded 10/20
Downloaded 11/20
Downloaded 12/20
Downloaded 13/20
Downloaded 14/20
Downloaded 15/20
Downloaded 16/20
Downloaded 17/20
Downloaded 18/20
Downloaded 19/20
Downloaded 20/20
