# Using Web Scraping For Your Hobby Or:

# One Of My Favourite Designers Anna Maria Horner on Sew Hot UK
 

In this notebook, I build a web scraper, which works with a search term - in my case "Anna Maria Horner". The website shows me the search result with 30 hits per page. The following scraper only gathers information from the first page.

First step is always: Checking Sew Hot UK robots.txt. It shows me that there are only a few things to consider:

User-agent: *

Disallow: /wp-admin/

Allow: /wp-admin/admin-ajax.php


So, let's start and find some products.

## Packages

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time

## Initiating request

The timeout had to be set to 5, after starting with 2, which was not enough time. Printing the status code, which prints out 200. Good to go.

In [2]:
url = "https://www.sewhot.co.uk/?s=Anna+Maria+Horner&post_type=product&product_cat=0"

page = requests.get(url, timeout=5)
print(page.status_code)

200


## BeautifulSoup

In [3]:
annamaria = BeautifulSoup(page.content, "html.parser")

## Finding Each Product on the First Page

Using Inspect on the Sew Hot UK website resulted in the finding that each product can be detected by the tag div, filtered by class="product-element-bottom product-information". Using find_all() to actually find all single products stored in There are 30 products per page, so len() should give out 30. Which it does, yeah!

In [4]:
products = annamaria.find_all("div", {"class" : "product-element-bottom product-information"})
len(products)

30

## Checking out the first product

As a starting point, I am interested in the product title, the price and the link to the specific product page. I can find all this information in my result, as the print out of the first product shows.

In [5]:
products[0]

<div class="product-element-bottom product-information">
<h3 class="wd-entities-title"><a href="https://www.sewhot.co.uk/product/anna-marias-welcome-home-quilt-kit/">Anna Maria’s Welcome Home Quilt Kit</a></h3> <div class="product-rating-price">
<div class="wrapp-product-price">
<span class="price"><span class="woocommerce-Price-amount amount"><bdi><span class="woocommerce-Price-currencySymbol">£</span>285.00</bdi></span></span>
</div>
</div>
<div class="fade-in-block wd-scroll">
<div class="hover-content wd-more-desc">
<div class="hover-content-inner wd-more-desc-inner">
					Anna Maria Horner’s Welcome Home Quilt Kits have finally arrived! It’s not too late to join in on the fun				</div>
<a aria-label="Read more description" class="wd-more-desc-btn" href="#" rel="nofollow"><span></span></a>
</div>
<div class="wd-bottom-actions">
<div class="wrap-wishlist-button"></div>
<div class="wd-add-btn wd-add-btn-replace">
<input class="wood-quick-quantity" min="1" name="quantity" step="1" sty

## Gathering product name and link to the specific product page

To achieve that task, I just started with the first element in the results (products). I tested if my code obtains the right information by applying it on the second finding as well. 

Product name and link can be extracted through the first anchor tag ("a"), the product name can be detected with the .get_text() function, the link is reached with .get("href").

In [6]:
products[0].find("a").get_text()

'Anna Maria’s Welcome Home Quilt Kit'

In [7]:
products[1].find("a").get_text()

'Made My Day Canna Toffee'

In [8]:
products[0].find("a").get("href")

'https://www.sewhot.co.uk/product/anna-marias-welcome-home-quilt-kit/'

In [9]:
products[1].find("a").get("href")

'https://www.sewhot.co.uk/product/made-my-day-canna-toffee/'

## Searching for the price

To be able to extract the price, I had to try a few things out. First, I used the span tag with class=price. One can see that the price and currency is in there. With indexing and the .get_text() function, I could fetch currency and price.

In [10]:
products[0].find_all("span", {"class" : "price"})

[<span class="price"><span class="woocommerce-Price-amount amount"><bdi><span class="woocommerce-Price-currencySymbol">£</span>285.00</bdi></span></span>]

In [11]:
products[0].find_all("span", {"class" : "price"})[0].get_text()

'£285.00'

Another possibility is to search for the bdi tag. The result is a bit narrower.

In [12]:
products[0].find("bdi")

<bdi><span class="woocommerce-Price-currencySymbol">£</span>285.00</bdi>

In [13]:
products[0].find("bdi").get_text()

'£285.00'

One might want to separate currency symbol from number. Of course, everything is in pounds on this page. This can later be accounted for in a column name, if it is of interest. Also, one can transform the price from string type to float type.

In [19]:
price = products[0].find_all("bdi")[0].get_text()
price = float(price.split("£")[1])
price

285.0

## Creating a DataFrame

Beside product name, product page and prize, I wanted to add another variable: product category. The information can be found in the product name, where I want to start. An alternative would be of course to check the product page for any information on that. But let's start with the result page entries and the information we can get there. 

I created a function find_product_category() where I assigned each product to a product category, depending on wether the respective keyword pointing to a certain category was included in the product name or not. I purposely did not split the product name afterwards, and left it as is. Just wanted to create a column which I can easily browse for the different product categories. I included my function into my loop, which will step by step create my DataFrame (see below).

In [15]:
def find_product_category(product_name):
    if "Fat Quarter Bundle" in product_name:
        product_category = "Fat Quarter Bundle"
    elif "Kit" in product_name:
        product_category = "Quilt Kit"
    elif "Pattern" in product_name:
        product_category = "Pattern"
    elif "Thread" in product_name:
        product_category = "Thread"
    elif "Template" in product_name:
        product_category = "Templates"
    else:
        product_category = "Bulk Goods"

    return product_category

In [16]:
product_list = []

for product in products:

    try:
        product_name = product.find("a").get_text()
    except AttributeError:
        product_name = None

    product_category = find_product_category(product_name)
    
    try:
        product_page = product.find("a").get("href")
    except AttributeError:
        product_page = None

    try:
        price_in_gbp = product.find("bdi").get_text()
        price_in_gbp = float(price_in_gbp.split("£")[1])
    except AttributeError:
        price_in_gbp = None


    product_list.append({
                "Product Name": product_name,
                "Product Category" : product_category,
                "Product Page": product_page,
                "Price in GBP": price_in_gbp
            })
    
df = pd.DataFrame(product_list)
df

Unnamed: 0,Product Name,Product Category,Product Page,Price in GBP
0,Anna Maria’s Welcome Home Quilt Kit,Quilt Kit,https://www.sewhot.co.uk/product/anna-marias-w...,285.0
1,Made My Day Canna Toffee,Bulk Goods,https://www.sewhot.co.uk/product/made-my-day-c...,15.5
2,Made My Day Canna Jade,Bulk Goods,https://www.sewhot.co.uk/product/made-my-day-c...,15.5
3,Pathways Quilt Pattern – downloadable,Pattern,https://www.sewhot.co.uk/product/pathways-quil...,10.0
4,Aurifil Thread Labs Subscription Box 2.0,Thread,https://www.sewhot.co.uk/product/aurifil-threa...,66.0
5,Tambourine Stitchery Brass,Bulk Goods,https://www.sewhot.co.uk/product/tambourine-st...,14.0
6,Our Fair Home Peony Heather 108″ Wideback,Bulk Goods,https://www.sewhot.co.uk/product/our-fair-home...,29.9
7,Visions Quilt Acrylic Template Set,Templates,https://www.sewhot.co.uk/product/visions-quilt...,64.0
8,Visions Quilt Pattern,Pattern,https://www.sewhot.co.uk/product/visions-quilt...,29.0
9,Our Fair Home Fat Quarter Bundle,Fat Quarter Bundle,https://www.sewhot.co.uk/product/our-fair-home...,69.95


## Write DataFrame to csv file

I am storing the DataFrame as a csv file, as I would like to use it later on in an other project.

In [17]:
df.to_csv('amh_firstpage.csv', index=False) 

## Outlook

Two things can be done on top. One is to iterate through each page of my search (give by the Sew Hot UK website) and thereby storing all the available products plus their info. One can access the actual product page and extract even more details/ info on each product.