# Beginner version - Using Web Scraping For Your Hobby Or:

# One Of My Favourite Designers Anna Maria Horner
 

In this notebook, I build a web scraper, which works with one search term - in my case "Anna Maria Horner". As the HTML structure differs from website to website, this scraper only works with the site which I scraped. But one can easily adapt it. 

The following example rather serves as an introduction: it contains some explanations and includes only one function in the end. Handle it as a proto type - the code is organized in an imperative style, in a series of statements.

The other notebooks of this repo show a different approach, they work with functions which also include docstrings.


The website - a patchwork shop in the UK - shows me the search result with 30 hits per page. The following scraper only gathers information from the first page.


## First step: Respect Scraping Ethics!

**The scraped content belongs to the website owner. Check robots.txt and the Terms of Use! If in doubt, scrape websites which have an API. Making requests to a website can cause a toll on a website's performance, please do not cause any disruption to the regular functioning of the website! (too many requests, huge requests...)**

First step is always: Checking the onlineshop's robots.txt. It shows me that there are only a few things to note:

User-agent: *

Disallow: /wp-admin/

Allow: /wp-admin/admin-ajax.php


So, let's start and find some products.

## Packages

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time

## Initiating request

The requests library needs a functioning URL to the website which you would like to scrape. Mine includes the search term in the path/ get parameters. The easiest way to get the URL is to go to the website, find the page you want to scrape and copy the link. 

You don't have to set the timeout parameter, but be aware that the default ist very high. Which is why I often times find it set to 2 seconds. In my case he timeout had to be set to 5, after starting with 2, which threw a timeout error. 

The following code executes an HTTP request to the specified URL. It retrieves the HTML data that the server sends back and saves this data in a Python object: requests.Response().

I am printing the status code, which prints out 200. Good to go.

In [2]:
url = "webpage you want to scrape"

page = requests.get(url, timeout=5)
print(page.status_code)

200


In [3]:
type(page)

requests.models.Response

The output is a requests.Response object. There is a number of key properties and methods available to a Response object. Above we saw .status_code which stores the integer code of the responded HTTP status, such as 200 or 404. 

In the following we are going to use .content - which contains the content of the response.

## BeautifulSoup

What happened so far: I scraped HTML code from the website and it is stored in the requests.Response() object which I called page. The scraped code still contains a lot of HTML tags and attributes. That's why I would like to make the data more readable and pick out what interests me.

With Beautiful Soup, another Python library, you can parse structured data (like HTML). The library provides a number of intuitive functions that you can use to analyze the HTML you receive. See the documentation here: https://beautiful-soup-4.readthedocs.io/en/latest/ . 

The following line of code creates a BeautifulSoup object that receives the HTML content as input. When the object is instantiated, you also instruct Beautiful Soup to use the corresponding parser. In the example below, this is the HTML parser.

In [4]:
annamaria = BeautifulSoup(page.content, "html.parser")

In [5]:
type(annamaria)

bs4.BeautifulSoup

## Finding Each Product on the First Page

How to look at the HTML code of a web page: Use the Inspect functionality (Firefox browser) on the repective page.

How to look for certain parts of the web page in the HTML code: You can hover over the part that you are interested in. This will highlight the respective web page section. Use drop down in HTML inspect tool to isolate the part even more. You will observe HTML tags, classes and attributes, which can also be found in the BeautifulSoup object. With the library BeautifulSoup, one can interact with HTML in a similar way to how one can do using the inspect functionality.

I found that each product can be detected by the tag div, filtered by class="product-element-bottom product-information". Using find_all() to actually find all the individual products stored in the annamaria variable. 

Storing the result in a variable called products. There are maximal 30 products per page. We know that there are several result pages, so len(product) should give out 30. Which it does, yeah!

In [4]:
products = annamaria.find_all("div", {"class" : "product-element-bottom product-information"})
len(products)

30

In [5]:
type(products)

bs4.element.ResultSet

## Checking out the first product

As a starting point, I am interested in the product title, the price and the link to the specific product page. I can find all this information in my result, as the print out of the first product shows. 

Note: you can easily use slicing to access each of the 30 products stored in the variable. 

In [None]:
products[0]

## Gathering product name and link to the specific product page

To achieve that task, I just started with the first element in the products variable. I tested if my code obtains the right information by applying it on the second element in products as well. 

Product name and link can be extracted through the first anchor tag ("a"), the product name can be detected with the .get_text() function, the link to the product page is reached with .get("href").

In [7]:
products[0].find("a").get_text()

'Anna Maria’s Welcome Home Quilt Kit'

In [8]:
products[1].find("a").get_text()

'Made My Day Canna Toffee'

In [None]:
products[0].find("a").get("href")

In [None]:
products[1].find("a").get("href")

## Searching for the price

To be able to extract the price, I had to try a few things out. First, I used the "span" tag with class=price. One can see that the price and currency is in there. With indexing and the .get_text() function, I could fetch currency and price.

In [11]:
products[0].find_all("span", {"class" : "price"})

[<span class="price"><span class="woocommerce-Price-amount amount"><bdi><span class="woocommerce-Price-currencySymbol">£</span>285.00</bdi></span></span>]

In [12]:
products[0].find_all("span", {"class" : "price"})[0].get_text()

'£285.00'

Another possibility is to search for the "bdi" tag. The result is a bit narrower.

In [13]:
products[0].find("bdi")

<bdi><span class="woocommerce-Price-currencySymbol">£</span>285.00</bdi>

In [14]:
products[0].find("bdi").get_text()

'£285.00'

One might want to separate currency symbol from number. Of course, everything is in pounds on this page. This can later be accounted for in the column name, if it is of interest. Also, one can transform the price from string type to float type.

In [15]:
price = products[0].find_all("bdi")[0].get_text()
price = float(price.split("£")[1])
price

285.0

## Creating a DataFrame with all the products

Beside product name, product page and prize, I wanted to add another variable: product category. The information can be found in the product name, where I want to start. An alternative would be of course to check the product page for any information on that. But let's start with the result page entries and the information we can get there. 

I created a function find_product_category() where I assigned each product to a product category, depending on wether the respective keyword pointing to a certain category was included in the product name or not. I purposely did not split the product name afterwards, and left it as is. Just wanted to create a column which I can easily browse for the different product categories. I included my function into my loop, which will step by step create my DataFrame (see below).

In [16]:
def find_product_category(product_name):
    if "Fat Quarter Bundle" in product_name:
        product_category = "Fat Quarter Bundle"
    elif "Kit" in product_name:
        product_category = "Quilt Kit"
    elif "Pattern" in product_name:
        product_category = "Pattern"
    elif "Thread" in product_name:
        product_category = "Thread"
    elif "Template" in product_name:
        product_category = "Templates"
    else:
        product_category = "Bulk Goods"

    return product_category

In [None]:
product_list = []

for product in products:

    try:
        product_name = product.find("a").get_text()
    except AttributeError:
        product_name = None

    product_category = find_product_category(product_name)
    
    try:
        product_page = product.find("a").get("href")
    except AttributeError:
        product_page = None

    try:
        price_in_gbp = product.find("bdi").get_text()
        price_in_gbp = float(price_in_gbp.split("£")[1])
    except AttributeError:
        price_in_gbp = None


    product_list.append({
                "Product Name": product_name,
                "Product Category" : product_category,
                "Product Page": product_page,
                "Price in GBP": price_in_gbp
            })
    
df = pd.DataFrame(product_list)
df.head()

## Write DataFrame to csv file

I am storing the DataFrame as a csv file, as I would like to use it later on in an other project. You can store the csv file right within the same folder by just using the df.to_csv() functionality. Revert to Pandas documentation for parameters. I used another approach.

I want to add the csv file to a seperate folder called data. The path to my folder is stored in a text file - path_to_data.txt. You obviously don't have to do that. I extract the path, add it to the path parameter of to_csv() using os.path.join().


I added the path_to_date.txt file to my gitignore. If needed, one can add the data folder to .gitignore file as well to avoid pushing it to GitHub.



In [18]:
import os

In [19]:
with open('path_to_data.txt') as file:
    path=file.readlines()
path = path[0].replace('\n', '')

In [20]:
df.to_csv(os.path.join(path,r'amh_first_result_page.csv'))

In [None]:
# if one does not want to follow above steps to store DataFrame in extra folder
# uncomment the collowing to store csv in same working folder

#df.to_csv('amh_first_result_page.csv')

## Outlook

Two things can be done on top. One is to iterate through each single product page of my search and thereby storing all the available products plus their info. The other is, that one can access the actual product page and extract even more details/ info on each product.