# A little more advanced: Using Web Scraping for Your Hobby

This notebook builds up on the previous one "WebScraping_SearchResult_FirstPage". The code is organized in helper functions, followed by a main function. 

The helper functions, though not hard to understand, include docstrings for training purposes and application examples. 

I included an input function which asks for a search term, in case someone would like to search for other products/ designers.

Steps include building a DataFrame with retrieved information of interest and storing it as CSV file.

In [1]:
import pandas as pd
import requests

from bs4 import BeautifulSoup

## Helper Functions

In [2]:
def ask_for_search_term():

    """
    Asks the user for what they would like to search for.
    
    Utilizes the input() function from Python. 
    Returned variable query is a string. Stores the search term chosen by user.
    
    :return: query - search term chosen by 
    :rtype: string
    """
    
    user_input = input("What designer or product would you like to search for? ")

    return user_input

In [3]:
def process_input(search_phrase):

    """
    Processes the input from the user for further steps.
    
    Parameter input_query takes any input - search term the user defines. 
    Converts letters to all lower cases. Replaces spaces with plus sign. 
    Returns processed_query with required format for url on SEW HOT UK.
    
    :param input_query: A string storing the search term/ query
    :type input_query: String
    :return: processed_query
    :rtype: string
    """
    
    processed_phrase = search_phrase.lower().replace(" ", "+")

    return processed_phrase

In [4]:
def initiate_request(base, path):

    """
    Sends get request to SEW HOT UK.

    Utilizing requests.get() to connect to given website and gather its content. 
    Prints message if get request was successful or not.
    URL of given website defined by the parameters base and path. 
    Base contains the base url. 
    Path is remainder of the url and inlcludes among others the query.
    The function returns page.
    
    :param base: A string storing the base url
    :type base: String
    :param path: A string storing the path of the url
    :type path: String
    
    :return: The content of the called webpage.
    :rtype: requests.Response() object
    """

    page = requests.get(base + path, timeout=5)
    status = page.status_code
    
    if status != 200:
        print(f"The returned status code is: {status}." 
              "Please check upon the underlying issue.")
    else:
        print("Your status code is 200, you're good to go!")

    return page

In [5]:
def parse_content(result):

    """
    Parse the HTML code from the get request stored in .content.
    
    Utilizing BeautifulSoup() for parsing and extracting data in HTML code.
    One paramter: result, a requests.Response() object with HTML content.
    Returns a BeautifulSoup object, where the html parser was used.

    :param result: Is a Response object which stores the HTML received.
    :type result: requests.Response() object

    :return: parsed HTML document/ code
    :rtype: bs4.BeautifulSoup object
    """
    
    parsed_result = BeautifulSoup(result.content, "html.parser")

    return parsed_result

In [12]:
def get_products(website_content):

    """
    Get each single product stored in the response/result.
    
    Utilizing BeautifulSoup's function .find_all() plus required tag/ class.
    The parameter website_content is a bs4.BeautifulSoup object.
    The function checks if any product was found given the search term.
    If not it prints a message. If yes, it stores the result in products.
    Returns the obtained BeautifulSoup result set.

    :param website_content: parsed HTML document/ code
    :type website_content: bs4.BeautifulSoup object

    :return: products stores all of the products found in HTML code/ response
    :rtype: bs4.element.ResultSet
    """

    if website_content.find("div", {"class" : "woocommerce-no-products-found"}):
        print("No products were found matching your search term." 
              "Check for the right spelling, or search for another product.")
    else: 
        products = website_content.find_all(
        "div", {"class" : "product-element-bottom product-information"}
        )
    
    return products

In [7]:
def find_product_category(product_name):

    """
    Fetching the product category for each product

    Function extracts the product catgeory for each product. It is stored in 
    the product name. The only parameter - product_name - is a string. 
    
    :param product_name: Product name which also includes the product category
    :type product_name: string

    :return: the product category as stored in product name
    :rtype: string
    """
    
    if "Fat Quarter Bundle" in product_name:
        product_category = "Fat Quarter Bundle"
    elif "Kit" in product_name:
        product_category = "Quilt Kit"
    elif "Pattern" in product_name:
        product_category = "Pattern"
    elif "Thread" in product_name:
        product_category = "Thread"
    elif "Template" in product_name:
        product_category = "Templates"
    else:
        product_category = "Bulk Goods"

    return product_category

In [8]:
def build_products_df(list_products):

    """
    Retrieve given infformation from the HTML code/ response.

    Try- Except blocks with BeautifulSoup .find(), .get_text() and .get() function.
    To retrieve the information of interest. 
    Added with some formatting steps as well as find_product_category function.
    Returns a dataframe with all the products plus gathered information 
    found in the HTML code resp. requests response.

    :param list_products:
    :type list_products: List(bs4.ResultSet objects)

    :return: dataframe with products plus info found in HTML code response
    :rtype: DataFrame
    
    """
    
    product_list = []

    for product in list_products:

        try:
            product_name = product.find("a").get_text()
        except AttributeError:
            product_name = None
    
        product_category = find_product_category(product_name)
        
        try:
            product_page = product.find("a").get("href")
        except AttributeError:
            product_page = None
    
        try:
            price_in_gbp = product.find("bdi").get_text()
            price_in_gbp = float(price_in_gbp.split("£")[1])
        except AttributeError:
            price_in_gbp = None
    
    
        product_list.append({
                    "Product Name": product_name,
                    "Product Category" : product_category,
                    "Product Page": product_page,
                    "Price in GBP": price_in_gbp
                })
    
    df = pd.DataFrame(product_list)

    return df

In [10]:
def store_df_as_csv(dataframe):

    import os
    
    with open('path_to_data.txt') as file:
        path=file.readlines()
    path = path[0].replace('\n', '')

    dataframe.to_csv(os.path.join(path,r'amh_first_result_page_adv.csv'))

    print("Dataframe was succesfully stored as csv file.")

## Running the Scraper

In [13]:
def main():

    search_term = ask_for_search_term()
    processed_query = process_input(search_term)

    base_url = "https://www.sewhot.co.uk"
    next_page = f"/?s={processed_query}&post_type=product&product_cat=0"

    print("Scraping and building of DataFrame with product info starts now.")
    
    query_result = initiate_request(base_url, next_page)
    parsed_data = parse_content(query_result)
    query_products = get_products(parsed_data)
    query_df = build_products_df(query_products)
    store_df_as_csv(query_df)

if __name__ == "__main__":
    main()

What designer or product would you like to search for?  Anna Maria Horner


Scraping and building of DataFrame with product info starts now.
Your status code is 200, you're good to go!
Dataframe was succesfully stored as csv file.


## Check the result

In [14]:
with open ("path_to_data.txt") as file:
    path = file.readlines()
path = path[0].replace("\n", "")
filename = "/amh_first_result_page_adv.csv"

In [15]:
df = pd.read_csv(filepath_or_buffer=path+filename)
df.head()

Unnamed: 0.1,Unnamed: 0,Product Name,Product Category,Product Page,Price in GBP
0,0,Anna Maria’s Welcome Home Quilt Kit,Quilt Kit,https://www.sewhot.co.uk/product/anna-marias-w...,285.0
1,1,Made My Day Canna Toffee,Bulk Goods,https://www.sewhot.co.uk/product/made-my-day-c...,15.5
2,2,Made My Day Canna Jade,Bulk Goods,https://www.sewhot.co.uk/product/made-my-day-c...,15.5
3,3,Pathways Quilt Pattern – downloadable,Pattern,https://www.sewhot.co.uk/product/pathways-quil...,10.0
4,4,Aurifil Thread Labs Subscription Box 2.0,Thread,https://www.sewhot.co.uk/product/aurifil-threa...,66.0
