## Data Scraping
Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program.

Beautiful Soup (called bs4 when calling the package in Python) is a Python package for parsing HTML and XML documents. This will be used to extract data from HTML pages.
Urllib is a Python package that collects several modules for working with URLs:
* request
* error
* parse
* robot parser

The following code block loads these packages and imports BeautifulSoup as soup, which means that we can use "soup" when calling BeautifulSoup functions instead of "BeautifulSoup" to simplify the code somewhat. Also, urllib as "uReq" because the code is only importing the request module.

In [1]:
from bs4 import BeautifulSoup as soup  # Library for HTML data structures
from urllib.request import urlopen as uReq  # Library for opening URLs
import re # Library for regular expressions

Next a variable is created to store the website URL of interest. The example used here is lifeinformatica.com for computer central processing unit (CPUs) products:

In [2]:
page_url = "https://lifeinformatica.com/categoria-producto/family-componentes/family-procesadores/"

Next the connection is opened and the HTML page from the URl is downloaded:

In [3]:
uClient = uReq(page_url)

Next the html is parsed into a soup data structure. This will allow navigation through the HTML data in a way similar to json data type. After, the connection is closed to the URL:

In [4]:
page_soup = soup(uClient.read(), "html.parser")
uClient.close()

page_soup now holds all the HTML data from the URL. The data of interest is each CPU product, specifically the name, brand, speed and price. The other infomation contained within the HTML is not of importance to the project right now. By navigating through the HTML structure, it is possible to find the class that holds the data for each CPU product:

In [5]:
containers = page_soup.findAll("div", {"class": "product-inner product-item__inner"})

Next, the out_filename is a variable that stores the name of the output file in csv format. The headers variable is used to write to local disk and header of csv file to be written.

In [6]:

out_filename = "cpu.csv"
headers = "manufacturer,product_name,speed,price \n"


Next the file is opened and the headers are written to the file. The "w" parameter overwrites any existing content:

In [7]:
f = open(out_filename, "w")
f.write(headers)

39

Next the actual data extraction from the HTML structure. The data needed is the Manufacturer, Product Name, Speed of the CPU, and the Price.
Idealy, the HTML structure would be constructed in a way that all the data needed is already seperated into elements. This is not the case for this website. Here, the data can be extracted by first finding the parts needed. Most of the parts are in the title of each container. From here strings can be used to carefully pull out the data parts that are needed, in a systematic apporach, meaning that it works well for all the CPU products in the webpage.

In [8]:
for container in containers:

    element = container.findAll("h2", {"class": "woocommerce-loop-product__title"})[0]

    #Manufacturer
    manufacturer = element.text.split(' ', 1)[0]

    #Speed
    full_title = element.text 
    split_word = 'GHz'   
    if manufacturer == "AMD":
        speed = full_title.partition(split_word)[0].split(' ', 4)[4]
    else:
        speed = re.search('([^x]+)'+speed, split_word).group(1)

    #Product Name
    if manufacturer == "AMD":
        product_name = re.search(manufacturer+'(.*?)'+speed, full_title).group(1)
    else:
        product_name = re.search(manufacturer+'(.*?)'+speed, full_title).group(1)

    #Price
    price = container.findAll("span", {"class": "woocommerce-Price-amount amount"})[0].text.strip().replace("€", "")

    print("manufacturer: " + manufacturer + "\n")
    print("product_name: " + product_name + "\n")
    print("speed: " + speed + "\n")
    print("price: " + price + "\n")

manufacturer: AMD

product_name:  Ryzen 5 3400G 

speed: 4.2 

price: 139,90



AttributeError: 'NoneType' object has no attribute 'group'