# Purpose

The overall goal of this exercise is to generate the billings for a grocery store/supermarket, so that we can create a recommendation system informed by customer purchases.

This Notebook aims to download a sample of Tesco's groceries, which will serve as the list of products we have in our grocery shop.

To do this, we will first go though the HTML elements we need to get, then we'll go through a way of downloading the sample for each product category (fresh food, bakery, etc).

In [9]:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
import json  # for pretty printing

We can get the contents of the page by doing a get request to the fresh-foods page: https://www.tesco.com/groceries/en-GB/shop/fresh-food/all.

In [4]:
resp = requests.get(
    'https://www.tesco.com/groceries/en-GB/shop/fresh-food/all',
    headers={'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
)
print(resp.text)

<!DOCTYPE html>
<html class="no-js" lang="en" data-base-static-url="/groceries/web-assets">
  <head>
    <meta name="sentry-trace" content="2a61087392c34dd2b4a16abf9d3566b6" />
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta http-equiv="content-language" content="en-GB">
      <script type="application/json" id="appdynamics-values">{"eumKey":"AD-AAB-AAE-NSN","userPageName":"28_Lego"}</script>
      <script type="text/javascript" nonce="93c1cf0f-ebf5-4aaa-abc3-49345289988a">
  window["adrum-start-time"] = new Date().getTime();
      (function(config) {
        var appDValues = JSON.parse(document.getElementById("appdynamics-values").innerText);
        config.spa = {"spa2": true}
        config.xd = {enable : false}
        config.appKey = appDValues.eumKey
        config.adrumExtUrlHttps = "https://cdn.appdynamics.com";
        config.beaconUrlHttps = "https://pdx-col.eum-appdynamics.com";
        (function(info) {
          info.PageView 

In [10]:
# BeautifulSoup is excellent at parsing the HTML from the page
soup = BeautifulSoup(resp.text)
products = []
for li in soup.find_all('li', attrs={'class': 'product-list--list-item'}):
    # We get the name of the product
    product_element = li.find('div', attrs={'class': 'product-details--content'})
    # We get the price of the product
    price_element = li.find('span', attrs={'class': 'value'})
    
    # Let's store it in a sensible data format
    products.append(
        {
            'product': product_element.text,
            'price': float(price_element.text)
        }
    )
pprint(products)

[{'price': 1.48, 'product': 'Tesco British Salted Block Butter 250G'},
 {'price': 1.48, 'product': 'Tesco British Unsalted Butter 250G'},
 {'price': 1.6, 'product': 'Tesco Gala Apple Minimum 5 Pack'},
 {'price': 3.0, 'product': 'Tesco Blueberries 250G'},
 {'price': 2.0, 'product': 'Tesco Red Seedless Grapes 500G'},
 {'price': 1.15, 'product': 'Tesco Maris Piper Potatoes 2.5Kg'},
 {'price': 3.5, 'product': 'Tesco 2 Boneless Salmon Fillets 260G'},
 {'price': 3.75, 'product': 'Tesco British Chicken Breast Portions 650G'},
 {'price': 1.35, 'product': 'Tesco Clementine Or Sweet Easy Peeler Pack 600G'},
 {'price': 2.0, 'product': 'Tesco Raspberries 150G'},
 {'price': 0.85, 'product': 'Tesco Little Gem Lettuce Twin Pack'},
 {'price': 0.89, 'product': 'Tesco British Semi Skimmed Milk 1.13L, 2 Pints'},
 {'price': 1.0, 'product': 'Tesco Baby Plum Tomatoes 325G'},
 {'price': 1.35, 'product': 'Tesco Sweet Peppers 500G'},
 {'price': 0.85, 'product': 'Tesco Brown Onions Minimum 3 Pack 385G'},
 {'pri

Let's encapsulate this in a function:

In [11]:
def scrape(url: str, products: list) -> list:
    """Takes a URL and returns the list of products in the page."""
    prod_type = url.split('/')[-2]
    resp = requests.get(
        url, 
        headers={
            'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) '+\
            'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
        }
    )
    soup = BeautifulSoup(resp.text)
    for li in soup.find_all('li', attrs={'class': 'product-list--list-item'}):
        product_element = li.find('div', attrs={'class': 'product-details--content'})
        
        # We check to see if there is a price attached to the product
        # (out of stock products don't have a price element)
        try:
            price = li.find('span', attrs={'class': 'value'}).text
        except AttributeError:
            pass
        products.append(
            {
                'product': product_element.text,
                'price': float(price),
                'type': prod_type
            }
        )
    return products

In [12]:
# Let's test
products = []
products = scrape(
    'https://www.tesco.com/groceries/en-GB/shop/fresh-food/all',
    products
)
pprint(products)

[{'price': 1.48,
  'product': 'Tesco British Salted Block Butter 250G',
  'type': 'fresh-food'},
 {'price': 1.48,
  'product': 'Tesco British Unsalted Butter 250G',
  'type': 'fresh-food'},
 {'price': 1.6,
  'product': 'Tesco Gala Apple Minimum 5 Pack',
  'type': 'fresh-food'},
 {'price': 3.0, 'product': 'Tesco Blueberries 250G', 'type': 'fresh-food'},
 {'price': 2.0,
  'product': 'Tesco Red Seedless Grapes 500G',
  'type': 'fresh-food'},
 {'price': 1.15,
  'product': 'Tesco Maris Piper Potatoes 2.5Kg',
  'type': 'fresh-food'},
 {'price': 3.5,
  'product': 'Tesco 2 Boneless Salmon Fillets 260G',
  'type': 'fresh-food'},
 {'price': 3.75,
  'product': 'Tesco British Chicken Breast Portions 650G',
  'type': 'fresh-food'},
 {'price': 1.35,
  'product': 'Tesco Clementine Or Sweet Easy Peeler Pack 600G',
  'type': 'fresh-food'},
 {'price': 2.0, 'product': 'Tesco Raspberries 150G', 'type': 'fresh-food'},
 {'price': 0.85,
  'product': 'Tesco Little Gem Lettuce Twin Pack',
  'type': 'fresh-food

Now, there is more than just fresh food on offer. There are bakery products, frozen foods, drinks, and so on.

We need to find a way to get a sample of the products on sale for each category. We can do this by checking a specific URL.

### Getting the list of categories

In [13]:
resp = requests.get(
    'https://www.tesco.com/groceries/',
    headers={'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
)

In [14]:
soup = BeautifulSoup(resp.text, 'html.parser')
for li in soup.find('ul', attrs={'class': 'menu menu-superdepartment'}).find_all('li'):
    print(li.find('a', href=True)['href'].replace('?include-children=true', '/all'))

/groceries/en-GB/shop/christmas/all
/groceries/en-GB/shop/fresh-food/all
/groceries/en-GB/shop/bakery/all
/groceries/en-GB/shop/frozen-food/all
/groceries/en-GB/shop/food-cupboard/all
/groceries/en-GB/shop/drinks/all
/groceries/en-GB/shop/baby/all
/groceries/en-GB/shop/health-and-beauty/all
/groceries/en-GB/shop/pets/all
/groceries/en-GB/shop/household/all
/groceries/en-GB/shop/home-and-ents/all
/groceries/en-GB/shop/inspiration-and-events/all


With this, we can iterate through the possible URLs with a small helper function:

In [15]:
def generate_links() -> str:
    """A small helper function to generate URLs.
    
    Returns an iterator."""
    resp = requests.get(
        'https://www.tesco.com/groceries/',
        headers={
            'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) '+\
            'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
    )
    soup = BeautifulSoup(resp.text, 'html.parser')
    for li in soup.find(
        'ul', 
        attrs={
            'class': 'menu menu-superdepartment'
        }
    ).find_all('li'):
        nURL = 'https://www.tesco.com' + li.find('a', href=True)['href'].replace(
            '?include-children=true', '/all')
        yield nURL

Now, we put it all together!

### Putting it all together

In [16]:
products = []
# We are going to skip the Christmas selection.
for link in generate_links():
    if 'christmas' not in link:
        products = scrape(
            link,
            products
        )

pprint(products)

[{'price': 1.48,
  'product': 'Tesco British Salted Block Butter 250G',
  'type': 'fresh-food'},
 {'price': 1.48,
  'product': 'Tesco British Unsalted Butter 250G',
  'type': 'fresh-food'},
 {'price': 1.6,
  'product': 'Tesco Gala Apple Minimum 5 Pack',
  'type': 'fresh-food'},
 {'price': 3.0, 'product': 'Tesco Blueberries 250G', 'type': 'fresh-food'},
 {'price': 2.0,
  'product': 'Tesco Red Seedless Grapes 500G',
  'type': 'fresh-food'},
 {'price': 1.15,
  'product': 'Tesco Maris Piper Potatoes 2.5Kg',
  'type': 'fresh-food'},
 {'price': 3.5,
  'product': 'Tesco 2 Boneless Salmon Fillets 260G',
  'type': 'fresh-food'},
 {'price': 3.75,
  'product': 'Tesco British Chicken Breast Portions 650G',
  'type': 'fresh-food'},
 {'price': 1.35,
  'product': 'Tesco Clementine Or Sweet Easy Peeler Pack 600G',
  'type': 'fresh-food'},
 {'price': 2.0, 'product': 'Tesco Raspberries 150G', 'type': 'fresh-food'},
 {'price': 0.85,
  'product': 'Tesco Little Gem Lettuce Twin Pack',
  'type': 'fresh-food

Before we move on, we should remove the Tesco string in the product name. We can call it generic or something, but we don't want to have a brand name associated with the notebooks (not because of tesco, I shop there, but because this small project is not affiliated with them).

In [18]:
for product in products:
    product['product'] = product['product'].replace('Tesco ', '')

pprint(products)

[{'price': 1.48,
  'product': 'British Salted Block Butter 250G',
  'type': 'fresh-food'},
 {'price': 1.48,
  'product': 'British Unsalted Butter 250G',
  'type': 'fresh-food'},
 {'price': 1.6, 'product': 'Gala Apple Minimum 5 Pack', 'type': 'fresh-food'},
 {'price': 3.0, 'product': 'Blueberries 250G', 'type': 'fresh-food'},
 {'price': 2.0, 'product': 'Red Seedless Grapes 500G', 'type': 'fresh-food'},
 {'price': 1.15, 'product': 'Maris Piper Potatoes 2.5Kg', 'type': 'fresh-food'},
 {'price': 3.5,
  'product': '2 Boneless Salmon Fillets 260G',
  'type': 'fresh-food'},
 {'price': 3.75,
  'product': 'British Chicken Breast Portions 650G',
  'type': 'fresh-food'},
 {'price': 1.35,
  'product': 'Clementine Or Sweet Easy Peeler Pack 600G',
  'type': 'fresh-food'},
 {'price': 2.0, 'product': 'Raspberries 150G', 'type': 'fresh-food'},
 {'price': 0.85,
  'product': 'Little Gem Lettuce Twin Pack',
  'type': 'fresh-food'},
 {'price': 0.89,
  'product': 'British Semi Skimmed Milk 1.13L, 2 Pints',


That's better!

### Saving

Since we got the data in a dictionary, let's save it as a JSON (though we could easily use pandas' functions to save a CSV):

In [21]:
json.dump(
    fp=open('../data/raw/sample_products.json', 'w'), 
    obj=products
)