# Day 5 - Webscraping

Today the goal will be to scrape Rema1000 locations around the country.

From the Rema1000 website we can see that the div class responsible for the element listing all Danish stores as a `div` with:
`
class="grid aspect-square h-auto auto-rows-min grid-cols-1 divide-y overflow-y-auto overflow-x-hidden lg:aspect-auto lg:h-[504px]"
`

<div>
<img src="pictures/stores.png" width="1000"/>
</div>


Second we can see that this "root" div has "child" divs for each store with the element being references as `id="store-list__item-329"`

<div>
<img src="pictures/address.png" width="1000"/>
</div>


Above, we can then see that each indiviual store element has a `h2` header (`class="text-xl font-bold"`) and an `address` (`class="text-base font-medium not-italic"`) element for storing the name of the store and it's address.

With this in place we should be able to scrape all stores from the Rema1000 website and store them.


In [29]:
# Now lets use Beautifulsoup to scrape the site
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [5]:
# base information
URL = "https://rema1000.dk/find-butik-og-abningstider"
page = requests.get(URL)

In [39]:
# Check if we are retrieving the HTML from the site
page.text

'<!DOCTYPE html><html  lang="da" data-capo=""><head><meta charset="utf-8">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<script id="CookieConsent" src="https://policy.app.cookieinformation.com/uc.js" type="text/javascript" data-culture="DA"></script>\n<script type="text/javascript">var _mtm = window._mtm = window._mtm || [];_mtm.push({\'mtm.startTime\': (new Date().getTime()), \'event\': \'mtm.Start\'});(function() {var d=document, g=d.createElement(\'script\'), s=d.getElementsByTagName(\'script\')[0];g.async=true; g.src=\'https://matomo.digital.rema1000.dk/js/container_tZ2kTSpc.js\'; s.parentNode.insertBefore(g,s);})();</script>\n<title>Find butik og åbningstider | REMA 1000</title>\n<link rel="preconnect" href="https://content-images.digital.rema1000.dk">\n<link rel="preconnect" href="https://d15493jtiio2fp.cloudfront.net">\n<script src="https://d15493jtiio2fp.cloudfront.net/shopping-list-web-sdk/rema1000-shopping-list.umd.js" async defer data-hid="12f2af6">

It works, now lets retrieve the store objects

In [37]:
# Parse
soup = BeautifulSoup(page.content, "html.parser")

# Retrieve all store elements
stores = soup.find_all("div", class_="py-2 text-rema-evening")

# Create list for storing store data
data_stores = []

# Loop through all stores and retrieve individual store names and addresses
for store in stores:
    name = store.find("h2", class_="text-xl font-bold")
    address = store.find("address", class_="text-base font-medium not-italic")

    # Add dictionaries to list
    data_stores.append({"name":name.text, "address":address.text})

# Add data to dataframe
df_stores = pd.DataFrame(data_stores)

# Display dataframe
df_stores


Unnamed: 0,name,address
0,"Aabenraa, Cimbria Parken","Reberbanen 7 st., Cimbria Parken, 6200 Aabenraa"
1,"Aabenraa, Nyløkke","Nyløkke 3, 6200 Aabenraa"
2,"Aabenraa, Rugkobbel","Farversmøllevej 2-4, Rugkobbelcentret, 6200 Aa..."
3,Aabybro,"Aabybro Centret 2 C, 9440 Aabybro"
4,"Aalborg C, Budolfi Plads","Vingårdsgade 6, 9000 Aalborg"
...,...,...
425,Åbyhøj,"Silkeborgvej 261, 8230 Åbyhøj"
426,Ålbæk,"Søndre Havnevej 4, 9982 Ålbæk"
427,Årslev,"Overvejen 79, 5792 Årslev"
428,Ølgod,"Østerbro 24, 6870 Ølgod"


Thus we've scraped the name and address of all Rema1000s across Denmark