# Mediamarkt scraper

This Mediamarkt scraper can be used to scrape data from four categories of the website Mediamarkt.com: 
- Mobile phones
- Televisions
- Laptops
- tablets

Two files (csv and json) will be exported once scraped which includes data of the brands in the categories such as the prices, availabiliy, ratings, reviews and other attributes.

If all steps are followed, this will take up around 45 minutes of your time.

### Import packages

First, import the following packages:

In [1]:
# Import packages
from bs4 import BeautifulSoup
import requests
from time import sleep
import pandas as pd
import json
from pandas.io.json import json_normalize

### Collect page URLS

Next, we will collect all page URLs of the categories.

In [2]:
# Define function to check whether there is a next page
def check_next_page(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text, "html.parser")
    next_btn = soup.find(class_= "pagination-next") 
    return next_btn.find("a").attrs["href"] if next_btn else None
print("Function created.")

Function created.


In [3]:
# Define a function to collect all page urls
def generate_page_urls(page_url):
    page_urls = []
    while page_url:
        print("Saving: ", page_url)
        page_urls.append(page_url)
        if check_next_page(page_url) != None: 
            page_url = "https://www.mediamarkt.nl" + check_next_page(page_url)
        else:
            break
    print("Done with this category!")
    
    sleep(1)
    
    return page_urls
print("Function created.")

Function created.


In [4]:
# Use the generate_page_urls function to collect all the page urls of the categories you want to scrape

# 1. Define the first page of every category
smartphones_url = "https://www.mediamarkt.nl/nl/category/_smartphones-483222.html?page=1"
laptops_url = "https://www.mediamarkt.nl/nl/category/_laptops-482723.html?page=1"
tablets_url = "https://www.mediamarkt.nl/nl/category/_tablets-645048.html?page=1"
tvs_url = "https://www.mediamarkt.nl/nl/category/_televisies-450682.html?page=1"

# 2. Use the function on all categories, first checking whether there is a next page and if so, adding it to page_urls
page_urls = generate_page_urls(smartphones_url) + generate_page_urls(laptops_url) + generate_page_urls(tablets_url) + generate_page_urls(tvs_url)
print("All page URLS have been collected.")

Saving:  https://www.mediamarkt.nl/nl/category/_smartphones-483222.html?page=1
Saving:  https://www.mediamarkt.nl/nl/category/_smartphones-483222.html?page=2
Saving:  https://www.mediamarkt.nl/nl/category/_smartphones-483222.html?page=3
Saving:  https://www.mediamarkt.nl/nl/category/_smartphones-483222.html?page=4
Saving:  https://www.mediamarkt.nl/nl/category/_smartphones-483222.html?page=5
Saving:  https://www.mediamarkt.nl/nl/category/_smartphones-483222.html?page=6
Saving:  https://www.mediamarkt.nl/nl/category/_smartphones-483222.html?page=7
Saving:  https://www.mediamarkt.nl/nl/category/_smartphones-483222.html?page=8
Saving:  https://www.mediamarkt.nl/nl/category/_smartphones-483222.html?page=9
Saving:  https://www.mediamarkt.nl/nl/category/_smartphones-483222.html?page=10
Saving:  https://www.mediamarkt.nl/nl/category/_smartphones-483222.html?page=11
Saving:  https://www.mediamarkt.nl/nl/category/_smartphones-483222.html?page=12
Saving:  https://www.mediamarkt.nl/nl/category/_s

### Collect product page URLS

All page URLs are collected, so now the product page URLs can be collected.

In [5]:
# Create a function to collect the product_urls
def create_product_urls(page_urls):
    product_urls = []
    for page_url in page_urls:
        res = requests.get(page_url)
        soup = BeautifulSoup(res.text, "html.parser")
        products = soup.find_all("h2")
        
        for product in products:
            try:
                product_url = "https://www.mediamarkt.nl" + product.find("a").attrs["href"]
                product_urls.append(product_url)
                print("Saving " + product_url)
            except:
                print("this is no product")
            
        sleep(1)
        
    return product_urls
print("Function created.")

Function created.


In [6]:
# Use the create_product_urls function to create a list product_urls
product_urls = create_product_urls(page_urls)
print("All product URLS have been saved.")

Saving https://www.mediamarkt.nl/nl/product/_samsung-galaxy-a52-128-gb-zwart-1686730.html
Saving https://www.mediamarkt.nl/nl/product/_motorola-moto-e7-power-64-gb-dual-sim-blauw-1714397.html
Saving https://www.mediamarkt.nl/nl/product/_samsung-galaxy-a52s-5g-128-gb-zwart-1703821.html
Saving https://www.mediamarkt.nl/nl/product/_samsung-galaxy-a32-5g-128-gb-zwart-1686726.html
Saving https://www.mediamarkt.nl/nl/product/_samsung-galaxy-a32-4g-128-gb-zwart-1686722.html
Saving https://www.mediamarkt.nl/nl/product/_xiaomi-redmi-9a-1715992.html
Saving https://www.mediamarkt.nl/nl/product/_samsung-galaxy-s21-5g-128-gb-grijs-1684429.html
Saving https://www.mediamarkt.nl/nl/product/_samsung-galaxy-a22-5g-64-gb-grijs-1698326.html
Saving https://www.mediamarkt.nl/nl/product/_xiaomi-11-lite-5g-new-edition-128gb-zwart-1710311.html
Saving https://www.mediamarkt.nl/nl/product/_samsung-galaxy-a03s-32-gb-zwart-1707156.html
Saving https://www.mediamarkt.nl/nl/product/_motorola-moto-g60s-128gb-dual-sim-

### Collect product specific data

From these product page URLs, we will now collect all product specific data. This includes the before mentioned data: prices, availabiliy, ratings, reviews and other attributes.

In [7]:
# Search for the right elements, store them in variables an put them together in a dictionary
product_data = []

def scrape(product_urls):
    for url in product_urls:
        res = requests.get(url)
        soup = BeautifulSoup(res.text, "html.parser") 
        device_type = soup.find(class_ = "breadcrumbs").find_all("li")[2].text.replace("\n", "")
        names = soup.find(class_ = "stickable").img.attrs["alt"]
        try:
            prices = soup.find("div", class_ = "price big").text
        except:
            prices = "no price"
        try:
            instock = soup.find(class_ = "box infobox availability").meta.attrs["content"]
        except: 
            instock = "OutOfStock"
        try:
            ratings = soup.find(class_ = "bvseo-ratingValue").text
        except:
            ratings = "no rating"
        try: 
            reviews = soup.find(class_ = "bvseo-reviewCount").text
        except: 
            reviews = "no reviews"
    
        # Get product attributes and store them in attributes_json
        attributes = soup.find(class_ = "specification").find_all('dt')
        values = soup.find(class_ = "specification").find_all('dd')

        attributes_json = {}
        for x, y in zip(attributes, values):
            attributes_json[x.text]=y.text
    
        # Store all variables in products
        products = {"device_type": device_type, "name": names, "price": prices, "instock": instock, "rating": ratings, 
                    "nr_reviews": reviews, "attributes": attributes_json}
        product_data.append(products)
        print("Saving: ", products["name"])

    sleep(1)
    
    return(product_data)
print("Function created.")

Function created.


In [8]:
# Use the scraping function to save the data in product_data
product_data = scrape(product_urls)
print("All product data have been collected.")

Saving:  SAMSUNG Galaxy A52 - 128 GB Zwart
Saving:  MOTOROLA moto e7 power - 64 GB Dual-Sim Blauw
Saving:  SAMSUNG Galaxy A52s 5G - 128 GB Zwart
Saving:  SAMSUNG Galaxy A32 5G - 128 GB Zwart
Saving:  SAMSUNG Galaxy A32 4G - 128 GB Zwart
Saving:  XIAOMI Redmi 9A
Saving:  SAMSUNG Galaxy S21 5G - 128 GB Grijs
Saving:  SAMSUNG Galaxy A22 5G - 64 GB Grijs
Saving:  XIAOMI 11 Lite 5G New Edition 128GB Zwart
Saving:  SAMSUNG Galaxy A03s - 32 GB Zwart
Saving:  MOTOROLA moto g60s - 128GB Dual-Sim - Blauw
Saving:  SAMSUNG Galaxy A12 - 32 GB Blauw
Saving:  MOTOROLA moto e20 - 32GB Dual-Sim - Grijs
Saving:  APPLE iPhone 12 - 128 GB Zwart 5G
Saving:  APPLE iPhone 12 - 64 GB Zwart 5G
Saving:  SAMSUNG Galaxy A13 - 128 GB Zwart
Saving:  APPLE iPhone 11 - 64 GB Wit
Saving:  SAMSUNG Galaxy S20 FE 4G - 128 GB Donkerblauw
Saving:  SAMSUNG Galaxy A12 - 32 GB Wit
Saving:  SAMSUNG Galaxy Xcover 5 EE - 64 GB Zwart
Saving:  SAMSUNG Galaxy A22 5G - 64 GB Paars
Saving:  APPLE iPhone 13 - 128 GB Green 5G
Saving:  

KeyboardInterrupt: ignored

### Store and export the product data

Now all data is collected, the raw data, JSON file and csv file will be created for further use.

In [None]:
# Write the raw data, product_data, to a json_file
with open('raw_product_data.json', 'w') as json_file:
  json.dump(product_data, json_file)
print("Data have been saved in a json file.")

In [None]:
# Opening and normalizing the raw JSON data

# 1. Open the saved raw JSON data and convert it into a pandas dataframe
df = pd.read_json('raw_product_data.json')

# 2. Normalize the data, putting the before nested items in attributes into columns and dropping the column 'attributes'
df = df.join(pd.json_normalize(df.attributes)).drop(columns=['attributes'])
print("Dataframe has been created.")

In [None]:
# Write the pandas dataframe to a csv
df.to_csv("mediamarkt_scraper_output.csv", sep = ",", index = False)
print("Done, the csv is ready for data preparation.")