# Bambus Technology LLC Python Test - Web scrapping -1
-- -------------------------------------------------------------------------------------------

### Overview
Product Data is a task used on all our marketplaces where we sell our products. The goal is to get product information of a list of offers. It is used to get an updated dataset of all our products for making content change, display checking, marketing decision, …

This test is to evaluate your understanding of a written task and your ability to complete it by developing a script using web scraping methods.

### Provided resources
    •	input.xlsx:
        o	Contains a list of links referring to Mister Sandman offers (products)
### Task description
You will crawl over the input list to get information (attributes) for each product of the list. The script needs to correctly reads the input, do the scraping task for each link, and export the result as excel with the expected format (see Expected result bellow).

The code will be written in Python and you are free to use the IDE of your choice and all Python libraries you need to complete the task.

The performance of the solution, as well as the documentation and understandability of the code will be evaluated.
### Expected result
The solution needs to be a zipped folder named “Test_Coding_yourname_dd.mm.yyyy.zip” containing:
        
        •	The input file “input.xlsx”
        
        •	The Python script “test_coding.py” containing the code to execute the task
        
        •	An excel file “Webshop_ProdData_ dd.mm.yyyy-hhmmss.xlsx” containing the product data (generated by your script)
            o	Attributes for each product:
                	url: link of the product (same as the input)
                	product_title: title displayed on the product page
                	variation_id: unique identifier used to differentiate different children’s variations of one parent product (e.g.: product X available in Y different sizes = parent X has Y variations and therefore Y unique variation IDs)
                	size: size of the product 
                	price: price of the product
                	ean: European Article Number (barcode), standard number of 13 digits
                	img1_url: url of the 1st picture visible on the product page
                	img2_url: url of the 2nd picture visible on the product page (in the carrousel)
            o	Result sample:
            
            ![image.png](attachment:image.png)
            
         N.B: the images urls can differ depending on the method you used but it will be considered as correct as long as it refers to the correct picture (see example bellow with 2 different url but same picture).
         
        https://mister-sandman.de/cdn/shop/products/babymatratze60x120vollansicht_695x695.jpg?v=1678786128
        https://cdn.shopify.com/s/files/1/1563/5705/products/babymatratze60x120vollansicht.jpg?v=1678786128


#### Importing Necessary Packages

In [2]:
import pandas as pd
import numpy as np
import re
import json
import time
from datetime import datetime

import requests
from bs4 import BeautifulSoup
from urllib import request

import pickle
import time

import seaborn as sns
import plotly.express as px

### Here reading the excel file

In [2]:
df = pd.read_excel('input.xlsx')

In [3]:
df

Unnamed: 0,Link
0,https://www.mister-sandman.de/products/babymat...
1,https://www.mister-sandman.de/products/steppbe...
2,https://www.mister-sandman.de/products/steppbe...
3,https://www.mister-sandman.de/products/steppbe...
4,https://www.mister-sandman.de/products/steppbe...
...,...
552,https://www.mister-sandman.de/products/topper-...
553,https://www.mister-sandman.de/products/topper-...
554,https://www.mister-sandman.de/products/topper-...
555,https://www.mister-sandman.de/products/topper-...


In [4]:
## Checking Duplicate values

df[df.duplicated()].values[0]

array(['https://www.mister-sandman.de/products/kaltschaum-topper-doppeltuchbezug-80x200'],
      dtype=object)

In [5]:
### This link has a duplicated

df[df['Link']=='https://www.mister-sandman.de/products/kaltschaum-topper-doppeltuchbezug-80x200']

Unnamed: 0,Link
463,https://www.mister-sandman.de/products/kaltsch...
472,https://www.mister-sandman.de/products/kaltsch...


### Drop the duplicates

In [6]:
df = df.drop_duplicates()

### Here we can see there's no duplicates

In [7]:
df[df.duplicated()]

Unnamed: 0,Link


In [8]:
### Getting the links in the lists

links_lists = list(df['Link'].values)

In [9]:
# type(links_lists), len(links_lists), links_lists

### Functions for scrapping the required data from the web pages

In [10]:
def prod_names(soup):

    prod_name_list = []
    try:
        titles = soup.find('h1', class_='product-title')
        if titles != None:
            prod_name = titles.text
            #prod_name_list.append(prod_name)
    except Exception as e:
        #prod_name = ''
        prod_name = ''
    
    return prod_name

def extract_json_data_ean(soup):
    script_tag = soup.find('script', {'type': 'application/json', 'data-section-type': 'static-product'})
    if script_tag:
        try:
            json_data = json.loads(script_tag.string)
            return json_data
        except json.JSONDecodeError:
            print("Error decoding JSON")
            return None
    return None

def sizes(soup):
    chosen_value = ''  # Initialize chosen_value with an empty string

    try:
        sizes = soup.find('select', id='template--20429025378649__main-data-variant-option-0-template--20429025378649__main')

        if sizes != None:
            # Ensure the select element is found
            select_element = soup.find('select', class_='options-selection__input-select')
            #print(type(select_element), len(select_element), select_element)
            
            if select_element:
                # Extract the value of the 'data-variant-option-chosen-value' attribute
                chosen_value = select_element.get('data-variant-option-chosen-value')
                #print(type(select_element), len(select_element), select_element)
        
        else:
            select_element = prod_names(soup).split()[-1]
            chosen_value = select_element
            #print(type(select_element), len(select_element), select_element)
            
    except Exception as e:
        print(f"An error occurred: {e}. Sizes(soup) function.")

    return chosen_value

def vpb_lists(soup):
    
    result = ['', '', '']  # Initialize result with default values
    
    try:
        # Find the price element
        price = soup.find('div', class_='product-pricing').find('div', class_='price__current').find_all('span')
        #print(type(price), len(price))  #, price)
        if price:
            price_text = price[-1].text.strip()
            price_text = price_text.replace(',', '').replace('€', '').strip()
            #print(price_text)
            if price_text.isdigit():  # Check if the price_text is a digit
                price_to_find = int(price_text)
                data1 = extract_json_data_ean(soup)
                # Iterate over the variants and find matching price
                for variant in data1['product']['variants']:
                    if (variant['price'] == price_to_find):
                        #print(sizes(soup))
                        #print(variant['title'])
                        if sizes(soup) in variant['title']:
                            result = [variant['id'], price_to_find, variant['barcode']]
                            break  # Stop iteration once a match is found
                        elif sizes(soup) in data1['product']['title']:
                            result = [variant['id'], price_to_find/100, variant['barcode']]
            else:
                print(f"Non-numeric price found: {price_text}")
                return result  # Return default result if price is not numeric

    except Exception as e:
        print(f"An error occurred: {e}")

    return result

def images_list(soup):
    # Initialize an empty list to store the image URLs
    image_src_links1 = []
    image_src_links2 = []
    
    try:
        # Find all divs with the specified class
        divs = soup.find_all('div', class_='product-gallery--image-background')

        # Iterate over each div
        for div in range(2):
            if div == 0:
                # Find all img tags within the div
                img_tags = divs[div].find_all('img')
                # Iterate over each img tag
                for img in range(1):
                    if 'src' in img_tags[img].attrs:
                        src_url = img_tags[img]['src']
                        #print(src_url)
                        if src_url in image_src_links1:
                            pass
                        else:
                            image_src_links1.append('https:'+src_url)
            else:
                # Find all img tags within the div
                img_tags = divs[div].find_all('img')
                # Iterate over each img tag
                for img in range(1):
                    if 'src' in img_tags[img].attrs:
                        src_url = img_tags[img]['src']
                        if src_url not in image_src_links2:
                            image_src_links2.append('https:'+src_url)
    except Exception as e:
        image_src_links1.append('')
        image_src_links2.append('')
                    
    return image_src_links1[0], image_src_links2[0]


### The below codes run and scrape the webpages from the links_lists

In [11]:
start = datetime.now()

if __name__ == '__main__':
    
    # Initialize a requests session with custom headers
    session = requests.Session()
    session.headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',
        "Accept-Encoding": "*",
        "Connection": "keep-alive"}

    data = {'url':[], 'Product_title': [], 'Variation_id':[], 'Size':[], 'Price':[], 'Ean':[], 'Img1_url':[], 'Img2_url':[]}

    for url in range(len(links_lists)):

        #req = requests.get(links_lists[url])
        req = session.get(links_lists[url])

        #pickle.dump(req, open(f'C:\\Users\\johan\\OneDrive\\Desktop\\DS Python\\Tasks\\Bambus\\Technical assessment\\Test_Coding_implementation\\Test_Coding\\Dump\\requests{url}.pkl', 'wb'))
        #req1 = pickle.load(open(f'C:\\Users\\johan\\OneDrive\\Desktop\\DS Python\\Tasks\\Bambus\\Technical assessment\\Test_Coding_implementation\\Test_Coding\\Dump\\requests{url}.pkl', 'rb'))

        soup = BeautifulSoup(req.text, 'html.parser')

        vpb = vpb_lists(soup)

        images = images_list(soup)

        img1 = images[0]
        img2 = images[1]

        data['url'].append(links_lists[url])
        data['Product_title'].append(prod_names(soup))
        data['Variation_id'].append(vpb[0])
        data['Price'].append(vpb[1])
        data['Ean'].append(vpb[2])
        data['Size'].append(sizes(soup))
        data['Img1_url'].append(img1)
        data['Img2_url'].append(img2)

        # Sleep for a short random time to avoid overloading the server
        time.sleep(2)

final_df = pd.DataFrame(data)

end = datetime.now()

duration = end - start

In [12]:
final_df

Unnamed: 0,url,Product_title,Variation_id,Size,Price,Ean,Img1_url,Img2_url
0,https://www.mister-sandman.de/products/babymat...,Babymatratze,23030724231216,60x120,2990.0,4063585872390,https://mister-sandman.de/cdn/shop/products/ba...,https://mister-sandman.de/cdn/shop/products/Ed...
1,https://www.mister-sandman.de/products/steppbe...,Ganzjahresbettdecke,30732042895492,135x200,1890.0,4063585977477,https://mister-sandman.de/cdn/shop/products/St...,https://mister-sandman.de/cdn/shop/products/de...
2,https://www.mister-sandman.de/products/steppbe...,Ganzjahresbettdecke,30732042961028,155x220,1990.0,4063585977453,https://mister-sandman.de/cdn/shop/products/St...,https://mister-sandman.de/cdn/shop/products/de...
3,https://www.mister-sandman.de/products/steppbe...,Ganzjahresbettdecke,30732042928260,200x200,2190.0,4063585977460,https://mister-sandman.de/cdn/shop/products/St...,https://mister-sandman.de/cdn/shop/products/de...
4,https://www.mister-sandman.de/products/steppbe...,Ganzjahresbettdecke,30732042993796,200x220,2290.0,4063585977446,https://mister-sandman.de/cdn/shop/products/St...,https://mister-sandman.de/cdn/shop/products/de...
...,...,...,...,...,...,...,...,...
551,https://www.mister-sandman.de/products/topper-...,Topper und Matratzenschoner - Soft-Topper für ...,40443745960068,120x200,1990.0,4063585308905,https://mister-sandman.de/cdn/shop/products/so...,https://mister-sandman.de/cdn/shop/products/3_...
552,https://www.mister-sandman.de/products/topper-...,Topper und Matratzenschoner - Soft-Topper für ...,39459118940292,140x200,1990.0,4063585341742,https://mister-sandman.de/cdn/shop/products/so...,https://mister-sandman.de/cdn/shop/products/3_...
553,https://www.mister-sandman.de/products/topper-...,Topper und Matratzenschoner - Soft-Topper für ...,40443748352132,160x200,2190.0,4063585308882,https://mister-sandman.de/cdn/shop/products/so...,https://mister-sandman.de/cdn/shop/products/3_...
554,https://www.mister-sandman.de/products/topper-...,Topper und Matratzenschoner - Soft-Topper für ...,39459118973060,180x200,2290.0,4063585341735,https://mister-sandman.de/cdn/shop/products/so...,https://mister-sandman.de/cdn/shop/products/3_...


### It takes max of 25 Mins to 28 mins scrape the web pages through links for the 556 links.

In [25]:
# Get the current date and time
file_name = datetime.now()
# Format the datetime object
formatted_file_name = file_name.strftime("%d.%m.%Y-%H%M%S")

# Conver the dataframe into an excel file
final_df.to_excel(f'Webshop_ProdData_ {formatted_file_name}.xlsx', index=False)

print(f'Webshop_ProdData_ {formatted_file_name}.xlsx File Created')

Webshop_ProdData_ 10.07.2024-184012.xlsx File Created


In [27]:
pd.read_excel('Webshop_ProdData_ 10.07.2024-184012.xlsx')

Unnamed: 0,url,Product_title,Variation_id,Size,Price,Ean,Img1_url,Img2_url
0,https://www.mister-sandman.de/products/babymat...,Babymatratze,23030724231216,60x120,2990.0,4063585872390,https://mister-sandman.de/cdn/shop/products/ba...,https://mister-sandman.de/cdn/shop/products/Ed...
1,https://www.mister-sandman.de/products/steppbe...,Ganzjahresbettdecke,30732042895492,135x200,1890.0,4063585977477,https://mister-sandman.de/cdn/shop/products/St...,https://mister-sandman.de/cdn/shop/products/de...
2,https://www.mister-sandman.de/products/steppbe...,Ganzjahresbettdecke,30732042961028,155x220,1990.0,4063585977453,https://mister-sandman.de/cdn/shop/products/St...,https://mister-sandman.de/cdn/shop/products/de...
3,https://www.mister-sandman.de/products/steppbe...,Ganzjahresbettdecke,30732042928260,200x200,2190.0,4063585977460,https://mister-sandman.de/cdn/shop/products/St...,https://mister-sandman.de/cdn/shop/products/de...
4,https://www.mister-sandman.de/products/steppbe...,Ganzjahresbettdecke,30732042993796,200x220,2290.0,4063585977446,https://mister-sandman.de/cdn/shop/products/St...,https://mister-sandman.de/cdn/shop/products/de...
...,...,...,...,...,...,...,...,...
551,https://www.mister-sandman.de/products/topper-...,Topper und Matratzenschoner - Soft-Topper für ...,40443745960068,120x200,1990.0,4063585308905,https://mister-sandman.de/cdn/shop/products/so...,https://mister-sandman.de/cdn/shop/products/3_...
552,https://www.mister-sandman.de/products/topper-...,Topper und Matratzenschoner - Soft-Topper für ...,39459118940292,140x200,1990.0,4063585341742,https://mister-sandman.de/cdn/shop/products/so...,https://mister-sandman.de/cdn/shop/products/3_...
553,https://www.mister-sandman.de/products/topper-...,Topper und Matratzenschoner - Soft-Topper für ...,40443748352132,160x200,2190.0,4063585308882,https://mister-sandman.de/cdn/shop/products/so...,https://mister-sandman.de/cdn/shop/products/3_...
554,https://www.mister-sandman.de/products/topper-...,Topper und Matratzenschoner - Soft-Topper für ...,39459118973060,180x200,2290.0,4063585341735,https://mister-sandman.de/cdn/shop/products/so...,https://mister-sandman.de/cdn/shop/products/3_...


In [4]:
dur =  []

start = datetime.now()

end = datetime.now()

duration = end - start
dur.append(duration)

print(dur)

[datetime.timedelta(0)]


In [7]:
duration_df = pd.DataFrame({'Duration':dur})

In [9]:
duration_df

Unnamed: 0,Duration
0,0 days
