# **Sustainability in Materials and Product Prices H&M**(change after disvussing)

A scraper has been set up using the BeautifulSoup and Selenium library from python to gather information about the products of the Dutch H&M website. Furthermore, this information can be used to determine the correlation between the price and material of H&M products. This Jupyter notebook will cover the following three parts: 

1. The requirements for scraping
2. The H&M webscraper and saving the scraped data in a json file 
3. The data inspection of the scraped H&M data

The code will be described in this notebook, along with their function. The feasible sample size contains 7904 products for H&M Women items. 

NOTE: Due to possible revisions to the H&M website, the class names used to extract the data may have changed.

## Chapter 1: Requirements for scraping

### 1.1 Import libraries and packages

We import the libraries required to run our code in the first cell of our notebook. For the following, these libraries are necessary: 

* The requests library will be used to retrieve the HTML content of the Dutch H&M website.
* The BeautifulSoup library will be used to extract the product URLs to create a list of product URLs. 
* We will use the time library because we need time to click and open the buttons.  
* The json library will be used to store the scraped data in a JSON file.
* The random library is needed to shuffle the product list to get the URLs in random order. 
* The csv library will be used to store the product list with URLs in a csv file. 
* The selenium library will be used to extract the product information of the product pages. 


In [None]:
import requests
from bs4 import BeautifulSoup
import time
import json
import random 
import csv

from selenium import webdriver
import selenium.webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

### 1.2 Running requests

This part of the web scraper contains code to scrape the H&M web page(s) of the H&M Women View All section (in Dutch: 'Dames, bekijk alle items').

In [None]:
header = {'User-agent': 'Mozilla/5.0'} # with the user agent, we let Python know for which browser version to retrieve the website
request = requests.get('https://www2.hm.com/nl_nl/dames/shop-by-product/view-all.html', headers = header)
request.encoding = request.apparent_encoding # set encoding to UTF-8
soup = BeautifulSoup(request.text)

### 1.3 Collecting the page URLs

The code below has been used to collect all of the product URLs. A list called page_urls contains all of the page URLs up to and including 9972 products. The amount of products is used by H&M in the page URL. In order to create the list of page URLs, we added a counter of 36 products per page to the URL below.

In [None]:
counter = 0
page_urls = []
while counter <= 9972:
    page_urls.append(f'https://www2.hm.com/nl_nl/dames/shop-by-product/view-all.html?sort=stock&image-size=small&image=model&offset={counter}&page-size=36.html')
    counter+=36
page_urls

### 1.4 Collecting the product URLs

The product URLs of the first page are scraped by using BeautifulSoup. The following function generates the product URLs for the first page of the View All page.

In [None]:
def get_producturls(soup):
    producturl = []
    for item in soup.find_all('li', class_="product-item"):
        firstclick= f'https://www2.hm.com'+ item.find('a')['href']
        producturl.append(firstclick)

    return producturl
    
get_producturls(soup)

To gather all of the product URLs of the 9972 available items in the Women View All section, we loop the function through all the different page urls that have been collected. These product URLs are added to the list: product_list. By printing the length of the product_list, we can see if it indeed includes all of the 9972 available URLs. 

In [None]:
product_list = []
for page_url in page_urls:
    request = requests.get(page_url, headers=header)
    soup = BeautifulSoup(request.text, 'html.parser')
    product_list.extend(get_producturls(soup))

In [None]:
print(len(product_list))

The product URLs in the product_list have been randomly shuffled to create a random order of which we can take a random sample of the products in this section. 

In [None]:
random.shuffle(product_list)

### 1.5 Saving the product URLs in a CSV file

The complete list of product URLs has been saved in a CSV file. 

In [None]:
with open('productlistHM.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    for i, product_url in enumerate(product_list):
        if i == 0:
            continue  # skip the first row
        writer.writerow([product_url])

## Chapter 2: The H&M Webscraper

Chapter 2 contains the code which has been used to scrape the H&M website and to save it as a JSON file.

### 2.1 Running the chromedriver

In [None]:
driver = selenium.webdriver.Chrome()

### 2.2 Collection the product information.

The following function has been used to collect the product information. The data collected for the items in the H&M Women's section includes the product title, the price, the color, the product URL, the current timestamp of scraping, the buitenlaag information, the material information and if the product is a new arrival product or that it is not a new arrival product. Selenium has been used to automatically accept the cookies to close any pop-ups. We can continue on to scraping after accepting the cookies. Moreover, selenium has been used to scrape this information and to find-, click on- and open the buttons for the material and new arrival information for the different products.


In [None]:
def scrape(url):
    driver.get(url)
    
    time.sleep(3)

    # The cookies are automatically accepted to close pop-ups before proceeding with web scraping.
    try:
        button = driver.find_element(By.ID,"onetrust-accept-btn-handler")
        button.click()
    except:
        button = ''
    
    time.sleep(3)

    # Extracting the title of the product being scraped
    try:
        title = driver.find_element(By.CLASS_NAME, "ProductName-module--container__bmkk9").text
    except:
        title = 'No title'  

    print(title) 
    
    # Extracting the price of the product being scraped
    try:
        price = driver.find_element(By.CLASS_NAME, "Price-module--black-large__Fa6KP").text
    except:
        price = 'No price'

    print(price)
    
    # Extracting the color of the product being scraped
    try:
        color = driver.find_element(By.CLASS_NAME, "product-input-label").text
    except:
        color = 'No color'  

    print(color)     

    # Extracting the current timestamp of collecting the product information
    timestamp = int(time.time())
    print(f'{timestamp}')
    
    # Extracting the product URL of the product being scraped
    print(f'{url}')
    
    time.sleep(10)

    # Find the material information button element by its CSS selector
    button = driver.find_element(By.ID,"toggle-materialsAndSuppliersAccordion")

    time.sleep(3)
    
    # Click on the button
    button.click()

    time.sleep(3)
    
    # Set aria-expanded attribute to true
    driver.execute_script("arguments[0].setAttribute('aria-expanded', 'true')", button)

    # Wait for the box to open 
    time.sleep(5)

    # Extracting the buitenlaag of the product being scraped
    try:
        buitenlaag = driver.find_element(By.CLASS_NAME, "f94b22").text
    except:
        buitenlaag = 'No buitenlaag' 

    print(buitenlaag)

    # Extracting the material of the product being scraped
    try:
        material = driver.find_element(By.CLASS_NAME, "f0d614").text 
    except:
        material = 'No material'  

    print(material)

    # Find the description button element by its CSS selector
    button = driver.find_element(By.ID,"toggle-descriptionAccordion")
    
    time.sleep(3)
    
    # Click on the button
    button.click()
    
    time.sleep(3)
    
    # Set aria-expanded attribute to true
    driver.execute_script("arguments[0].setAttribute('aria-expanded', 'true')", button)

    # Wait for the box to open 
    time.sleep(3)

    # Extracting if the product being scraped is a new arrival product
    try:
        newarrival = driver.find_element(By.CLASS_NAME, "fccfcd").text
    except:
        newarrival = 'No New Arrival'  

    print(newarrival)

    my_data = {'title': title,
    'price': price,
    'color': color,
    'timestamp': timestamp,
    'url': url,
    'buitenlaag': buitenlaag,
    'material': material,
    'newarrival': newarrival
    }
    
    return(my_data)

driver.quit()

### 2.3 Looping through the products and writing the collected data to a JSON file 

The following code loops over every product URL which has been saved in the CSV file to collect the product information for the products in the file. The collected data has been written into a JSON file where it has been saved. When an error occured during the scraping process, the matching product has not been saved to the JSON file. In this way only products are saved, of which it was possible to scrape all of the product information without error.



In [None]:
with open('productlistHM.csv', newline='') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        url = row[0]
        try:
            my_data = scrape(url)
            with open('HM_product_information.json', 'a', encoding='utf-8') as f:
                f.write(json.dumps(my_data))
                f.write('\n')

        except Exception as e:
            print(f"Error scraping {url}: {e}")

## Chapter 3: Data inspection
The output from the web scraper is stored in a JSON file. RStudio has been used to inspect the data. To import the data in RStudio, the dataset has been converted to a CSV file by using python code. Furthermore, the dataset is cleaned so that it may be utilized to identify shared characteristics and distinctive features among the various products. The python code used to convert the dataset and save it as a CSV file is shown below.

In [None]:
import json

with open('final_combined_HM.json') as f:
    data = []
    for line in f:
        try:
            data.append(json.loads(line))
        except json.JSONDecodeError:
            print(f"Could not parse line: {line}")
    df = pd.DataFrame(data)

In [None]:
df.to_csv('HM_products.csv', index=False)