# CC Assignment 3
*Web-Scraping Wallapop*

**Team A3:** Berta Alguero, John Bergmann, Federico Colombo, Nimit Jain, Nathaniel Thomas-Copeland

## 1) Setting up the working environment

Before data can be extracted from Wallapop using this web-scraper, the following libraries should be installed:
```arrow```, ```numpy```, ```pandas```, ```selenium```, ```tqdm```, and ```webdriver_manager```

Uncommenting the following cells will download the necessary libraries

In [65]:
# pip install arrow

In [66]:
# pip install numpy

In [67]:
# pip install pandas

In [68]:
# pip install selenium

In [69]:
# pip install tqdm

In [70]:
# pip install webdriver-manager

In [71]:
# Standard library imports
import time
import random
from datetime import datetime
import re

# Third party library imports
import arrow
import numpy as np
import pandas as pd
from tqdm import tqdm
from selenium.common.exceptions import TimeoutException
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC 
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

## 2) Setting up the webdriver and accessing Wallapop

In [72]:
# Creating empty lists for scraped data to be stored in

links = []
titles = []
descriptions = []
prices = []
images = []
bike_types = []
bike_states = []
children = []
bike_sizes = []
bike_size_letters = []
dates = []

The following block of code assigns our web-scraper a random User Agent(UA) used to "mask" online presence and avoid detection from anti-scraping mechanisms.

Furthermore, uncommenting the following line of code allows the scraper to work in "headless" mode, disabling the graphical user-interface of the browser:
> ```#opts.add_argument("--headless")```

In [73]:
user_agents = ["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36",
               "Mozilla/5.0 (Macintosh; Intel Mac OS X 13_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36",
               "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36",
               "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.365",
               "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36",
               "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
               "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"
               ]

user_agent = random.choice(user_agents)

opts = Options()
opts.add_argument(f"user-agent={user_agent}")
#opts.add_argument("--headless")

In [74]:
# Initiate chromium instance using options defined in cell above and maximizing window

try:
    driver = webdriver.Chrome(service = Service(ChromeDriverManager().install()), options = opts)
    driver.maximize_window()
except:
    driver = webdriver.Chrome(ChromeDriverManager().install(), options = opts)
    driver.maximize_window()

In [75]:
# If the above line throws an error/warning with regards to ChromeDriverManager and/or the Service module, 
# installing chromedriver locally and setting the PATH should troubleshoot any problems
# the code to run would then be as follows:

#driver = webdriver.Chrome(PATH = "path to your installation", options = opts)

In [76]:
# Access Wallapop homepage

driver.get("https://es.wallapop.com")

In [77]:
# Await cookie pop-up request -> accept if present, otherwise, move to next cell

try:
    cookies = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//*[@id='onetrust-accept-btn-handler']")))
    cookies.click()
except:
    pass

## 3) Defining our static search parameters

These search parameters will remain the same for the entirety of the scraping process.

The steps of this process are as follows:

>```1. Entering bycicles in the searchbar```

>```2. Specifying "bicicletas" as desired category```

>```3. Setting search location to Barcelona within a 10km radius```

>```4. Setting the maximum price to 800€```

In [78]:
searchbar = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//*[@id='searchBoxForm']/div/div[1]/input[1]")))
searchbar.clear()
searchbar.send_keys("bicicleta")
searchbar.submit()

time.sleep(1)

In [79]:
categories = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//div[contains(text(), 'Todas las categorías')]")))
categories.click()

time.sleep(1)

In [80]:
bike_category = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//span[contains(text(), 'Bicicletas')]")))
bike_category.click()

time.sleep(1)

In [81]:
location_filter = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//div[contains(text(), 'España, Madrid')]")))
location_filter.click()

time.sleep(1)

In [82]:
location_search = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//input[@class='LocationFilter__input py-0 px-5 w-100 form-control ng-untouched ng-pristine ng-valid']")))
location_search.clear()
location_search.send_keys("España, Barcelona")
time.sleep(1)
location_search.send_keys(Keys.ENTER) 

time.sleep(1)

In [83]:
radius_slider = driver.find_element(By.XPATH, "//span[@role = 'slider']")
radius_distance = driver.find_element(By.XPATH, "//span[@class = 'ngx-slider-span ngx-slider-bubble ngx-slider-model-value']").text.replace("km", "")

while radius_distance != "10":
    ActionChains(driver).drag_and_drop_by_offset(radius_slider, -50, 0).perform()
    radius_distance = driver.find_element(By.XPATH, "//span[@class = 'ngx-slider-span ngx-slider-bubble ngx-slider-model-value']").text.replace("km", "")
    
time.sleep(1)

In [84]:
accept_location = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//button[contains(text(), 'Aplicar')]")))
accept_location.click()

time.sleep(1)

In [85]:
price_filter = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//div[contains(text(), 'Precio')]")))
price_filter.click()

time.sleep(1)

In [86]:
max_price = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//input[@placeholder = 'Sin límite']")))
max_price.click()

time.sleep(1)

In [87]:
max_price.clear()
max_price.send_keys("800")

time.sleep(1)

In [88]:
accept_price_bottons = driver.find_elements(By.XPATH, "//*[@class = 'btn btn-filter btn-primary']")

for button in accept_price_bottons:
    try:
        if button.text == "Aplicar":
            button.click()
        else:
            continue
    except:
        pass

time.sleep(1)

## 4) Designing the web-scraper

This web-scraper accomplishes the following tasks:

>```- Dynamically adjusts search filters in order to retrieve a maximum of 250 entries for each combination of "Estado del producto" and "Subcategoria" filters```

>```- Retrieves the necessary information and appends it to the lists defined above```

In [89]:
def super_scraper(bike_model, bike_condition):

    # Click on the "Subcategoría" button
    subfield = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//div[contains(text(), 'Subcategoría')]")))
    subfield.click()
    time.sleep(1)

    # Select "Bicicletas y triciclos" from the drop-down menu
    subfield_specific = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//p[contains(text(), 'Bicicletas y triciclos')]")))
    subfield_specific.click()
    time.sleep(1)

    # Select the matching bike model and confirm filter selection
    select_bike_labels = driver.find_elements(By.XPATH, "//*[@class = 'w-100 ng-star-inserted']")
    for bike in select_bike_labels:

        bike_label = bike.text.strip()

        if bike_label == bike_model:
            bike.find_element(By.XPATH, ".//tsl-checkbox-form").click()
            time.sleep(0.5)

    accept_subfield_buttons = driver.find_elements(By.XPATH, "//*[@class = 'btn btn-filter btn-primary']")
    for button in accept_subfield_buttons:
        try:
            if button.text == "Aplicar":
                button.click()
            else:
                continue
        except:
            pass
    time.sleep(1)
    
    # Click on the "Estado del producto" button
    state_filter = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//div[contains(text(), 'Estado del producto')]")))
    state_filter.click()
    time.sleep(1)
    
    # Select the matching bike condition and confirm filter selection
    select_states = driver.find_elements(By.XPATH, "//*[@class = 'w-100 ng-star-inserted']")
    for bike in select_states:
    
        bike_state = bike.find_element(By.XPATH, ".//p").text.strip()
        
        if bike_state == bike_condition:
            bike.find_element(By.XPATH, ".//tsl-checkbox-form").click()
            time.sleep(0.5)

    accept_state_buttons = driver.find_elements(By.XPATH, "//*[@class = 'btn btn-filter btn-primary']")
    for button in accept_state_buttons:
        try:
            if button.text == "Aplicar":
                button.click()
            else:
                continue
        except:
            pass
    time.sleep(1)
    
    # If more than 40 postings are available for a given filter combination, the following code clicks on the 
    # "Ver más productos" button, allowing one to access all results
    try:
        # Scroll to bottom    
        load_more = driver.find_element(By.XPATH, "//button[contains(text(), 'Ver más productos')]")
        driver.execute_script("return arguments[0].scrollIntoView(true);", load_more)
            
        # Click "Ver más productos" button
        driver.execute_script("arguments[0].click();", load_more)
        time.sleep(1)
            
        # Scroll back up
        driver.execute_script("window.scrollTo(0, 0);")

        # Get the screen height of the webpage
        screen_height = driver.execute_script("return window.screen.height;")   
        i = 1

        # Infinitely scroll until all postings are visible OR at least 250 results have been collected
        while True:
            # Scroll one screen height at a time
            driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height = screen_height, i = i))  
            i += 1
            time.sleep(1)
            # Update scroll height each time after each scroll action
            scroll_height = driver.execute_script("return document.body.scrollHeight;")  
            # Break the loop when the height we need to scroll to is smaller than the total scroll height OR more than 250 results have been found
            results = driver.find_elements(By.XPATH, "//*[@class = 'ItemCardList__item ng-star-inserted']")# This tag specifically chose since it ignores ads
            if ((screen_height) * i > scroll_height) or (len(results) >= 250):
                break
            time.sleep(1)

    except:
        # Get the screen height of the web
        screen_height = driver.execute_script("return window.screen.height;")
        i = 1

        # Infinitely scroll until all postings are visible OR at least 250 results have been collected
        while True:
            # Scroll one screen height at a time
            driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height = screen_height, i = i))  
            i += 1
            time.sleep(1)
            # Update scroll height each time after each scroll action
            scroll_height = driver.execute_script("return document.body.scrollHeight;")  
            # Break the loop when the height we need to scroll to is smaller than the total scroll height OR more than 250 results have been found
            results = driver.find_elements(By.XPATH, "//*[@class = 'ItemCardList__item ng-star-inserted']")# This tag specifically chose since it ignores ads
            if ((screen_height) * i > scroll_height) or (len(results) >= 250):
                break
            time.sleep(1)

    # Extract information for each identified posting
    for offer in tqdm(results[:250], desc = f"Scraping bicycles of type '{bike_model}' and condition '{bike_condition}'", leave = False):
    
        driver.execute_script("return arguments[0].scrollIntoView(true);", offer)
        time.sleep(0.5)
        # WebDriverWait(driver, 15).until(EC.element_to_be_clickable((offer))) -> commented out since lead to problems for some ppl and avoid try, except spam
        
        title = offer.find_element(By.TAG_NAME, "p").text.strip()
        titles.append(title)

        img = offer.find_element(By.XPATH, ".//img").get_attribute("src")
        images.append(img)
        
        offer.click()
        driver.switch_to.window(driver.window_handles[1])
        
        # During the development of this scraper, it occured that certain detailed postings were not accessible anymore
        # The following try, except block mitigates this issue by atempting to identify a given posting's details
        try:
            WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//div[@class = 'container-detail clearfix']")))
            time.sleep(1)

        # Appends faulty link while setting all other features to np.nan
        except:
            links.append(driver.current_url)
            descriptions.append(np.nan)
            prices.append(np.nan)
            bike_types.append(np.nan)
            bike_states.append(np.nan)
            children.append(np.nan)
            bike_sizes.append(np.nan)
            bike_size_letters.append(np.nan)
            dates.append(np.nan)

            driver.close()
            driver.switch_to.window(driver.window_handles[0])
            
            continue

        link = driver.current_url
        links.append(link)

        desc = driver.find_element(By.XPATH, "//p[@class = 'js__card-product-detail--description card-product-detail-description']").text.strip()
        descriptions.append(desc)
        time.sleep(0.25)

        # Logic to the implementation in order to append prices as floats
        try:
            price = float(driver.find_element(By.XPATH, "//div[@class = 'card-product-price-info']").text.replace("EUR", "").strip())
            prices.append(price)
        except:
            price = float(driver.find_element(By.XPATH, "//div[@class = 'card-product-price-info']").text.replace("EUR", "").replace(",",".").strip())
            prices.append(price)

        bike_types.append(bike_model)
        bike_states.append(bike_condition)
        
        child_spelling = ["niño/a", "niño", "niña", "niños" , "niñas", "niño/as"]
        
        # Block of code checks whether a certain spelling variant of child found in "child_spelling" exists in the title or description
        # of the posting and proceeds to append true values
        if (any(word in child_spelling for word in title.split()) or any(word in child_spelling for word in desc.split())):
            children.append(True)
        else:
            children.append(False)
        
        # Block of code checks whether a size variant exists in the title or description
        # of the posting and proceeds to append true values
        hashtags = driver.find_element(By.XPATH, "//div[@class = 'mb-3 card-product-detail-mobile-horizontal-scroll']").text.strip().replace("#"," ")
        talla_locations = (title + " " + desc + " " + hashtags).split()
        size_letters = ["s", "m", "l"]
        sizes = [str(x) for x in range(50, 61)]

        force_break = False
        for index, word in enumerate(talla_locations): 
            word = re.sub('[^A-Za-z0-9]+', '', word.casefold()) #fancy re that gets rid of everything except numbers and text

            if "talla" in word: #check the actual word for any following characters about size
                extra_index = list(range(1, (len(word)-4)))
                for x in extra_index:
                    if word[-x] in size_letters:
                        force_break = True
                        bike_size_letters.append(word[-x].upper())
                        bike_sizes.append(np.nan)
                        break
                        
                    elif word[-x] in sizes:
                        force_break = True
                        bike_sizes.append(int(word[-x]))
                        bike_size_letters.append(np.nan)
                        break

            if force_break:
                break

            elif "talla" in word: #check next 2 elements for characters about size
                next2 = [1,2]
                for x in next2:
                    try:
                        y = re.sub('[^A-Za-z0-9]+', '', talla_locations[index+x].casefold()) #fancy re that gets rid of everything except numbers and text
                        if y in size_letters:
                            force_break = True
                            bike_size_letters.append(y.upper())
                            bike_sizes.append(np.nan)
                            break
                        elif y in sizes:
                            force_break = True
                            bike_sizes.append(int(y))
                            bike_size_letters.append(np.nan)
                            break
                    except:
                        pass

            if force_break:
                break

            elif index+1 == len(talla_locations):
                bike_size_letters.append(np.nan)
                bike_sizes.append(np.nan)
                
            else:
                continue
        
        # Append the posting date in datetime format using the arrow library
        date = driver.find_element(By.XPATH, "//div[@class = 'card-product-detail-user-stats-published']").text.strip()
        date = arrow.get(date, "DD-MMM-YYYY", locale = "es")
        date = date.datetime
        dates.append(date)

        driver.close()
        driver.switch_to.window(driver.window_handles[0])
        driver.execute_script("window.scrollTo(0, 0);")
        time.sleep(1)

    # Clear filter for bike model so that next filter combination can be applied
    model = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, f"//div[contains(text(), '{bike_model}')]")))
    model.click()
    clear_model = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//p[contains(text(), 'Restablecer')]")))
    clear_model.click()

    accept_clear_model = driver.find_elements(By.XPATH, "//*[@class = 'btn btn-filter btn-primary']")
    for button in accept_clear_model:
        try:
            if button.text == "Aplicar":
                button.click()
            else:
                continue
        except:
            pass
    time.sleep(1)

    # Clear filter for bike condition so that next filter combination can be applied
    condition = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, f"//div[contains(text(), '{bike_condition}')]")))
    condition.click()
    clear_condition = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//p[contains(text(), 'Restablecer')]")))
    clear_condition.click()

    accept_clear_condition = driver.find_elements(By.XPATH, "//*[@class = 'btn btn-filter btn-primary']")
    for button in accept_clear_condition:
        try:
            if button.text == "Aplicar":
                button.click()
            else:
                continue
        except:
            pass
    time.sleep(1)

## 5) Applying web-scraping function to desired filter combinations

In [90]:
models = ["Bicicletas de carretera", "Bicicletas plegables", "MTB"]
conditions = ["Nuevo", "Como nuevo", "En buen estado"]

In [91]:
for m in models:
    for c in conditions:
        super_scraper(m, c)

                                                                                                                                     

In [92]:
driver.quit()

## 6) Creating a main dataframe with the collected data

In [93]:
df = pd.DataFrame({
    "Link": links,
    "Title": titles,
    "Description": descriptions,
    "Price": prices,
    "Image": images,
    "Type": bike_types,
    "State": bike_states,
    "Children": children,
    "Size": bike_sizes,
    "Size (letter)": bike_size_letters,
    "Date": dates
})

In [94]:
# Removing any possible duplicate entries based on link (since unique)

df.drop_duplicates(subset = ["Link"], inplace = True)

In [95]:
df.head()

Unnamed: 0,Link,Title,Description,Price,Image,Type,State,Children,Size,Size (letter),Date
0,https://es.wallapop.com/item/bicicleta-montana...,Bicicleta montaña,Bicicleta de montaña para niño/a a partir de 1...,90.0,https://cdn.wallapop.com/images/10420/e5/qy/__...,Bicicletas de carretera,Nuevo,True,,,2022-12-08 00:00:00+00:00
1,https://es.wallapop.com/item/bicicleta-deporti...,bicicleta deportiva 290€,No la he usado esta nueva,290.0,https://cdn.wallapop.com/images/10420/e5/q0/__...,Bicicletas de carretera,Nuevo,False,,,2022-12-08 00:00:00+00:00
2,https://es.wallapop.com/item/bicicleta-krn-kra...,Bicicleta KRN Krampus Monster Truck,Cuadro de Aluminio Horquilla de Aluminio Posib...,795.0,https://cdn.wallapop.com/images/10420/e5/q2/__...,Bicicletas de carretera,Nuevo,False,,,2022-12-09 00:00:00+00:00
3,https://es.wallapop.com/item/bicicleta-montana...,Bicicleta montaña barata y nueva,"Vendo boco nueva, nunca ha sido utilizado.",115.0,https://cdn.wallapop.com/images/10420/e4/sm/__...,Bicicletas de carretera,Nuevo,False,,,2022-12-03 00:00:00+00:00
4,https://es.wallapop.com/item/bicicleta-nueva-8...,BICICLETA NUEVA,"Bicicleta nueva de hace un año, usada solo dos...",300.0,https://cdn.wallapop.com/images/10420/e3/ty/__...,Bicicletas de carretera,Nuevo,False,,L,2022-11-28 00:00:00+00:00


In [96]:
df = df.astype({"Size": "Int64"})

In [97]:
df.dtypes

Link                              object
Title                             object
Description                       object
Price                            float64
Image                             object
Type                              object
State                             object
Children                            bool
Size                               Int64
Size (letter)                     object
Date             datetime64[ns, tzutc()]
dtype: object

## 7) Providing an overview of average bike pricing grouped by type/state combinations

In [98]:
agg = df.groupby(["Type", "State"])[["Price"]].mean().rename(columns = {"Price": "Avg Price"})

In [99]:
agg.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Avg Price
Type,State,Unnamed: 2_level_1
Bicicletas de carretera,Como nuevo,317.65
Bicicletas de carretera,En buen estado,274.47956
Bicicletas de carretera,Nuevo,317.805556
Bicicletas plegables,Como nuevo,168.354167
Bicicletas plegables,En buen estado,113.930385
Bicicletas plegables,Nuevo,258.770909
MTB,Como nuevo,235.375
MTB,En buen estado,259.925
MTB,Nuevo,344.41525
