# **WEB ANALYTICS – Data Science and Engineering Degree**  
**(1st Semester, 4th-year-level Course)**  

## **Web Scraping with Selenium**  

This lab was part of my **Web Analytics** course at **Universidad Carlos III de Madrid (UC3M)**, where I studied abroad from **September 2024 to December 2024** as part of my **Computer Science degree**. This specific lab focused on **web scraping automation using Selenium** and showcases dynamic data extraction from websites with JavaScript-rendered content. The lab introduced **browser automation, handling dynamic elements, and interacting with web pages programmatically**.

Working in a **group of three students**, we developed an automated web scraper using **Selenium and Python** to interact with websites, extract structured data, and apply best practices in **web automation, data extraction, and ethical scraping**.

---

## **Automated Web Scraping and Data Extraction**  
We implemented a series of milestones that covered **real-world automated web scraping scenarios**, including:

- **Navigating and interacting with JavaScript-heavy websites using Selenium.**  
- **Extracting real-time job listings, prices, and product availability from dynamic web pages.**  
- **Handling cookies, login authentication, and AJAX-loaded content.**  
- **Using headless browsers for efficient data collection.**  

---

## **Milestones**  

### **Milestone 1: Automating Browser Interaction**  
- Launched a Selenium-controlled browser to navigate a target website.  
- Simulated user interactions, such as clicking buttons and scrolling pages.  

### **Milestone 2: Extracting Job Listings from Dynamic Websites**  
- Accessed job postings on a live employment platform.  
- Extracted **job titles, company names, locations, and salaries** from dynamically loaded listings.  

### **Milestone 3: Scraping Product Prices from E-commerce Sites**  
- Identified and bypassed **anti-scraping mechanisms** to collect product details.  
- Extracted **product names, prices, availability, and ratings** from a retail website.  

### **Milestone 4: Handling Authentication & Cookies**  
- Automated login sequences for a protected website.  
- Stored and managed session cookies to maintain access for data extraction.  

---

## **Outcome**  
Through this lab, we gained hands-on experience in **automated web scraping, browser simulation, and handling dynamic web content**. We developed Python-based scripts using **Selenium WebDriver**, interacted with **JavaScript-heavy web pages**, and applied ethical and responsible web scraping techniques. This lab prepared us to **extract structured data from modern websites, automate repetitive web tasks, and build intelligent data collection systems**.

---

## **Technologies Used**  
- **Python**  
- **Selenium WebDriver**  
- **BeautifulSoup (for post-processing scraped data)**  
- **Chromedriver / Geckodriver**  
- **Headless Browser Execution**  

# 0. Lab Preparation

1.  Study and have understood the concepts explained in the theoretical class and the introductory lab.

2.   Gain experience with the use of the [Selenium](https://https://www.selenium.dev/). The exercises of this lab will be mainly based on the utilization of functions offered by this library.

3. It is assumed students have experience in using Python notebooks. Either a local installation (e.g., local python installation + Jupyter) or a cloud-based solution (e.g., Google Colab). *We recommend the second option*.

# 1. Lab Introduction

* In this lab, we will implement a web scraper using [Selenium](https://https://www.selenium.dev/). One of the tools explained in the theoretical class.

* The lab will be done in groups of 4 people.

* The lab defines a set of milestones the students must complete. Upon completing all the milestones, students should call the professor, who will check the correctness of the solution (*If the professor is busy, do not wait for them, move to the next lab*).

* **The final mark will be computed as a function of the number of milestones successfully completed.**

* **Each group should also share their lab notebook with the professor upon the finalization of the lab.**

* In this lab we will use the [Selenium](https://https://www.selenium.dev/) library for the creation of a web scraper, to extract information from the web. As indicated in the *Lab Preparation* section above, it is expected that students have gained experience in the use of the library before starting the first session of the lab.

- It is recommended to use [Google Colab](https://colab.research.google.com/) to produce the Python notebook with the solution of the lab. Of course, if any student prefers using its local programming environment (e.g., jupyter) and python installation, they are welcome to do so.

In [None]:
!pip install selenium
!apt-get update
!apt install chromium-chromedriver
!pip install selenium webdriver_manager
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Ign:4 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Get:5 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:7 https://r2u.stat.illinois.edu/ubuntu jammy Release
Hit:9 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Get:10 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Hit:11 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:12 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Fetched 275 kB in 1s (204 kB/s)
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not s

# MILESTONE 1

a) Access to the website [BACHELOR IN DATA SCIENCE AND ENGINEERING
](https://www.uc3m.es/bachelor-degree/data-science)

b) Find the element tag with `id="program"` and print the result

c) Find the table inside PROGRAM for Course 1 - Semester 1 and print the result


In [None]:
# set options to be headless
from selenium import webdriver
from selenium.webdriver.common.by import By

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# open it, go to a website, and get results
driver = webdriver.Chrome(options=options)

url = "https://www.uc3m.es/bachelor-degree/data-science"
driver.get(url)  # Open the URL

# Find the element with id="program"
program_element = driver.find_element(By.ID, 'program')
program_text = program_element.text
print(program_text)  # Print the text content


PROGRAM


In [None]:
table_element = driver.find_element(By.XPATH, "//h3[contains(text(), 'Year 1 - Semester 1')]//following::table[1]")
print(table_element.get_attribute('innerHTML'))

<caption class="oculto">General subjects</caption><thead><tr><th class="first" id="Subjects-1-1-1371240957274">Subjects</th><th id="ECTS-1-1-1371240957274">ECTS</th><th id="TYPE-1-1-1371240957274">TYPE</th><th id="Language-1-1-1371240957274" class="last">Language</th></tr></thead><tbody><tr><td data-label="Subject" headers="Subjects-1-1-1371240957274"><a href="https://aplicaciones.uc3m.es/cpa/generaFicha?&amp;est=350&amp;plan=392&amp;asig=16472&amp;idioma=2"><span class="">Calculus I</span></a></td><td data-label="ECTS" headers="ECTS-1-1-1371240957274"><span class="">6</span></td><td data-label="TYPE" headers="TYPE-1-1-1371240957274"><span class="">BC</span></td><td class="listaIdiomas" data-label="Language" headers="Language-1-1-1371240957274"><img class="idioma_img" src="/base/media/base/img/decorativa/IMG_Comunes_IdiomaEN_Square/ingles.jpg" alt="English"></td></tr><tr><td data-label="Subject" headers="Subjects-1-1-1371240957274"><a href="https://aplicaciones.uc3m.es/cpa/generaFicha?

# MILESTONE 2

a) Obtain the link to Web Analytics course by finding the corresponding element in the source code.

b) Access to this URL by clicking in the link.

c) Print the text inside the _Learning activities and methodology_ section.

In [None]:
url = "https://www.uc3m.es/bachelor-degree/data-science"
driver.get(url)

link = driver.find_element(By.XPATH, '//span[text()="Web Analytics"]/parent::a')

href = link.get_attribute('href')

print(href)

program = driver.find_element(By.XPATH, '//h2[contains(text(), "Program")]')
program.click()

https://aplicaciones.uc3m.es/cpa/generaFicha?&est=350&plan=392&asig=16507&idioma=2


In [None]:
link = driver.find_element(By.XPATH, '//span[text()="Web Analytics"]/parent::a')
link.click()

In [None]:
heading = driver.find_element(By.XPATH, ".//div[@class='panel-heading degradado' and contains(text(), 'Learning activities and methodology')]")

# Get its parent element (the panel body)
panel_body = heading.find_element(By.XPATH, "./following-sibling::div[@class='panel-body']")

# Extract the text content from the panel body
learning_activities_text = panel_body.text.strip()

print(learning_activities_text)

driver.quit()

The course will be based in the following activities:

- LECTURES: theoretical lessons that will introduce the main concepts of the course. Students participation to discuss the concepts and problems introduced in the lectures will be encouraged.

- LABS: practical lessons in which students will bring to practice the concepts introduced in lectures. Students will have to solve practical problems associated to web analytics.

- FINAL GROUP PROJECT: Students will be assigned a project that will be developed throughout the semester in groups of 2 oe 3 people. Students should propose their own project. In exceptional cases the professors may offer a list of projects to students. The responsible professor has to approve the student proposal. The project will include the following elements: 
 1- An initial definition of the goals of the project, technology used and expected results
 2- Use of any of the data collection studied to retrieve information from some popular online service or socia

# MILESTONE 3

Now let's build the first steps for a price monitoring website. For that, we are going to use yamovil.com to obtain car prices. Specifically, we want to find SEAT cars in Madrid and the price of each of them.

Follow these steps:

a) Check https://www.yamovil.es/robots.txt and see if the site can be crawled or not for our specific search. Explain.

b) If yes, use this [URL](https://www.yamovil.es/coches-segunda-mano/seat-ocasion-en-madrid) which already includes the indicated search (SEAT Cars Madrid Second Hand), print the cookies banner text and accept the cookies by clicking on the accept button.

c) Scrape the HTML using _Selenium_, and print the **mark**, **model**, **version** and **price** of each available car.

d) Click on the last car, print the new url where you have navigated to and print the location of the car ("este coche se encuentra en ...").

**HINT**: do not forget to quit the driver at the end of your code with `driver.quit()`

a) Yes, the site can be crawled by the robots.txt file.

Because our crawler/search is not included in the disallow section:

User-agent: *

Disallow: /admin/

Disallow: /feed/

Disallow: /goal/

Disallow: /sobre-coches-y-concesionarios/category/

Disallow: /sobre-coches-y-concesionarios/articulos/

Disallow: /sobre-coches-y-concesionarios/author/

In [None]:
# Step 1: Access the page
driver.get("https://www.yamovil.es/coches-segunda-mano/seat-ocasion-en-madrid")

# Step 2: Find the cookies banner text and print it
cookies_banner = driver.find_element(By.ID, "CybotCookiebotDialogBodyContentText")
print(cookies_banner.text)

# Step 3: Accept the cookies
accept_button = driver.find_element(By.ID, "CybotCookiebotDialogBodyLevelButtonLevelOptinAllowAll")
driver.execute_script('arguments[0].click();', accept_button)

Las cookies de este sitio web se usan para personalizar el contenido y los anuncios, ofrecer funciones de redes sociales y analizar el tráfico. Además, compartimos información sobre el uso que haga del sitio web con nuestros partners de redes sociales, publicidad y análisis web, quienes pueden combinarla con otra información que les haya proporcionado o que hayan recopilado a partir del uso que haya hecho de sus servicios. para más información visita nuestra política de cookies.


In [None]:
vehicle_list = driver.find_elements(By.CLASS_NAME, "vehicle-list__item")
for vehicle in vehicle_list:
  make = vehicle.find_element(By.CLASS_NAME, "make").text
  model = vehicle.find_element(By.CLASS_NAME, "model").text
  version = vehicle.find_element(By.CLASS_NAME, "version").text
  price = vehicle.find_element(By.CLASS_NAME, "price").text

  print(make)
  print(model)
  print(version)
  print(price)


Seat
Leon
1.4 TSI ACT SANDS FR 110 kW (150 CV)
16.450€
Seat
Leon
1.4 TSI ACT SANDS FR 110 kW (150 CV)
17.450€
Seat
Ateca
1.5 TSI SANDS X-Perience Go 110 kW (150 CV)
21.390€
Seat
Ateca
1.5 TSI SANDS Xcellence Edition 110 kW (150 CV)
18.950€
Seat
Leon ST
1.5 TGI GNC SANDS FR Edition Plus DSG 96 kW (130 CV)
19.250€
Seat
Leon
1.5 TSI SANDS FR Edition Plus 110 kW (150 CV)
16.950€
Seat
Mii
1.0 Style 55 kW (75 CV)
7.990€
Seat
Ibiza
1.2 TSI Reference Plus Limited 66 kW (90 CV)
9.750€
Seat
Ibiza SC
1.2 Reference 51 kW (70 CV)
7.690€
Seat
Leon
1.5 TSI SANDS FR Edition 96 kW (130 CV)
16.950€
Seat
Ibiza SC
1.6 TDI Style 66 kW (90 CV)
8.950€
Seat
Leon ST
1.5 TSI SANDS Xcellence Go L 96 kW (130 CV)
18.450€
Seat
Córdoba
1.4 16V Stylance 74 kW (100 CV)
6.450€
Seat
Leon
1.5 eTSI SANDS FR Go L DSG 110 kW (150 CV)
21.950€
Seat
Arona
1.0 TSI SANDS FR XM Edition 81 kW (110 CV)
17.950€
Seat
Arona
1.0 TSI FR XM 81 kW (110 CV)
18.150€
Seat
Arona
1.0 TSI FR XM DSG 81 kW (110 CV)
19.250€
Seat
Arona
1.0 TSI FR X

In [None]:
l = len(vehicle_list)
last_vehicle = vehicle_list[l-1]
last_vehicle.click()
print(driver.current_url)

https://www.yamovil.es/coches-segunda-mano/seat/mii/1-0-cosmopolitan-55-kw-75-cv


In [None]:
location = driver.find_element(By.CLASS_NAME, 'vehicle-header__branch')
print(location.text)

Este coche se encuentra en Pinto
