<a href="https://colab.research.google.com/github/Marcos-Sanson/UC3M-Web-Analytics/blob/main/Beautiful_Soup_Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **WEB ANALYTICS – Data Science and Engineering Degree**  
## *(1st Semester, 4th-Year-Level Course)*  

### **Web Scraping with BeautifulSoup**  

This lab was part of my **Web Analytics** course at **Universidad Carlos III de Madrid (UC3M)**, where I studied abroad from **September 2024 to December 2024** as part of my Computer Science degree. This specific lab focused on **web scraping techniques** using **Python, Requests, and BeautifulSoup** to extract structured data from websites. The lab introduced **HTML parsing, DOM traversal, and automated data collection**, providing hands-on experience in gathering information from public web pages.  

Working in a group of three students, we developed a **web scraper** to extract information from various sources to apply **best practices in web crawling, data extraction, and API request handling**.  

### **Web Scraping and Data Extraction**  

We implemented a series of milestones that covered **real-world web scraping scenarios**, including:  
- Extracting **university program details** from the **UC3M website**.  
- Navigating **HTML elements and attributes** using **BeautifulSoup**.  
- Scraping **automobile listings** to build a **price monitoring system**.  
- Identifying **robots.txt restrictions** to respect website policies.  

### **Milestones**  

#### **Milestone 1: Extracting University Program Data**  
We accessed the **Bachelor in Data Science and Engineering program** page at UC3M and extracted key details:  
- Located and printed the **"Quality"** section from the page.  
- Retrieved and displayed **available student places per campus**.  

#### **Milestone 2: Extracting Course-Specific Information**  
We followed an internal link to the **Web Analytics course page** and used BeautifulSoup to extract:  
- The **URL linking to the Web Analytics course**.  
- The **Objectives section**, detailing the course's learning outcomes in **data visualization, web crawling, and machine learning applications**.  

#### **Milestone 3: Scraping Automobile Listings for Price Monitoring**  
We developed an initial **price monitoring system** for **second-hand SEAT vehicles in Madrid**:  
- Verified that **robots.txt** did not restrict scraping for this data.  
- Scraped the **Yamovil website**, extracting:  
  - **Car make, model, and version**.  
  - **Listed prices for 30 available SEAT vehicles**.  

This milestone demonstrated the **practical application of web scraping for market research** to enable **automated data collection for price tracking**.  

### **Outcome**  
Through this lab, we gained experience in **web scraping fundamentals**, including **HTML parsing, data extraction, and web automation**. We developed **Python-based web scrapers**, applied **ethical scraping techniques**, and **navigated real-world website structures** to extract valuable information efficiently.  


# 0. Lab Preparation

1.  Study and have clear the concepts explained in the theoretical class and the introductory lab.

2.   Gain experience with the use of the [Requests](https://docs.python-requests.org/en/master/) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). The exercises of this lab will be mainly based on the utilization of functions offered by these libraries.

3. It is assumed students have experience in using Python notebooks. Either a local installation (e.g., local python installation + Jupyter) or a cloud-based solution (e.g., Google Colab). *We recommend the second option*.

# 1. Lab Introduction


* In this lab, we will implement a web scraper using the parsing library [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). One of the tools explained in the theoretical class.

* The lab will be done in groups of 3 people.

* The lab defines a set of milestones the students must complete. Upon completing every milestone, students should call the professor, who will check the correctness of the solution (*If the professor is busy, do not wait for them, move to the next milestone*).

* **The final mark will be computed as a function of the number of milestones successfully completed.**

* **Each group should also share their lab notebook with the professor upon the finalization of the lab.**

* In this lab we will use the [Requests](https://docs.python-requests.org/en/master/) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) libraries for the creation of a web scraper, to extract information from the web. As indicated in the *Lab Preparation* section above, it is expected that students have gained experience in the use of the libraries before starting the first session of the lab.

- It is recommended to use [Google Colab](https://colab.research.google.com/) to produce the Python notebook with the solution of the lab. Of course, if any student prefers using its local programming environment (e.g., jupyter) and python installation, they are welcome to do so.

# MILESTONE 1

a) Access to the website [BACHELOR IN DATA SCIENCE AND ENGINEERING
](https://www.uc3m.es/bachelor-degree/data-science)

b) Create the _BeautifulSoup_ object.

c) Find the element tag with `id="quality"` and print the result.

d) Find the `Places offered:` inside QUALITY and print the result.


In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
web_url = "https://www.uc3m.es/bachelor-degree/data-science#program"
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36'
page_0 = requests.get(web_url)

soup_0 = BeautifulSoup(page_0.content, "html.parser")

element_tag = soup_0.find_all(id='quality')
print(element_tag)

[<div class="marcoParrafo" id="quality">
<h2>Quality</h2>
</div>]


In [3]:
places_offered = soup_0.find("p", string="Places offered:").find_next().text

print(places_offered)


Leganes Campus: 50



# MILESTONE 2

a) Obtain the link to Web Analytics course (see inside Program) by finding the corresponding href with _BeautifulSoup_.

b) Access to this URL and create a new _BeautifulSoup_ object.

c) Print the text inside the Objectives section.


In [4]:
web_analytics_url = "https://aplicaciones.uc3m.es/cpa/generaFicha?&est=350&plan=392&asig=16507&idioma=2"

page_1 = requests.get(web_analytics_url)

In [5]:
soup_1 = BeautifulSoup(page_1.content, "html.parser")

content = soup_1.find_all(class_='tarea')[1].text


In [6]:
print(content)

1. Students should be able to demonstrate they have acquired and understood the knowledge associated to an area that starts from high school education and reach a level that although it is based on text books, it also includes aspects that include concepts coming from up-to-date knowledge in the referread area.
2. Students should be able to apply the acquired knowledge to their job in a professional way and should incorporate the required competences that can be shown through solid arguments and the resolution of problems within their area of study.
3. Ability to design solutions based on automatic knowledge within applications applied to specific domains such as: recommendation systems, natural language processing, the WEB or online social networks.
4. Ability to develop web and mobile applications and crawlers to collect data using  them.
5. Ability to develop data visualization tools to communicate the results derived from data analysis.
6. Adequate knowledge and skills to analyze a

# MILESTONE 3

Now let's build the first steps for a price monitoring website. For that, we are going to use yamovil.com to obtain car prices. Specifically, we want to find SEAT cars in Madrid and the price of each of them.

Follow these steps:

a) Check https://www.yamovil.es/robots.txt and see if the site can be crawled or not for our specific search. Explain.

b) If yes, use this [URL](https://www.yamovil.es/coches-segunda-mano/seat-ocasion-en-madrid) which already includes the indicated search (SEAT Cars Madrid Second Hand), scrape the HTML using _BeautifulSoup_, and print the **mark**, **model**, **version** and **price** of each available car.

**HINT:** The resulting list should have 30 cars (which are the ones that appear in the first page)


***a) Yes, the site can be crawled, because on the robots.txt page it only excludes the following:***

User-agent: *
*   Disallow: /admin/
*   Disallow: /feed/
*   Disallow: /goal/
*   Disallow: /sobre-coches-y-concesionarios/category/
*   Disallow: /sobre-coches-y-concesionarios/articulos/
*   Disallow: /sobre-coches-y-concesionarios/author/

Which doesn't include our specific search for SEAT Cars Madrid Second Hand. It would disallow us on those other pages, however.

In [7]:
yamovil_url = "https://www.yamovil.es/coches-segunda-mano/seat-ocasion-en-madrid"
page_2 = requests.get(yamovil_url)

soup_2 = BeautifulSoup(page_2.content, "html.parser")

In [8]:
list = soup_2.find_all(class_='vehicle-list__item')
for i in list:
  print(i.find(class_='make').text)
  print(i.find(class_='model').text)
  print(i.find(class_='version').text)
  print(i.find(class_='price').text)

  print(len(list))

Seat
Arona
1.0 TSI Style Plus 81 kW (110 CV)
15.750€
30
Seat
Ateca
1.4 EcoTSI SANDS Xcellence 4Drive DSG 110 kW (150 CV)
19.650€
30
Seat
Arona
1.0 TSI Ecomotive SANDS Style Edition 70 kW (95 CV)
14.950€
30
Seat
Arona
1.0 TSI FR Go2 81 kW (110 CV)
15.980€
30
Seat
Mii
1.0 Cosmopolitan 55 kW (75 CV)
7.750€
30
Seat
Arona
1.0 TSI Ecomotive Style Edition 85 kW (115 CV)
13.950€
30
Seat
Leon ST
1.0 TSI SANDS Style 85 kW (115 CV)
13.450€
30
Seat
Ateca
1.5 TSI SANDS Style XL DSG 110 kW (150 CV)
23.490€
30
Seat
Ibiza
1.6 TDI Reference 66 kW (90 CV)
8.950€
30
Seat
Ibiza
1.4 TDI Reference 51 kW (70 CV)
6.750€
30
Seat
Tarraco
1.5 TSI SANDS Style Edition 110 kW (150 CV)
32.450€
30
Seat
Ateca
1.5 TSI SANDS Style Go M 110 kW (150 CV)
20.480€
30
Seat
Ateca
1.5 TSI SANDS FR XL DSG 110 kW (150 CV)
27.480€
30
Seat
Arona
1.0 TSI FR XM DSG 81 kW (110 CV)
19.350€
30
Seat
Leon ST
2.0 TDI SANDS FR Edition DSG-7  110 kW (150 CV)
19.450€
30
Seat
León
1.5 TSI SANDS FR Edition 96 kW (130 CV)
16.750€
30
Seat
Leon ST