# **WEB ANALYTICS – Data Science and Engineering Degree**  
**(1st Semester, 4th-year-level Course)**  

## **Introduction to Web Scraping**  

This lab was part of my **Web Analytics** course at **Universidad Carlos III de Madrid (UC3M)**, where I studied abroad from **September 2024 to December 2024** as part of my **Computer Science degree**. This specific lab introduced **fundamental web scraping techniques**, covering **HTML structure analysis, HTTP requests, and data extraction using Python**.  

Working in a **group of three students**, we developed initial web scrapers using **Python, Requests, and BeautifulSoup** to extract structured information from web pages while adhering to ethical scraping principles.  

---

## **Web Scraping Fundamentals and Data Extraction**  
We implemented a series of milestones that covered **real-world web scraping use cases**, including:

- **Understanding HTML structure and using CSS selectors for data extraction.**  
- **Sending HTTP requests and handling server responses.**  
- **Extracting structured data from static websites.**  
- **Applying best practices for ethical web scraping and handling robots.txt restrictions.**  

---

## **Milestones**  

### **Milestone 1: Understanding HTML and Extracting Text Data**  
- Loaded and inspected HTML pages to understand structure and element selection.  
- Extracted **specific text content** from an example website using BeautifulSoup.  

### **Milestone 2: Sending HTTP Requests and Parsing Responses**  
- Used the `requests` library to fetch web pages.  
- Handled **status codes, headers, and encoding issues** while retrieving data.  

### **Milestone 3: Extracting Structured Data from Tables and Lists**  
- Parsed HTML tables and lists to extract **formatted datasets**.  
- Stored extracted data in **CSV and JSON formats** for further analysis.  

### **Milestone 4: Implementing Ethical Web Scraping Practices**  
- Analyzed `robots.txt` files to understand scraping permissions.  
- Implemented **request delays** and **user-agent headers** to avoid detection and ensure responsible web scraping.  

---

## **Outcome**  
Through this lab, we gained experience in **web scraping, data extraction, and HTML parsing**. We developed Python-based scripts using **Requests and BeautifulSoup**, understood how to interact with **web page structures**, and applied ethical web scraping techniques. This lab prepared us for **more advanced data collection projects, including API interactions and dynamic content scraping**.  

---

## **Technologies Used**  
- **Python**  
- **Requests**  
- **BeautifulSoup**  
- **CSV and JSON for data storage**  
- **HTTP headers and status code handling**  


# Introduction to Web Scraping

In this notebook, we will learn how to use Python to retrieve information from websites.


---

**IMPORTANT** You don't have to deliver the answers of this notebook.

---



Before we start, let's review the HTML structure, as we will navigate through it in our scrapers.

### HTML overview

**Hypertext Markup Language (HTML)** is the main language used to write/build web pages. HTML describes the structure of a web page and it can be used with **Cascading Style Sheets (CSS)** (to describe the presentation of web pages, including colors, layout, and fonts) and a scripting language such as **JavaScript** (to create interactive websites).

**HTML Tags:**

HTML is a markup language and makes use of various tags to format the content. These tags are enclosed within angle braces `<Tag Name>`. Except for few tags, HTML tags typically come in pairs like `<p>` and `</p>`. The first tag in a pair is the opening tag, and the second tag is the closing tag. The end tag is written like the start tag but with a slash inserted before the tag name.

Some of the tags are:
* `<!DOCTYPE html>` defines the document type and the HTML version. Current version of HTML is 5.
* `<html>` element is the root element of an HTML page.
* `<head>` element contains meta information about the document.  
* `<title>` element specifies a title for the document.  
* `<body>` element contains the visible page content.  
* `<div>` tag defines a division or a section in an HTML document. It's usually a container for other elements.
* `<h1>` element defines a large heading.  
* `<p>` element defines a paragraph.  
* `<a>` element defines a hyperlink.

Here you have an example of a simple HTML structure:

```
<!DOCTYPE html>
<html>

   <head>
      <title>This is document title</title>
   </head>

   <body>
      <h1>This is a heading</h1>
      <p>This is a paragraph</p>
      <a href="https://www.uc3m.es/">This is a link to uc3m website</a>
   </body>

</html>
```

As you can see, HTML is structured like a tree thanks to the **Document Object Model (DOM)**, a programming API that defines the logical structure of documents and the way a document is accessed and manipulated.

For a complete overview of HTML, check out the official documentation at w3schools: [Documentation](https://www.w3schools.com/html/)



### Short exercise
To get some experience with the HTML page structure, we will search and locate elements in [IMDB](https://www.imdb.com/).

**Tip**: Remember to use the web browser developer tool. Click right button and select *Inspect* to access the elements panel.


* Find the _Sign in_ button. Write here the corresponding HTML code.


In [None]:
<a class="ipc-btn ipc-btn--single-padding ipc-btn--center-align-content ipc-btn--default-height ipc-btn--core-baseAlt ipc-btn--theme-baseAlt ipc-btn--on-textPrimary ipc-text-button imdb-header__signin-text" tabindex="0" aria-disabled="false" href="/registration/signin/?ref=nv_generic_lgin&amp;u=%2F"><span class="ipc-btn__text">Sign In</span></a>

* Find the document title. Write here the corresponding HTML code.

In [None]:
<meta name="description" content="IMDb is the world's most popular and authoritative source for movie, TV and celebrity content. Find ratings and reviews for the newest movie and TV shows. Get personalized recommendations, and learn where to watch across hundreds of streaming providers." data-id="main">

* Find the IMDB logo. Write here the corresponding HTML code.

In [None]:
<meta property="og:image" content="https://m.media-amazon.com/images/G/01/imdb/images/social/imdb_logo.png">


* What is the _heading_ size of the titles in the main section of the page, (e.g., "Featured today" header)?

In [None]:
h3

# Scrape HTML Content From a Website

Now we will learn how to extract data from a Website, this action is called *Web scraping*. *Web scraping* has become an effective way of extractiong information from the web for decision making and analysis. It is an essential part of the data science toolkit. Data scientists shoud know how to gahter data from web pages and store that data for further analysis.

Any web page you see on the internet can be crawled for information and anything visible on a web page can be extracted. Every web page has its own structure and web elements that because of which you need to write your web crawlers/spiders according to the web page being extracted.


We will use *requests* to obtain the source code of a web page.

Documentation: [Requests](https://docs.python-requests.org/en/master/) is an elegant and simple HTTP library for Python, built for human beings

Here is an example to extract the source code of a website:

In [1]:
import requests

In [2]:
web_url = "https://www.imdb.com/"
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36'
headers = {'User-Agent': user_agent}
# Use requests to retrieve data from a given website
page = requests.get(web_url)

# Print the source code
print(page.text)

<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
</body>
</html>



# Parse HTML using Beautiful Soup

We just saw how to download the source code of a website to a Python variable, but we need to use an extra library to parse the HTML. For that purpose, we will use `BeautifulSoup`.

BeautifulSoup will help us to extract data from the HTML response

Documentation: [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

### Review CSS Selectors

1. To get a tag, such as `<a></a>`, `<body></body>`, use the naked name for the tag. E.g. select_one('a') gets an anchor/link element, select_one('body') gets the body element

2. .temp gets an element with a class of temp, E.g. to get `<a class="temp"></a>` use select_one('.temp')

3. #temp gets an element with an id of temp, E.g. to get `<a id="temp"></a>` use select_one('#temp')

4. .temp.example gets an element with both classes temp and example, E.g. to get `<a class="temp example"></a>` use select_one('.temp.example')

5. .temp a gets an anchor element nested inside of a parent element with class temp, E.g. to get `<div class="temp"><a></a></div>` use select_one('.temp a'). Note the space between .temp and a.

6. .temp .example gets an element with class example nested inside of a parent element with class temp, E.g. to get `<div class="temp"><a class="example"></a></div>` use select_one('.temp .example'). Again, note the space between .temp and .example. The space tells the selector that the class after the space is a child of the class before the space.

7. ids, such as` <a id=one></a>`, are unique so you can usually use the id selector by itself to get the right element. No need to do nested selectors when using ids.

Source text: [here](https://www.learndatasci.com/tutorials/ultimate-guide-web-scraping-w-python-requests-and-beautifulsoup/)

Complete list of CSS Selectors: [CSS Selector Reference](https://www.w3schools.com/cssref/css_selectors.asp)

### Tasks



*   Get HTML object from [Wikipedia, Universidad Carlos III de Madrid](https://en.wikipedia.org/wiki/Charles_III_University_of_Madrid) using BeautifulSoup.

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
web_url = "https://en.wikipedia.org/wiki/Charles_III_University_of_Madrid"
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36'
headers = {'User-Agent': user_agent}
page = requests.get(web_url, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")

print(soup.h1)


<h1 class="firstHeading mw-first-heading" id="firstHeading"><span class="mw-page-title-main">Charles III University of Madrid</span></h1>




* Print result of `BeatuifulSoup object` . `h1`

In [None]:
soup.h1

<h1 class="firstHeading mw-first-heading" id="firstHeading"><span class="mw-page-title-main">Charles III University of Madrid</span></h1>

* Find the element printed above, using function `find()`



In [None]:
soup.find('h1')


<h1 class="firstHeading mw-first-heading" id="firstHeading"><span class="mw-page-title-main">Charles III University of Madrid</span></h1>

* Print the text inside the `h1` tag.

In [None]:
soup.h1.text

'Charles III University of Madrid'

* Find all `a` tags that contain the class named `external`

In [None]:
external_links = soup.find_all('a', class_='external')
for link in external_links:
  print(link)


<a class="external text" href="http://uc3m.es" rel="nofollow">uc3m<wbr/>.es</a>
<a class="external text" href="https://miriadax.net/" rel="nofollow">Miríadax</a>
<a class="external text" href="http://www.goteo.org/about" rel="nofollow">Gotero.org</a>
<a class="external text" href="http://universidad-en-cifras.uc3m.es/Capit_08_en.html" rel="nofollow">University Figures - 2013</a>
<a class="external text" href="https://web.archive.org/web/20140714230527/http://universidad-en-cifras.uc3m.es/Capit_08_en.html" rel="nofollow">Archived</a>
<a class="external text" href="http://universidad-en-cifras.uc3m.es/Capit_07_en.html" rel="nofollow">University Figures - 2013</a>
<a class="external text" href="https://web.archive.org/web/20140714130948/http://universidad-en-cifras.uc3m.es/Capit_07_en.html" rel="nofollow">Archived</a>
<a class="external text" href="http://universidad-en-cifras.uc3m.es/Capit_03_en.html" rel="nofollow">University Figures - 2013</a>
<a class="external text" href="https://web

* Find all `span` tags that contain the class named `reference-text`

In [None]:
external_links = soup.find_all('span', class_='reference-text')
len(external_links)

24


* Check https://www.amazon.es/robots.txt and see if the site can be crawled or not. Then, access to amazon, select an item to buy (e.g., Alexa), copy the URL and obtain the HTML. What happens?

In [None]:
url = 'https://amzn.eu/d/6YsPmsD'

In [None]:
page = requests.get(url)
with open('output.html', 'w') as file:
  file.write(page.text)

# Parse HTML using Scrapy

So far, you know how to extract information from a website using BeautifulSoup, a parsing library.  

Now, we will learn how to crawl with **Scrapy**.

[Scrapy](https://scrapy.org/) is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

* First, let's install it:

In [None]:
!pip install scrapy

Collecting scrapy
  Downloading Scrapy-2.11.2-py2.py3-none-any.whl.metadata (5.3 kB)
Collecting Twisted>=18.9.0 (from scrapy)
  Downloading twisted-24.7.0-py3-none-any.whl.metadata (18 kB)
Collecting cssselect>=0.9.1 (from scrapy)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting itemloaders>=1.0.1 (from scrapy)
  Downloading itemloaders-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Collecting parsel>=1.5.0 (from scrapy)
  Downloading parsel-1.9.1-py2.py3-none-any.whl.metadata (11 kB)
Collecting queuelib>=1.4.2 (from scrapy)
  Downloading queuelib-1.7.0-py2.py3-none-any.whl.metadata (5.7 kB)
Collecting service-identity>=18.1.0 (from scrapy)
  Downloading service_identity-24.1.0-py3-none-any.whl.metadata (4.8 kB)
Collecting w3lib>=1.17.0 (from scrapy)
  Downloading w3lib-2.2.1-py3-none-any.whl.metadata (2.1 kB)
Collecting zope.interface>=5.1.0 (from scrapy)
  Downloading zope.interface-7.0.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_6

* Create a new folder for the lab and move there:

In [None]:
%mkdir Lab_scrapy
%cd "Lab_scrapy"

/content/Lab_scrapy


* Create a new scrapy project:

In [None]:
import scrapy

In [None]:
# Create first scrapy project
!scrapy startproject my_scrapy_crawler

New Scrapy project 'my_scrapy_crawler', using template directory '/usr/local/lib/python3.10/dist-packages/scrapy/templates/project', created in:
    /content/Lab_scrapy/my_scrapy_crawler

You can start your first spider with:
    cd my_scrapy_crawler
    scrapy genspider example example.com


* Move to the project folder:

In [None]:
%cd "my_scrapy_crawler"

/content/Lab_scrapy/my_scrapy_crawler


## Tasks

* Create automatically a spider called _wikipedia_spider_ that crawls the url [Wikipedia, Universidad Carlos III de Madrid](https://en.wikipedia.org/wiki/Charles_III_University_of_Madrid)

In [None]:
!scrapy genspider wikipedia_spider https://en.wikipedia.org/wiki/Charles_III_University_of_Madrid

Spider 'wikipedia_spider' already exists in module:
  my_scrapy_crawler.spiders.wikipedia_spider


In [None]:
!scrapy crawl wikipedia_spider

2024-09-18 10:28:37 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: my_scrapy_crawler)
2024-09-18 10:28:37 [scrapy.utils.log] INFO: Versions: lxml 4.9.4.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.7.0, Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.1, Platform Linux-6.1.85+-x86_64-with-glibc2.35
2024-09-18 10:28:37 [scrapy.addons] INFO: Enabled addons:
[]
2024-09-18 10:28:37 [asyncio] DEBUG: Using selector: EpollSelector
2024-09-18 10:28:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-09-18 10:28:37 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-09-18 10:28:37 [scrapy.extensions.telnet] INFO: Telnet Password: ca1c84124f73a61c
2024-09-18 10:28:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsol

Now, you should have the following contents:

Lab_scrapy/
> my_scrapy_crawler/

>> scrapy.cfg

>> my_scrapy_crawler/

>>> __init__.py

>>> items.py

>>> middlewares.py   

>>> pipelines.py     

>>> settings.py      

>>> spiders/

>>>> __init__.py

>>>> wikipedia_spider.py



* Modify the crawler you just created to extract the information inside the tag with `id=firstHeading`.

**Tip:** You have to modify the function `parse` inside the file wikipedia_spider.py

In [None]:
def parse(self, response):
    print(response.ccs('#firstHeading').extract())

In [None]:
!scrapy crawl wikipedia_spider

2024-09-18 10:31:25 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: my_scrapy_crawler)
2024-09-18 10:31:25 [scrapy.utils.log] INFO: Versions: lxml 4.9.4.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.7.0, Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.1, Platform Linux-6.1.85+-x86_64-with-glibc2.35
2024-09-18 10:31:25 [scrapy.addons] INFO: Enabled addons:
[]
2024-09-18 10:31:25 [asyncio] DEBUG: Using selector: EpollSelector
2024-09-18 10:31:25 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-09-18 10:31:25 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-09-18 10:31:25 [scrapy.extensions.telnet] INFO: Telnet Password: 992565a63688224e
2024-09-18 10:31:25 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsol

**EXTRA**: One of the advantages of Scrapy is that you can repeat the same crawler to multiple URLs at once. Try to add more URLs from Wikipedia in the list `start_urls` and check how easy it is!

* Modify the crawler you just created to extract all the `a` tags wiht class name `external`

In [None]:
!scrapy crawl wikipedia_spider

2024-09-18 10:38:37 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: my_scrapy_crawler)
2024-09-18 10:38:37 [scrapy.utils.log] INFO: Versions: lxml 4.9.4.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.7.0, Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.1, Platform Linux-6.1.85+-x86_64-with-glibc2.35
2024-09-18 10:38:37 [scrapy.addons] INFO: Enabled addons:
[]
2024-09-18 10:38:37 [asyncio] DEBUG: Using selector: EpollSelector
2024-09-18 10:38:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-09-18 10:38:37 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-09-18 10:38:37 [scrapy.extensions.telnet] INFO: Telnet Password: 9347722569da6370
2024-09-18 10:38:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsol

* Access to Facebook using Scrapy (that implies repeating the previous process and create a new spider for Facebook). Can we use Scrapy to extract information from this website? Explain.

In [None]:
!scrapy genspider facebook_spider https://www.facebook.com/

Spider 'facebook_spider' already exists in module:
  my_scrapy_crawler.spiders.facebook_spider


In [None]:
!scrapy crawl facebook_spider

2024-09-18 10:40:32 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: my_scrapy_crawler)
2024-09-18 10:40:32 [scrapy.utils.log] INFO: Versions: lxml 4.9.4.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.7.0, Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.1, Platform Linux-6.1.85+-x86_64-with-glibc2.35
2024-09-18 10:40:32 [scrapy.addons] INFO: Enabled addons:
[]
2024-09-18 10:40:32 [asyncio] DEBUG: Using selector: EpollSelector
2024-09-18 10:40:32 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-09-18 10:40:32 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-09-18 10:40:32 [scrapy.extensions.telnet] INFO: Telnet Password: 6f36b6c47f15ff80
2024-09-18 10:40:32 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsol

# Parse HTML using Selenium

Last but not least, we will learn [Selenium](https://https://www.selenium.dev/), a powerful web scraping tool. Initially, it was created for website testing purposes. However, nowadays, it is also used for Web Scraping.  Selenium it is useful to scrape dynamic websites, that contains cookies, JavaScript functions or any other dynamic or that require human actions.

Selenium requieres integraition with third-party browsers in order to work. Let's see how to install it and how we can build our own first scraper with Selenium

* Install Chrome and Selenium

In [None]:
!apt update
!apt install chromium-chromedriver
!pip install selenium

[33m0% [Working][0m            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
[33m0% [Connecting to archive.ubuntu.com (185.125.190.81)] [Connecting to security.ubuntu.com (185.125.1[0m[33m0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.82)] [Waiting for headers] [0m                                                                                                    Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
[33m0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.82)] [Waiting for headers] [0m                                                                                                    Ign:3 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
[33m0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.82)] [Connected to ppa.laun[0m                                                                                   

* Import selenium webdriver and create our headless browser (a headless browser is a web browser without a graphical user interface).

In [None]:
# set options to be headless
from selenium import webdriver
from selenium.webdriver.common.by import By

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# open it, go to a website, and get results
driver = webdriver.Chrome(options=options)

In [None]:
url = "https://en.wikipedia.org/wiki/Charles_III_University_of_Madrid"

In [None]:
driver.get(url)

In [None]:
driver.current_url

'https://en.wikipedia.org/wiki/Charles_III_University_of_Madrid'

## Tasks

* Find element by `id=firstHeading` and print the text

In [None]:
driver.find_element(By.ID, 'firstHeading').text

'Charles III University of Madrid'

* Find all elements by class name `external` and print the result

In [None]:
driver.page_source

