In [1]:
# Import basic libraries
import numpy as np
import pandas as pd

# Import scrapping libraries
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

## Exercise 1
Perform web scraping of two of the three proposed web pages using BeautifulSoup first and Selenium afterwards. 

- http://quotes.toscrape.com

- https://www.bolsamadrid.es

- www.wikipedia.es (do a search first and scrape some content)

### The websites I have chosen to do the scraping will be Quotes.toscrape and wikipedia

- This is a dating site for famous people.
- Each quote has several tags related to the text.
- You have the option of clicking on a link (about) to learn more about the author of each quote.
- You have the option to go to the next or previous page by clicking a link.
- Each page has 10 cities

**Initial Observations**

Whether we view it directly on the web page using developer tools or examine it here in our notebook, we can already discern some structures that will help us create functions to "clean" or extract only the information we're interested in. Personally, this initial phase or step 0 seems easier to do on the web page itself.

- `<html lang="en">`: I deduce that this indicates the language in which the page is written.
- `<head>`: We see that this section contains the code for the page title, with the `<title>` code highlighting the title.
- `<body>`: This section displays the main part of the page.
- `<div class>`: It separates content by the "class" attribute and its value (e.g., `class="xx"`).
- `<span class="text" itemprop="text">`: Within this code, we find the text of the quotes.
- `<small class="author" itemprop="author">`: Within the same `div` class, it presents another section, in this case, the name of the author.
- `<a href...>`: This is the link that will take us to a new page for more information about the author if we click "about."

In conclusion, we can see that each quote is enclosed within a `div` with the "quote" key. Then, within this structure, we have the `span class="text"` where the main text of the quote is located. It is followed by another `span` (without a class) to add the word "by" and introduce the author with the `small class="author"` attribute. To link the "about" section to more information about the author, the `a href` code is used to reference the new web page (if you notice, it's the part that appears at the end of the browser's address bar when you click on the link and it opens the page). Finally, each quote section is closed with a new key for tags and by closing the `div` (`</div>`).

### BeautifulSoup

In [2]:
# URL of the web page
url = "http://quotes.toscrape.com"

# Send an HTTP GET request to the URL
response = requests.get(url)


In [3]:
# Parse the HTML content of the page using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

In [4]:
# Find the elements containing quotes and authors
quote_divs = soup.find_all('div', class_='quote')

In [5]:
# Extract and print the quotes and authors
for quote_div in quote_divs:
    quote_text = quote_div.find('span', class_='text').text
    author = quote_div.find('small', class_='author').text
    print(f"Quote: {quote_text}\nAuthor: {author}\n")

Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein

Quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Author: J.K. Rowling

Quote: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Author: Albert Einstein

Quote: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Author: Jane Austen

Quote: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Author: Marilyn Monroe

Quote: “Try not to become a man of success. Rather become a man of value.”
Author: Albert Einstein

Quote: “It is better to be hated for what you are than to be loved for what you are not.”
Author: André Gide

Quote: “I have not failed. I've just found 10,000 ways that won't work.”
Author: Thomas

In [6]:
# Iterate over the quote_divs and extract the data
for quote_div in quote_divs:
    quote_text = quote_div.find('span', class_='text').text
    author = quote_div.find('small', class_='author').text

    # Extract tags associated with the quote
    tags = [tag.text for tag in quote_div.find_all('a', class_='tag')]

    print(f"Quote: {quote_text}\nAuthor: {author}\nTags: {', '.join(tags)}\n")


Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein
Tags: change, deep-thoughts, thinking, world

Quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Author: J.K. Rowling
Tags: abilities, choices

Quote: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Author: Albert Einstein
Tags: inspirational, life, live, miracle, miracles

Quote: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Author: Jane Austen
Tags: aliteracy, books, classic, humor

Quote: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Author: Marilyn Monroe
Tags: be-yourself, inspirational

Quote: “Try not to become a man of success. Rather become a man of value.”
Author: Albert Einstein
Tags:

#### I would like to know how many quotes are there in ten first pages.

In [7]:
# Function to count quotes on a page
def count_quotes_on_page(url):
    # Send an HTTP GET request to the URL
    response = requests.get(url)

    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the elements containing quotes
    quote_divs = soup.find_all('div', class_='quote')

    # Return the count of quotes on this page
    return len(quote_divs)

# Base URL of the website
base_url = "http://quotes.toscrape.com"
page_number = 1

while True:
    # Construct the URL for the current page
    page_url = f"{base_url}/page/{page_number}/"

    # Count quotes on the current page
    quote_count = count_quotes_on_page(page_url)

    # Check if there are quotes on the page
    if quote_count > 0:
        print(f"Page {page_number}: {quote_count} quotes")
        page_number += 1
    else:
        break


Page 1: 10 quotes
Page 2: 10 quotes
Page 3: 10 quotes
Page 4: 10 quotes
Page 5: 10 quotes
Page 6: 10 quotes
Page 7: 10 quotes
Page 8: 10 quotes
Page 9: 10 quotes
Page 10: 10 quotes


### SELENIUM

In [8]:
# Call the browser you will use
driver = webdriver.Chrome()

# Get the URL
driver.get("http://quotes.toscrape.com/")

#Find Elements
quotes = driver.find_elements(By.CLASS_NAME, "text")
authors = driver.find_elements(By.CLASS_NAME, "author")
tags = driver.find_elements(By.CLASS_NAME, "tags")

#Print title of web
print("\033[1mTitol:\033[0m", driver.title)


[1mTitol:[0m Quotes to Scrape


In [9]:
# Show text extract from "text" key of class_name atrribute
for i, quote in enumerate(quotes, 1):
    quote = quote.text.strip()
    print(f"\n{i}- {quote}")


1- “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”

2- “It is our choices, Harry, that show what we truly are, far more than our abilities.”

3- “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”

4- “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”

5- “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”

6- “Try not to become a man of success. Rather become a man of value.”

7- “It is better to be hated for what you are than to be loved for what you are not.”

8- “I have not failed. I've just found 10,000 ways that won't work.”

9- “A woman is like a tea bag; you never know how strong it is until it's in hot water.”

10- “A day without sunshine is like, you know, night.”


In [10]:
# Show text extract from "authors" key of class_name atrribute
for i, author in enumerate(authors, 1):
    author= author.text.strip()
    print(f"\n{i}- {author}")


1- Albert Einstein

2- J.K. Rowling

3- Albert Einstein

4- Jane Austen

5- Marilyn Monroe

6- Albert Einstein

7- André Gide

8- Thomas A. Edison

9- Eleanor Roosevelt

10- Steve Martin


In [11]:
# Show text extract from "tags" key of class_name atrribute
for i, tag in enumerate(tags, 1):
    tag = tag.text.strip()
    print(f"\n{i}- {tag}")


1- Tags: change deep-thoughts thinking world

2- Tags: abilities choices

3- Tags: inspirational life live miracle miracles

4- Tags: aliteracy books classic humor

5- Tags: be-yourself inspirational

6- Tags: adulthood success value

7- Tags: life love

8- Tags: edison failure inspirational paraphrased

9- Tags: misattributed-eleanor-roosevelt

10- Tags: humor obvious simile


In [12]:
# Close browser
driver.close()


## wikipedia

**Initial Observations**

We can see that the complexity level of this page is higher (as it's not a dedicated practice page). However, we have identified some repeating tags, although it took some effort to find the ones we will use:

- `<html lang="es">`: I deduce that it indicates the language in which the page is written.
- `<head>`: We notice that this section contains the code for the page title, with the `<title>` code highlighting the title.
- `<body class>`: It displays the main part of the page and encompasses most of the code.
- `<div class>`: Content is separated by the "class" attribute and its value (e.g., `class="xx"`).
- `<div class="mw-page-container"`: We see that this is where the text is located. Here, we continue to find different keys within the "class" attribute. We'll gradually narrow down the code to find the parts that interest us.
- `<h2>`: Here, we can extract the different titles of each section.
- `<p>`: This contains the main text of the page.
- `<a href...>`: This is the link that will take us to a new page for more information if we click "about."

In summary, we are observing a more complex web page structure, and within the `<div class="mw-page-container">`, we expect to find the specific content we're looking for, such as information about "Comunio," a popular football game. We will proceed by examining and extracting data from this webpage.

Once again, both the Developer Tools and printing the code in the notebook have proven helpful in identifying where to locate the code. Both approaches provide clarity within the code.

It's good to know that Developer Tools and viewing the code in the notebook helped you identify and locate the points of interest when extracting information from a web page. Using these tools is an essential part of the web scraping process, as they allow you to explore the website's structure and find the elements you want to scrape.

It's a good practice to combine the use of Developer Tools for initial website inspection and viewing the code in your notebook to have a comprehensive understanding of how to access the desired information. This way, you can create your web scraping scripts more effectively and accurately.


In [13]:
# URL of the Wikipedia page
url = "https://es.wikipedia.org/wiki/Ideologías_políticas"

# Send an HTTP GET request to the URL
response = requests.get(url)

# Parse the HTML content of the page using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Extract and print the page title
title = soup.find('h1', id='firstHeading').text
print(f"Page Title: {title}\n")

# Extract and print the main content of the page
content = soup.find('div', class_='mw-parser-output')
print("Main Content:")
print(content.text)


Page Title: Anexo:Ideologías políticas

Main Content:
Se entiende por ideologías políticas a los conjuntos de ideas o postulados fundamentales que caracterizan a los partidos políticos en relación con cómo deberían funcionar las instituciones de un Estado, una sociedad o una población. Según los estudios sociales, una ideología política es un juego ético de ideales, principios laborales y económicos, doctrinas, mitos o símbolos de un movimiento social, institución, clase o un grupo grande que explica cómo la sociedad debería funcionar. Las ideologías políticas ofrecen algún programa político y cultural para un cierto orden social. Una ideología política se ocupa mucho de cómo el poder debería asignarse y a cuáles fines debería concertar.
Algunos partidos siguen su ideología de manera estricta, aunque otros pueden tomar una inspiración amplia de un grupo de ideologías relacionadas, sin específicamente abrazar una idea específica. La popularidad de una ideología es en parte debida a la i

In [14]:
# Check if the request was successful (status code 200 indicates success)
if response.status_code == 200:
    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract and print the page title
    title_element = soup.find('h1', id='firstHeading')  # Find the title element
    title = title_element.text if title_element else "Title not found"
    print(f"Page Title: {title}\n")

    # Extract and print the main content of the page
    content_div = soup.find('div', class_='mw-parser-output')  # Find the main content div
    if content_div:
        # Extract the text from all paragraphs within the content div
        paragraphs = content_div.find_all('p')
        main_content = "\n".join([p.text for p in paragraphs])
        print("Main Content:")
        print(main_content)
    else:
        print("Main content not found on the page.")

else:
    print(f"Failed to retrieve the page. Status Code: {response.status_code}")


Page Title: Anexo:Ideologías políticas

Main Content:
Se entiende por ideologías políticas a los conjuntos de ideas o postulados fundamentales que caracterizan a los partidos políticos en relación con cómo deberían funcionar las instituciones de un Estado, una sociedad o una población. Según los estudios sociales, una ideología política es un juego ético de ideales, principios laborales y económicos, doctrinas, mitos o símbolos de un movimiento social, institución, clase o un grupo grande que explica cómo la sociedad debería funcionar. Las ideologías políticas ofrecen algún programa político y cultural para un cierto orden social. Una ideología política se ocupa mucho de cómo el poder debería asignarse y a cuáles fines debería concertar.

Algunos partidos siguen su ideología de manera estricta, aunque otros pueden tomar una inspiración amplia de un grupo de ideologías relacionadas, sin específicamente abrazar una idea específica. La popularidad de una ideología es en parte debida a la 

In [15]:
# Check if the request was successful (status code 200 indicates success)
if response.status_code == 200:
    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract and print only the titles or headings (e.g., h1, h2, h3, etc.)
    headings = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])

    if headings:
        print("Headings:")
        for heading in headings:
            print(heading.text)
    else:
        print("No headings found on the page.")

else:
    print(f"Failed to retrieve the page. Status Code: {response.status_code}")


Headings:
Contenidos
Anexo:Ideologías políticas
Sociología[editar]
Jean-Jacques Rousseau[editar]
El Contrato Social[editar]
Karl Marx[editar]
El Trabajo Asalariado y Capital[editar]
Thomas Hobbes[editar]
Leviatán[editar]
Max Weber[editar]
El Espíritu Capitalista[editar]
División en grupos[editar]
Ideologías relacionadas con el anarquismo[editar]
Ideologías relacionadas con el conservadurismo[editar]
Ideologías relacionadas con el ecologismo[editar]
Ideologías relacionadas con el feminismo[editar]
Ideologías relacionadas con el liberalismo[editar]
Ideologías relacionadas con el nacionalismo[editar]
Ideologías relacionadas con la religión[editar]
Ideologías relacionadas con el socialismo[editar]
Véase también[editar]
Referencias[editar]


In [16]:
import re

# Check if the request was successful (status code 200 indicates success)
if response.status_code == 200:
    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find and extract the main content of the page
    main_content = soup.find('div', class_='mw-parser-output')

    if main_content:
        # Extract all the text from the main content
        all_text = main_content.get_text()

        # Remove unwanted characters and split the text into words
        words = re.findall(r'\b\w+\b', all_text)

        # Count the number of words
        word_count = len(words)

        print(f"Total Word Count: {word_count}")

        # Extract and print the first 100 words (adjust as needed)
        print("\nExtracted Words:")
        for i, word in enumerate(words[:100]):
            print(f"{i+1}. {word}")

    else:
        print("Main content not found on the page.")

else:
    print(f"Failed to retrieve the page. Status Code: {response.status_code}")


Total Word Count: 3889

Extracted Words:
1. Se
2. entiende
3. por
4. ideologías
5. políticas
6. a
7. los
8. conjuntos
9. de
10. ideas
11. o
12. postulados
13. fundamentales
14. que
15. caracterizan
16. a
17. los
18. partidos
19. políticos
20. en
21. relación
22. con
23. cómo
24. deberían
25. funcionar
26. las
27. instituciones
28. de
29. un
30. Estado
31. una
32. sociedad
33. o
34. una
35. población
36. Según
37. los
38. estudios
39. sociales
40. una
41. ideología
42. política
43. es
44. un
45. juego
46. ético
47. de
48. ideales
49. principios
50. laborales
51. y
52. económicos
53. doctrinas
54. mitos
55. o
56. símbolos
57. de
58. un
59. movimiento
60. social
61. institución
62. clase
63. o
64. un
65. grupo
66. grande
67. que
68. explica
69. cómo
70. la
71. sociedad
72. debería
73. funcionar
74. Las
75. ideologías
76. políticas
77. ofrecen
78. algún
79. programa
80. político
81. y
82. cultural
83. para
84. un
85. cierto
86. orden
87. social
88. Una
89. ideología
90. política
91. se
92.

### SELENIUM

In [17]:
# Call the browser
driver = webdriver.Chrome()

# Get the URL
driver.get("https://es.wikipedia.org/wiki/Ideologías_políticas")

# Print title
print(driver.title)

Anexo:Ideologías políticas - Wikipedia, la enciclopedia libre


In [18]:
# Find <p> elements
p_elements = driver.find_elements(By.TAG_NAME, 'p')

# print text foer each <p> element
for p_element in p_elements:
    text_coms = p_element.text
    print("\n", text_coms)


 Wiki Loves Monuments: ¡Fotografía un monumento, ayuda a Wikipedia y gana!
Más información

 Se entiende por ideologías políticas a los conjuntos de ideas o postulados fundamentales que caracterizan a los partidos políticos en relación con cómo deberían funcionar las instituciones de un Estado, una sociedad o una población. Según los estudios sociales, una ideología política es un juego ético de ideales, principios laborales y económicos, doctrinas, mitos o símbolos de un movimiento social, institución, clase o un grupo grande que explica cómo la sociedad debería funcionar. Las ideologías políticas ofrecen algún programa político y cultural para un cierto orden social. Una ideología política se ocupa mucho de cómo el poder debería asignarse y a cuáles fines debería concertar.

 Algunos partidos siguen su ideología de manera estricta, aunque otros pueden tomar una inspiración amplia de un grupo de ideologías relacionadas, sin específicamente abrazar una idea específica. La popularidad 


 La institución de una república determina todas las normas que han de seguir aquellos que forman parte de esta. La institución de una república requiere una lealtad irrevocable entre sus súbditos y su líder. El soberano no puede quebrantar un pacto hecho con sus súbditos, ni estos con el soberano. Cualquier otro pacto previo queda totalmente anulado.

 De nacionalidad alemana y nacido en 1864, fue un filósofo, economista, jurista, historiador, politólogo y sociólogo alemán, de mucho reconocimiento por sus obras literarias. Se le considera uno de los fundadores del estudio de la sociología y la administración pública. Sus temas más importantes recaen en la sociología de la religión y el gobierno. Su definición del Estado fundamenta los estudios de Ciencias Políticas, de igual forma, su contribución literaria al desarrollo de sistemas económicos, como lo es el capitalismo, ha permeado en la sociedad contemporánea.

 En esta obra, él entiende que no se puede delimitar ni definir un conc

## Exercise 2
Document your data set generated with the information in the different Kaggle files in a Word document.

### To know more

As an example of what is requested, you can consult this link:

-> https://www.kaggle.com/datasets/vivovinco/20212022-football-team-stats .

# Quotes to Scrape

## About Dataset

**Context:**
This dataset contains quotes from famous people, each tagged depending on the quote's content.

**Content:**
- There are 10 quotes on the first page.
- Includes the name of the author.
- Tags related to the quote for classification purposes.

**Website:** [http://quotes.toscrape.com](http://quotes.toscrape.com)

**Creation Date:** 10/09/2023

---

# Political Ideologies

## About Dataset

**Context:**
Political ideologies encompass a diverse array of beliefs that dictate how societies and their institutions should operate. These ideologies span the political spectrum, ranging from extreme left to extreme right, and they address fundamental questions related to ethics, economics, and societal principles. Two crucial dimensions classify these ideologies: their envisioned societal purposes and the methods proposed for achieving them. Key figures in the realm of political thought, such as Jean-Jacques Rousseau, Karl Marx, Thomas Hobbes, and Max Weber, have significantly contributed to the development of these ideologies. This notebook section also organizes ideologies into categories, including anarchism, conservatism, environmentalism, feminism, liberalism, nationalism, and religion, offering insights into their core tenets and impacts on political landscapes worldwide.

**Content:**

1. **Political Ideologies:** These are sets of ideas that define how a society or state should function, covering government type and economic system.

2. **Key Thinkers:** Influential figures like Rousseau, Marx, Weber, and Hobbes explored these ideologies and their impact on society.

**Website:** [https://es.wikipedia.org/wiki/Ideologías_políticas](https://es.wikipedia.org/wiki/Ideologías_políticas)

**Creation Date:** 10/09/2023


## Exercise 3
Choose a web page of your choice and perform web scraping using the Selenium library first and Scrapy later. 

In [20]:
# Call the browser you will use
driver = webdriver.Chrome()

# Get the URL
driver.get("https://worldmigrationreport.iom.int/wmr-2022-interactive/")

# Print title
print(driver.title)

Interactive World Migration Report 2022


In [21]:
# Find <p> elements
p_elements = driver.find_elements(By.TAG_NAME, 'p')

# print text foer each <p> element
for p_element in p_elements:
    text_coms = p_element.text
    print("\n", text_coms)


 Since 2000, IOM has been producing world migration reports. The World Migration Report 2022, the eleventh in the world migration report series, has been produced to contribute to an increased understanding of migration throughout the world. This new edition presents key data and information on migration as well as thematic chapters on highly topical migration issues. This interactive represents only a small part of the Report. To access the full report, click on the download button.

 In most discussions on migration, the starting point is usually numbers. Understanding changes in scale, emerging trends and shifting demographics related to global social and economic transformations, such as migration, help us make sense of the changing world we live in and plan for the future. The current global estimate is that there were around 281 million international migrants in the world in 2020, which equates to 3.6 per cent of the global population.

 Overall, the estimated number of internat


 Afghanistan

 Venezuela (Bolivarian Republic of)

 Poland

 United Kingdom

 Indonesia

 Kazakhstan

 Romania

 Germany

 Myanmar

 Egypt

 Turkey

 Viet Nam

 Morocco

 Italy

 Colombia

 United States of America

 Nepal

 South Sudan

 France

 Republic of Korea

 Sudan

 Portugal

 Iraq

 Somalia

 Uzbekistan

 Algeria

 Sri Lanka

 Brazil

 Malaysia

 Democratic Republic of the Congo

 Haiti

 Cuba

 Bosnia and Herzegovina

 Bulgaria

 Nigeria

 Burkina Faso

 El Salvador

 Dominican Republic

 Peru

 Spain

 Belarus

 Guatemala

 Iran (Islamic Republic of)

 Mali

 Yemen

 Lao People’s Democratic Republic

 Canada

 Albania

 Zimbabwe

 Azerbaijan

 Republic of Moldova

 Côte d’Ivoire

 Ecuador

 Jamaica

 Cambodia

 Greece

 Argentina

 Thailand

 Croatia

 Czechia

 Ghana

 Serbia

 Honduras

 Netherlands

 Armenia

 Ethiopia

 Bolivia (Plurinational State of)

 South Africa

 Tunisia

 Paraguay

 Georgia

 Lebanon

 Central African Republic

 Eritrea

 New Zealand

 Japan

 U


 Migration Research and Analysis: Recent United Nations Contributions

 5

 The Great Disrupter: COVID-19’s impact on migration, mobility and migrants globally

 6

 Peace and security as drivers of stability, development and safe migration

 7

 International migration as a stepladder of opportunity: what do the global data actually show?

 8

 Disinformation about migration: an age-old issue with new tech dimensions

 9

 Migration and slow-onset impacts of climate change: taking stock and taking action

 10

 Human trafficking in migration pathways: trends, challenges and new forms of cooperation

 11

 Artificial intelligence, migration and mobility: implications for policy and practice

 12

 Reflections on migrant's contributions in an era of increasing disruption and disinformation (Repeat)


In [25]:
driver.quit()