In [1]:
# Import basic libraries
import numpy as np
import pandas as pd

# Import scrapping libraries
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

## Exercise 1
Perform web scraping of two of the three proposed web pages using BeautifulSoup first and Selenium afterwards. 

- http://quotes.toscrape.com

- https://www.bolsamadrid.es

- www.wikipedia.es (do a search first and scrape some content)

### The websites I have chosen to do the scraping will be Quotes.toscrape and wikipedia

- This is a dating site for famous people.
- Each quote has several tags related to the text.
- You have the option of clicking on a link (about) to learn more about the author of each quote.
- You have the option to go to the next or previous page by clicking a link.
- Each page has 10 cities

**Initial Observations**

Whether we view it directly on the web page using developer tools or examine it here in our notebook, we can already discern some structures that will help us create functions to "clean" or extract only the information we're interested in. Personally, this initial phase or step 0 seems easier to do on the web page itself.

- `<html lang="en">`: I deduce that this indicates the language in which the page is written.
- `<head>`: We see that this section contains the code for the page title, with the `<title>` code highlighting the title.
- `<body>`: This section displays the main part of the page.
- `<div class>`: It separates content by the "class" attribute and its value (e.g., `class="xx"`).
- `<span class="text" itemprop="text">`: Within this code, we find the text of the quotes.
- `<small class="author" itemprop="author">`: Within the same `div` class, it presents another section, in this case, the name of the author.
- `<a href...>`: This is the link that will take us to a new page for more information about the author if we click "about."

In conclusion, we can see that each quote is enclosed within a `div` with the "quote" key. Then, within this structure, we have the `span class="text"` where the main text of the quote is located. It is followed by another `span` (without a class) to add the word "by" and introduce the author with the `small class="author"` attribute. To link the "about" section to more information about the author, the `a href` code is used to reference the new web page (if you notice, it's the part that appears at the end of the browser's address bar when you click on the link and it opens the page). Finally, each quote section is closed with a new key for tags and by closing the `div` (`</div>`).

### BeautifulSoup

In [2]:
# URL of the web page
url = "http://quotes.toscrape.com"

# Send an HTTP GET request to the URL
response = requests.get(url)

# Parse the HTML content of the page using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Find the elements containing quotes and authors
quote_divs = soup.find_all('div', class_='quote')

# Initialize empty lists to store data
quotes = []
authors = []
tags_list = []

# Iterate over the quote_divs and extract the data
for quote_div in quote_divs:
    # Extract the quote text
    quote_text = quote_div.find('span', class_='text').text
    quotes.append(quote_text)

    # Extract the author's name
    author = quote_div.find('small', class_='author').text
    authors.append(author)

    # Extract tags associated with the quote
    tags = [tag.text for tag in quote_div.find_all('a', class_='tag')]
    tags_list.append(tags)

# Create a DataFrame from the extracted data
data = {
    'Quote': quotes,
    'Author': authors,
    'Tags': tags_list
}

df = pd.DataFrame(data)

df


Unnamed: 0,Quote,Author,Tags
0,“The world as we have created it is a process ...,Albert Einstein,"[change, deep-thoughts, thinking, world]"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"[abilities, choices]"
2,“There are only two ways to live your life. On...,Albert Einstein,"[inspirational, life, live, miracle, miracles]"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"[aliteracy, books, classic, humor]"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"[be-yourself, inspirational]"
5,“Try not to become a man of success. Rather be...,Albert Einstein,"[adulthood, success, value]"
6,“It is better to be hated for what you are tha...,André Gide,"[life, love]"
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison,"[edison, failure, inspirational, paraphrased]"
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt,[misattributed-eleanor-roosevelt]
9,"“A day without sunshine is like, you know, nig...",Steve Martin,"[humor, obvious, simile]"


#### I would like to know how many quotes are there in ten first pages.

In [3]:
# Function to count quotes on a page
def count_quotes_on_page(url):
    # Send an HTTP GET request to the URL
    response = requests.get(url)

    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the elements containing quotes
    quote_divs = soup.find_all('div', class_='quote')

    # Return the count of quotes on this page
    return len(quote_divs)

# Base URL of the website
base_url = "http://quotes.toscrape.com"
page_number = 1

while True:
    # Construct the URL for the current page
    page_url = f"{base_url}/page/{page_number}/"

    # Count quotes on the current page
    quote_count = count_quotes_on_page(page_url)

    # Check if there are quotes on the page
    if quote_count > 0:
        print(f"Page {page_number}: {quote_count} quotes")
        page_number += 1
    else:
        break


Page 1: 10 quotes
Page 2: 10 quotes
Page 3: 10 quotes
Page 4: 10 quotes
Page 5: 10 quotes
Page 6: 10 quotes
Page 7: 10 quotes
Page 8: 10 quotes
Page 9: 10 quotes
Page 10: 10 quotes


### SELENIUM

In [7]:
from selenium import webdriver
from selenium.webdriver.common.by import By

# Initialize the WebDriver (e.g., Chrome)
driver = webdriver.Chrome()
# Get the URL
driver.get("http://quotes.toscrape.com/")

# Find Elements
quotes = driver.find_elements(By.CLASS_NAME, "text")
authors = driver.find_elements(By.CLASS_NAME, "author")
tags = driver.find_elements(By.CLASS_NAME, "tags")

# Create empty lists to store the data
quote_texts = []
author_names = []
tag_texts = []

# Extract data and store it in lists
for quote, author, tag in zip(quotes, authors, tags):
    quote_texts.append(quote.text.strip())
    author_names.append(author.text.strip())
    tag_texts.append(tag.text.strip())

# Create a DataFrame from the extracted data
data = {
    "Quote": quote_texts,
    "Author": author_names,
    "Tags": tag_texts,
}

df = pd.DataFrame(data)

# Display the DataFrame
print(df)

# Close the WebDriver
driver.quit()


                                               Quote             Author  \
0  “The world as we have created it is a process ...    Albert Einstein   
1  “It is our choices, Harry, that show what we t...       J.K. Rowling   
2  “There are only two ways to live your life. On...    Albert Einstein   
3  “The person, be it gentleman or lady, who has ...        Jane Austen   
4  “Imperfection is beauty, madness is genius and...     Marilyn Monroe   
5  “Try not to become a man of success. Rather be...    Albert Einstein   
6  “It is better to be hated for what you are tha...         André Gide   
7  “I have not failed. I've just found 10,000 way...   Thomas A. Edison   
8  “A woman is like a tea bag; you never know how...  Eleanor Roosevelt   
9  “A day without sunshine is like, you know, nig...       Steve Martin   

                                             Tags  
0       Tags: change deep-thoughts thinking world  
1                         Tags: abilities choices  
2  Tags: inspirati

In [8]:
df

Unnamed: 0,Quote,Author,Tags
0,“The world as we have created it is a process ...,Albert Einstein,Tags: change deep-thoughts thinking world
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,Tags: abilities choices
2,“There are only two ways to live your life. On...,Albert Einstein,Tags: inspirational life live miracle miracles
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,Tags: aliteracy books classic humor
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,Tags: be-yourself inspirational
5,“Try not to become a man of success. Rather be...,Albert Einstein,Tags: adulthood success value
6,“It is better to be hated for what you are tha...,André Gide,Tags: life love
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison,Tags: edison failure inspirational paraphrased
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt,Tags: misattributed-eleanor-roosevelt
9,"“A day without sunshine is like, you know, nig...",Steve Martin,Tags: humor obvious simile


## wikipedia

**Initial Observations**

We can see that the complexity level of this page is higher (as it's not a dedicated practice page). However, we have identified some repeating tags, although it took some effort to find the ones we will use:

- `<html lang="es">`: I deduce that it indicates the language in which the page is written.
- `<head>`: We notice that this section contains the code for the page title, with the `<title>` code highlighting the title.
- `<body class>`: It displays the main part of the page and encompasses most of the code.
- `<div class>`: Content is separated by the "class" attribute and its value (e.g., `class="xx"`).
- `<div class="mw-page-container"`: We see that this is where the text is located. Here, we continue to find different keys within the "class" attribute. We'll gradually narrow down the code to find the parts that interest us.
- `<h2>`: Here, we can extract the different titles of each section.
- `<p>`: This contains the main text of the page.
- `<a href...>`: This is the link that will take us to a new page for more information if we click "about."

In summary, we are observing a more complex web page structure, and within the `<div class="mw-page-container">`, we expect to find the specific content we're looking for, such as information about "Comunio," a popular football game. We will proceed by examining and extracting data from this webpage.

Once again, both the Developer Tools and printing the code in the notebook have proven helpful in identifying where to locate the code. Both approaches provide clarity within the code.

It's good to know that Developer Tools and viewing the code in the notebook helped you identify and locate the points of interest when extracting information from a web page. Using these tools is an essential part of the web scraping process, as they allow you to explore the website's structure and find the elements you want to scrape.

It's a good practice to combine the use of Developer Tools for initial website inspection and viewing the code in your notebook to have a comprehensive understanding of how to access the desired information. This way, you can create your web scraping scripts more effectively and accurately.


In [9]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the Wikipedia page
url = "https://es.wikipedia.org/wiki/Ideologías_políticas"

# Send an HTTP GET request to the URL
response = requests.get(url)

# Parse the HTML content of the page using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Extract section headings and content
sections = []
headings = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])

for heading in headings:
    section = {
        "Heading": heading.text.strip(),
        "Content": ""
    }
    
    # Extract content under the heading until the next heading
    content = []
    next_element = heading.find_next_sibling()
    
    while next_element and next_element.name not in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
        content.append(next_element.text.strip())
        next_element = next_element.find_next_sibling()
    
    section["Content"] = "\n".join(content)
    sections.append(section)

# Create a DataFrame from the extracted data
df = pd.DataFrame(sections)

# Display the DataFrame
df

Unnamed: 0,Heading,Content
0,Contenidos,mover a la barra lateral\nocultar
1,Anexo:Ideologías políticas,19 idiomas\n\n\n\n\nالعربيةAzərbaycancaবাংলাDa...
2,Sociología[editar],La sociología es el estudio de las convencione...
3,Jean-Jacques Rousseau[editar],"Jean-Jacques Rousseau a la edad de 41 años, pi..."
4,El Contrato Social[editar],"En su obra maestra, El Contrato Social, Rousse..."
5,Karl Marx[editar],Karl Marx en 1875.\nFue un filósofo alemán de ...
6,El Trabajo Asalariado y Capital[editar],"En El capital, Marx hace referencia a la Econo..."
7,Thomas Hobbes[editar],"Sociólogo nacido en Inglaterra, para el 1588. ..."
8,Leviatán[editar],"En su obra, Leviatán, Thomas Hobbes abarca el ..."
9,Max Weber[editar],Max Weber en 1894.\nDe nacionalidad alemana y ...


In [10]:
import re

# Check if the request was successful (status code 200 indicates success)
if response.status_code == 200:
    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find and extract the main content of the page
    main_content = soup.find('div', class_='mw-parser-output')

    if main_content:
        # Extract all the text from the main content
        all_text = main_content.get_text()

        # Remove unwanted characters and split the text into words
        words = re.findall(r'\b\w+\b', all_text)

        # Count the number of words
        word_count = len(words)

        # Extract the first 100 words (adjust as needed)
        extracted_words = words[:100]

        # Create a list to store word count data
        word_count_data = []

        # Iterate through different word count values
        for i in range(1, word_count + 1, 100):
            # Extract words for the current range
            extracted_words = words[i - 1:i + 99]
            if i + 99 > word_count:
                extracted_words = words[i - 1:]

            # Append word count data to the list
            word_count_data.append({
                "Start Index": i,
                "End Index": i + len(extracted_words) - 1,
                "Total Word Count": len(extracted_words),
                "Words": " ".join(extracted_words)
            })

        # Create a DataFrame to store the word count data
        df = pd.DataFrame(word_count_data)

        # Display the DataFrame
        print(df)

    else:
        print("Main content not found on the page.")

else:
    print(f"Failed to retrieve the page. Status Code: {response.status_code}")

    Start Index  End Index  Total Word Count  \
0             1        100               100   
1           101        200               100   
2           201        300               100   
3           301        400               100   
4           401        500               100   
5           501        600               100   
6           601        700               100   
7           701        800               100   
8           801        900               100   
9           901       1000               100   
10         1001       1100               100   
11         1101       1200               100   
12         1201       1300               100   
13         1301       1400               100   
14         1401       1500               100   
15         1501       1600               100   
16         1601       1700               100   
17         1701       1800               100   
18         1801       1900               100   
19         1901       2000              

In [11]:
df.head()

Unnamed: 0,Start Index,End Index,Total Word Count,Words
0,1,100,100,Se entiende por ideologías políticas a los con...
1,101,200,100,a cuáles fines debería concertar Algunos parti...
2,201,300,100,la extrema izquierda también conocida como ult...
3,301,400,100,la mejor forma de gobierno por ejemplo la demo...
4,401,500,100,el funcionamiento de dichas sociedades En la s...


### SELENIUM

In [12]:
# Call the browser
driver = webdriver.Chrome()

# Get the URL
driver.get("https://es.wikipedia.org/wiki/Ideologías_políticas")

# Print title
print(driver.title)

# Find <p> elements
p_elements = driver.find_elements(By.TAG_NAME, 'p')

# Create a list to store paragraph text
paragraphs = []

# Extract and store text from each <p> element
for p_element in p_elements:
    text = p_element.text
    if text:
        paragraphs.append(text)

# Create a DataFrame to store the paragraph text
df = pd.DataFrame({'Paragraphs': paragraphs})

# Display the DataFrame
print(df)

# Close the browser
driver.quit()


Anexo:Ideologías políticas - Wikipedia, la enciclopedia libre
                                           Paragraphs
0   Se entiende por ideologías políticas a los con...
1   Algunos partidos siguen su ideología de manera...
2   Las ideologías se clasifican a través del espe...
3   Una ideología es una colección de ideas. Usual...
4   Finalmente, las ideologías políticas se clasif...
5   La sociología es el estudio de las convencione...
6   Este personaje es de suma importancia en la hi...
7   En su obra maestra, El Contrato Social, Rousse...
8   Establece al soberano como un ser “supremo”, a...
9   Fue un filósofo alemán de origen judío, que se...
10  En El capital, Marx hace referencia a la Econo...
11  Establece que existe una lucha entre clases so...
12  El trabajo es un intercambio entre el obrero y...
13  En general, establece a la imposición de la ma...
14  Sociólogo nacido en Inglaterra, para el 1588. ...
15  En su obra, Leviatán, Thomas Hobbes abarca el ...
16  De igual forma, 

## Exercise 2
Document your data set generated with the information in the different Kaggle files in a Word document.

### To know more

As an example of what is requested, you can consult this link:

-> https://www.kaggle.com/datasets/vivovinco/20212022-football-team-stats .

# Quotes to Scrape

## About Dataset

**Context:**
This dataset contains quotes from famous people, each tagged depending on the quote's content.

**Content:**
- There are 10 quotes on the first page.
- Includes the name of the author.
- Tags related to the quote for classification purposes.

**Website:** [http://quotes.toscrape.com](http://quotes.toscrape.com)

**Creation Date:** 10/09/2023

---

# Political Ideologies

## About Dataset

**Context:**
Political ideologies encompass a diverse array of beliefs that dictate how societies and their institutions should operate. These ideologies span the political spectrum, ranging from extreme left to extreme right, and they address fundamental questions related to ethics, economics, and societal principles. Two crucial dimensions classify these ideologies: their envisioned societal purposes and the methods proposed for achieving them. Key figures in the realm of political thought, such as Jean-Jacques Rousseau, Karl Marx, Thomas Hobbes, and Max Weber, have significantly contributed to the development of these ideologies. This notebook section also organizes ideologies into categories, including anarchism, conservatism, environmentalism, feminism, liberalism, nationalism, and religion, offering insights into their core tenets and impacts on political landscapes worldwide.

**Content:**

1. **Political Ideologies:** These are sets of ideas that define how a society or state should function, covering government type and economic system.

2. **Key Thinkers:** Influential figures like Rousseau, Marx, Weber, and Hobbes explored these ideologies and their impact on society.

**Website:** [https://es.wikipedia.org/wiki/Ideologías_políticas](https://es.wikipedia.org/wiki/Ideologías_políticas)

**Creation Date:** 10/09/2023


## Exercise 3
Choose a web page of your choice and perform web scraping using the Selenium library first and Scrapy later. 

In [13]:
# Call the browser
driver = webdriver.Chrome()

# Get the URL
driver.get("https://worldmigrationreport.iom.int/wmr-2022-interactive/")

# Print title
print(driver.title)

# Find <p> elements
p_elements = driver.find_elements(By.TAG_NAME, 'p')

# Create a list to store paragraph text
paragraphs = []

# Extract and store text from each <p> element
for p_element in p_elements:
    text = p_element.text
    if text:
        paragraphs.append(text)

# Create a DataFrame to store the paragraph text
df = pd.DataFrame({'Paragraphs': paragraphs})

# Display the DataFrame
print(df)

# Close the browser
driver.quit()


Interactive World Migration Report 2022
                                            Paragraphs
0    Since 2000, IOM has been producing world migra...
1    In most discussions on migration, the starting...
2    Overall, the estimated number of international...
3    International remittances are financial or in-...
4    The World Bank compiles global data on interna...
..                                                 ...
700  Human trafficking in migration pathways: trend...
701                                                 11
702  Artificial intelligence, migration and mobilit...
703                                                 12
704  Reflections on migrant's contributions in an e...

[705 rows x 1 columns]
