# Ejercicios de Web Scraping.

# Ejercicio 1

Realiza web scraping de dos de las tres páginas web propuestas utilizando BeautifulSoup primero y Selenium después. 

* http://quotes.toscrape.com

* https://www.bolsamadrid.es

* www.wikipedia.es (haz alguna búsqueda primero y aplasta algún contenido)


# Primera web http://quotes.toscrape.com
# **Web Scraping con BeautifulSoup:**

Beautiful Soup es una librería Python que permite extraer información de contenido en formato HTML o XML.

Hemos elegido la web http://quotes.toscrape.com para hacer este proyecto de web scraping, es una web estática, el servidor que almacena la web nos devolverá un documento HTML que contendrá los datos que vemos cuando navegamos por la web.

In [None]:
!pip install undetected-chromedriver

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd
import requests
import undetected_chromedriver as uc
from bs4 import BeautifulSoup

#  Hacemos una petición GET a la página que queremos scrapear
url = "http://quotes.toscrape.com/"
response = requests.get(url)

# Analizamos el contenido de la página con BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

### Este es el código de nuestra web con una buena presentación para una fácil lectura.

In [None]:
# Buscamos el contenido que necesitamos dentro de nuestro código HTML
results = soup.find(class_='container')

In [None]:
# Trabajamos sobre la variable que contiene nuestra web y usamos un find_all para obtener un elemento iterable 
# que contenga todas nuestras citas
job_elements = results.find_all('div', class_='quote')

**En este punto obtenemos la información de cada una de las citas, que son objetos en sí mismos.**

In [None]:
# Vamos a almacenar los datos de las citas en listas y ver cómo se pueden ver los objetos como texto.
text_list = []
author_list = []
link_list = []
keyword = []
keyword_list = []

for job_element in job_elements:
    text_element = job_element.find("span", class_="text")
    text_list.append(text_element.text)
    author_element = job_element.find("small", class_="author")
    author_list.append(author_element.text)
    link_element = job_element
    link_list.append("http://quotes.toscrape.com"+link_element.find('a')['href'])
    keywords_element = job_element.find("meta", class_="keywords")
    keyword.append(keywords_element['content'])
    
    keys = []
    keys = keywords_element['content']
    keyword_list.append(keys)
   
    print(text_element.text)
    print(author_element.text)
    print(link_element.find('a')['href'])
    print(keywords_element['content'])
    print()

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Albert Einstein
/author/Albert-Einstein
change,deep-thoughts,thinking,world

“It is our choices, Harry, that show what we truly are, far more than our abilities.”
J.K. Rowling
/author/J-K-Rowling
abilities,choices

“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Albert Einstein
/author/Albert-Einstein
inspirational,life,live,miracle,miracles

“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Jane Austen
/author/Jane-Austen
aliteracy,books,classic,humor

“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Marilyn Monroe
/author/Marilyn-Monroe
be-yourself,inspirational

“Try not to become a man of success. Rather become a man of value.”
Albert Einstein
/author/Albert-Einstein
ad

In [None]:
# definimos una función que nos permita obtener las citas de cualquier página de nuestra web
def get_pages(types, kind):
    web = f'http://quotes.toscrape.com/{types}/{kind}/'
    page = requests.get(web)
    soup = BeautifulSoup(page.content, "html.parser")
    results = soup.find(id='')
    job_elements = results.find_all('div', class_='quote')
    
    text_list = []
    author_list = []
    link_list = []
    keyword = []
    
    for job_element in job_elements:
        text_element = job_element.find("span", class_="text")
        text_list.append(text_element.text)
        author_element = job_element.find("small", class_="author")
        author_list.append(author_element.text)
        link_element = job_element
        link_list.append("http://quotes.toscrape.com"+link_element.find('a')['href'])
        keywords_element = job_element.find("meta", class_="keywords")
        keyword.append(keywords_element['content'])

    dict_quotes = {'Cita': text_list,'Autor': author_list,'Link Autor': link_list,'Keywords': keyword}

    df = pd.DataFrame(dict_quotes)
    
    return df

In [None]:
# definimos un dataframe que vaya contatenando los dataframes generados en cada página en la que entremos
df_quotes_complete = pd.DataFrame(columns=('Cita','Autor','Link Autor','Keywords'))

# recorremos las páginas de nuestra web
kinds = list(range(1,11))
page = 'page'
for kind in kinds:
    df_quotes_complete = pd.concat((df_quotes_complete, get_pages(page,kind)), ignore_index=True)

df_quotes_complete

Unnamed: 0,Cita,Autor,Link Autor,Keywords
0,“The world as we have created it is a process ...,Albert Einstein,http://quotes.toscrape.com/author/Albert-Einstein,"change,deep-thoughts,thinking,world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,http://quotes.toscrape.com/author/J-K-Rowling,"abilities,choices"
2,“There are only two ways to live your life. On...,Albert Einstein,http://quotes.toscrape.com/author/Albert-Einstein,"inspirational,life,live,miracle,miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,http://quotes.toscrape.com/author/Jane-Austen,"aliteracy,books,classic,humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,http://quotes.toscrape.com/author/Marilyn-Monroe,"be-yourself,inspirational"
...,...,...,...,...
95,“You never really understand a person until yo...,Harper Lee,http://quotes.toscrape.com/author/Harper-Lee,better-life-empathy
96,“You have to write the book that wants to be w...,Madeleine L'Engle,http://quotes.toscrape.com/author/Madeleine-LE...,"books,children,difficult,grown-ups,write,write..."
97,“Never tell the truth to people who are not wo...,Mark Twain,http://quotes.toscrape.com/author/Mark-Twain,truth
98,"“A person's a person, no matter how small.”",Dr. Seuss,http://quotes.toscrape.com/author/Dr-Seuss,inspirational


# Ahora queremos obtener las citas de la página de la etiqueta "life".

In [None]:
tag = 'tag'
life = 'life'
get_pages(tag, life) # la función nos devuelve un dataframe con las citas de la página "life"

Unnamed: 0,Cita,Autor,Link Autor,Keywords
0,“There are only two ways to live your life. On...,Albert Einstein,http://quotes.toscrape.com/author/Albert-Einstein,"inspirational,life,live,miracle,miracles"
1,“It is better to be hated for what you are tha...,André Gide,http://quotes.toscrape.com/author/Andre-Gide,"life,love"
2,“This life is what you make it. No matter what...,Marilyn Monroe,http://quotes.toscrape.com/author/Marilyn-Monroe,"friends,heartbreak,inspirational,life,love,sis..."
3,"“I may not have gone where I intended to go, b...",Douglas Adams,http://quotes.toscrape.com/author/Douglas-Adams,"life,navigation"
4,"“Good friends, good books, and a sleepy consci...",Mark Twain,http://quotes.toscrape.com/author/Mark-Twain,"books,contentment,friends,friendship,life"
5,“Life is what happens to us while we are makin...,Allen Saunders,http://quotes.toscrape.com/author/Allen-Saunders,"fate,life,misattributed-john-lennon,planning,p..."
6,"“Today you are You, that is truer than true. T...",Dr. Seuss,http://quotes.toscrape.com/author/Dr-Seuss,"comedy,life,yourself"
7,“Life is like riding a bicycle. To keep your b...,Albert Einstein,http://quotes.toscrape.com/author/Albert-Einstein,"life,simile"
8,“Life isn't about finding yourself. Life is ab...,George Bernard Shaw,http://quotes.toscrape.com/author/George-Berna...,"inspirational,life,yourself"
9,“Finish each day and be done with it. You have...,Ralph Waldo Emerson,http://quotes.toscrape.com/author/Ralph-Waldo-...,"life,regrets"


# **Web Scraping con Selenium.**

### Instalación y configuración de Selenium

In [None]:
!pip install selenium

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
%%shell

# Add debian buster
cat > /etc/apt/sources.list.d/debian.list <<'EOF'
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster.gpg] http://deb.debian.org/debian buster main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster-updates.gpg] http://deb.debian.org/debian buster-updates main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-security-buster.gpg] http://deb.debian.org/debian-security buster/updates main
EOF

# Add keys
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A

apt-key export 77E11517 | gpg --dearmour -o /usr/share/keyrings/debian-buster.gpg
apt-key export 22F3D138 | gpg --dearmour -o /usr/share/keyrings/debian-buster-updates.gpg
apt-key export E562B32A | gpg --dearmour -o /usr/share/keyrings/debian-security-buster.gpg

# Prefer debian repo for chromium* packages only
# Note the double-blank lines between entries
cat > /etc/apt/preferences.d/chromium.pref << 'EOF'
Package: *
Pin: release a=eoan
Pin-Priority: 500


Package: *
Pin: origin "deb.debian.org"
Pin-Priority: 300


Package: chromium*
Pin: origin "deb.debian.org"
Pin-Priority: 700
EOF

Executing: /tmp/apt-key-gpghome.u7ebY3q8G6/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
gpg: key DCC9EFBF77E11517: "Debian Stable Release Key (10/buster) <debian-release@lists.debian.org>" not changed
gpg: Total number processed: 1
gpg:              unchanged: 1
Executing: /tmp/apt-key-gpghome.Lq35otfHah/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
gpg: key DC30D7C23CBBABEE: "Debian Archive Automatic Signing Key (10/buster) <ftpmaster@debian.org>" not changed
gpg: Total number processed: 1
gpg:              unchanged: 1
Executing: /tmp/apt-key-gpghome.grZsTdpvDs/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A
gpg: key 4DFAB270CAA96DFA: "Debian Security Archive Automatic Signing Key (10/buster) <ftpmaster@debian.org>" not changed
gpg: Total number processed: 1
gpg:              unchanged: 1
gpg: cannot open '/dev/tty': No such device or address
gpg: [stdout]: write error: Broken pipe
gpg: filter_flush failed on c



In [None]:
# Instale el navegador y el controlador de cromo
!apt-get update
!apt-get install chromium chromium-driver

0% [Working]            Hit:1 http://deb.debian.org/debian buster InRelease
0% [Connecting to archive.ubuntu.com (185.125.190.36)] [Connecting to security.                                                                               Hit:2 http://deb.debian.org/debian buster-updates InRelease
0% [Connecting to archive.ubuntu.com (185.125.190.36)] [Connecting to security.                                                                               Hit:3 http://deb.debian.org/debian-security buster/updates InRelease
0% [Connecting to archive.ubuntu.com (185.125.190.36)] [Connecting to security.                                                                               Hit:4 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease
0% [Connecting to archive.ubuntu.com (185.125.190.36)] [Waiting for headers] [C                                                                               Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time

In [None]:
def web_driver():
    options = webdriver.ChromeOptions()
    options.add_argument("--verbose")
    options.add_argument('--no-sandbox')
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    options.add_argument("--window-size=1920, 1200")
    options.add_argument('--disable-dev-shm-usage')
    driver = webdriver.Chrome(options=options)
    return driver

In [None]:
# abrimos el navegador

browser= web_driver()

 # contactamos con la web
browser.get('http://quotes.toscrape.com')

In [None]:
 # leemos el contenido
 results  = browser.find_element(By.CLASS_NAME, 'container')
 print(results.text)

Quotes to Scrape
Login
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein (about)
Tags: change deep-thoughts thinking world
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
by J.K. Rowling (about)
Tags: abilities choices
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
by Albert Einstein (about)
Tags: inspirational life live miracle miracles
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
by Jane Austen (about)
Tags: aliteracy books classic humor
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
by Marilyn Monroe (about)
Tags: be-yourself inspirational
“Try not to become a man of success. Rather become a man of value.”
by Albert Einstein (about)
Tags: adulthood success

In [None]:
# localizamos el elemento con contiene cada cita
message = browser.find_element(by=By.XPATH, value="/html/body/div/div[2]/div[1]/div[1]")  

In [None]:
# navegamos por la toda web para obtener todas las citas
cita_list = []
autor_list = []
tags_list = []

cortar = 1
while cortar < 11:
    i=1
    while i<11:
        message = browser.find_element(by=By.XPATH, value=("/html/body/div/div[2]/div[1]/div[{}]").format(i))
        cita_list.append((message.find_element(by=By.CLASS_NAME, value="text").text))
        autor_list.append((message.find_element(by=By.CLASS_NAME, value="author").text))
        tags = message.find_element(by=By.CLASS_NAME, value="tags")
        tags_list.append(str(tags.text)[6:])
        i+=1
    
    # boton para cargar más items
    time.sleep(3)
    if cortar == 1:
        browser.find_element(by=By.XPATH, value='/html/body/div/div[2]/div[1]/nav/ul/li/a').click()
        
    elif cortar <10:
        browser.find_element(by=By.XPATH, value='/html/body/div/div[2]/div[1]/nav/ul/li[2]/a').click() 
                                                
    cortar+=1 

In [None]:
# hacemos un dataframe con toda la información de la web
dict_quotes_Sel_web = {'Cita': cita_list,'Autor': autor_list, 'Keywords': tags_list}

dict_quotes_Sel_web = pd.DataFrame(dict_quotes_Sel_web)
dict_quotes_Sel_web

Unnamed: 0,Cita,Autor,Keywords
0,“The world as we have created it is a process ...,Albert Einstein,change deep-thoughts thinking world
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,abilities choices
2,“There are only two ways to live your life. On...,Albert Einstein,inspirational life live miracle miracles
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,aliteracy books classic humor
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,be-yourself inspirational
...,...,...,...
95,“You never really understand a person until yo...,Harper Lee,better-life-empathy
96,“You have to write the book that wants to be w...,Madeleine L'Engle,books children difficult grown-ups write write...
97,“Never tell the truth to people who are not wo...,Mark Twain,truth
98,"“A person's a person, no matter how small.”",Dr. Seuss,inspirational


In [None]:
dict_quotes_Sel_web['Cita'].nunique()

100

In [None]:
dict_quotes_Sel_web.dtypes

Cita        object
Autor       object
Keywords    object
dtype: object

In [None]:
# guardamos los datos en un archivo
dict_quotes_Sel_web.to_csv('quotes_scraping_Sel.csv', index = False)

In [None]:
# cerramos el navegador
browser.close()

# Segunda Web www.wikipedia.es

## **Web Scraping utilizando Beautiful Soup.**

In [None]:
#  Hacemos una petición GET a la página que queremos scrapear
# ==============================================================================
url = "https://es.wikipedia.org/wiki/Beautiful_Soup"
response = requests.get(url)

# Analizamos el contenido de la página con BeautifulSoup
# ==============================================================================
soup = BeautifulSoup(response.content, "html.parser")

In [None]:
# Buscamos el contenido que necesitamos dentro de nuestro código HTML
# ==============================================================================
results = soup.find(id='content')

In [None]:
# Título de la búsqueda
# ==============================================================================
job_elements = results.find('h1', class_='firstHeading')
print(job_elements.text)

Beautiful Soup


In [None]:
# Primera definición de la búsqueda.
# ==============================================================================
job_elements = results.find('div', class_='mw-parser-output').find('p')
print(job_elements.text)

Beautiful Soup es una biblioteca de Python para analizar documentos HTML (incluyendo los que tienen un marcado incorrecto). Esta biblioteca crea un árbol con todos los elementos del documento y puede ser utilizado para extraer información. Por lo tanto, esta biblioteca es útil para realizar web scraping — extraer información de sitios web.[2]​



# **Web Scraping con Selenium.**

In [None]:
# abrimos el navegador
# ==============================================================================
browser= web_driver()

# contactamos con la web
# ==============================================================================
browser.get('https://es.wikipedia.org/wiki/Beautiful_Soup')

In [None]:
 # leemos el contenido
 # ==============================================================================
 results = browser.find_element(By.CLASS_NAME, value = 'mw-parser-output')
 # Contenido general de la página de Wikipedia correspondiente a la búsqueda Beautifu Soup.
# ==============================================================================
 print(results.text)

Beautiful Soup
Información general
Tipo de programa Parseador HTML, web scraping
Autor Leonard Richardson
Licencia Python Software Foundation License (Inferior a la versión 4)
Licencia MIT (A partir de la versión 4)1
Información técnica
Programado en Python
Plataformas admitidas Python
Versiones
Última versión estable 4.11.1
8 de abril de 2022
Enlaces
Sitio web oficial
Repositorio de código
[editar datos en Wikidata]
Beautiful Soup es una biblioteca de Python para analizar documentos HTML (incluyendo los que tienen un marcado incorrecto). Esta biblioteca crea un árbol con todos los elementos del documento y puede ser utilizado para extraer información. Por lo tanto, esta biblioteca es útil para realizar web scraping — extraer información de sitios web.2
Está disponible para Python 3.
Código de ejemplo[editar]
# extracción de todos los enlaces de un documento html
from bs4 import BeautifulSoup

with open("./index.html") as f:
    soup = BeautifulSoup(f)
 
for anchor in soup.find_all('a'

In [None]:
# Definición obtenida en la página.
# ==============================================================================
contenido = results.find_element(By.TAG_NAME, value = 'p')
print(contenido.text)

Beautiful Soup es una biblioteca de Python para analizar documentos HTML (incluyendo los que tienen un marcado incorrecto). Esta biblioteca crea un árbol con todos los elementos del documento y puede ser utilizado para extraer información. Por lo tanto, esta biblioteca es útil para realizar web scraping — extraer información de sitios web.2


In [None]:
browser.close()

# Ejercicio 2

**Documenta en un Word tu conjunto de datos generado con la información que tienen los distintos archivos de Kaggle.**

Para saber más

A modo de ejemplo de lo que se pide puedes consultar este enlace:

* https://www.kaggle.com/datasets/vivovinco/20212022-football-team-stats .

## **Sobre los dataset obtenidos con Beautiful Soup y https://quotes.toscrape.com/**

**Primer dataset.**

Este dataset recoge las citas contenidas en la web https://quotes.toscrape.com/, tiene 100 registros y cuatro columnas tipo objeto:

* Cita: citas extraidas de la web, 100 valores únicos

* Etiqueta Life:  Se obtuvo un dataframe con las citas de la página de la etiqueta "life".

# Sobre el dataset obtenido con Selenium y https://quotes.toscrape.com/

* Cita: citas extraidas de la web, 100 valores únicos

* Autor: autores de las citas, 50 valores únicos

* Keywords: palabras representativas de la cita, variable categórica

## **Sobre los dataset obtenidos con Beautiful Soup y Selenium de la página www.wikipedia.es**

Se consulto:

* El título y el concepto de **Beautiful Soup**





# Ejercicio 3

Elige una página web que quieras y realiza web scraping mediante la librería Selenium primero y Scrapy después. 

# **Web Scraping con Selenium.**

In [None]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [None]:
# abrimos el navegador
# ==============================================================================
browser= web_driver()

# contactamos con la web
# ==============================================================================
browser.get("https://twitter.com/search?q=ChatGPT&src=typeahead_click")

In [None]:
# Localizar todos los elementos con el XPath específico
elements = browser.find_elements(By.XPATH, '//span[@class="css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0"]')

In [None]:
# Imprimir el texto dentro de cada elemento
for element in elements:
    print(element.text)

Don’t miss what’s happening
People on Twitter are the first to know.
Log in
Sign up
Explore
Settings
Top
Latest
People
Photos
Videos

People
ChatGPT NFT Club
@ChatGPTNFTs
Follow
10,000 NFT - 0.07 ETH
ChatGPT Chef
@ChatGPTChef
Follow
Cooking Interesting ChatGPT content. And helping users to understand
better Note: All the content generated using ChatGPT for Research. Use carefully
R “Ray” Wang 王瑞光 #1A #AI #ChatGPT #StableDiffusion
@rwang0
Follow


2X BestSelling Author

Keynoter
Provocateur




View all
推特 电报 谷歌 chatgpt 脸书 line tiktok ins账号出售批发
@UmerHayree
·
When you are sad, eat a candy and tell yourself that life is sweet! 47. Success is from failure to failure, and the original enthusiasm is not reduced at all. 
10
10
10
3
推特账号 电报号 谷歌账号 Chat GPT账号 ins账号 出售批发
@EsthefaniaB
·
As long as you outlive your competitors, you win. 59. Those who are stronger than me are still working hard. I have no reason not to work hard. 
10
10
10
5
推特账号 电报号 谷歌账号 Chat GPT账号 ins账号 出售批发
@EsthefaniaB
·
People 

In [None]:
browser.close()