<a href="https://colab.research.google.com/github/LabCInf/TallerScraping/blob/main/Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p><img alt="Colaboratory logo" height="250px" src="https://upload.wikimedia.org/wikipedia/commons/archive/f/fb/20161010213812%21Escudo-UdeA.svg" align="left" hspace="10px" vspace="0px"></p>

<h1><b>Introducción al web scraping</b></h1>

<h2>Material preparado por: Jaider Ochoa Gutiérrez y Juan Fernando Pérez Pérez</h2>

Este será un panorama general del uso de herramientas computacionales para hacer scraping de páginas web.

**¡Manos a la obra!**

# Paso 1: preparar el entorno de trabajo y las librerías

In [None]:
!pip install BeautifulSoup4
!pip install requests

In [None]:
from bs4 import BeautifulSoup
import pandas as pd
import requests

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Paso 2: Selección e inspección del sitio web

**Estructura de la página HTML**

El lenguaje de marcado de hipertexto (HTML) es el lenguaje de marcado estándar para documentos diseñados para mostrarse en un navegador web. HTML describe la estructura de una página web y se puede utilizar con hojas de estilo en cascada (CSS) y un lenguaje de secuencias de comandos como JavaScript para crear sitios web interactivos. HTML consta de una serie de elementos que "le dicen" al navegador cómo mostrar el contenido. Por último, los elementos están representados por etiquetas .

A continuación se muestran algunas etiquetas:



* La declaración `<!DOCTYPE html>` define este documento como HTML5.
* El elmento`<html>` es el elemento raíz de una página HTML.
* La etiqueta `<div>` define una división o una sección en un documento HTML. Suele ser un contenedor para otros elementos.
* El elemento `<head>` contiene metainformación sobre el documento.
* El elemento `<title>` especifica un título para el documento.
* El elemento `<body>` contiene el contenido de la página visible.
* El elemento `<h1>` define un encabezado grande.
* El elemento `<p>` define un párrafo.
* El elemento `<a>` define un hipervínculo.

Las etiquetas HTML normalmente vienen en pares como `<p>` y `</p>`. La primera etiqueta de un par es la etiqueta de apertura, la segunda etiqueta es la etiqueta de cierre. La etiqueta final se escribe como la etiqueta de inicio, pero con una barra inclinada insertada antes del nombre de la etiqueta.

In [None]:
url = 'https://techcrunch.com/'

# Paso 3: Enviar una solicitud HTML

In [None]:
url = 'https://techcrunch.com/'
page = requests.get(url) #Hacer la solicitud al sitio web
soup = BeautifulSoup(page.content, 'html.parser')
page #La respuesta que debemos obtener es 200, lo que indica que SÍ podemos "escrapear" el contenido del sitio web

<Response [200]>

In [None]:
a = soup.find_all('a')
a

# Paso 4: Extracción de secciones específicas

In [None]:
#Extraer los títulos de las noticias
titulo = soup.find_all('a', {'class': 'post-block__title__link'}) #En la inspección se debe identificar el atributo o elemento que deseamos extraer
titulo_list = [] #Lista vacía donde vamos a almacer los datos
for x in titulo[1:] : #hacemos un ciclo for para extraer cada unos de los títulos de las noticias
   titulo_list.append((x.get_text())) #Append es un método de pandas que se emplea para almacenar los datos en la lista vacía que hemos creado anteriormente
                                      # .get_text método de beautifulsoup
titulos=pd.DataFrame(titulo_list, columns=['Títulos']) #Guardamos el resultado en un dataframe
titulos #visualimos los datos almacenados en el dataframe

Unnamed: 0,Títulos
0,\n\t\t\t\tBrex just signed a term sheet for $3...
1,\n\t\t\t\tMesh++ raises $4.9M to make the worl...
2,\n\t\t\t\tSnap says iOS privacy changes hit it...
3,\n\t\t\t\tStarting your journey to zero trust ...
4,\n\t\t\t\tDaily Crunch: ‘To stand up to the ty...
5,"\n\t\t\t\tRed Hat continues to grow, but IBM’s..."
6,\n\t\t\t\tPotential winners and losers line up...
7,\n\t\t\t\tFacebook agrees to pay French publis...
8,\n\t\t\t\tESG and shareholder activism: A tsun...
9,\n\t\t\t\tCharting a course through the intern...


In [None]:
#Extraer el contenido de las noticias (resumen de la noticia)
content = soup.find_all('div', {'class': 'post-block__content'}) #En la inspección se debe identificar el atributo o elemento que deseamos extraer
content_list = []
for i in content[1:]: #hacemos un ciclo para extraer los resúmenes de las noticias
   content_list.append((i.get_text()))

contenido=pd.DataFrame(content_list, columns=['Contenido']) #Guardamos el resultado en un dataframe
contenido

Unnamed: 0,Contenido
0,\n\t\tSo now Plaid says it’s a payments compan...
1,\n\t\tFacebook has reached a multi-year agreem...
2,\n\t\tWith the increase in attention on ESG is...
3,\n\t\tGoogle Chrome's Manifest V3 update is ju...
4,\n\t\tTC Sessions: SaaS 2021 kicks off in just...
5,\n\t\tAfter steadily expanding access over the...
6,"\n\t\tIn a bull market, it's especially hard t..."
7,"\n\t\tAmazon is rolling out “Local Selling,” a..."
8,"\n\t\tIf challenging the status quo was easy, ..."
9,\n\t\tThe climate measures in the budget recon...


In [None]:
#Extraer el link de la noticia
link = soup.find_all('a', {'class': 'post-block__title__link'}) #En la inspección se debe identificar el atributo o elemento que deseamos extraer
link_list = []
for i in link[1:]: #hacemos un ciclo para extraer cada unos de los enlaces de las noticias
   link_list.append((i.get('href')))

enlaces=pd.DataFrame(link_list, columns=['Enlaces']) #Guardamos el resultado en un dataframe
enlaces

Unnamed: 0,Enlaces
0,https://techcrunch.com/2021/10/21/potential-wi...
1,https://techcrunch.com/2021/10/21/facebook-agr...
2,https://techcrunch.com/2021/10/21/esg-and-shar...
3,https://techcrunch.com/2021/10/21/charting-a-c...
4,https://techcrunch.com/2021/10/21/network-your...
5,https://techcrunch.com/2021/10/21/twitter-roll...
6,https://techcrunch.com/2021/10/21/lessons-from...
7,https://techcrunch.com/2021/10/21/amazon-rolls...
8,https://techcrunch.com/2021/10/21/surface-duo-...
9,https://techcrunch.com/2021/10/21/the-climate-...


In [None]:
#Concatenamos todas los dataframe (tablas creadas)
df=pd.concat([titulos,contenido,enlaces],axis=1)
df

Unnamed: 0,Títulos,Contenido,Enlaces
0,\n\t\t\t\tPotential winners and losers line up...,\n\t\tSo now Plaid says it’s a payments compan...,https://techcrunch.com/2021/10/21/potential-wi...
1,\n\t\t\t\tFacebook agrees terms to pay French ...,\n\t\tFacebook has reached a multi-year agreem...,https://techcrunch.com/2021/10/21/facebook-agr...
2,\n\t\t\t\tESG and shareholder activism: A tsun...,\n\t\tWith the increase in attention on ESG is...,https://techcrunch.com/2021/10/21/esg-and-shar...
3,\n\t\t\t\tCharting a course through the intern...,\n\t\tGoogle Chrome's Manifest V3 update is ju...,https://techcrunch.com/2021/10/21/charting-a-c...
4,\n\t\t\t\tNetwork your way to opportunity at T...,\n\t\tTC Sessions: SaaS 2021 kicks off in just...,https://techcrunch.com/2021/10/21/network-your...
5,\n\t\t\t\tTwitter rolls out the ability for an...,\n\t\tAfter steadily expanding access over the...,https://techcrunch.com/2021/10/21/twitter-roll...
6,\n\t\t\t\tLessons from founders raising their ...,"\n\t\tIn a bull market, it's especially hard t...",https://techcrunch.com/2021/10/21/lessons-from...
7,\n\t\t\t\tAmazon rolls out in-store pickup for...,"\n\t\tAmazon is rolling out “Local Selling,” a...",https://techcrunch.com/2021/10/21/amazon-rolls...
8,\n\t\t\t\tSurface Duo 2 review: Getting better...,"\n\t\tIf challenging the status quo was easy, ...",https://techcrunch.com/2021/10/21/surface-duo-...
9,\n\t\t\t\tThe climate policies tucked into Con...,\n\t\tThe climate measures in the budget recon...,https://techcrunch.com/2021/10/21/the-climate-...


In [None]:
#Es importante que de manera constante guardemos nuestros datos en formato JSON o CSV para evitar perderlos y tener que correr el código una y otra vez
df.to_csv("/content/drive/MyDrive/2021/3. Estrategia digital TC/6. Semana de la innovación/ejemplo_1.csv", sep=";",encoding="utf-8-sig")

In [None]:
#Limpieza del texto
import re #librería de expresiones regulares

def  clean_text(df, text_field):
  patternURLEMAIL=r'(\w+[.]?\w+@(\w+\.)+\w+)|((http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?\w+([\-\.]{1}\w+)*\.[a-z]{2,5}(\/)?(([^\s@])*(\/)?)*)'
  patternHashtagMention=r'(@\w+)|(#\w+)'
  # Utilizamos las expresiones regulares anteriores sobre URL, email, hashtag y menciones para quitarlos
  df[text_field] = df[text_field].apply(lambda elem: re.sub(patternURLEMAIL,'', elem))  
  ## Sustituir espacios de más
  df[text_field] = df[text_field].apply(lambda elem: re.sub(r'\s+',' ', elem))

  return df

In [None]:
dfclean=clean_text(df,"Contenido")
dfclean=clean_text(df,"Títulos")

In [None]:
dfclean

Unnamed: 0,Títulos,Contenido,Enlaces
0,Potential winners and losers line up as Plaid...,So now Plaid says it’s a payments company. It...,https://techcrunch.com/2021/10/21/potential-wi...
1,Facebook agrees terms to pay French publisher...,Facebook has reached a multi-year agreement t...,https://techcrunch.com/2021/10/21/facebook-agr...
2,ESG and shareholder activism: A tsunami is co...,With the increase in attention on ESG issues ...,https://techcrunch.com/2021/10/21/esg-and-shar...
3,Charting a course through the internet’s ever...,Google Chrome's Manifest V3 update is just on...,https://techcrunch.com/2021/10/21/charting-a-c...
4,Network your way to opportunity at TC Session...,TC Sessions: SaaS 2021 kicks off in just five...,https://techcrunch.com/2021/10/21/network-your...
5,Twitter rolls out the ability for anyone to h...,After steadily expanding access over the cour...,https://techcrunch.com/2021/10/21/twitter-roll...
6,Lessons from founders raising their first rou...,"In a bull market, it's especially hard to und...",https://techcrunch.com/2021/10/21/lessons-from...
7,Amazon rolls out in-store pickup for products...,"Amazon is rolling out “Local Selling,” a set ...",https://techcrunch.com/2021/10/21/amazon-rolls...
8,Surface Duo 2 review: Getting better,"If challenging the status quo was easy, we’d ...",https://techcrunch.com/2021/10/21/surface-duo-...
9,The climate policies tucked into Congress’ bu...,The climate measures in the budget reconcilia...,https://techcrunch.com/2021/10/21/the-climate-...


In [None]:
dfclean.Contenido[4]

' TC Sessions: SaaS 2021 kicks off in just five days on October 27. Attendees from around the globe will be in the virtual room ready to connect with founders, investors, engineers and journalists. Are '

# Paso 5: almacenar datos

In [None]:
#Guardar en formato CSV
dfclean.to_csv("/content/drive/MyDrive/2021/3. Estrategia digital TC/6. Semana de la innovación/ejemplo_1_clean.csv", sep=";",encoding="utf-8-sig")

In [None]:
dfclean=pd.read_csv("/content/drive/MyDrive/2021/3. Estrategia digital TC/6. Semana de la innovación/ejemplo_1_clean.csv", sep=";",encoding="utf-8-sig")

In [None]:
#Guardar en formato JSON
dfclean.to_json("/content/drive/MyDrive/2021/3. Estrategia digital TC/6. Semana de la innovación/ejemplo_1_clean.json") # exportar a json

In [None]:
dfclean=pd.read_json("/content/drive/MyDrive/2021/3. Estrategia digital TC/6. Semana de la innovación/ejemplo_1_clean.json")
dfclean

Unnamed: 0.1,Unnamed: 0,Títulos,Contenido,Enlaces
0,0,Potential winners and losers line up as Plaid...,So now Plaid says it’s a payments company. It...,https://techcrunch.com/2021/10/21/potential-wi...
1,1,Facebook agrees terms to pay French publisher...,Facebook has reached a multi-year agreement t...,https://techcrunch.com/2021/10/21/facebook-agr...
2,2,ESG and shareholder activism: A tsunami is co...,With the increase in attention on ESG issues ...,https://techcrunch.com/2021/10/21/esg-and-shar...
3,3,Charting a course through the internet’s ever...,Google Chrome's Manifest V3 update is just on...,https://techcrunch.com/2021/10/21/charting-a-c...
4,4,Network your way to opportunity at TC Session...,TC Sessions: SaaS 2021 kicks off in just five...,https://techcrunch.com/2021/10/21/network-your...
5,5,Twitter rolls out the ability for anyone to h...,After steadily expanding access over the cour...,https://techcrunch.com/2021/10/21/twitter-roll...
6,6,Lessons from founders raising their first rou...,"In a bull market, it's especially hard to und...",https://techcrunch.com/2021/10/21/lessons-from...
7,7,Amazon rolls out in-store pickup for products...,"Amazon is rolling out “Local Selling,” a set ...",https://techcrunch.com/2021/10/21/amazon-rolls...
8,8,Surface Duo 2 review: Getting better,"If challenging the status quo was easy, we’d ...",https://techcrunch.com/2021/10/21/surface-duo-...
9,9,The climate policies tucked into Congress’ bu...,The climate measures in the budget reconcilia...,https://techcrunch.com/2021/10/21/the-climate-...


# Referencias


[Documentación de BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

[Documentación de re](https://docs.python.org/3/library/re.html)

[Documentación de request](https://docs.python-requests.org/en/latest/)

[Documentación de Panda](https://pandas.pydata.org/)

[Documentación de HTML](https://html.spec.whatwg.org/multipage/)

[Documentación JSON](https://www.json.org/json-en.html)