# Web Scraping con Python

### Librerías

Vamos a necesitar Selenium, BeautifulSoup y Pandas

Instala Selenium con el siguiente código *conda install -c conda-forge selenium*

Instala BeautifulSoup con el código *conda install -c anaconda beautifulsoup4*

Ahora Carga Webdriver de Selenium

In [58]:
from selenium import webdriver

Carga BeautifulSoup de Beautiful Soup

In [59]:
from bs4 import BeautifulSoup as bs

Y Pandas como PD

In [60]:
import pandas as pd

## Configuración

Ok, ahora configura webdriver para usar Chrome por default, localiza tu chromedriver y usalo en este comando

In [61]:
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from urllib.parse import urljoin

In [62]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

Arma 3 listas, una para metros cuadrados, otra para precio y otra para ubicacion

In [63]:
title_sucio = []
precio_sucio = []
stock_sucio = []

In [64]:
titles = []
prices = []
stock = []

Y arma un driver.get para especificar la página de donde vamos a trabajar

In [65]:
#driver.get("https://www.inmuebles24.com/terrenos-en-venta-en-fraccionamiento-bugambilias.html")
driver.get("http://books.toscrape.com/")
url = "http://books.toscrape.com/"

## Extracción de Datos

Guarda el código HTML de la página en un objeto Contenido

In [66]:
contenido = driver.page_source

In [67]:
contenido



Conviertelo a BeautifulSoup

In [68]:
soup = bs(contenido)

In [69]:
soup

<html class="no-js" lang="en-us"><!--<![endif]--><head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.css" rel="stylesheet"/>
<link href="static/oscar/css/datetimepicker.css" rel="stylesheet" type="text/css"/>
</head>
<body class="default" id="default">
<header class="header container-fluid">
<div class="page_inner">
<div class="

Arma un for en el que vayas agregando a nuestras listas los datos necesarios con a.find, usa . append para ir agregando los datos a las listas

In [70]:
for libro in soup.find_all("article", attrs = {"class":"product_pod"}):
    title =  libro.find("h3")
    title_sucio.append(title.text)
    
    precio = libro.find("p", attrs = {"class":"price_color"})
    precio_sucio.append(precio.text)
    
    sstock = libro.find("p", attrs = {"class":"instock availability"})
    stock_sucio.append(sstock.text)
    

In [71]:
stock_sucio

['\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n',
 '\n\n    \n        In stock\n    \n']

In [72]:
for libro in title_sucio:
    titles.append(libro.replace("\n"," ").strip())

In [73]:
len(titles)

20

Arma un for para limpiar precio

In [74]:
for libro in precio_sucio:
    prices.append(libro.replace("\n"," ").strip())
len(prices)
prices

['£51.77',
 '£53.74',
 '£50.10',
 '£47.82',
 '£54.23',
 '£22.65',
 '£33.34',
 '£17.93',
 '£22.60',
 '£52.15',
 '£13.99',
 '£20.66',
 '£17.46',
 '£52.29',
 '£35.02',
 '£57.25',
 '£23.88',
 '£37.59',
 '£51.33',
 '£45.17']

In [75]:
for libro in stock_sucio:
    stock.append(libro.replace("\n"," ").strip())
len(stock)

20

Arma un for para limpiar ubicación

In [76]:
#Scrapear la paguina
sig_page = soup.select_one("li.next > a")
if sig_page:
    sig_url = sig_page.get("href")
    url = urljoin(url,sig_url)
else: 
   breakpoint()

Arma un for para limpiar m2

## Almacenamiento

Guarda tus datos en un dataframe

In [77]:
df = pd.DataFrame({"Title":titles,"Price":prices,"Availability":stock})

In [78]:
df.head(30)

Unnamed: 0,Title,Price,Availability
0,A Light in the ...,£51.77,In stock
1,Tipping the Velvet,£53.74,In stock
2,Soumission,£50.10,In stock
3,Sharp Objects,£47.82,In stock
4,Sapiens: A Brief History ...,£54.23,In stock
5,The Requiem Red,£22.65,In stock
6,The Dirty Little Secrets ...,£33.34,In stock
7,The Coming Woman: A ...,£17.93,In stock
8,The Boys in the ...,£22.60,In stock
9,The Black Maria,£52.15,In stock


Exporta tu Dataframe a un archivo CSV y abrelo en Excel

In [79]:
df.to_csv("Libros.csv",index=False,encoding='utf-8')

PermissionError: [Errno 13] Permission denied: 'Libros.csv'

In [None]:
from urllib.parse import urljoin