## Web Scraper - IMFD Taller Periodistas 2019
Se obtendrán datos de fuentes de oferta inmobiliaria

#### Fuentes:
1. www.portalinmobiliario.com
2. http://www.propiedades.emol.com
3. www.zoominmobiliario.com

El objetivo es obtener ofertas inmobiliarias enfocadas en la clase media Chilena (C1b, C2, C3).

#### Supuestos de la clase media:
- Ingresos mensuales entre 900.000 y 2.000.000 CLP.
- Préstamo inmobiliario a 25 o hasta 40 años.
- persona Chilena promedio gasta en promedio 40% a 60% del sueldo en vivienda (360.000 a 1.200.000)
- Precio de compra entre 2500 y 4000 UFs.
- Arriendo entre 360.000 y 1.200.000 CLP (UF 13.0 a 44.0).

Tutoriales:
1. Selenium Web Scraping: https://medium.com/the-andela-way/introduction-to-web-scraping-using-selenium-7ec377a8cf72
2. GeckoDriver: https://askubuntu.com/questions/870530/how-to-install-geckodriver-in-ubuntu
3. Selenium: https://selenium-python.readthedocs.io/installation.html
4. Google Places: https://developers.google.com/places/web-service/search
5. Google Maps: https://developers.google.com/maps/documentation/geocoding/start

In [1]:
import sys

from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException

def hasNumbers(inputString):
    return any(char.isdigit() for char in inputString)

----
### Portal Inmobiliario

In [2]:
# Variables

# Purchase min and max values
price_down = '2.500'
price_high = '4.000'

# Purchase min and max values
rent_min = '13'
rent_max = '44'

# First is dptos (pages[0]), second houses(pages[1])
# The inside bracket: first purchase (pages[0][0], pages[1][0]), second renting (pages[0][1], pages[1][1])
pages = [[199, 358],[113, 61]]

tipos = ['departamento', 'casa']

urls = []
for i in range(len(tipos)):
    t = tipos[i]
    # Buy
    for p in range(1, pages[i][0]+1):
        urls.append("https://www.portalinmobiliario.com/venta/"+t+"/metropolitana?pd="+price_down+"&ph="+price_high+"&pg="+str(p))
    # Rent
    for p in range(1, pages[i][1]+1):
        urls.append("https://www.portalinmobiliario.com/arriendo/"+t+"/metropolitana?pd="+rent_min+"&ph="+rent_max+"&pg="+str(p))
    
# for u in urls:
#     print(u)

In [3]:
# Get data
browser = webdriver.Firefox()
unit_browser = webdriver.Firefox()
p_inmobiliario = []

for u in urls:
    browser.get(u)
    
    # Find all offers' data
    titles_element = browser.find_elements_by_class_name('product-item-data')
    
    # Code, Address, Price(s), Size
    for prop in titles_element:
        
        data = prop.text.split('\n')
        try:
            rooms = 'n/a'
            
            # Get Lat/Lon
            link = prop.find_element_by_tag_name('a').get_attribute("href")
            unit_browser.get(link)
            lat_elem = unit_browser.find_element_by_xpath("//meta[@property='og:latitude']")
            # print(lat_elem.get_attribute('outerHTML'))
            lon_elem = unit_browser.find_element_by_xpath("//meta[@property='og:longitude']")
            # print(lon_elem.get_attribute('outerHTML'))
            lat = float(lat_elem.get_attribute('content'))
            lon = float(lon_elem.get_attribute('content'))
            
            # Get Comuna
            bc_dir = unit_browser.find_element_by_class_name('breadcrumb')
            comuna = bc_dir.text.split()[4]
            
            # Clean Data depending on "Proyecto" or "Propiedad Usada"
            if 'Proyecto' in data[0]:
                data = data[2:]
            else:
                rooms = data[3]
                data = data[1:3] + data[4:]

            # Get Address
            addr = data[0]

            # Get Code
            code = int(data[1].split()[1])
            
            # Get price
            price_min = float(data[3].split(',')[0].replace("UF ", "").replace(".", ''))
        
            
            # Check if place has more info:
            price_max = 0.0
            values = [0.0, 0.0]
            if(len(data)>= 5):
                # Check if there's "hasta" price
                if(data[4]=="Hasta:"):
                    price_max = float(data[5].split(',')[0].replace("UF ", "").replace(".", ''))
                    values = data[7].replace(",", ".").split()
                else:
                    values = data[5].replace(",", ".").split()

                size_min = float(values[0])

                if(len(values)>2):
                    size_max = float(values[2])

            purchase = 'venta' if 'venta' in u else 'arriendo'
            elem_type = 'casa' if 'casa' in u else 'departamento'

            p_inmobiliario.append([code, addr, comuna, lat, lon, rooms, price_min, price_max, size_min, size_max, purchase, elem_type])
            
        except:
            e = sys.exc_info()
            print(u)
            print(e)
            print(data)
            print(link)
            print(str(lat)+str(lon))

unit_browser.close()
browser.close()

https://www.portalinmobiliario.com/venta/departamento/metropolitana?pd=2.500&ph=4.000&pg=106
(<class 'KeyboardInterrupt'>, KeyboardInterrupt(), <traceback object at 0x7f7b37c3a648>)
['Propiedad usada, Venta, Departamento', 'Grajales / Almirante La Torre, Santiago', 'Código: 5028866', '2D/2B', 'Valor:', 'UF 3.180,79', 'Superficie:', '66 - 72 m²']
https://www.portalinmobiliario.com/venta/departamento/santiago-metropolitana/5028866-grajales-almirante-la-torre-uda?tp=2&op=1&iug=441&ca=3&pd=2500&ph=4000&ts=1&mn=2&or=&sf=1&sp=0&at=0&i=2645
-33.50912-70.66219


MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=38849): Max retries exceeded with url: /session/d3da1374-d82a-4050-8a36-b1ba2cc4587c/element/1f035bfd-09d8-4d87-b440-6ec15d76314d/text (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7b37c3b4a8>: Failed to establish a new connection: [Errno 111] Connection refused'))

Formato de los datos en `p_inmobiliario`:

| Código | Dirección | Comuna | Latitud | Longitud | Piezas | Precio Min (UF) | Precio Max (UF) | Tamaño Min (m<sup>2</sup>) | Tamaño Max (m<sup>2</sup>) | Pago | Tipo de Vivienda|
|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
| 12345 | Vicuña Mackenna 3030, Macul | Macul| -33 | -70 | n/a | 2400.0 | 3000.0 | 37.0 | 38.0 | venta | departamento |
| 12346 | Marberia 476, Las Condes | Las Condes | -33 | -70 | 3D2B | 28.0 | 0.0 | 25.0 | 0.0 | arriendo | casa |

Lo siguiente es guardar los datos en un csv:

In [37]:
import csv
import datetime

# Add current date to CSV's filename
now = '{0:d%Y-%m-%d_t%H-%M-%S}'.format(datetime.datetime.now())
filename = "p_inmobiliario_"+now+"_parte1.csv"

with open(filename, 'w', newline="") as myfile:
    wr = csv.writer(myfile)
    wr.writerow(['Codigo', 'Direccion', 'Comuna' 'Latitud', 'Longitud', 'Piezas', 'Precio Min (UF)', 'Precio Max (UF)', 'Tamaño Min', 'Tamaño Max', 'Pago', 'Tipo de Vivienda'])
    wr.writerows(p_inmobiliario)

#### Google Places
Es necesario encontrar latitud y longitud de los campos de dirección obtenidos de la página

#### Google Maps
Después de obtener los datos, debemos mostrarlos en el mapa. Cargamos la llave de la API de Google Maps:

In [None]:
with open('API_key.txt') as f:
    API_key = f.readline()
    f.close

import gmaps
gmaps.configure(api_key=API_key)

### TRY STUFF BLOCK

In [148]:
browser.get('https://www.portalinmobiliario.com/venta/departamento/metropolitana?pd=2.000&ph=3.000&pg=1')
# find_elements_by_xpath returns an array of selenium objects.
titles_element = browser.find_elements_by_class_name('product-item-data')

# print(titles_element[0].text.split('\n'))#)[7]))#.split(',')[0].replace('UF ', '')))

p_inmobiliario = []

# Code, Address, Price(s), Size
for prop in titles_element:
    data = prop.text.split('\n')
    rooms = 'n/a'
    # Clean Data depending on "Proyecto" or "Propiedad Usada"
    if 'Proyecto' in data[0]:
        data = data[2:]
    else:
        rooms = data[3]
        data = data[1:3] + data[4:]
        
    print(data)
    # Get Address
    addr = data[0]
    # Get Code
    code = int(data[1].split()[1])
    # Get price
    price_min = float(data[3].split(',')[0].replace("UF ", "").replace(".", ''))
    values
    # Check if there's "hasta" price
    if(data[4]=="Hasta:"):
        price_max = float(data[5].split(',')[0].replace("UF ", "").replace(".", ''))
        values = data[7].replace(",", ".").split() 
    else:
        price_max = 0.0
        values = data[5].replace(",", ".").split()
    size_min = float(values[0])
    if(len(values)>2):
        size_max = float(values[2])
    else:
        size_max = 0.0
    
    p_inmobiliario.append([code, addr, rooms, price_min, price_max, size_min, size_max, "sell", "dpto"])
    print(p_inmobiliario[-1])



['Departamental 1475, La Florida', 'Código: 7364', 'Desde:', 'UF 2.100,00', 'Hasta:', 'UF 3.690,00', 'Superficie:', '37,21 - 67,65 m²']
[7364, 'Departamental 1475, La Florida', 'n/a', 2100.0, 3690.0, 37.21, 67.65, 'sell', 'dpto']
['Lazo 1456, San Miguel', 'Código: 7401', 'Desde:', 'UF 2.797,00', 'Hasta:', 'UF 3.570,00', 'Superficie:', '55,90 - 71,70 m²']
[7401, 'Lazo 1456, San Miguel', 'n/a', 2797.0, 3570.0, 55.9, 71.7, 'sell', 'dpto']
['Vicuña Mackenna 6130, La Florida', 'Código: 7612', 'Desde:', 'UF 2.460,00', 'Hasta:', 'UF 3.500,00', 'Superficie:', '39,06 - 74,24 m²']
[7612, 'Vicuña Mackenna 6130, La Florida', 'n/a', 2460.0, 3500.0, 39.06, 74.24, 'sell', 'dpto']
['Camino del Paisaje 6546, La Florida', 'Código: 4708', 'Desde:', 'UF 2.690,00', 'Hasta:', 'UF 6.920,00', 'Superficie:', '53,08 - 129,69 m²']
[4708, 'Camino del Paisaje 6546, La Florida', 'n/a', 2690.0, 6920.0, 53.08, 129.69, 'sell', 'dpto']
['José Miguel Carrera 680, Santiago', 'Código: 6952', 'Desde:', 'UF 2.825,00', 'Supe

In [138]:
print(data[1:])

['Avenida Vicuña Mackenna 2935 - Departamento 704, San Joaquín', 'Código: 4763668', '1D/1B', 'Valor:', 'UF 2.380,00', 'Superficie:', '33 - 36 m²']


In [6]:
arr = ['a','c,','s','r','t','f','d','q']

print(arr[1:3] + arr[4:])


['c,', 's', 't', 'f', 'd', 'q']


In [7]:
a = ['Carmen Victoria - 1D1B - Plaza San Isidr, Santiago', 'Código: 5005791', 'Valor:', 'UF 2.000,00']
len(a)

4