# Getting data from Lamudi website

This notebook walks thorugh the steps in collecting data from Lamudi, an online real estate market place.
The data will focus on condominiums for rent in the Philippines using requests, BeautifulSoup, numpy, selenium, pandas, json and time packages.

### Import necessary libraries to be used in web scraping

The requests package allows users to get Lamudi's HTML code.

The BeautifulSoup is Python package that allows users to parse the HTML code that can be used to extract data.

The Numpy is also used for working with arrays.

The selenium package is used to automate web browser interaction from Python. A webdriver is used, specifically chrome, to access each condominium listing since some elements are stored in flex boxes and can only be accessed through the use of selenium and a webdriver.

The time module is used to represent time. It would be specifically be used in delaying the code before it executes the next step to minimize errors such as conenction errors.

In [1]:
import requests
from bs4 import BeautifulSoup
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep

### Check for variables

In cases that the elements to be collected does not exist, the code will print "N/A" for the variable.

In [4]:
test_url = "https://www.lSeleamudi.com.ph/condominium/rent/?page=1"
pages = requests.get(test_url)
beautiful = BeautifulSoup(pages.content, 'html.parser')
condo_content = beautiful.find_all('div', class_ = 'ListingCell-AllInfo ListingUnit')

SSLError: HTTPSConnectionPool(host='www.lseleamudi.com.ph', port=443): Max retries exceeded with url: /condominium/rent/?page=1 (Caused by SSLError(SSLCertVerificationError("hostname 'www.lseleamudi.com.ph' doesn't match either of '*.parklogic.com', 'parklogic.com'")))

#### Get the name of the first listing

In [None]:
name = condo_content[0].find('h2', {'class': 'ListingCell-KeyInfo-title'}).text.strip()
print(name)

#### Get the address of the first listing

In [None]:
address = condo_content[0].find('a', {'class' : 'js-listing-link ellipsis'}).text.strip()
updatedaddress = re.sub("^\s+|\s+$|\s+(?=\s)", "",address)
print(updatedaddress)

#### Get the price of the first listing

In [None]:
price = condo_content[0].find('span', {'class' : 'PriceSection-FirstPrice'})
    if price is None:
        price = "N/A"
    else:
        price = condo.find('span', {'class' : 'PriceSection-FirstPrice'}).text.strip()
print(price)

#### Get the number of bedrooms offered in the first listing

In [None]:
bedroom = condo_content[0].find('span', {'class' : 'KeyInformation-value_v2 KeyInformation-amenities-icon_v2 icon-bedrooms'})
    if bedroom is None:
        bedroom = "N/A"
    else:
        bedroom = condo.find('span', {'class' : 'KeyInformation-value_v2 KeyInformation-amenities-icon_v2 icon-bedrooms'}).text.strip()
print(bedroom)

#### Get the number of bathrooms offered in the first listing

In [None]:
bath = condo_content[0].find('span', {'class' : 'KeyInformation-value_v2 KeyInformation-amenities-icon_v2 icon-bathrooms'})
    if bath is None:
        bath = "N/A"
    else:
        bath = condo.find('span', {'class' : 'KeyInformation-value_v2 KeyInformation-amenities-icon_v2 icon-bathrooms'}).text.strip()
print(bath)

#### Get the measurement of floorarea in the first listing

In [None]:
floorarea = condo_content[0].find('span', {'class' : 'KeyInformation-value_v2 KeyInformation-amenities-icon_v2 icon-livingsize'})
    if floorarea is None:
        floorarea = "N/A"
    else:
        floorarea = condo.find('span', {'class' : 'KeyInformation-value_v2 KeyInformation-amenities-icon_v2 icon-livingsize'}).text.strip()
print(floorarea)

#### Get the amenities offered in the first listing

In [None]:
findLinkWrapper = condo_content[0].find("div", class_='ListingCell-keyInfo-wrapper')
findLink = findLinkWrapper.find('a')['href']
driver = webdriver.Chrome(executable_path='C:/Users/Clarisa Hilario/Downloads/chromedriver_win32/chromedriver.exe')
driver.implicitly_wait(30)
driver.get(findLink)
html = driver.page_source
soup = BeautifulSoup(html)

findListing = soup.find("section", id="listing-amenities")
if findListing is None:
    amenities = "N/A"
else:
    findAmenities = findListing.find_all("div", class_="ellipsis")
    amenities=""
    for n in findAmenities:
        amenities = amenities+n.text.strip()+ " ,"
print(amenities)

#### Get the rating of the first listing

In [None]:
findLinkWrapper = condo_content[0].find("div", class_='ListingCell-keyInfo-wrapper')
findLink = findLinkWrapper.find('a')['href']
driver = webdriver.Chrome(executable_path='C:/Users/Clarisa Hilario/Downloads/chromedriver_win32/chromedriver.exe')
driver.implicitly_wait(30)
driver.get(findLink)
html = driver.page_source
soup = BeautifulSoup(html)

findAreaRating=soup.find("div", class_='AreaRating-Overall')
rating = ""
if findAreaRating is None:
    rating = "N/A"
else:
    rating = findAreaRating.text.strip()
print(rating)

### Pagination

Since the listings are separated into a 100 pages with 30 condomunium listings per page, the "url" variable will eclude the page number. A for loop will be used in accessing each page by concatenating the string value of the current selected page. There will also be a 60-second delay before continuing to the next page to avoid connection errors.

In [None]:
url = "https://www.lSeleamudi.com.ph/condominium/rent/?page="

In [None]:
condos = []
for x in range(1, 101):
    URL = 'https://www.lamudi.com.ph/metro-manila/condominium/rent/?page='
    pages = requests.get(URL+ str(x))
    beautiful = BeautifulSoup(pages.content, 'html.parser')
    sleep(60)
    condo_content = beautiful.find_all('div', class_ = 'ListingCell-AllInfo ListingUnit')
    for condo in condo_content:
        name = condo.find('h2', {'class': 'ListingCell-KeyInfo-title'}).text.strip()
        address = condo.find('a', {'class' : 'js-listing-link ellipsis'}).text.strip()
        updatedaddress = re.sub("^\s+|\s+$|\s+(?=\s)", "",address)
        
        price = condo.find('span', {'class' : 'PriceSection-FirstPrice'})
        if price is None:
            price = "N/A"
        else:
            price = condo.find('span', {'class' : 'PriceSection-FirstPrice'}).text.strip()
        
        bedroom = condo.find('span', {'class' : 'KeyInformation-value_v2 KeyInformation-amenities-icon_v2 icon-bedrooms'})
        if bedroom is None:
            bedroom = "N/A"
        else:
            bedroom = condo.find('span', {'class' : 'KeyInformation-value_v2 KeyInformation-amenities-icon_v2 icon-bedrooms'}).text.strip()
        
        bath = condo.find('span', {'class' : 'KeyInformation-value_v2 KeyInformation-amenities-icon_v2 icon-bathrooms'})
        if bath is None:
            bath = "N/A"
        else:
            bath = condo.find('span', {'class' : 'KeyInformation-value_v2 KeyInformation-amenities-icon_v2 icon-bathrooms'}).text.strip()
        
        floorarea = condo.find('span', {'class' : 'KeyInformation-value_v2 KeyInformation-amenities-icon_v2 icon-livingsize'})
        if floorarea is None:
            floorarea = "N/A"
        else:
            floorarea = condo.find('span', {'class' : 'KeyInformation-value_v2 KeyInformation-amenities-icon_v2 icon-livingsize'}).text.strip()
        
        findLinkWrapper = condo.find("div", class_='ListingCell-keyInfo-wrapper')
        findLink = findLinkWrapper.find('a')['href']
        driver = webdriver.Chrome(executable_path='C:/Users/Clarisa Hilario/Downloads/chromedriver_win32/chromedriver.exe')
        driver.implicitly_wait(30)
        driver.get(findLink)
        html = driver.page_source
        soup = BeautifulSoup(html)
        #amenities
        findListing = soup.find("section", id="listing-amenities")
        if findListing is None:
            amenities = "N/A"
        else:
            findAmenities = findListing.find_all("div", class_="ellipsis")
            amenities=""
            for n in findAmenities:
                amenities = amenities+n.text.strip()+ " ,"
        #ratings
        findAreaRating=soup.find("div", class_='AreaRating-Overall')
        rating = ""
        if findAreaRating is None:
            rating = "N/A"
        else:
            rating = findAreaRating.text.strip()
        
        condos.append({
            'name' : name,
            'address' : updatedaddress,
            'price' : price,
            'bedroom' : bedroom,
            'bath' : bath,
            'floorarea' : floorarea,
            'amenities' : amenities,
            'rating' : rating
        })
condos

### Check the number of listings collected

In [None]:
print(len(condos))

### Import JSON package and make a JSON file of collected data

JSON (JavaScript Object Notation) is an open standard file format that uses human-readable text to store and transmit data objects consisting of attribute-value pairs and arrays.

In [None]:
import json

In [None]:
with open('/Users/Clarisa Hilario/Desktop/condominiums.json', 'w') as outfile:
    json.dump(condos, outfile)

### Import pandas and convert JSON file to CSV file

Pandas package is used to work with multi-dimensional arrays. It would be specifically used on reading the JSON file in order to convert to CSV (Comma Separated Values) file.

In [None]:
import pandas as pd
df = pd.read_json (r'/Users/Clarisa Hilario/Desktop/condominiums.json')
makecsv=df.to_csv (r'/Users/Clarisa Hilario/Desktop/condominiums.csv', index = None, header = True)

### Variables collected

##### name
This is the title name of the listing for condo for rent
address: This variable indicates the address where the condominium is located
price: This indicates the rent price for the condominium
##### address
This contains the address of each condominium
##### price
This contains the price of each condominium
##### bedroom
This indicates the number of bedroom available for that certain listing
##### bath 
This indicates the number of bathroom available for that certain listing
##### floorarea
This indicates the size of the condominum for rent. 
###### amenities
This contains all amenities offered per condominium.
##### rating
This indicated the overall rating each condominium has received.