# Using Selenium to scrape data from a website

> To use selenium, we need to create an instance that is going to "drive" us through the webpage.
Here is what it could look like:

In [10]:
from selenium import webdriver
from time import sleep

driver = webdriver.Chrome()
driver.get("https://zoopla.co.uk")

We see that we have navigated to [Zoopla.co.uk](https://zoopla.co.uk) website. We can search for elements via `Xpath` and can also send mouse and keyboard actions through Selenium as well. Let's recall the challenge we want to solve -- extracting data for 50 houses:
* **Sale Price**: Our response variable 
* Number of Bedrooms 
* Description 
* Address

We will focus our effort just in the London area, the next cell will take us to the URL corresponding to properties in London:

In [11]:
driver = webdriver.Chrome()
URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
driver.get(URL)

Ops... Looks like cookies are blocking us... We need to find a way to get around this. Let's start by using `Xpath` to find the "Accept All Cookies" button.

*Note, the Zoopla website has a frame in the website. The 'Accept Cookies' is in this frame, so, we have to tell Selenium to access the frame. Usually, if it does not have the frame, you can ignore the `switch_to_frame` method*.

In [12]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time

driver = webdriver.Chrome()
URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
driver.get(URL)
time.sleep(2) # wait for a few seconds so the website does not suspect you are a bot
try:
    driver.switch_to('gdpr-consent-notice') # This is the id of the frame
    accept_cookies_button = driver.find_element(by=By.XPATH, value='//*[@id="save"]') # This is the id of the button
    accept_cookies_button.click()
except AttributeError:
    driver.switch_to('gdpr-consent-notice') # This is the id of the frame
    accept_cookies_button = driver.find_element(by=By.XPATH, value='//*[@id="save"]') # This is the id of the button
    accept_cookies_button.click()
except:
    print('No cookies to accept')

No cookies to accept


If we rune the code, the webdriver will go to the website and click the button for us. So, analyse the methods we used:
- `find_element()` To make the driver point to the element
- `click()` To make the driver click on the element that was pointed

Alright, so it is time to start extracting the data we are interested in. Let's extract the price, address, number of bedrooms and the description:

First of all, observe the HTML code corresponding to a property:
<p align=center><img src=images/Selenium_1.png width=900></p>
<figcaption align="center"><cite>Zoopla Website and Corresponding HTML Code</cite></figcaption>

If you get the XPath of that property, it will look like this:

`//*[@id="listing_60212639"]`

Which is fine if we want to find a single property, but not so great if we want to list all the properties in that page. We will focus on how to get all the properties shortly, for now, let's extract the URL of that property, and extract the information we need. 

_Note: Zoopa is constantly adding new properties, it is likely that the Xpath changed, so make sure that you are following all the steps and using the correct XPath_

Let's take a look again at the HTML code, you will notice that there are some `<a>` tags in the HTML code. Usually, these tags are used to include a hyper reference (`href`). Selenium allows us to get that href, but first we need to locate the `<a>` tag containing the href.

So, if you expand one of the `<div>` tags corresponding to a property, you will see something like this:

<p align=center><img src=images/Selenium_2.png width=900></p>
<figcaption align="center"><cite>Property Div Tag</cite></figcaption>

Can you see the `<a>` tag? That is the tag that contains the URL we need. So, let's tell Selenium to extract it:

In [13]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time

driver = webdriver.Chrome() 
URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
driver.get(URL)
time.sleep(2) # Wait a couple of seconds, so the website doesn't suspect you are a bot
try:
    driver.switch_to_frame('gdpr-consent-notice') # This is the id of the frame
    accept_cookies_button = driver.find_element(by=By.XPATH, value='//*[@id="save"]')
    accept_cookies_button.click()

except AttributeError: # If you have the latest version of Selenium, the code above won't run because the "switch_to_frame" is deprecated
    driver.switch_to.frame('gdpr-consent-notice') # This is the id of the frame
    accept_cookies_button = driver.find_element(by=By.XPATH, value='//*[@id="save"]')
    accept_cookies_button.click()

except:
    pass
time.sleep(2)
house_property = driver.find_element(by=By.XPATH, value='//*[@id="listing_56648054"]') # Change this xpath with the xpath the current page has in their properties
a_tag = house_property.find_element(by=By.TAG_NAME, value='a')
link = a_tag.get_attribute('href')
print(link)

https://www.zoopla.co.uk/new-homes/details/56648054/?search_identifier=91afb4c3755906a0f0140222305f21ae


Nice, now we can visit that link using Selenium. Alternatively, you can also click on the `property` element (`property.click()`) and it will take you to the same page. But you will have to:
- Click the element
- Sleep
- Extract the information
- Go back
- Sleep
- Find the next property 
- Click
- Sleep

On the other hand, if you have the links, you can visit them like this:

- Extract all the links
- Iterate through the list, and for each iteration, visit the corresponding URL
- Sleep
- Extract the information of the property
- Visit the next URL

So, it's up to you, but for many different websites, creating a list with links (which is usually called "crawler"), is much more efficient

Enough talking (or writing), let's visit the link we extracted:

In [17]:
# driver.get(link)


And it moved us to the webpage of that property

<p align=center><img src=images/Selenium_3.png width=900></p>

There, you can see the price, address, number of bedrooms, and the description. As always, let's take a look at the XPath corresponding to each property

<p align=center><img src=images/Selenium_4.png width=900></p>
<figcaption align="center"><cite>Property Xpath</cite></figcaption>

And there it is, if you do the same with the number of bedrooms, the address and the description, you should have something like the following:

In [21]:
price = driver.find_element(by=By.XPATH, value='//p[@data-testid="listing-price"]').text
print(price)
# h3 class = c-eFZDwI
address = driver.find_element(by=By.XPATH, value='//').text
print(address)
# bedrooms = driver.find_element(by=By.XPATH, value='//div[@class="c-dkBAiW"]').text
# print(bedrooms)
# div_tag = driver.find_element(by=By.XPATH, value='//div[@data-testid="truncated_text_container"]')
# span_tag = div_tag.find_element(by=By.XPATH, value='.//span')
# description = span_tag.text
# print(description)

£1,405,000
