## Using Selenium

> To use Selenium, we need to create an instance that is going to "drive" us through the webpage.  

Here is what it could look like:

In [None]:
from selenium import webdriver
from time import sleep

driver = webdriver.Chrome()
driver.get("https://zoopla.co.uk")

We see that we've navigated to the [Zoopla.co.uk](https://www.zoopla.co.uk/) website. We can search for elements via `Xpath` and can also send mouse and keyboard actions through Selenium as well. Let's recall the challenge we want to solve - extracting data for 50 houses:
- **Sale Price**: Our response variable
- Number of bedrooms
- Square footage
- Description
- Address

We'll focus our efforts just in the London area the next cell will take us to the URL corresponding to properties in London:

In [None]:
driver = webdriver.Chrome() 
URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
driver.get(URL)

Oh... Looks like cookies are blocking us... We need to find a way to get around this. Let's start by using Xpath to find the "Accept All Cookies" button

_Note: The Zoopla website has a frame in the website. The 'Accept Cookies' is in this frame, so we have to tell Selenium to access the frame. Usually, if it doesn't have a frame, you can ignore the `switch_to_frame` method_ 

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time

driver = webdriver.Chrome() 
URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
driver.get(URL)
time.sleep(2) # Wait a couple of seconds, so the website doesn't suspect you are a bot
try:
    driver.switch_to_frame('gdpr-consent-notice') # This is the id of the frame
    accept_cookies_button = driver.find_element(by=By.XPATH, value='//*[@id="save"]')
    accept_cookies_button.click()

except AttributeError: # If you have the latest version of Selenium, the code above won't run because the "switch_to_frame" is deprecated
    driver.switch_to.frame('gdpr-consent-notice') # This is the id of the frame
    accept_cookies_button = driver.find_element(by=By.XPATH, value='//*[@id="save"]')
    accept_cookies_button.click()

except:
    pass # If there is no cookies button, we won't find it, so we can pass

If we rune the code, the webdriver will go to the website and click the button for us. So, analyse the methods we used:
- `find_element()` To make the driver point to the element
- `click()` To make the driver click on the element that was pointed

Alright, so it is time to start extracting the data we are interested on. Let's extract the price, address, number of bedrooms and the description:

First of all, observe the HTML code corresponding to a property:
<p align=center><img src=images/Selenium_1.png width=900></p>
<figcaption align="center"><cite>Zoopla Website and Corresponding HTML Code</cite></figcaption>

If you get the XPath of that property, it will look like this:

`//*[@id="listing_60212639"]`

Which is fine if we want to find a single property, but not so great if we want to list all the properties in that page. We will focus on how to get all the properties shortly, for now, let's extract the URL of that property, and extract the information we need. 

_Note: Zoopa is constantly adding new properties, it is likely that the Xpath changed, so make sure that you are following all the steps and using the correct XPath_

Let's take a look again at the HTML code, you will notice that there are some `<a>` tags in the HTML code. Usually, these tags are used to include a hyper reference (`href`). Selenium allows us to get that href, but first we need to locate the `<a>` tag containing the href.

So, if you expand one of the `<div>` tags corresponding to a property, you will see something like this:

<p align=center><img src=images/Selenium_2.png width=900></p>
<figcaption align="center"><cite>Property Div Tag</cite></figcaption>

Can you see the `<a>` tag? That is the tag that contains the URL we need. So, let's tell Selenium to extract it:

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time

driver = webdriver.Chrome() 
URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
driver.get(URL)
time.sleep(2) # Wait a couple of seconds, so the website doesn't suspect you are a bot
try:
    driver.switch_to_frame('gdpr-consent-notice') # This is the id of the frame
    accept_cookies_button = driver.find_element(by=By.XPATH, value='//*[@id="save"]')
    accept_cookies_button.click()

except AttributeError: # If you have the latest version of Selenium, the code above won't run because the "switch_to_frame" is deprecated
    driver.switch_to.frame('gdpr-consent-notice') # This is the id of the frame
    accept_cookies_button = driver.find_element(by=By.XPATH, value='//*[@id="save"]')
    accept_cookies_button.click()

except:
    pass
time.sleep(2)
house_property = driver.find_element(by=By.XPATH, value='//*[@id="listing_61920149"]') # Change this xpath with the xpath the current page has in their properties
a_tag = house_property.find_element_by_tag_name('a')
link = a_tag.get_attribute('href')
print(link)


Nice, now we can visit that link using Selenium. Alternatively, you can also click on the `property` element (`property.click()`) and it will take you to the same page. But you will have to:
- Click the element
- Sleep
- Extract the information
- Go back
- Sleep
- Find the next property 
- Click
- Sleep

On the other hand, if you have the links, you can visit them like this:

- Extract all the links
- Iterate through the list, and for each iteration, visit the corresponding URL
- Sleep
- Extract the information of the property
- Visit the next URL

So, it's up to you, but for many different websites, creating a list with links (which is usually called "crawler"), is much more efficient

Enough talking (or writing), let's visit the link we extracted:

In [None]:
driver.get(link)

And it moved us to the webpage of that property

<p align=center><img src=images/Selenium_3.png width=900></p>

There, you can see the price, address, number of bedrooms, and the description. As always, let's take a look at the XPath corresponding to each property

<p align=center><img src=images/Selenium_4.png width=900></p>
<figcaption align="center"><cite>Property Xpath</cite></figcaption>

And there it is, if you do the same with the number of bedrooms, the address and the description, you should have something like the following:

In [None]:
price = driver.find_element(by=By.XPATH, value='//p[@data-testid="price"]').text
print(price)
address = driver.find_element(by=By.XPATH, value='//address[@data-testid="address-label"]').text
print(address)
bedrooms = driver.find_element(by=By.XPATH, value='//div[@class="c-PJLV c-PJLV-iiNveLf-css"]').text
print(bedrooms)
div_tag = driver.find_element(by=By.XPATH, value='//div[@data-testid="truncated_text_container"]')
span_tag = div_tag.find_element(by=By.XPATH, value='.//span')
description = span_tag.text
print(description)

Now that we have a button, we can send a click action to it:

In [None]:
dict_properties = {'Price': [], 'Address': [], 'Bedrooms': [], 'Description': []}
price = driver.find_element(by=By.XPATH, value='//p[@data-testid="price"]').text
dict_properties['Price'].append(price)
address = driver.find_element(by=By.XPATH, value='//address[@data-testid="address-label"]').text
dict_properties['Address'].append(address)
bedrooms = driver.find_element(by=By.XPATH, value='//div[@class="c-PJLV c-PJLV-iiNveLf-css"]').text
dict_properties['Bedrooms'].append(bedrooms)
div_tag = driver.find_element(by=By.XPATH, value='//div[@data-testid="truncated_text_container"]')
span_tag = div_tag.find_element(by=By.XPATH, value='.//span')
description = span_tag.text
dict_properties['Description'] = description

In [None]:
dict_properties

## Adding links to a list: Creating a Crawler

As mentioned, it would be more efficient to create a list with all the links and then iterate through that list. Here, I am going to give a small teaser of what it looks like, but, ultimately, it will be your task to complete the whole scraper

Before we move on, I am going to create a list with the "Accept Cookies" functionality, so we don't have to repeat myself so many times:

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time

def load_and_accept_cookies() -> webdriver.Chrome:
    '''
    Open Zoopla and accept the cookies
    
    Returns
    -------
    driver: webdriver.Chrome
        This driver is already in the Zoopla webpage
    '''
    driver = webdriver.Chrome() 
    URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
    driver.get(URL)
    time.sleep(3) 
    try:
        driver.switch_to_frame('gdpr-consent-notice') # This is the id of the frame
        accept_cookies_button = driver.find_elementh(by=By.XPATH, value='//*[@id="save"]')
        accept_cookies_button.click()
        time.sleep(1)
    except AttributeError: # If you have the latest version of Selenium, the code above won't run because the "switch_to_frame" is deprecated
        driver.switch_to.frame('gdpr-consent-notice') # This is the id of the frame
        accept_cookies_button = driver.find_element(by=By.XPATH, value='//*[@id="save"]')
        accept_cookies_button.click()
        time.sleep(1)

    except:
        pass

    return driver 

Let's use this function from now on, it will make our code much more readable:

In [None]:
driver = load_and_accept_cookies() # In case it works, driver should be in the Zoopla webpage with the cookies button clicked

Great, let's observe the list of properties one more time. All the properties are in a container as we can see in this image:

<p align=center><img src=images/Selenium_5.png width=900></p>
<figcaption align="center"><cite>Properties Container</cite></figcaption>

And, each property is one `<div>` tag inside that container. For example, for the two first properties:

<p align=center><img src=images/Selenium_7.png width=450><img src=images/Selenium_6.png width=450></p>
<figcaption align="center"><cite>Property and Corresponding Div Tag</cite></figcaption>

So, we can find a way to iterate through _all_ the properties in that list, and for each iteration, extract the link. We saw earlier that if you have the property, you can easily find the `<a>` tag that contains the `href` like this:
```
property = driver.find_element(by=By.XPATH, value='//*[@id="listing_61920149"]') # Change this Xpath with the Xpath the current page has in their properties
a_tag = property.find_element(by=By.TAG_NAME, value='a')
link = a_tag.get_attribute('href')
```
Let's use something similar

In [None]:
driver = load_and_accept_cookies()
prop_container = driver.find_element(by=By.XPATH, value='//div[@class="css-1itfubx e5pbze00"]') # XPath corresponding to the Container

Now, `prop_container` is pointing to the list of properties on the website. We have to get ALL the `<div>` tags inside, but only those that are direct children. So, we have to use a relative Xpath: `./div`
- The dot represents that it is relative
- The single slash represents direct children

Also, take into account that we are looking for ALL occurrence of this XPath, so we have to use the `find_elementS_by_xpath` method


In [None]:
prop_list = prop_container.find_elements(by=By.XPATH, value='./div')
link_list = []

for house_property in prop_list:
    a_tag = house_property.find_element(by=By.TAG_NAME, value='a')
    link = a_tag.get_attribute('href')
    link_list.append(link)
    
print(f'There are {len(link_list)} properties in this page')
print(link_list)

Now we have a list of the links of the properties in that page. How awesome is that?

Next, we need to iterate through this list and start visiting each link to extract the data we were interested on (Price, Address, Number of Bedroom, Description)

## Try it out

With the new acquired knowledge, extract the data from all the properties in 5 different Zoopla pages. This means that, once you finish scraping a page, you have to click the 'Next Page' button (you can also change the URL if you know how to tweak it). So, once you extract the 25 links, you can go to the next page by clicking 'Next':

<p align=center><img src=images/Selenium_8.png width=450></p>

Below is a template you can use to get started:

In [None]:
from selenium import webdriver

def get_links(driver: webdriver.Chrome) -> list:
    '''
    Returns a list with all the links in the current page
    Parameters
    ----------
    driver: webdriver.Chrome
        The driver that contains information about the current page
    
    Returns
    -------
    link_list: list
        A list with all the links in the page
    '''

    prop_container = driver.find_element(by=By.XPATH, value='//div[@class="css-1itfubx e5pbze00"]')
    prop_list = prop_container.find_elements(by=By.XPATH, value='./div')
    link_list = []

    for house_property in prop_list:
        a_tag = house_property.find_element(by=By.TAG_NAME, value='a')
        link = a_tag.get_attribute('href')
        link_list.append(link)

    return link_list

big_list = []
driver = load_and_accept_cookies()

for i in range(5): # The first 5 pages only
    big_list.extend(get_links(driver)) # Call the function we just created and extend the big list with the returned list
    ## TODO: Click the next button. Don't forget to use sleeps, so the website doesn't suspect
    pass # This pass should be removed once the code is complete


for link in big_list:
    ## TODO: Visit all the links, and extract the data. Don't forget to use sleeps, so the website doesn't suspect
    pass # This pass should be removed once the code is complete

driver.quit() # Close the browser when you finish

# Key Takeaways

- In order to start using Selenium, we'll need to first create an instance of it using the appropriate webdriver (such as Chromedriver)
- If a website has cookies blocking access, one way around that is to use Xpath to find the "Accept All Cookies" button and press it before crawling the data
- To extract the data for multiple items on a website, we'll need to find the href `<a>` tags to create a list with links to crawl, and then iterate through that list
- To find different details of items on a website, we can use the `.find_element_by_xpath()` command and then instruct Selenium to take an action on each item (such as a click)